Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
2
Users Guide
The correct bibliographic citation for this manual is as follows: SAS Institute Inc. 2008. SAS/ETS 9.2 Users Guide. Cary, NC: SAS Institute Inc.
Contents
I General Information
Whats New in SAS/ETS . . . . . . . . . . . . . . . . . . .
1
3 15 63 129 149 165
Introduction . . . . . . . . . . . . . . . . . . . . . . . . Working with Time Series Data . . . . . . . . . . . . . . . . . Date Intervals, Formats, and Functions SAS Macros and Functions . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
II Procedure Reference
Chapter 7. Chapter 8. Chapter 9. The ARIMA Procedure . . . . . . . . . . . . . . . . . . . . The AUTOREG Procedure . . . . . . . . . . . . . . . . . . . The COMPUTAB Procedure . . . . . . . . . . . . . . . . . .
187
189 313 429 483 519 615 681 719 773 827 869 943 1261 1349 1375 1441 1511 1541
Chapter 10. The COUNTREG Procedure . . . . . . . . . . . . . . . . . . Chapter 11. The DATASOURCE Procedure . . . . . . . . . . . . . . . . . Chapter 12. The ENTROPY Procedure (Experimental) . . . . . . . . . . . . . Chapter 13. The ESM Procedure . . . . . . . . . . . . . . . . . . . . . Chapter 14. The EXPAND Procedure Chapter 15. The FORECAST Procedure Chapter 16. The LOAN Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Chapter 17. The MDC Procedure . . . . . . . . . . . . . . . . . . . . . Chapter 18. The MODEL Procedure . . . . . . . . . . . . . . . . . . . . Chapter 19. The PANEL Procedure . . . . . . . . . . . . . . . . . . . . Chapter 20. The PDLREG Procedure . . . . . . . . . . . . . . . . . . .
Chapter 21. The QLIM Procedure . . . . . . . . . . . . . . . . . . . . . Chapter 22. The SIMILARITY Procedure (Experimental) . . . . . . . . . . . . Chapter 23. The SIMLIN Procedure . . . . . . . . . . . . . . . . . . . . Chapter 24. The SPECTRA Procedure . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
Chapter 26. The SYSLIN Procedure . . . . . . . . . . . . . . . . . . . . Chapter 27. The TIMESERIES Procedure . . . . . . . . . . . . . . . . . . Chapter 28. The TSCSREG Procedure . . . . . . . . . . . . . . . . . . . Chapter 29. The UCM Procedure . . . . . . . . . . . . . . . . . . . . . Chapter 30. The VARMAX Procedure . . . . . . . . . . . . . . . . . . . Chapter 31. The X11 Procedure Chapter 32. The X12 Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2187
2189 2289 2339
2377
2379 2383 2439 2453 2491 2511 2545 2553 2661
Chapter 37. Getting Started with Time Series Forecasting . . . . . . . . . . . . Chapter 38. Creating Time ID Variables . . . . . . . . . . . . . . . . . .
Chapter 39. Specifying Forecasting Models . . . . . . . . . . . . . . . . . Chapter 40. Choosing the Best Forecasting Model . . . . . . . . . . . . . . . Chapter 41. Using Predictor Variables . . . . . . . . . . . . . . . . . . . Chapter 42. Command Reference . . . . . . . . . . . . . . . . . . . . . Chapter 43. Window Reference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
V Investment Analysis
Chapter 45. Overview . . . . . . . . . . . . . . . . . . . . . . . . . Chapter 46. Portfolios . . . . . . . . . . . . . . . . . . . . . . . . . Chapter 47. Investments . . . . . . . . . . . . . . . . . . . . . . . .
2695
2697 2701 2709
iv
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . .
Subject Index
2793
Syntax Index
2835
vi
Credits
Documentation
Editing Technical Review Ed Huddleston, Anne Jones Jackie Allen, Evan L. Anderson, Ming-Chun Chang, Jan Chvosta, Brent Cohen, Allison Crutcheld, Paige Daniels, Gl Ege, Bruce Elsheimer, Donald J. Erdman, Kelly Fellingham, Laura Jackson, Wilma S. Jackson, Wen Ji, Kurt Jones, Kathleen Kiernan, Jennifer Sloan, Michael J. Leonard, Mark R. Little, Kevin Meyer, Gina Marie Mondello, Steve Morrison, Youngjin Park, David Schlotzhauer, Jim Seabolt, Rajesh Selukar, Arthur Sinko, Michele A. Trovero, Charles Sun, Donna E. Woodward
Documentation Production
Software
The procedures in SAS/ETS software were implemented by members of the Advanced Analytics department. Program development includes design, programming, debugging, support, documentation, and technical review. In the following list, the names of the developers who currently support the procedure are listed rst.
ARIMA
AUTOREG COMPUTAB COUNTREG DATASOURCE ENTROPY ESM EXPAND FORECAST LOAN MDC MODEL PANEL PDLREG QLIM SIMILARITY SIMLIN SPECTRA STATESPACE SYSLIN TIMESERIES TSCSREG UCM VARMAX X11
Xilong Chen, Arthur Sinko, Jan Chvosta, John P. Sall Michael J. Leonard, Alan R. Eaton, David F. Ross Jan Chvosta, Laura Jackson Kelly Fellingham, Meltem Narter Xilong Chen, Arthur Sinko, Greg Sterijevski, Donald J. Erdman Michael J. Leonard Marc Kessler, Michael J. Leonard, Mark R. Little Michael J. Leonard, Mark R. Little, John P. Sall Gl Ege Jan Chvosta Donald J. Erdman, Mark R. Little, John P. Sall Jan Chvosta, Greg Sterijevski Xilong Chen, Arthur Sinko, Jan Chvosta, Leigh A. Ihnen Jan Chvosta Michael J. Leonard Mark R. Little, John P. Sall Rajesh Selukar, Donald J. Erdman, John P. Sall Donald J. Erdman, Michael J. Leonard Donald J. Erdman, Leigh A. Ihnen, John P. Sall Marc Kessler, Michael J. Leonard Jan Chvosta, Meltem Narter Rajesh Selukar Youngjin Park Wilma S. Jackson, R. Bart Killam, Leigh A. Ihnen,
viii
Richard D. Langston X12 Time Series Investment Analysis System Compiler and Symbolic Differentiation SASEHAVR SASECRSP SASEFAME Testing Wilma S. Jackson Evan L. Anderson, Michael J. Leonard, Meltem Narter, Gl Ege Gl Ege, Scott Gray, Michael J. Leonard
Stacey Christian
Kelly Fellingham Kelly Fellingham, Peng Zang Kelly Fellingham Jackie Allen, Ming-Chun Chang, Bruce Elsheimer, Kelly Fellingham, Jennifer Sloan, Charles Sun, Linda Timberlake, Mark Traccarella, Peng Zang
Technical Support
Members
Paige Daniels, Wen Ji, Kurt Jones, Kathleen Kiernan, Kevin Meyer, Gina Marie Mondello, David Schlotzhauer, Donna E. Woodward
Acknowledgments
Hundreds of people have helped the SAS System in many ways since its inception. The following individuals have been especially helpful in the development of the procedures in SAS/ETS software. Acknowledgments for the SAS System generally appear in Base SAS software documentation and SAS/ETS software documentation.
ix
David Amick David M. DeLong David Dickey Douglas J. Drummond William Fortney Wayne Fuller A. Ronald Gallant Phil Hanser Marvin Jochimsen Jeff Kaplan Ken Kraus George McCollister Douglas Miller Brian Monsell Robert Parks Gregory Sali Bob Spatz Mary Young
Idaho Ofce of Highway Safety Duke University North Carolina State University Center for Survey Statistics Boeing Computer Services Iowa State University The University North Carolina at Chapel Hill Sacramento Municipal Utilities District Mississippi R&O Center Sun Guard Center for Research in Security Prices San Diego Gas & Electric Purdue University U.S. Census Bureau Washington University Idaho Ofce of Highway Safety Center for Research in Security Prices Salt River Project
The nal responsibility for the SAS System lies with SAS Institute alone. We hope that you will always let us know your opinions about the SAS System and its documentation. It is through your participation that SAS software is continuously improved.
Part I
General Information
Chapter 1
This chapter summarizes the new features available in SAS/ETS software with SAS 9.2. It also describes other new features that were added with SAS 9.1.
Overview
Many SAS/ETS procedures now produce graphical output using the SAS Output Delivery System. This output is produced when you turn on ODS graphics with the following ODS statement:
ods graphics on;
Several procedures now support the PLOTS= option to control the graphical output produced. (See the chapters for individual SAS/ETS procedures for details on the plots supported.) With SAS 9.2, SAS/ETS offers three new modules: The new ESM procedure provides forecasting using exponential smoothing models with optimized smoothing weights. The SASEHAVR interface engine is now production and available to Windows users for accessing economic and nancial data residing in a HAVER ANALYTICS Data Link Express (DLX) database. The new SIMILARITY (experimental) procedure provides similarity analysis of time series data. New features have been added to the following SAS/ETS components: PROC AUTOREG PROC COUNTREG PROC DATASOURCE PROC MODEL PROC PANEL PROC QLIM SASECRSP Interface Engine SASEFAME Interface Engine SASEHAVR Interface Engine PROC UCM PROC VARMAX PROC X12
AUTOREG Procedure
Two new features have been added to the AUTOREG procedure. An alternative test for stationarity, proposed by Kwiatkowski, Phillips, Schmidt, and Shin (KPSS), is implemented. The null hypothesis for this test is a stationary time series, which is a natural choice for many applications. Bartlett and quadratic spectral kernels for estimating long-run variance can be used. Automatic bandwidth selection is an option.
COUNTREG Procedure ! 5
Corrected Akaike information criterion (AICC) is implemented. This modication of AIC corrects for small-sample bias. Along with the corrected Akaike information criterion, the mean absolute error (MAE) and mean absolute percentage error (MAPE) are now included in the summary statistics.
COUNTREG Procedure
Often the data that is being analyzed take the form of nonnegative integer (count) values. The new COUNTREG procedure implements count data models that take this discrete nature of data into consideration. The dependent variable in these models is a count that represents various discrete events (such as number of accidents, number of doctor visits, or number of children). The conditional mean of the dependent variable is a function of various covariates. Typically, you are interested in estimating the probability of the number of event occurrences using maximum likelihood estimation. The COUNTREG procedure supports the following types of models: Poisson regression negative binomial regression with linear (NEGBIN1) and quadratic (NEGBIN2) variance functions (Cameron and Trivedi 1986) zero-inated Poisson (ZIP) model (Lambert 1992) zero-inated negative binomial (ZINB) model
DATASOURCE Procedure
PROC DATASOURCE now supports the newest Compustat Industrial Universal Character Annual and Quarterly data by providing the new letypes CSAUCY3 for annual data and CSQUCY3 for quarterly data.
MODEL Procedure
The t copula and the normal mixture copula have been added to the MODEL procedure. Both copulas support asymmetric parameters. The copula is used to modify the correlation structure of
the model residuals for simulation. Starting with SAS 9.2, the MODEL procedure stores MODEL les in SAS datasets using an XMLlike format instead of in SAS catalogs. This makes MODEL les more readily extendable in the future and enables Java-based applications to read the MODEL les directly. More information is stored in the new format MODEL les; this enables some features that are not available when the catalog format is used. The MODEL procedure continues to read and write old-style catalog MODEL les, and model les created by previous releases of SAS/ETS continue to work, so you should experience no direct impact from this change. The CMPMODEL= option can be used in an OPTIONS statement to modify the behavior of the MODEL when reading and writing MODEL les. The values allowed are CMPMODEL= BOTH | XML | CATALOG. For example, the following statements restore the previous behavior:
options cmpmodel=catalog;
The CMPMODEL= option defaults to BOTH in SAS 9.2; this option is intended for transitional use while customers become accustomed to the new le format. If CMPMODEL=BOTH, the MODEL procedure writes both formats; when loading model les, PROC MODEL attempts to load the XML version rst and the CATALOG version second (if the XML version is not found). If CMPMODEL=XML the MODEL procedure reads and writes only the XML format. If CMPMODEL=CATALOG, only the catalog format is used.
PANEL Procedure
The PANEL procedure expands the estimation capability of the TSCSREG procedure in the timeseries cross-sectional framework. The new methods include: between estimators, pooled estimators, and dynamic panel estimators using GMM method. Creating lags of variables in a panel setting is simplied by the LAG statement. Because the presence of heteroscedasticity can result in inefcient and biased estimates of the variance covariance matrix in the OLS framework, several methods that produce heteroscedasticity-corrected covariance matrices (HCCME) are added. The new RESTRICT statement species linear restrictions on the parameters. New ODS Graphics plots simplify model development by providing visual analytical tools.
QLIM Procedure
Stochastic frontier models are now available in the QLIM procedure. Specication of these models allows for random shocks of production or cost along with technological or cost inefciencies. The nonnegative error-term component that represents technological or cost inefciencies has halfnormal, exponential, or truncated normal distributions.
SASECRSP Engine ! 7
SASECRSP Engine
The SASECRSP interface now supports reading of CRSP stock, indices, and combined stock/indices databases by using a variety of keys, not just CRSPs primary key PERMNO. In addition, SASECRSP can now read the CRSP/Compustat Merged (CCM) database and fully supports cross-database access, enabling you to access the CCM database by CRSPs main identiers PERMNO and PERMCO, as well as to access the CRSP Stock databases by Compustats GVKEY identier. A list of other new features follows: SASECRSP now fully supports access of scal CCM data members by both scal and calendar date range restrictions. Fiscal to calendar date shifting has been added as well. New date elds have been added for CCM scal members. Now scal members have three different dates: a CRSP date, a scal integer date, and a calendar integer date. An additional date function has been added which enables you to convert from scal to calendar dates. Date range restriction for segment members has also been added.
SASEFAME Engine
The SASEFAME interface enables you to access and process nancial and economic time series data that resides in a FAME database. SASEFAME for SAS 9.2 supports Windows, Solaris, AIX, Linux, Linux Opteron, and HP-UX hosts. You can now use the SAS windowing environment to view FAME data and use the SAS viewtable commands to navigate your FAME data base. You can select the time span of data by specifying a range of dates in the RANGE= option. You can use an input SAS data set with a WHERE clause to specify selection of variables based on BY variables, such as tickers or issues stored in a FAME string case series. You can use a FAME crosslist to perform selection based on the crossproduct of two FAME namelists. The new FAMEOUT= option now supports the following classes and types of data series objects: FORMULA, TIME, BOOLEAN, CASE, DATE, and STRING. It is easy to use a SAS input data set with the INSET= option to create a specic view of your FAME data. Multiple views can be created by using multiple LIBNAME statements with customized options tailored to the unique view that you want to create. See Example 34.10: Selecting Time Series Using CROSSLIST= Option with INSET= and WHERE=TICK on page 2325 in Chapter 34, The SASEFAME Interface Engine. The INSET variables dene the BY variables that enable you to view cross sections or slices of your data. When used in conjunction with the WHERE clause and the CROSSLIST= option, SASEFAME can show any or all of your BY groups in the same view or in multiple views. The INSET= option is invalid without a WHERE that clause species the BY variables you want to use in your view, and it must be used with the CROSSLIST=option.
The CROSSLIST= option provides a more efcient means of selecting cross sections of nancial time series data. This option can be used without using the INSET= option. There are two methods for performing the crosslist selection function. The rst method uses two FAME namelists, and the second method uses one namelist and one BY group specied in the WHERE= clause of the INSET=option. See Example 34.9: Selecting Time Series Using CROSSLIST= Option with a FAME Namelist of Tickers on page 2322 in Chapter 34, The SASEFAME Interface Engine. The FAMEOUT= option provides efcient selection of the class and type of the FAME data series objects you want in your SAS output data set. The possible values for fame_data_object_class_type are FORMULA, TIME, BOOLEAN, CASE, DATE, and STRING. If the FAMEOUT=option is not specied, numeric time series are output to the SAS data set. FAMEOUT=CASE defaults to case series of numeric type, so if you want another type of case series in your output, then you must specify it. Scalar data objects are not supported. See Example 34.6: Reading Other FAME Data Objects with the FAMEOUT= Option on page 2316 in Chapter 34, The SASEFAME Interface Engine.
SASEHAVR Engine
The SASEHAVR interface engine is now production, giving Windows users random access to economic and nancial data that resides in a Haver Analytics Data Link Express (DLX) database. You can now use the SAS windowing environment to view HAVERDLX data and use the SAS viewtable commands to navigate your Haver database. You can use the SQL procedure to create a view of your resulting SAS data set. You can limit the range of data that is read from the time series and specify a desired conversion frequency. Start dates are recommended in the LIBNAME statement to help you save resources when processing large databases or when processing a large number of observations. You can further subset your data by using the WHERE, KEEP, or DROP statements in your DATA step. New options are provided for more efcient subsetting by time series variables, groups, or sources. You can force the aggregation of all variables selected to be of the frequency specied by the FREQ= option if you also specify the FORCE=FREQ option. Aggregation is supported only from a more frequent time interval to a less frequent time interval, such as from weekly to monthly. A list of other new features follows: You can see the available data sets in the SAS LIBNAME window of the SAS windowing environment by selecting the SASEHAVR libref in the LIBNAME window that you have previously used in your LIBNAME statement. You can view your SAS output observations by double clicking on the desired output data set libref in the libname window of the SAS windowing environment. You can type Viewtable on the SAS command line to view any of your SASEHAVR tables, views, or librefs, both for input and output data sets. By default, the SASEHAVR engine reads all time series in the Haver database that you reference by using your SASEHAVR libref. The START= option is specied in the form YYYYMMDD, as is the END= option. The start and end dates are used to limit the time span of data; they can help you save resources when processing large databases or when processing a large number of observations.
It is also possible to select specic variables to be included or excluded from the SAS data set by using the KEEP= or the DROP= option. When the KEEP= or the DROP= option is used, the resulting SAS data set keeps or drops the variables that you select in that option. There are three wildcards currently available: *, ?, and #. The * wildcard corresponds to any character string and will include any string pattern that corresponds to that position in the matching variable name. The ? means that any single alphanumeric character is valid. The # wildcard corresponds to a single numeric character. You can also select time series in your data by using the GROUP= or the SOURCE= option to select on group name or on source name. Alternatively, you can deselect time series by using the DROPGROUP= or the DROPSOURCE= option. These options also support the wildcards *, ?, and #. By default, SASEHAVR selects only the variables that are of the specied frequency in the FREQ= option. If this option is not specied, SASEHAVR selects the variables that match the frequency of the rst selected variable. If no other selection criteria are specied, the rst selected variable is the rst physical DLXRecord read from the Haver database. The FORCE=FREQ option can be specied to force the aggregation of all variables selected to be of the frequency specied by the FREQ= option. Aggregation is supported only from a more frequent time interval to a less frequent time interval, such as from weekly to monthly. The FORCE= option is ignored if the FREQ= option is not specied.
UCM Procedure
The following features are new to the UCM procedure: The new RANDOMREG statement enables specication of regressors with time-varying regression coefcients. The coefcients are assumed to follow independent random walks. Multiple RANDOMREG statements can be specied, and each statement can specify multiple regressors. The regression coefcient random walks for regressors specied in the same RANDOMREG statement are assumed to have the same disturbance variance parameter. This arrangement enables a very exible specication of regressors with time-varying coefcients.
The new SPLINEREG statement enables specication of a spline regressor that can optionally have time-varying coefcients. The spline specication is useful when the series being forecast depends on a regressor in a nonlinear fashion. The new SPLINESEASON statement enables parsimonious modeling of long and complex seasonal patterns using the spline approximation. The SEASON statement now has options that enable complete control over the constituent harmonics that make up the trigonometric seasonal model. It is now easy to obtain diagnostic test statistics useful for detecting structural breaks such as additive outliers and level shifts. As an experimental feature, you can now model the irregular component as an autoregressive moving-average (ARMA) process. The memory management and numerical efciency of the underlying algorithms have been improved.
VARMAX Procedure
The VARMAX procedure now enables independent (exogenous) variables with their distributed lags to inuence dependent (endogenous) variables in various models, such as VARMAX, BVARX, VECMX, BVECMX, and GARCH-type multivariate conditional heteroscedasticity models.
yt D C
i yt
i xt
t i
Pp
i D1 i B
.B/ D 0 C 1 B C
C s B s , and .B/ D Ik
VARMAX Procedure ! 11
If the Kalman ltering method is used for the parameter estimation of the VARMAX(p,q,s) model, then the dimension of the state-space vector is large, which takes time and memory for computing. For convenience, the parameter estimation of the VARMAX(p,q,s) model uses the two-stage estimation method, which computes the estimation of deterministic terms and exogenous parameters and then maximizes the log-likelihood function of the VARMA(p,q) model. Some examples of VARMAX modeling are:
model y1 y2 = x1 / q=1; nloptions tech=qn; model y1 y2 = x1 / p=1 q=1 xlag=1 nocurrentx; nloptions tech=qn;
X12 Procedure
The X12 procedure has many new statements and options. Many of the new features are related to the regARIMA modeling, which is used to extend the series to be seasonally adjusted. A new experimental input and output data set has been added which describes the times series model t to the series. The following miscellaneous statements and options are new: The NOINT option on the AUTOMDL statement suppresses the tting of a constant term in automatically identied models. The following tables are now available through the OUTPUT statement: A7, A9, A10, C20, D1, and D7. The TABLES statement enables you to display some tables that represent intermediate calculations in the X11 method and that are not displayed by default. The following statements and options related to the regression component of regARIMA modeling are new: The SPAN= option on the OUTLIER statement can be used to limit automatic outlier detection to a subset of the time series observations. The following predened variables have been added to the PREDEFINED option on the REGRESSION statement: EASTER(value), SCEASTER(value), LABOR(value), THANK(value), TDSTOCK(value), SINCOS(value . . . ). User-dened regression variables can be included on the regression model by specifying them in the USERVAR=(variables) option in the REGRESSION statement or the INPUT statement. Events can be included as user-dened regression variables in the regression model by specifying them in the EVENT statement. SAS predened events do not require an INEVENT= data set, but an INEVENT= data set can be specied to dene other events. You can now supply initial or xed parameter values for regression variables by using the B=(value < F > . . . ) option in the EVENT statement, the B=(value < F > . . . ) option in the INPUT statement, the B=(value < F > . . . ) option in the REGRESSION statement, or by using the MDLINFOIN= data set in the PROC X12 statement. Some regression variable parameters can be xed while others are estimated.
References ! 13
You can now assign user-dened regression variables to a group by the USERTYPE= option in the EVENT statement, the USERTYPE= option in the INPUT statement, the USERTYPE= option in the REGRESSION statement, or by using the MDLINFOIN= data set in the PROC X12 statement. Census Bureau predened variables are automatically assigned to a regression group, and this cannot be modied. But assigning user-dened regression variables to a regression group allows them to be processed similarly to the predened variables. You can now supply initial or xed parameters for ARMA coefcients using the MDLINFOIN= data set in the PROC X12 statement. Some ARMA coefcients can be xed while others are estimated. The INEVENT= option on the PROC X12 statement enables you to supply an EVENT definition data set so that the events dened in the EVENT denition data set can be used as user-dened regressors in the regARIMA model. User-dened regression variables in the input data set can be identied by specifying them in the USERDEFINED statement. User-dened regression variables specied in the USERVAR=(variables) option of the REGRESSION statement or the INPUT statement do not need to be specied in the USERDEFINED statement, but user-dened variables specied only in the MDLINFOIN= data set need to be idented in the USERDEFINED statement. The following new experimental options specify input and output data sets that describe the times series model: The MDLINFOIN= and MDLINFOOUT= data sets specied in the PROC X12 statement enable you to store the results of model identication and use the stored information as input when executing the X12 Procedure.
References
Center for Research in Security Prices (2003), CRSP/Compustat Merged Database Guide, Chicago: The University of Chicago Graduate School of Business, http://www.crsp.uchicago.edu/ support/documentation/pdfs/ccm_database_guide.pdf. Center for Research in Security Prices (2003), CRSP Data Description Guide, Chicago: The University of Chicago Graduate School of Business, http://www.crsp.uchicago.edu/ support/documentation/pdfs/stock_indices_data_descriptions.pdf. Center for Research in Security Prices (2002), CRSP Programmers Guide, Chicago: The University of Chicago Graduate School of Business, http://www.crsp.uchicago.edu/support/ documentation/pdfs/stock_indices_programming.pdf. Center for Research in Security Prices (2003), CRSPAccess Database Format Release Notes, Chicago: The University of Chicago Graduate School of Business, http://www.crsp. uchicago.edu/support/documentation/release_notes.html.
Center for Research in Security Prices (2003), CRSP Utilities Guide, Chicago: The University of Chicago Graduate School of Business, http://www.crsp.uchicago.edu/support/ documentation/pdfs/stock_indices_utilities.pdf. Center for Research in Security Prices (2002), CRSP SFA Guide, Chicago: The University of Chicago Graduate School of Business. Gomez, V. and Maravall, A. (1997a), Program TRAMO and SEATS: Instructions for the User, Beta Version, Banco de Espana. Gomez, V. and Maravall, A. (1997b), Guide for Using the Programs TRAMO and SEATS, Beta Version, Banco de Espana. Haver Analytics (2001), DLX API Programmers Reference, New York, NY. Stoffer, D. and Toloi, C. (1992), A Note on the Ljung-Box-Pierce Portmanteau Statistic with Missing Data, Statistics & Probability Letters 13, 391-396. SunGard Data Management Solutions (1998), Guide to FAME Database Servers, 888 Seventh Avenue, 12th Floor, New York, NY 10106 USA, http://www.fame.sungard.com/support. html, http://www.data.sungard.com. SunGard Data Management Solutions (1995), Users Guide to FAME, Ann Arbor, MI, http:// www.fame.sungard.com/support.html. SunGard Data Management Solutions (1995), Reference Guide to Seamless C HLI, Ann Arbor, MI, http://www.fame.sungard.com/support.html. SunGard Data Management Solutions(1995), Command Reference for Release 7.6, Vols. 1 and 2, Ann Arbor, MI, http://www.fame.sungard.com/support.html.
Chapter 2
Introduction
Contents
Overview of SAS/ETS Software . . . . . . . . . . . . . . . . . . . . . . . . . . . Uses of SAS/ETS Software . . . . . . . . . . . . . . . . . . . . . . . . . . Contents of SAS/ETS Software . . . . . . . . . . . . . . . . . . . . . . . . Experimental Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . About This Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chapter Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Typographical Conventions . . . . . . . . . . . . . . . . . . . . . . . . . . Where to Turn for More Information . . . . . . . . . . . . . . . . . . . . . . . . . Accessing the SAS/ETS Sample Library . . . . . . . . . . . . . . . . . . . Online Help System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . SAS Short Courses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . SAS Technical Support Services . . . . . . . . . . . . . . . . . . . . . . . . Major Features of SAS/ETS Software . . . . . . . . . . . . . . . . . . . . . . . . Discrete Choice and Qualitative and Limited Dependent Variable Analysis . Regression with Autocorrelated and Heteroscedastic Errors . . . . . . . . . Simultaneous Systems Linear Regression . . . . . . . . . . . . . . . . . . . Linear Systems Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . Polynomial Distributed Lag Regression . . . . . . . . . . . . . . . . . . . . Nonlinear Systems Regression and Simulation . . . . . . . . . . . . . . . . ARIMA (Box-Jenkins) and ARIMAX (Box-Tiao) Modeling and Forecasting Vector Time Series Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . State Space Modeling and Forecasting . . . . . . . . . . . . . . . . . . . . Spectral Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Seasonal Adjustment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Structural Time Series Modeling and Forecasting . . . . . . . . . . . . . . . Time Series Cross-Sectional Regression Analysis . . . . . . . . . . . . . . . Automatic Time Series Forecasting . . . . . . . . . . . . . . . . . . . . . . Time Series Interpolation and Frequency Conversion . . . . . . . . . . . . . Trend and Seasonal Analysis on Transaction Databases . . . . . . . . . . . Access to Financial and Economic Databases . . . . . . . . . . . . . . . . . Spreadsheet Calculations and Financial Report Generation . . . . . . . . . . Loan Analysis, Comparison, and Amortization . . . . . . . . . . . . . . . . Time Series Forecasting System . . . . . . . . . . . . . . . . . . . . . . . . Investment Analysis System . . . . . . . . . . . . . . . . . . . . . . . . . . 16 17 18 20 20 21 22 22 22 23 23 23 23 24 26 27 28 29 29 31 32 33 34 35 36 37 37 39 41 42 44 44 45 46
16 ! Chapter 2: Introduction
ODS Graphics . . . . . . . . . . . . . . . . Related SAS Software . . . . . . . . . . . . . . . Base SAS Software . . . . . . . . . . . . . SAS Forecast Studio . . . . . . . . . . . . . SAS High-Performance Forecasting . . . . . SAS/GRAPH Software . . . . . . . . . . . SAS/STAT Software . . . . . . . . . . . . . SAS/IML Software . . . . . . . . . . . . . . SAS Stat Studio . . . . . . . . . . . . . . . SAS/OR Software . . . . . . . . . . . . . . SAS/QC Software . . . . . . . . . . . . . . MLE for User-Dened Likelihood Functions JMP Software . . . . . . . . . . . . . . . . SAS Enterprise Guide . . . . . . . . . . . . SAS Add-In for Microsoft Ofce . . . . . . Enterprise MinerTime Series nodes . . . . SAS Risk Products . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
47 48 48 51 51 52 53 54 54 55 56 56 57 58 59 59 60 61
18 ! Chapter 2: Introduction
forecasting dairy milk yields and composition (Benseman 1990) predicting the gloss of coated aluminum products subject to weathering (Khan 1990) learning curve analysis for predicting manufacturing costs of aircraft (Le Bouton 1989) analyzing Dow Jones stock index trends (Early, Sweeney, and Zekavat 1989) analyzing the usefulness of the composite index of leading economic indicators for forecasting the economy (Lin and Myers 1988)
analysis of time-stamped transactional data time series cross-sectional regression analysis unobserved components analysis of time series vector autoregressive and moving-average modeling and forecasting seasonal adjustment (Census X-11 and X-11 ARIMA) seasonal adjustment (Census X-12 ARIMA)
Macros
SAS/ETS software includes the following SAS macros: %AR %BOXCOXAR %DFPVALUE %DFTEST %LOGTEST %MA %PDL generates statements to dene autoregressive error models for the MODEL procedure investigates Box-Cox transformations useful for modeling and forecasting a time series computes probabilities for Dickey-Fuller test statistics performs Dickey-Fuller tests for unit roots in a time series process tests to determine whether a log transformation is appropriate for modeling and forecasting a time series generates statements to dene moving-average error models for the MODEL procedure generates statements to dene polynomial distributed lag models for the MODEL procedure
These macros are part of the SAS AUTOCALL facility and are automatically available for use in your SAS program. Refer to SAS Macro Language: Reference for information about the SAS macro facility.
20 ! Chapter 2: Introduction
Experimental Software
Experimental software is sometimes included as part of a production-release product. It is provided to (sometimes targeted) customers in order to obtain feedback. All experimental uses are marked Experimental in this document. Whenever an experimental procedure, statement, or option is used, a message is printed to the SAS log to indicate that it is experimental. The design and syntax of experimental software might change before any production release. Experimental software has been tested prior to release, but it has not necessarily been tested to productionquality standards, and so should be used with care.
Chapter Organization ! 21
Chapter Organization
Following a brief Whats New, this book is divided into ve major parts. Part I contains general information to aid you in working with SAS/ETS Software. Part II explains the SAS procedures of SAS/ETS software. Part III describes the available data access interfaces for economic and nancial databases. Part IV is the reference for the Time Series Forecasting System, an interactive forecasting menu system that uses PROC ARIMA and other routines to perform time series forecasting. Finally, Part V is the reference for the Investment Analysis System. The new features added to SAS/ETS software since the publication of SAS/ETS Software: Changes and Enhancements for Release 8.2 are summarized in Chapter 1, Whats New in SAS/ETS. If you have used SAS/ETS software in the past, you may want to skim this chapter to see whats new. Part I contains the following chapters. Chapter 2, the current chapter, provides an overview of SAS/ETS software and summarizes related SAS publications, products, and services. Chapter 3, Working with Time Series Data, discusses the use of SAS data management and programming features for time series data. Chapter 4, Date Intervals, Formats, and Functions, summarizes the time intervals, date and datetime informats, date and datetime formats, and date and datetime functions available in the SAS System. Chapter 5, SAS Macros and Functions, documents SAS macros and DATA step nancial functions provided with SAS/ETS software. The macros use SAS/ETS procedures to perform Dickey-Fuller tests, test for the need for log transformations, or select optimal Box-Cox transformation parameters for time series data. Chapter 6, Nonlinear Optimization Methods, documents the NonLinear Optimization subsystem used by some ETS procedures to perform nonlinear optimization tasks. Part II contains chapters that explain the SAS procedures that make up SAS/ETS software. These chapters appear in alphabetical order by procedure name. Part III contains chapters that document the ETS access interfaces to economic and nancial databases. Each of the chapters that document the SAS/ETS procedures (Part II) and the SAS/ETS access interfaces (Part III) is organized as follows: 1. The Overview section gives a brief description of the procedure. 2. The Getting Started section provides a tutorial introduction on how to use the procedure. 3. The Syntax section is a reference to the SAS statements and options that control the procedure. 4. The Details section discusses various technical details. 5. The Examples section contains examples of the use of the procedure.
22 ! Chapter 2: Introduction
6. The References section contains technical references on methodology. Part IV contains the chapters that document the features of the Time Series Forecasting System. Part V contains chapters that document the features of the Investment Analysis System.
Typographical Conventions
This book uses several type styles for presenting information. The following list explains the meaning of the typographical conventions used in this book: roman UPPERCASE ROMAN is the standard type style used for most text. is used for SAS statements, options, and other SAS language elements when they appear in the text. However, you can enter these elements in your own SAS programs in lowercase, uppercase, or a mixture of the two. is used in the Syntax sections initial lists of SAS statements and options. is used for user-supplied values for options in the syntax denitions. In the text, these values are written in italic. is used for the names of variables and data sets when they appear in the text. is used to refer to matrices and vectors and to refer to commands. is used for terms that are dened in the text, for emphasis, and for references to publications. is used for example code. In most cases, this book uses lowercase type for SAS statements.
UPPERCASE BOLD
oblique
helvetica
bold italic
bold monospace
24 ! Chapter 2: Introduction
linear regression model with heteroscedasticity probit with heteroscedasticity logit with heteroscedasticity Tobit (censored and truncated) with heteroscedasticity Box-Cox regression with heteroscedasticity bivariate probit bivariate Tobit sample selection models multivariate limited dependent models The COUNTREG procedure provides regression models in which the dependent variable takes nonnegative integer count values. The COUNTREG procedure supports the following models: Poisson regression negative binomial regression with quadratic and linear variance functions zero inated Poisson (ZIP) model zero inated negative binomial (ZINB) model xed and random effect Poisson panel data models xed and random effect NB (negative binomial) panel data models The PANEL procedure deals with panel data sets that consist of time series observations on each of several cross-sectional units. The models and methods the PANEL procedure uses to analyze are as follows: one-way and two-way models xed and random effects autoregressive models the Parks method dynamic panel estimator the Da Silva method for moving-average disturbances
26 ! Chapter 2: Introduction
tests for any linear hypothesis that involves the structural coefcients restrictions for any linear combination of the structural coefcients forecasts with condence limits estimation and forecasting of ARCH (autoregressive conditional heteroscedasticity), GARCH (generalized autoregressive conditional heteroscedasticity), I-GARCH (integrated GARCH), E-GARCH (exponential GARCH), and GARCH-M (GARCH in mean) models combination of ARCH and GARCH models with autoregressive models, with or without regressors estimation and testing of general heteroscedasticity models variety of model diagnostic information including the following: autocorrelation plots partial autocorrelation plots Durbin-Watson test statistic and generalized Durbin-Watson tests to any order Durbin h and Durbin t statistics Akaike information criterion Schwarz information criterion tests for ARCH errors Ramseys RESET test Chow and PChow tests Phillips-Perron stationarity test CUSUM and CUMSUMSQ statistics
exact signicance levels (p-values) for the Durbin-Watson statistic embedded missing values
28 ! Chapter 2: Introduction
test linear hypotheses about the parameters write predicted and residual values to an output SAS data set write parameter estimates to an output SAS data set write the crossproducts matrix (SSCP) to an output SAS data set use raw data, correlations, covariances, or cross products as input The ENTROPY procedure supports the following models and features: generalized maximum entropy (GME) estimation generalized cross entropy (GCE) estimation normed moment generalized maximum entropy maximum entropy-based seemingly unrelated regression (MESUR) estimation pure inverse estimation estimation of parameters in simultaneous systems of linear equations Markov models unordered multinomial choice problems weighted regression any number of restrictions for any linear combination of coefcients, within a single model or across equations tests for any linear hypothesis, for the parameters of a single model or across equations
30 ! Chapter 2: Introduction
hypothesis tests of nonlinear functions of the parameter estimates linear and nonlinear restrictions of the parameter estimates bounds imposed on the parameter estimates computation of estimates and standard errors of nonlinear functions of the parameter estimates estimation and simulation of ordinary differential equations (ODEs) vector autoregressive error processes and polynomial lag distributions easily specied for the nonlinear equations variance modeling (ARCH, GARCH, and others) computation of goal-seeking solutions of nonlinear systems to nd input values needed to produce target outputs dynamic, static, or n-period-ahead-forecast simulation modes simultaneous solution or single equation solution modes Monte Carlo simulation using parameter estimate covariance and across-equation residuals covariance matrices or user-specied random functions a variety of diagnostic statistics including the following model R-square statistics general Durbin-Watson statistics and exact p-values asymptotic standard errors and t tests rst-stage R-square statistics covariance estimates collinearity diagnostics simulation goodness-of-t statistics Theil inequality coefcient decompositions Theil relative change forecast error measures heteroscedasticity tests Godfrey test for serial correlation Hausman specication test Chow tests block structure and dependency structure analysis for the nonlinear system listing and cross-reference of tted model automatic calculation of needed derivatives by using exact analytic formula efcient sparse matrix methods used for model solution; choice of other solution methods Model denition, parameter estimation, simulation, and forecasting can be performed interactively in a single SAS session or models can also be stored in les and reused and combined in later runs.
32 ! Chapter 2: Introduction
Akaikes information criterion (AIC) Schwarzs Bayesian criterion (SBC or BIC) Box-Ljung chi-square test statistics for white-noise residuals autocorrelation function of residuals partial autocorrelation function of residuals inverse autocorrelation function of residuals automatic outlier detection
Johansen cointegration test for nonstationary vector processes of integrated order one Stock-Watson common trends test for the possibility of cointegration among nonstationary vector processes of integrated order one Johansen cointegration test for nonstationary vector processes of integrated order two model parameter estimation methods: least squares (LS) maximum likelihood (ML) model checks and residual analysis using the following tests: Durbin-Watson (DW) statistics F test for autoregressive conditional heteroscedastic (ARCH) disturbance F test for AR disturbance Jarque-Bera normality test Portmanteau test seasonal deterministic terms subset models multiple regression with distributed lags dead-start model that does not have present values of the exogenous variables Granger-causal relationships between two distinct groups of variables innite order AR representation impulse response function (or innite order MA representation) decomposition of the predicted error covariances roots of the characteristic functions for both the AR and MA parts to evaluate the proximity of the roots to the unit circle contemporaneous relationships among the components of the vector time series forecasts conditional covariances for GARCH models
34 ! Chapter 2: Introduction
multivariate ARIMA modeling by using the general state space representation of the stochastic process automatic model selection using Akaikes information criterion (AIC) user-specied state space models including restrictions transfer function models with random inputs any combination of simple and seasonal differencing; input series can be differenced to any order for any lag lengths forecasts with condence limits ability to save selected and tted model in a data set and reuse for forecasting wide range of output options including the ability to print any statistics concerning the data and their covariance structure, the model selection process, and the nal model t
Spectral Analysis
The SPECTRA procedure provides spectral analysis and cross-spectral analysis of time series. The SPECTRA procedure includes the following features: efcient calculation of periodogram and smoothed periodogram using fast nite Fourier transform and Chirp-Z algorithms multiple spectral analysis, including raw and smoothed spectral and cross-spectral function estimates, with user-specied window weights choice of kernel for smoothing output of the following spectral estimates to a SAS data set: Fourier sine and cosine coefcients periodogram smoothed periodogram cospectrum quadrature spectrum amplitude phase spectrum squared coherency Fishers Kappa and Bartletts Kolmogorov-Smirnov test statistic for testing a null hypothesis of white noise
Seasonal Adjustment ! 35
Seasonal Adjustment
The X11 procedure provides seasonal adjustment of time series by using the Census X-11 or X-11 ARIMA method. The X11 procedure is based on the U.S. Bureau of the Census X-11 seasonal adjustment program and also supports the X-11 ARIMA method developed by Statistics Canada. The X11 procedure includes the following features: decomposition of monthly or quarterly series into seasonal, trend, trading day, and irregular components both multiplicative and additive form of the decomposition all the features of the Census Bureau program support of the X-11 ARIMA method support of sliding spans analysis processing of any number of variables at once with no maximum length for a series computation of tests for stable, moving, and combined seasonality optional printing or storing in SAS data sets of the individual X11 tables that show the various components at different stages of the computation; full control over what is printed or output ability to project seasonal component one year ahead, which enables reintroduction of seasonal factors for an extrapolated series The X12 procedure provides seasonal adjustment of time series using the X-12 ARIMA method. The X12 procedure is based on the U.S. Bureau of the Census X-12 ARIMA seasonal adjustment program (version 0.3). It also supports the X-11 ARIMA method developed by Statistics Canada and the previous X-11 method of the U.S. Census Bureau. The X12 procedure includes the following features: decomposition of monthly or quarterly series into seasonal, trend, trading day, and irregular components support of multiplicative, additive, pseudo-additive, and log additive forms of decomposition support of the X-12 ARIMA method support of regARIMA modeling automatic identication of outliers support of TRAMO-based automatic model selection use of regressors to process missing values within the span of the series processing of any number of variables at once with no maximum length for a series
36 ! Chapter 2: Introduction
computation of tests for stable, moving, and combined seasonality spectral analysis of original, seasonally adjusted, and irregular series optional printing or storing in a SAS data set of the individual X11 tables that show the various components at different stages of the decomposition; full control over what is printed or output optional projection of seasonal component one year ahead, which enables reintroduction of seasonal factors for an extrapolated series
38 ! Chapter 2: Introduction
simple double linear damped trend seasonal Winters method (additive and multiplicative) Additionally, PROC ESM can transform the data before applying the smoothing methods using any of these transformations: log square root logistic Box-Cox In addition to forecasting, the ESM procedure can also produce graphic output. The ESM procedure can forecast both time series data, whose observations are equally spaced at a specic time interval (for example, monthly, weekly), or transactional data, whose observations are not spaced with respect to any particular time interval. (Internet, inventory, sales, and similar data are typical examples of transactional data. For transactional data, the data are accumulated based on a specied time interval to form a time series.) The ESM procedure is a replacement for the older FORECAST procedure. ESM is often more convenient to use than PROC FORECAST but it supports only exponential smoothing models. The FORECAST procedure provides forecasting of univariate time series using automatic trend extrapolation. PROC FORECAST is an easy-to-use procedure for automatic forecasting and uses simple popular methods that do not require statistical modeling of the time series, such as exponential smoothing, time trend with autoregressive errors, and the Holt-Winters method. The FORECAST procedure supplements the powerful forecasting capabilities of the econometric and time series analysis procedures described previously. You can use PROC FORECAST when you have many series to forecast and you want to extrapolate trends without developing a model for each series. The FORECAST procedure includes the following features: choice of the following forecasting methods: EXPO methodexponential smoothing: single, double, triple, or Holt two-parameter smoothing exponential smoothing as an ARIMA Model
WINTERS methodusing updating equations similar to exponential smoothing to t model parameters ADDWINTERS methodlike the WINTERS method except that the seasonal parameters are added to the trend instead of multiplied with the trend STEPAR methodstepwise autoregressive models with constant, linear, or quadratic trend and autoregressive errors to any order Holt-Winters forecasting method with constant, linear, or quadratic trend additive variant of the Holt-Winters method support for up to three levels of seasonality for Holt-Winters method: time-of-year, day-ofweek, or time-of-day ability to forecast any number of variables at once forecast condence limits for all methods
40 ! Chapter 2: Introduction
choice of four interpolation methods: cubic splines linear splines step functions simple aggregation ability to perform extrapolation by a linear projection of the trend of the cubic spline curve t to the input data ability to transform series before and after interpolation (or without interpolation) by using any of the following: constant shift or scale sign change or absolute value logarithm, exponential, square root, square, logistic, inverse logistic lags, leads, differences classical decomposition bounds, trims, reverse series centered moving, cumulative, or backward moving average centered moving, cumulative, or backward moving range centered moving, cumulative, or backward moving geometric mean centered moving, cumulative, or backward moving maximum centered moving, cumulative, or backward moving median centered moving, cumulative, or backward moving minimum centered moving, cumulative, or backward moving product centered moving, cumulative, or backward moving corrected sum of squares centered moving, cumulative, or backward moving uncorrected sum of squares centered moving, cumulative, or backward moving rank centered moving, cumulative, or backward moving standard deviation centered moving, cumulative, or backward moving sum centered moving, cumulative, or backward moving median centered moving, cumulative, or backward moving t-value centered moving, cumulative, or backward moving variance support for a wide range of time series frequencies: YEAR SEMIYEAR QUARTER MONTH SEMIMONTH
TENDAY WEEK WEEKDAY DAY HOUR MINUTE SECOND support for repeating of shifting the basic interval types to dene a great variety of different frequencies, such as scal years, biennial periods, work shifts, and so forth Refer to Chapter 3, Working with Time Series Data, and Chapter 4, Date Intervals, Formats, and Functions, for more information about time series data transformations.
42 ! Chapter 2: Introduction
provides a seamless interface between FAME and SAS data processing uses the LIBNAME statement to enable you to specify which time series you would like to read from the FAME database enables you to convert the selected time series to the same time scale works with the SAS DATA step to perform further subsetting and to store the resulting time series into a SAS data set performs more analysis if desired either in the same SAS session or in another session at a later time supports the FAME CROSSLIST function for subsetting via BYGROUPS using the CROSSLIST= option you can use a FAME namelist that contains your BY variables for selection in the CROSSLIST you can use a SAS input dataset, INSET, that contains the BY selection variables along with the WHERE= option in your SASEFAME libref supports the use of FAME in a client/server environment that uses the FAME CHLI capability on your FAME server enables access to your FAME remote data when you specify the port number of the TCP/IP service that is dened for your FAME server and the node name of your FAME master server in your SASEFAME librefs physical path The SASEHAVR interface LIBNAME engine includes the following features: enables Windows users random access to economic and nancial data residing in a HAVER ANALYTICS Data Link Express (DLX) database the following types of HAVER data sets are available: United States Economic Indicators Specialized Databases Financial Indicators Industry Industrial Countries Emerging Markets International Organizations Forecasts and As Reported Data United States Regional enables you to limit the range of data that is read from the time series enables you to specify a desired conversion frequency. Start dates are recommended on the LIBNAME statement to help you save resources when processing large databases or when processing a large number of observations.
44 ! Chapter 2: Introduction
enables you to use the WHERE, KEEP, or DROP statements in your DATA step to further subset your data supports use of the SQL procedure to create a view of your resulting SAS data set
balloon payment full control over adjustment terms for adjustable rate loans: life caps, adjustment frequency, and maximum and minimum rates support for a wide variety of payment and compounding intervals ability to incorporate initialization costs, discount points, down payments, and prepayments (uniform or lump-sum) in loan calculations analysis of different rate adjustment scenarios for variable rate loans including the following: worst case best case xed rate case estimated case ability to make loan comparisons at different points in time ability to make loan comparisons at each analysis date on the basis of ve different economic criteria: present worth of cost (net present value of all payments to date) true interest rate (internal rate of return to date) current periodic payment total interest paid to date outstanding balance ability to base loan comparisons on either after-tax or before-tax analysis report of the best alternative when loans of equal amount are compared amortization schedules for each loan contract output that shows payment dates, rather than just payment sequence numbers, when starting date is specied optional printing or output of the amortization schedules, loan summaries, and loan comparison information to SAS data sets ability to specify rounding of payments to any number of decimal places
46 ! Chapter 2: Introduction
best predict your time series. The system provides both graphical and statistical features to help you choose the best forecasting method for each series. The system can be invoked by selecting AnalysisISolutions, by the FORECAST command, and by clicking the Forecasting icon in the Data Analysis folder of the SAS Desktop. The following is a brief summary of the features of the Time Series Forecasting system. With the system you can: use a wide variety of forecasting methods, including several kinds of exponential smoothing models, Winters method, and ARIMA (Box-Jenkins) models. You can also produce forecasts by combining the forecasts from several models. use predictor variables in forecasting models. Forecasting models can include time trend curves, regressors, intervention effects (dummy variables), adjustments you specify, and dynamic regression (transfer function) models. view plots of the data, predicted versus actual values, prediction errors, and forecasts with condence limits. You can plot changes or transformations of series, zoom in on parts of the graphs, or plot autocorrelations. use hold-out samples to select the best forecasting method compare goodness-of-t measures for any two forecasting models side-by-side or list all models sorted by a particular t statistic view the predictions and errors for each model in a spreadsheet or view and compare the forecasts from any two models in a spreadsheet examine the tted parameters of each forecasting model and their statistical signicance control the automatic model selection process: the set of forecasting models considered, the goodness-of-t measure used to select the best model, and the time period used to t and evaluate models customize the system by adding forecasting models for the automatic model selection process and for point-and-click manual selection save your work in a project catalog print an audit trail of the forecasting process save and print system output including spreadsheets and graphs
ODS Graphics ! 47
savings depreciations bonds generic cash ows Various tools are provided to help analyze the value of investment alternatives: time value, periodic equivalent, internal rate of return, benet-cost ratio, and breakeven analysis. These analyses can help answer a number of questions you might have about your investments: Which option is more protable or less costly? Is it better to buy or rent? Are the extra fees for renancing at a lower interest rate justied? What is the balance of this account after saving this amount periodically for so many years? How much is legally tax-deductible? Is this a reasonable price? Investment Analysis can be benecial to users in many industries for a variety of decisions: manufacturing: cost justication of automation or any capital investment, replacement analysis of major equipment, or economic comparison of alternative designs government: setting funds for services nance: investment analysis and portfolio management for xed-income securities
ODS Graphics
Many SAS/ETS procedures produce graphical output using the SAS Output Delivery System (ODS). The ODS Graphics system provides several advantages: Plots and graphs are output objects in the Output Delivery System (ODS) and can be manipulated with ODS commands. There is no need to write SAS/GRAPH statements or use special plotting macros. There are multiple formats to choose from: html, gif, and rtf. Templates control the appearance of plots. Styles control the color scheme.
48 ! Chapter 2: Introduction
You can edit or create templates and styles for all graphs. To enable graphical output from SAS/ETS procedures, you must use the following statement in your SAS program.
ods graphics on;
The graphical output produced by many SAS/ETS procedures can be controlled using the PLOTS= option on the PROC statement. For more information about the features of the ODS Graphics system, including the many ways that you can control or customize the plots produced by SAS procedures, refer to Chapter 21, Statistical Graphics Using ODS (SAS/STAT Users Guide). For more information about the SAS Output Delivery system, refer to the SAS Output Delivery System: Users Guide.
RANK SORT
50 ! Chapter 2: Introduction
for processing SAS data sets with Structured Query Language for standardizing variables to a xed mean and variance for printing descriptive statistics in tabular format for plotting variables over time for transposing SAS data sets for computing descriptive statistics
Global Statements
Global statements can be specied anywhere in your SAS program, and they remain in effect until changed. Global statements are documented in SAS Language Reference: Dictionary. You may nd the following SAS global statements useful: FILENAME FOOTNOTE %INCLUDE LIBNAME OPTIONS QUIT RUN TITLE X for accessing data les for printing footnote lines at the bottom of each page for including les of SAS statements for accessing SAS data libraries for setting various SAS system options for ending an interactive procedure step for executing the preceding SAS statements for printing title lines at the top of each page for issuing host operating system commands from within your SAS session
Some Base SAS statements can be used with any SAS procedure, including SAS/ETS procedures. These statements are not global, and they affect only the SAS procedure they are used with. These statements are documented in SAS Language Reference: Dictionary. The following Base SAS statements are useful with SAS/ETS procedures: BY FORMAT LABEL WHERE for computing separate analyses for groups of observations for assigning formats to variables for assigning descriptive labels to variables for subsetting data to restrict the range of data processed or to select or exclude observations from the analysis
SAS Functions
SAS functions can be used in DATA step programs and in the COMPUTAB and MODEL procedures. The following kinds of functions are available: character functions for manipulating character strings
date and time functions for performing date and calendar calculations nancial functions for performing nancial calculations such as depreciation, net present value, periodic savings, and internal rate of return lagging and differencing functions for computing lags and differences mathematical functions for computing data transformations and other mathematical calculations probability functions for computing quantiles of statistical distributions and the signicance of test statistics random number functions for simulation experiments sample statistics functions for computing means, standard deviations, kurtosis, and so forth SAS functions are documented in SAS Language Reference: Dictionary. Chapter 3, Working with Time Series Data, discusses the use of date, time, lagging, and differencing functions. Chapter 4, Date Intervals, Formats, and Functions, contains a reference list of date and time functions.
52 ! Chapter 2: Introduction
accumulates the time-stamped data to form a xed-interval time series diagnoses the time series using time series analysis techniques creates a list of candidate model specications based on the diagnostics ts each candidate model specication to the time series generates forecasts for each candidate tted model selects the most appropriate model specication based on either in-sample or holdout-sample evaluation using a model selection criterion rets the selected model specication to the entire range of the time series creates a forecast score from the selected tted model generate forecasts from the forecast score evaluates the forecast using in-sample analysis provides for out-of-sample forecast performance analysis performs top-down, middle-out, or bottom-up reconciliations of forecasts in the hierarchy
SAS/GRAPH Software
SAS/GRAPH software includes procedures that create two- and three-dimensional high resolution color graphics plots and charts. You can generate output that graphs the relationship of data values to one another, enhance existing graphs, or simply create graphics output that is not tied to data. With the addition of ODS Graphics features to SAS/ETS procedures, there is now less need for the use of SAS/GRAPH procedures with SAS/ETS. However, SAS/GRAPH procedures allow you to create additional graphical displays of your results. SAS/GRAPH software can produce the following types of output: charts plots maps text three-dimensional graphs With SAS/GRAPH software you can produce high-resolution color graphics plots of time series data.
SAS/STAT Software ! 53
SAS/STAT Software
SAS/STAT software is of interest to users of SAS/ETS software because many econometric and other statistical methods not included in SAS/ETS software are provided in SAS/STAT software. SAS/STAT software includes procedures for a wide range of statistical methodologies including the following: logistic regression censored regression principal component analysis structural equation models using covariance structure analysis factor analysis survival analysis discriminant analysis cluster analysis categorical data analysis; log-linear and conditional logistic models general linear models mixed linear and nonlinear models generalized linear models response surface analysis kernel density estimation LOESS regression spline regression two-dimensional kriging multiple imputation for missing values survey data analysis
54 ! Chapter 2: Introduction
SAS/IML Software
SAS/IML software gives you access to a powerful and exible programming language (Interactive Matrix Language) in a dynamic, interactive environment. The fundamental object of the language is a data matrix. You can use SAS/IML software interactively (at the statement level) to see results immediately, or you can store statements in a module and execute them later. The programming is dynamic because necessary activities such as memory allocation and dimensioning of matrices are done automatically. You can access built-in operators and call routines to perform complex tasks such as matrix inversion or eigenvector generation. You can dene your own functions and subroutines using SAS/IML modules. You can perform operations on an entire data matrix. You have access to a wide choice of data management commands. You can read, create, and update SAS data sets from inside SAS/IML software without ever using the DATA step. SAS/IML software is of interest to users of SAS/ETS software because it enables you to program your own econometric and time series methods in the SAS System. It contains subroutines for time series operators and for general function optimization. If you need to perform a statistical calculation not provided as an automated feature by SAS/ETS or other SAS software, you can use SAS/IML software to program the matrix equations for the calculation.
SAS/OR Software ! 55
transform data subset data analyze univariate distributions discover structure and features in multivariate data t and evaluate explanatory models create your own customized statistical graphics add legends, curves, maps, or other custom features to statistical graphics develop interactive programs that use dialog boxes extend the built-in analyses by calling SAS procedures create custom analyses repeat an analysis on different data extend the results of SAS procedures by using IML share analyses with colleagues who also use Stat Studio call functions from libraries written in C/C++, FORTRAN, or Java See for more information.
SAS/OR Software
SAS/OR software provides SAS procedures for operations research and project planning and includes a menu driven system for project management. SAS/OR software has features for the following: solving transportation problems linear, integer, and mixed-integer programming nonlinear programming and optimization scheduling projects plotting Gantt charts drawing network diagrams solving optimal assignment problems network ow programming
56 ! Chapter 2: Introduction
SAS/OR software might be of interest to users of SAS/ETS software for its mathematical programming features. In particular, the NLP and OPTMODEL procedures in SAS/OR software solve nonlinear programming problems and can be used for constrained and unconstrained maximization of user-dened likelihood functions. See SAS/OR Users Guide: Mathematical Programming for more information.
SAS/QC Software
SAS/QC software provides a variety of procedures for statistical quality control and quality improvement. SAS/QC software includes procedures for the following: Shewhart control charts cumulative sum control charts moving average control charts process capability analysis Ishikawa diagrams Pareto charts experimental design SAS/QC software also includes the SQC menu system for interactive application of statistical quality control methods and the ADX Interface for experimental design.
JMP Software ! 57
JMP Software
JMP software uses a exible graphical interface to display and analyze data. JMP dynamically links statistics and graphics so you can easily explore data, make discoveries, and gain the knowledge you need to make better decisions. JMP provides a comprehensive set of statistical tools as well as design of experiments (DOE) and advanced quality control (QC and SPC) tools for Six Sigma in a single package. JMP is software for interactive statistical graphics and includes: a data table window for editing, entering, and manipulating data a broad range of graphical and statistical methods for data analysis a facility for grouping data and computing summary statistics JMP scripting language (JSL)a scripting language for saving and creating frequently used routines JMP automation Formula Editora formula editor for each table column to compute values as needed linear models, correlations, and multivariate design of experiments module options to highlight and display subsets of data statistical quality control and variability chartsspecial plots, charts, and communication capability for quality-improvement techniques survival analysis time series analysis, which includes the following: Box-Jenkins ARIMA forecasting seasonal ARIMA forecasting transfer function modeling smoothing models: Winters method, single, double, linear, damped trend linear, and seasonal exponential smoothing diagnostic charts (autocorrelation, partial autocorrelation, and variogram) and statistics of t a model comparison table to compare all forecasts generated spectral density plots and white noise tests tools for printing and for moving analyses results between applications
58 ! Chapter 2: Introduction
SAS Enterprise Guide has the following features: integration with the SAS9 platform: open metadata repository (OMR) integration SAS report integration create report interface ODS support Web report studio integration access to information maps ETL studio impact analysis ESRI integration within the OLAP analyzer data mining scoring task the user interface and workow process ow ability to create stored processes from process ows SAS folders window project parameters query builder interface code node OLAP analyzer ESRI integration tree-diagram-based OLAP explorer SAS report snapshots SAS Web OLAP viewer for .NET ability to create EG projects workspace maximization With Enterprise Guide, you can perform time series analysis with the following EG procedures: prepare time series datathe Prepare Time Series Data task can be used to make data more suitable for analysis by other time series tasks. create time series datathe Create Time Series Data wizard helps you convert transactional data into xed-interval time series. Transactional data are time-stamped data collected over time with irregular or varied frequency. ARIMA Modeling and Forecasting task Basic Forecasting task Regression Analysis with Autoregressive Errors Regression Analysis of Panel Data
60 ! Chapter 2: Introduction
References ! 61
optimize the allocation of regulatory capital and economic capital facilitate regulatory compliance and risk disclosure requirements for a wide variety of regulations such as Basel I, Basel II, and the Capital Requirements Directive (CAD III)
References
Amal, S. and Weselowski, R. (1993), Practical Econometric Analysis for Assessment of Real Property: Using the SAS System on Personal Computers, Proceedings of the Eighteenth Annual SAS Users Group International Conference, 385-390. Cary, NC: SAS Institute Inc. Benseman, B. (1990), Better Forecasting with SAS/ETS Software, Proceedings of the Fifteenth Annual SAS Users Group International Conference, 494-497. Cary, NC: SAS Institute Inc. Calise, A. and Earley, J. (1997), Forecasting College Enrollment Using the SAS System, Proceedings of the Twenty-Second Annual SAS Users Group International Conference, 1326-1329. Cary, NC: SAS Institute Inc. Early, J., Sweeney, J., and Zekavat, S. M. (1989), PROC ARIMA and the Dow Jones Stock Index, Proceedings of the Fourteenth Annual SAS Users Group International Conference, 371-375. Cary, NC: SAS Institute Inc. Fischetti, T., Heathcote, S. and Perry, D. (1993), Using SAS to Create a Modular Forecasting System, Proceedings of the Eighteenth Annual SAS Users Group International Conference, 580585. Cary, NC: SAS Institute Inc. Fleming, N. S., Gibson, E. and Fleming, D. G. (1996), The Use of PROC ARIMA to Test an Intervention Effect, Proceedings of the Twenty-First Annual SAS Users Group International Conference, 1317-1326. Cary, NC: SAS Institute Inc. Hisnanick, J. J. (1991), Evaluating Input Separability in a Model of the U.S. Manufacturing Sector, Proceedings of the Sixteenth Annual SAS Users Group International Conference, 688-693. Cary, NC: SAS Institute Inc. Hisnanick, J. J. (1992), Using PROC ARIMA in Forecasting the Demand and Utilization of Inpatient Hospital Services, Proceedings of the Seventeenth Annual SAS Users Group International Conference, 383-391. Cary, NC: SAS Institute Inc. Hisnanick, J. J. (1993), Using SAS/ETS in Applied Econometrics: Parameters Estimates for the CES-Translog Specication, Proceedings of the Eighteenth Annual SAS Users Group International Conference, 275-279. Cary, NC: SAS Institute Inc. Hoyer, K. K. and Gross, K. C. (1993), Spectral Decomposition and Reconstruction of Nuclear Plant Signals, Proceedings of the Eighteenth Annual SAS Users Group International Conference, 1153-1158. Cary, NC: SAS Institute Inc.
62 ! Chapter 2: Introduction
Keshani, D. A. and Taylor, T. N. (1992), Weather Sensitive Appliance Load Curves; Conditional Demand Estimation, Proceedings of the Annual SAS Users Group International Conference, 422430. Cary, NC: SAS Institute Inc. Khan, M. H. (1990), Transfer Function Model for Gloss Prediction of Coated Aluminum Using the ARIMA Procedure, Proceedings of the Fifteenth Annual SAS Users Group International Conference, 517-522. Cary, NC: SAS Institute Inc. Le Bouton, K. J. (1989), Performance Function for Aircraft Production Using PROC SYSLIN and L2 Norm Estimation, Proceedings of the Fourteenth Annual SAS Users Group International Conference, 424-426. Cary, NC: SAS Institute Inc. Lin, L. and Myers, S. C. (1988), Forecasting the Economy using the Composite Leading Index, Its Components, and a Rational Expectations Alternative, Proceedings of the Thirteenth Annual SAS Users Group International Conference, 181-186. Cary, NC: SAS Institute Inc. McCarty, L. (1994), Forecasting Operational Indices Using SAS/ETS Software, Proceedings of the Nineteenth Annual SAS Users Group International Conference, 844-848. Cary, NC: SAS Institute Inc. Morelock, M. M., Pargellis, C. A., Graham, E. T., Lamarre, D., and Jung, G. (1995), TimeResolved Ligand Exchange Reactions: Kinetic Models for Competitive Inhibitors with Recombinant Human Renin, Journal of Medical Chemistry, 38, 17511761. Parresol, B. R. and Thomas, C. E. (1991), Econometric Modeling of Sweetgum Stem Biomass Using the IML and SYSLIN Procedures, Proceedings of the Sixteenth Annual SAS Users Group International Conference, 694-699. Cary, NC: SAS Institute Inc.
Chapter 3
Converting between Date, Datetime, and Time Values . . Computing Datetime Values . . . . . . . . . . . . . . . . Computing Calendar and Time Variables . . . . . . . . . Interval Functions INTNX and INTCK . . . . . . . . . . . . . Incrementing Dates by Intervals . . . . . . . . . . . . . . Alignment of SAS Dates . . . . . . . . . . . . . . . . . . Computing the Width of a Time Interval . . . . . . . . . Computing the Ceiling of an Interval . . . . . . . . . . . Counting Time Intervals . . . . . . . . . . . . . . . . . . Checking Data Periodicity . . . . . . . . . . . . . . . . . Filling In Omitted Observations in a Time Series Data Set Using Interval Functions for Calendar Calculations . . . . Lags, Leads, Differences, and Summations . . . . . . . . . . . The LAG and DIF Functions . . . . . . . . . . . . . . . . Multiperiod Lags and Higher-Order Differencing . . . . . Percent Change Calculations . . . . . . . . . . . . . . . . Leading Series . . . . . . . . . . . . . . . . . . . . . . . Summing Series . . . . . . . . . . . . . . . . . . . . . . Transforming Time Series . . . . . . . . . . . . . . . . . . . . Log Transformation . . . . . . . . . . . . . . . . . . . . Other Transformations . . . . . . . . . . . . . . . . . . . The EXPAND Procedure and Data Transformations . . . Manipulating Time Series Data Sets . . . . . . . . . . . . . . . Splitting and Merging Data Sets . . . . . . . . . . . . . . Transposing Data Sets . . . . . . . . . . . . . . . . . . . Time Series Interpolation . . . . . . . . . . . . . . . . . . . . . Interpolating Missing Values . . . . . . . . . . . . . . . . Interpolating to a Higher or Lower Frequency . . . . . . . Interpolating between Stocks and Flows, Levels and Rates Reading Time Series Data . . . . . . . . . . . . . . . . . . . . Reading a Simple List of Values . . . . . . . . . . . . . . Reading Fully Described Time Series in Transposed Form
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
97 97 98 98 99 100 101 102 102 103 104 104 105 105 109 110 112 113 115 115 117 118 118 118 119 124 124 125 125 125 126 126
Overview
This chapter discusses working with time series data in the SAS System. The following topics are included: dating time series and working with SAS date and datetime values subsetting data and selecting observations
storing time series data in SAS data sets specifying time series periodicity and time intervals plotting time series using calendar and time interval functions computing lags and other functions across time transforming time series transposing time series data sets interpolating time series reading time series data recorded in different ways In general, this chapter focuses on using features of the SAS programming language and not on features of SAS/ETS software. However, since SAS/ETS procedures are used to analyze time series, understanding how to use the SAS programming language to work with time series data is important for the effective use of SAS/ETS software. You do not need to read this chapter to use SAS/ETS procedures. If you are already familiar with SAS programming you might want to skip this chapter, or you can refer to sections of this chapter for help on specic time series data processing questions.
Introduction
To analyze data with the SAS System, data values must be stored in a SAS data set. A SAS data set is a matrix (or table) of data values organized into variables and observations. The variables in a SAS data set label the columns of the data matrix, and the observations in a SAS data set are the rows of the data matrix. You can also think of a SAS data set as a kind of le, with the observations representing records in the le and the variables representing elds in the records. (See SAS Language Reference: Concepts for more information about SAS data sets.) Usually, each observation represents the measurement of one or more variables for the individual subject or item observed. Often, the values of some of the variables in the data set are used to identify the individual subjects or items that the observations measure. These identifying variables are referred to as ID variables. For many kinds of statistical analysis, only relationships among the variables are of interest, and the identity of the observations does not matter. ID variables might not be relevant in such a case.
However, for time series data the identity and order of the observations are crucial. A time series is a set of observations made at a succession of equally spaced points in time. For example, if the data are monthly sales of a companys product, the variable measured is sales of the product and the unit observed is the operation of the company during each month. These observations can be identied by year and month. If the data are quarterly gross national product, the variable measured is nal goods production and the unit observed is the economy during each quarter. These observations can be identied by year and quarter. For time series data, the observations are identied and related to each other by their position in time. Since SAS does not assume any particular structure to the observations in a SAS data set, there are some special considerations needed when storing time series in a SAS data set. The main considerations are how to associate dates with the observations and how to structure the data set so that SAS/ETS procedures and other SAS procedures recognize the observations of the data set as constituting time series. These issues are discussed in following sections.
Dating Observations ! 67
When a time series is stored in the manner shown by this example, the terms series and variable can be used interchangeably. There is one observation per row and one series/variable per column.
Dating Observations
The SAS System supports special date, datetime, and time values, which make it easy to represent dates, perform calendar calculations, and identify the time period of observations in a data set. The preceding example uses the ID variables YEAR and MONTH to identify the time periods of the observations. For a quarterly data set, you might use YEAR and QTR as ID variables. A daily data set might have the ID variables YEAR, MONTH, and DAY. Clearly, it would be more convenient to have a single ID variable that could be used to identify the time period of observations, regardless of their frequency. The following section, SAS Date, Datetime, and Time Values on page 68, discusses how the SAS System represents dates and times internally and how to specify date, datetime, and time values in a SAS program. The section Reading Date and Datetime Values with Informats on page 69 discusses how to read in date and time values from data records and how to control the display of date and datetime values in SAS output. Later sections discuss other issues concerning date and datetime values, specifying time intervals, data periodicity, and calendar calculations. SAS date and datetime values and the other features discussed in the following sections are also described in SAS Language Reference: Dictionary. Reference documentation on these features is also provided in Chapter 4, Date Intervals, Formats, and Functions.
The year value can be given with two or four digits, so 17OCT91D is the same as 17OCT1991D.
Time 32 seconds past 2:45 p.m., 17 October 1991 12:05 p.m., 17 October 1991 2:00 a.m., 17 October 1991 midnight, 17 October 1991
For example, the following SAS statements read monthly values of the U.S. Consumer Price Index. Since the data are monthly, you could identify the date with the variables YEAR and MONTH, as in the previous example. Instead, in this example the time periods are coded as a three-letter month abbreviation followed by the year. The informat MONYY. is used to read month-year dates coded this way and to express them as SAS date values for the rst day of the month, as follows:
data uscpi; input date : monyy7. cpi; format date monyy7.; label cpi = "US Consumer Price Index"; datalines; jun1990 129.9 jul1990 130.4 ... more lines ...
The SAS System provides informats for most common notations for dates and times. See Chapter 4 for more information about the date and datetime informats available.
with MONYY7. format" with DATE9. format" with no format"; date1 date9.;
Obs 1 2 3 4 5 6 7 8 9 10 11 12 13 14
DATE with MONYY7. format JUN1990 JUL1990 AUG1990 SEP1990 OCT1990 NOV1990 DEC1990 JAN1991 FEB1991 MAR1991 APR1991 MAY1991 JUN1991 JUL1991
DATE with no format 11109 11139 11170 11201 11231 11262 11292 11323 11354 11382 11413 11443 11474 11504
DATE with DATE9. format 01JUN1990 01JUL1990 01AUG1990 01SEP1990 01OCT1990 01NOV1990 01DEC1990 01JAN1991 01FEB1991 01MAR1991 01APR1991 01MAY1991 01JUN1991 01JUL1991
The appropriate format to use for SAS date or datetime valued ID variables depends on the sampling frequency or periodicity of the time series. Table 3.2 shows recommended formats for common data sampling frequencies and shows how the date 17OCT1991D or the datetime value 17OCT1991:14:45:32DT is displayed by these formats.
Table 3.2 Formats for Different Sampling Frequencies
SAS datetime
Example 1991 1991:4 OCT1991 Thursday, 17 Oct 1991 17OCT1991 17OCT91:14 17OCT91:14:45 17OCT91:14:45:32
See Chapter 4, Date Intervals, Formats, and Functions, for more information about the date and datetime formats available.
Sorting by Time
Many SAS/ETS procedures assume the data are in chronological order. If the data are not in time order, you can use the SORT procedure to sort the data set. For example,
proc sort data=a; by date; run;
There are many ways of coding the time ID variable or variables, and some ways do not sort correctly. If you use SAS date or datetime ID values as suggested in the preceding section, you do not need to be concerned with this issue. But if you encode date values in nonstandard ways, you need to consider whether your ID variables will sort. SAS date and datetime values always sort correctly, as do combinations of numeric variables such as YEAR, MONTH, and DAY used together. Julian dates also sort correctly. (Julian dates are numbers of the form yyddd, where yy is the year and ddd is the day of the year. For example, 17 October 1991 has the Julian date value 91290.)
Calendar dates such as numeric values coded as mmddyy or ddmmyy do not sort correctly. Character variables that contain display values of dates, such as dates in the notation produced by SAS date formats, generally do not sort correctly.
For complete reference documentation on the WHERE statement, see SAS Language Reference: Dictionary.
data subset; set full; where date >= 1jan1970d; if state = NC; run;
In this case, it makes no difference logically whether the WHERE statement or the IF statement is used, and you can combine several conditions in one subsetting statement. The following statements produce the same results as the previous example:
data subset; set full; if date >= 1jan1970d & state = NC; run;
The WHERE statement acts on the input data sets specied in the SET statement before observations are processed by the DATA step program, whereas the IF statement is executed as part of the DATA step program. If the input data set is indexed, using the WHERE statement can be more efcient than using the IF statement. However, the WHERE statement can refer only to variables in the input data set, not to variables computed by the DATA step program. To subset the variables of a data set, use KEEP or DROP statements or use KEEP= or DROP= data set options. See SAS Language Reference: Dictionary for information about KEEP and DROP statements and SAS data set options. For example, suppose you want to subset the data set as in the preceding example, but you want to include in the subset data set only the variables DATE, X, and Y. You could use the following statements:
data subset; set full; if date >= 1jan1970d & state = NC; keep date x y; run;
You can specify any number of conditions in the WHERE statement. For example, suppose that a strike created an outlier in May 1975, and you want to exclude that observation. You could use the
following statements:
proc autoreg data=full; where date >= 1jan1970d & state = NC & date ^= 1may1975d; ... additional statements ... run;
You can use KEEP= and DROP= data set options to exclude variables from the input data set. See SAS Language Reference: Dictionary for information about SAS data set options.
format date monyy7.; label cpi = "Consumer Price Index" ppi = "Producer Price Index"; datalines; jun1990 129.9 114.3 jul1990 130.4 114.5 ... more lines ...
monthly, quarterly, yearly, and so forth. This means that time series with different sampling frequencies are not mixed in the same SAS data set. Most SAS/ETS procedures that process time series expect the input data set to contain time series in this standard form, and this is the simplest way to store time series in SAS data sets. (The EXPAND and TIMESERIES procedures can be helpful in converting your data to this standard form.) There are more complex ways to represent time series in SAS data sets. You can incorporate cross-sectional dimensions with BY groups, so that each BY group is like a standard form time series data set. This method is discussed in the section Cross-Sectional Dimensions and BY Groups on page 79. You can interleave time series, with several observations for each time period identied by another ID variable. Interleaved time series data sets are used to store several series in the same SAS variable. Interleaved time series data sets are often used to store series of actual values, predicted values, and residuals, or series of forecast values and condence limits for the forecasts. This is discussed in the section Interleaved Time Series on page 80.
The decimal points with no digits in the data records represent missing data and are read by SAS as missing value codes. In this example, the time range of the USPRICE data set is June 1990 through July 1991, but the time range of the CPI variable is August 1990 through July 1991, and the time range of the PPI variable is June 1990 through June 1991. SAS/ETS procedures ignore missing values at the beginning or end of a series. That is, the series is considered to begin with the rst nonmissing value and end with the last nonmissing value.
In this example, the series CPI has one embedded missing value, and the series PPI has two embedded missing values. The ranges of the two series are the same as before. Note that the observation for November 1990 has missing values for both CPI and PPI; there is no data for this period. This is an example of a missing observation. You might ask why the data record for this period is included in the example at all, since the data record contains no data. However, deleting the data record for November 1990 from the example would cause an omitted observation in the USPRICE data set. SAS/ETS procedures expect input data sets to contain observations for a contiguous time sequence. If you omit observations from a time series data set and then try to analyze the data set with SAS/ETS procedures, the omitted
observations will cause errors. When all data are missing for a period, a missing observation should be included in the data set to preserve the time sequence of the series. If observations are omitted from the data set, the EXPAND procedure can be used to ll in the gaps with missing values (or to interpolate nonmissing values) for the time series variables and with the appropriate date or datetime values for the ID variable.
The second way is to store the data in a time series cross-sectional form. In this form, the series for all cross sections are stored in one variable and a cross section ID variable is used to identify observations for the different series. The observations are sorted by the cross section ID variable and by time within each cross section. The following statements indicate how to read the CPI series for U.S. cities in time series crosssectional form:
data cpicity; length city $11; input city $11. date : monyy. cpi; format date monyy.; datalines; New York JAN1990 135.100 New York FEB1990 135.300 ... more lines ...
When processing a time series cross sectional form data set with most SAS/ETS procedures, use the cross section ID variable in a BY statement to process the time series separately. The data set must be sorted by the cross section ID variable and sorted by date within each cross section. The PROC SORT step in the preceding example ensures that the CPICITY data set is correctly sorted. When the cross section ID variable is used in a BY statement, each BY group in the data set is like a standard form time series data set. Thus, SAS/ETS procedures that expect a standard form time series data set can process time series cross sectional data sets when a BY statement is used, producing an independent analysis for each cross section. It is also possible to analyze time series cross-sectional data jointly. The PANEL procedure (and the older TSCSREG procedure) expects the input data to be in the time series cross-sectional form described here. See Chapter 19, The PANEL Procedure, for more information.
proc forecast data=uscpi interval=month lead=12 out=foreout outfull outresid; var cpi; id date; run; proc print data=foreout(obs=6); run;
Figure 3.5 Partial Listing of Output Data Set Produced by PROC FORECAST
Obs 1 2 3 4 5 6 date JUN1990 JUN1990 JUN1990 JUL1990 JUL1990 JUL1990 _TYPE_ ACTUAL FORECAST RESIDUAL ACTUAL FORECAST RESIDUAL _LEAD_ 0 0 0 0 0 0 cpi 129.900 130.817 -0.917 130.400 130.678 -0.278
Observations with _TYPE_=ACTUAL contain the values of CPI read from the input data set. Observations with _TYPE_=FORECAST contain one-step-ahead predicted values for observations with dates in the range of the input series and contain forecast values for observations for dates beyond the range of the input series. Observations with _TYPE_=RESIDUAL contain the difference between the actual and one-step-ahead predicted values. Observations with _TYPE_=U95 and _TYPE_=L95 contain the upper and lower bounds, respectively, of the 95% condence interval for the forecasts.
The output data set FOREOUT contains many different time series in the single variable CPI. (The rst few observations of FOREOUT are shown in Figure 3.6.) BY groups that are identied by the variable CITY contain the result series for the different cities. Within each value of CITY, the actual, forecast, residual, and condence limits series are stored in interleaved form, with the observations for the different series identied by the values of _TYPE_.
Figure 3.6 Combined Cross Sections and Interleaved Time Series Data
FORECAST Output Data Set with BY Groups Obs 1 2 3 4 5 6 city Chicago Chicago Chicago Chicago Chicago Chicago date JAN90 JAN90 JAN90 FEB90 FEB90 FEB90 _TYPE_ ACTUAL FORECAST RESIDUAL ACTUAL FORECAST RESIDUAL _LEAD_ 0 0 0 0 0 0 cpi 128.100 128.252 -0.152 129.200 128.896 0.304
uses the standard form because PROC ARIMA is designed for the detailed analysis of one series at a time and so forecasts only one series at a time. The following statements show the use of the ARIMA procedure to produce a forecast of the USCPI data set. Figure 3.7 shows part of the output data set that is produced by the ARIMA procedures FORECAST statement. (The printed output from PROC ARIMA is not shown.) Compare the PROC ARIMA output data set shown in Figure 3.7 with the PROC FORECAST output data set shown in Figure 3.6.
title "PROC ARIMA Output Data Set"; proc arima data=uscpi; identify var=cpi(1); estimate q=1; forecast id=date interval=month lead=12 out=arimaout; run; proc print data=arimaout(obs=6); run;
Figure 3.7 Partial Listing of Output Data Set Produced by PROC ARIMA
PROC ARIMA Output Data Set Obs 1 2 3 4 5 6 date JUN1990 JUL1990 AUG1990 SEP1990 OCT1990 NOV1990 cpi 129.9 130.4 131.6 132.7 133.5 133.8 FORECAST . 130.368 130.881 132.354 133.306 134.046 STD . 0.36160 0.36160 0.36160 0.36160 0.36160 L95 . 129.660 130.172 131.645 132.597 133.337 U95 . 131.077 131.590 133.063 134.015 134.754 RESIDUAL . 0.03168 0.71909 0.34584 0.19421 -0.24552
The output data set produced by the ARIMA procedures FORECAST statement stores the actual values in a variable with the same name as the response series, stores the forecast series in a variable named FORECAST, stores the residuals in a variable named RESIDUAL, stores the 95% condence limits in variables named L95 and U95, and stores the standard error of the forecast in the variable STD. This method of storing several different result series as a standard form time series data set is simple and convenient. However, it works well only for a single input series. The forecast of a single series can be stored in the variable FORECAST. But if two series are forecast, two different FORECAST variables are needed. The STATESPACE procedure handles this problem by generating forecast variable names FOR1, FOR2, and so forth. The SPECTRA procedure uses a similar method. Names such as FOR1, FOR2, RES1, RES2, and so forth require you to remember the order in which the input series are listed. This is why PROC FORECAST, which is designed to forecast a whole list of input series at once, stores its results in interleaved form.
Other SAS/ETS procedures are often used for a single input series but can also be used to process several series in a single step. Thus, they are not clearly like PROC FORECAST nor clearly like PROC ARIMA in the number of input series they are designed to work with. These procedures use a third method for storing multiple result series in an output data set. These procedures store output time series in standard form (as PROC ARIMA does) but require an OUTPUT statement to give names to the result series.
Table 3.3
Name YEAR SEMIYEAR QTR MONTH SEMIMONTH TENDAY WEEK WEEKDAY DAY HOUR MINUTE SECOND
Periodicity yearly semiannual quarterly monthly 1st and 16th of each month 1st, 11th, and 21st of each month weekly daily ignoring weekend days daily hourly every minute every second
Interval names can be abbreviated in various ways. For example, you could specify monthly intervals as MONTH, MONTHS, MONTHLY, or just MON. SAS accepts all these forms as equivalent. Interval names can also be qualied with a multiplier to indicate multi-period intervals. For example, biennial intervals are specied as YEAR2. Interval names can also be qualied with a shift index to indicate intervals with different starting points. For example, scal years starting in July are specied as YEAR.7. Intervals are classied as either date or datetime intervals. Date intervals are used with SAS date values, while datetime intervals are used with SAS datetime values. The interval types YEAR, SEMIYEAR, QTR, MONTH, SEMIMONTH, TENDAY, WEEK, WEEKDAY, and DAY are date intervals. HOUR, MINUTE, and SECOND are datetime intervals. Date intervals can be turned into datetime intervals for use with datetime values by prexing the interval name with DT. Thus DTMONTH intervals are like MONTH intervals but are used with datetime ID values instead of date ID values. See Chapter 4, Date Intervals, Formats, and Functions, for more information about specifying time intervals and for a detailed reference to the different kinds of intervals available.
to label forecast observations in the output data set. The values of the ID variable for the forecast observations after the end of the input data set are extrapolated according to the frequency specications of the INTERVAL= option.
Time Intervals, the Time Series Forecasting System, and the Time Series Viewer
Time intervals are used in the Time Series Forecasting System and Time Series Viewer to identify the number of seasonal cycles or seasonality associated with a DATE, DATETIME, or TIME ID variable. For example, monthly time series have a seasonality of 12 because there are 12 months in a year; quarterly time series have a seasonality of 4 because there are four quarters in a year. The seasonality is used to analyze seasonal properties of time series data and to estimate seasonal forecasting methods.
The TSVIEW DATA= option species the data set to be viewed; the VAR= option species the variable that contains the time series observations; the TIMEID= option species the time series ID variable.
The Time Series Viewer can also be invoked by selecting SolutionsIAnalyzeITime Series Viewer from the menu in the SAS Display Manager.
period. Quarterly tick marks with YYQC format date values are used.
title "ARIMA Forecasts of CPI"; proc arima data=uscpi; identify var=cpi(1); estimate q=1; forecast id=date interval=month lead=12 out=arimaout; run; title "ARIMA forecasts of CPI"; proc sgplot data=arimaout noautolegend; scatter x=date y=cpi; scatter x=date y=forecast / markerattrs=(symbol=asterisk); scatter x=date y=l95 / markerattrs=(symbol=asterisk color=green); scatter x=date y=u95 / markerattrs=(symbol=asterisk color=green); format date yyqc4.; xaxis values=(1jan90d to 1jul92d by qtr); refline 15jul91d / axis=x; run;
Residual Plots
The following example plots the residuals series that was excluded from the plot in the previous example. The NEEDLE statement species a needle plot, so that each residual point is plotted as a vertical line showing deviation from zero.
proc sgplot data=foreout; where _type_ = RESIDUAL; needle x=date y=cpi / markers; format date yyqc4.; xaxis values=(1jan90d to 1jul91d by qtr); run;
U S C o n s u m e r P r i c e I n d e x
137 + | | + 136 + + | + | + 135 + + | + + | 134 + | + + + | 133 + | + | 132 + | + | 131 + | | + 130 + + | | 129 + --+-------------+-------------+-------------+-------------+-------------+MAY1990 AUG1990 DEC1990 MAR1991 JUN1991 OCT1991 date
JUN1990 JUL1990 AUG1990 SEP1990 OCT1990 NOV1990 DEC1990 JAN1991 FEB1991 MAR1991 APR1991 MAY1991 JUN1991 JUL1991
The TIMEPLOT procedure has several interesting features not discussed here. See The TIMEPLOT Procedure in the Base SAS Procedures Guide for more information.
For more information about the GPLOT procedure, see SAS/GRAPH Software: Reference.
The monthly USCPI data shown in a previous example contained time ID values represented in the MONYY format. If the data records instead contain separate year and month values, the data can be read in and the DATE variable computed with the following statements:
data uscpi; input month year cpi; date = mdy( month, 1, year ); format date monyy.; datalines; 6 90 129.9 7 90 130.4 ... more lines ...
Returning to the example of reading the USCPI data from records that contain date values represented in the MONYY format, you can nd the month and year of each observation from the SAS dates of the observations by using the following statements.
data uscpi; input date monyy7. cpi; format date monyy7.; year = year( date ); month = month( date ); datalines; jun1990 129.9 jul1990 130.4 ... more lines ...
interval
is a SAS date value (for date intervals) or datetime value (for datetime intervals)
n
is the number of intervals to increment from the interval that contains the from value
alignment
controls the alignment of SAS dates, within the interval, used to identify output observations. Allowed values are BEGINNING, MIDDLE, END, and SAMEDAY. The number of intervals to increment, n, can be positive, negative, or zero. For example, the statement NEXTMON=INTNX(MONTH,DATE,1) assigns to the variable NEXTMON the date of the rst day of the month following the month that contains the value of DATE. Thus INTNX(MONTH,21OCT2007D,1) returns the date 1 November 2007. The INTCK function counts the number of interval boundaries between two date values or between two datetime values. The form of the INTCK function is
INTCK ( interval, from, to ) ;
is the starting date value (for date intervals) or datetime value (for datetime intervals)
to
is the ending date value (for date intervals) or datetime value (for datetime intervals) For example, the statement NEWYEARS=INTCK(YEAR,DATE1,DATE2) assigns to the variable NEWYEARS the number of New Years Days between the two dates.
data uscpi; input cpi; date = intnx( month, 1jun1990d, _n_-1 ); format date monyy7.; datalines; 129.9 130.4 ... more lines ...
The automatic variable _N_ counts the number of times the DATA step program has executed; in this case _N_ contains the observation number. Thus _N_1 is the increment needed from the rst observation date. Alternatively, you could increment from the month before the rst observation, in which case the INTNX function in this example would be written INTNX(MONTH,1MAY1990D,_N_).
The results of using this version of the USCPI data set for analysis with SAS/ETS procedures would be the same as with rst-of-month values for DATE. Although you can use any date within the interval as an ID value for the interval, you might nd working with time series in SAS less confusing if you always use date ID values normalized to the start of the interval.
For some applications it might be preferable to use end of period dates, such as 31Jan1994, 28Feb1994, 31Mar1994, . . . , 31Dec1994. For other applications, such as plotting time series, it might be more convenient to use interval midpoint dates to identify the observations. (Some SAS/ETS procedures provide an ALIGN= option to control the alignment of dates for output time series observations. In addition, the INTNX library function supports an optional argument to specify the alignment of the returned date value.) To normalize date values to the start of intervals, use the INTNX function with a 0 increment. The INTNX function with an increment of 0 computes the date of the rst day of the interval (or the rst second of the interval for datetime values). For example, INTNX(MONTH,17OCT1991D,0,BEG) returns the date 1OCT1991D. The following statements show how the preceding example can be changed to normalize the midmonth DATE values to rst-of-month and end-of-month values. For exposition, the rst-of-month value is transformed back into a middle-of-month value.
data uscpi; input date : date9. cpi; format date monyy7.; monthbeg = intnx( month, date, 0, beg ); midmonth = intnx( month, monthbeg, 0, mid ); monthend = intnx( month, date, 0, end ); datalines; 15jun1990 129.9 15jul1990 130.4 ... more lines ...
If you want to compute the date of a particular day within an interval, you can use calendar functions, or you can increment the starting date of the interval by a number of days. The following example shows three ways to compute the seventh day of the month:
data test; set uscpi; mon07_1 = mdy( month(date), 7, year(date) ); mon07_2 = intnx( month, date, 0, beg ) + 6; mon07_3 = intnx( day, date, 6 ); run;
that contains the number of days in the month for each observation:
data uscpi; input date : date9. cpi; format date monyy7.; width = intnx( month, date, 1 ) - intnx( month, date, 0 ); datalines; 15jun1990 129.9 15jul1990 130.4 15aug1990 131.6 ... more lines ...
The following example shows how to use the INTCK function with shifted interval specications to count the number of Sundays, Mondays, Tuesdays, and so forth, in each month. The variables NSUNDAY, NMONDAY, NTUESDAY, and so forth, are added to the USCPI data set.
data uscpi; set uscpi; d0 = intnx( month, date, 0 ) - 1; d1 = intnx( month, date, 1 ) - 1; nSunday = intck( week.1, d0, d1 ); nMonday = intck( week.2, d0, d1 ); nTuesday = intck( week.3, d0, d1 ); nWedday = intck( week.4, d0, d1 ); nThurday = intck( week.5, d0, d1 ); nFriday = intck( week.6, d0, d1 ); nSatday = intck( week.7, d0, d1 ); drop d0 d1; run;
Since the INTCK function counts the number of interval beginning dates between two dates, the number of Sundays is computed by counting the number of week boundaries between the last day of the previous month and the last day of the current month. To count Mondays, Tuesdays, and so forth, shifted week intervals are used. The interval type WEEK.2 species weekly intervals starting on Mondays, WEEK.3 species weeks starting on Tuesdays, and so forth.
This data set is converted to a standard form time series data set by the following PROC EXPAND step. The TO= option species that monthly data is to be output, while the METHOD=NONE option species that no interpolation is to be performed, so that the variables X, Y, and Z in the output data set STANDARD will have missing values for the omitted time periods that are lled in by the EXPAND procedure.
proc expand data=omitted out=standard to=month method=none; id date; run;
For example, suppose you want to know the date of the third Wednesday in the month of October 1991. The answer can be computed as
intnx( week.4, 1oct91d - 1, 3 )
which returns the SAS date value 16OCT91D. Consider this more complex example: how many weekdays are there between 17 October 1991 and the second Friday in November 1991, inclusive? The following formula computes the number of weekdays between the date value contained in the variable DATE and the second Friday of the following month (including the ending dates of this period):
n = intck( weekday, date - 1, intnx( week.6, intnx( month, date, 1 ) - 1, 2 ) + 1 );
Setting DATE to 17OCT91D and applying this formula produces the answer, N=17.
data uscpi; set uscpi; cpilag = lag( cpi ); cpidif = dif( cpi ); run; proc print data=uscpi; run;
Figure 3.16 USCPI Data Set with Lagged and Differenced Series
Plot of USCPI Data Obs 1 2 3 4 5 6 7 8 9 10 11 12 13 14 date JUN1990 JUL1990 AUG1990 SEP1990 OCT1990 NOV1990 DEC1990 JAN1991 FEB1991 MAR1991 APR1991 MAY1991 JUN1991 JUL1991 cpi 129.9 130.4 131.6 132.7 133.5 133.8 133.8 134.6 134.8 135.0 135.2 135.6 136.0 136.2 cpilag . 129.9 130.4 131.6 132.7 133.5 133.8 133.8 134.6 134.8 135.0 135.2 135.6 136.0 cpidif . 0.5 1.2 1.1 0.8 0.3 0.0 0.8 0.2 0.2 0.2 0.4 0.4 0.2
The DATA step is a powerful tool that can read any number of observations from any number of input les or data sets, can create any number of output data sets, and can write any number of output observations to any of the output data sets, all in the same program. Thus, in general, it is not clear what previous observation means in a DATA step program. In a DATA step program, the previous observation exists only if you write the program in a simple way that makes this concept meaningful. Since, in general, the previous observation is not clearly dened, it is not possible to make true lag or difference functions for the DATA step. Instead, the DATA step provides queuing functions that make it easy to compute lags and differences.
If the subsetting IF statement comes before the LAG function call, the value of CPILAG will be missing for January 1991, even though a value for December 1990 is available in the USCPI data set. To avoid losing this value, you must rearrange the statements to ensure that the LAG function is actually executed for the December 1990 observation.
data subset; set uscpi; cpilag = lag( cpi ); if date >= 1jan1991d; run;
In other cases, the subsetting statement should come before the LAG and DIF functions. For example, the following statements subset the FOREOUT data set shown in a previous example to select only _TYPE_=RESIDUAL observations and also to compute the variable LAGRESID:
data residual; set foreout; if _type_ = "RESIDUAL"; lagresid = lag( cpi ); run;
Another pitfall of LAG and DIF functions arises when they are used to process time series crosssectional data sets. For example, suppose you want to add the variable CPILAG to the CPICITY data set shown in a previous example. You might use the following statements:
data cpicity; set cpicity; cpilag = lag( cpi ); run;
However, these statements do not yield the desired result. In the data set produced by these statements, the value of CPILAG for the rst observation for the rst city is missing (as it should be), but in the rst observation for all later cities, CPILAG contains the last value for the previous city. To correct this, set the lagged variable to missing at the start of each cross section, as follows:
data cpicity; set cpicity; by city date; cpilag = lag( cpi ); if first.city then cpilag = .; run;
You can also calculate lags and differences in the DATA step without using LAG and DIF functions. For example, the following statements add the variables CPILAG and CPIDIF to the USCPI data set:
data uscpi; set uscpi; retain cpilag; cpidif = cpi - cpilag; output; cpilag = cpi; run;
The RETAIN statement prevents the DATA step from reinitializing CPILAG to a missing value at the start of each iteration and thus allows CPILAG to retain the value of CPI assigned to it in the last statement. The OUTPUT statement causes the output observation to contain values of the variables before CPILAG is reassigned the current value of CPI in the last statement. This is the approach that must be used if you want to build a variable that is a function of its previous lags.
To compute second differences, take the difference of the difference. To compute higher-order differences, nest DIF functions to the order needed. For example, the following statements compute the second difference of CPI:
data uscpi; set uscpi; cpi2dif = dif( dif( cpi ) ); run;
Multiperiod lags and higher-order differencing can be combined. For example, the following statements compute monthly changes in the ination rate, with ination rate computed as percent change
Often, changes from the previous period are expressed at annual rates. This is done by exponentiation of the current-to-previous period ratio to the number of periods in a year and expressing the result as a percent change. For example, the following statements compute the month-over-month change in CPI as a percent change at annual rates:
data uscpi; set uscpi; pctchng = ( ( cpi / lag( cpi ) ) ** 12 - 1 ) * 100; label pctchng = "Monthly Percent Change, At Annual Rates"; run;
pctchng = dif12( cpi ) / lag12( cpi ) * 100; label pctchng = "Percent Change from One Year Ago"; run;
To compute year-over-year percent change measured at a given period within the year, subset the series of percent changes from the same period in the previous year to form a yearly data set. Use an IF or WHERE statement to select observations for the period within each year on which the year-over-year changes are based. For example, the following statements compute year-over-year percent change in CPI from December of the previous year to December of the current year:
data annual; set uscpi; pctchng = dif12( cpi ) / lag12( cpi ) * 100; label pctchng = "Percent Change: December to December"; if month( date ) = 12; format date year4.; run;
It is also possible to compute percent change in the average over the most recent yearly span. For example, the following statements compute monthly percent change in the average of CPI over the most recent 12 months from the average over the previous 12 months:
data uscpi; retain sum12 0; drop sum12 ave12 cpilag12; set uscpi; sum12 = sum12 + cpi; cpilag12 = lag12( cpi ); if cpilag12 ^= . then sum12 = sum12 - cpilag12; if lag11( cpi ) ^= . then ave12 = sum12 / 12; pctchng = dif12( ave12 ) / lag12( ave12 ) * 100;
This example is a complex use of LAG and DIF functions that requires care in handling the initialization of the moving-window averaging process. The LAG12 of CPI is checked for missing values to determine when more than 12 values have been accumulated, and older values must be removed from the moving sum. The LAG11 of CPI is checked for missing values to determine when at least 12 values have been accumulated; AVE12 will be missing when LAG11 of CPI is missing. The DROP statement prevents temporary variables from being added to the data set. Note that the DIF and LAG functions must execute for every observation, or the queues of remembered values will not operate correctly. The CPILAG12 calculation must be separate from the IF statement. The PCTCHNG calculation must not be conditional on the IF statement. The EXPAND procedure provides an alternative way to compute moving averages.
Leading Series
Although the SAS System does not provide a function to look ahead at the next value of a series, there are a couple of ways to perform this task. The most direct way to compute leads is to use the EXPAND procedure. For example:
proc expand data=uscpi out=uscpi method=none; id date; convert cpi=cpilead1 / transform=( lead 1 ); convert cpi=cpilead2 / transform=( lead 2 ); run;
Another way to compute lead series in SAS software is by lagging the time ID variable, renaming the series, and merging the result data set back with the original data set. For example, the following statements add the variable CPILEAD to the USCPI data set. The variable CPILEAD contains the value of CPI in the following month. (The value of CPILEAD is missing for the last observation, of course.)
data temp; set uscpi; keep date cpi; rename cpi = cpilead; date = lag( date ); if date ^= .; run; data uscpi; merge uscpi temp; by date; run;
To compute leads at different lead lengths, you must create one temporary data set for each lead length. For example, the following statements compute CPILEAD1 and CPILEAD2, which contain leads of CPI for 1 and 2 periods, respectively:
data temp1(rename=(cpi=cpilead1)) temp2(rename=(cpi=cpilead2)); set uscpi; keep date cpi; date = lag( date ); if date ^= . then output temp1; date = lag( date ); if date ^= . then output temp2; run; data uscpi; merge uscpi temp1 temp2; by date; run;
Summing Series
Simple cumulative sums are easy to compute using SAS sum statements. The following statements show how to compute the running sum of variable X in data set A, adding XSUM to the data set.
data a; set a; xsum + x; run;
The SAS sum statement automatically retains the variable XSUM and initializes it to 0, and the sum statement treats missing values as 0. The sum statement is equivalent to using a RETAIN statement and the SUM function. The previous example could also be written as follows:
data a; set a; retain xsum; xsum = sum( xsum, x ); run;
You can also use the EXPAND procedure to compute summations. For example:
proc expand data=a out=a method=none; convert x=xsum / transform=( sum ); run;
Like differencing, summation can be done at different lags and can be repeated to produce higherorder sums. To compute sums over observations separated by lags greater than 1, use the LAG and
SUM functions together, and use a RETAIN statement that initializes the summation variable to zero. For example, the following statements add the variable XSUM2 to data set A. XSUM2 contains the sum of every other observation, with even-numbered observations containing a cumulative sum of values of X from even observations, and odd-numbered observations containing a cumulative sum of values of X from odd observations.
data a; set a; retain xsum2 0; xsum2 = sum( lag( xsum2 ), x ); run;
Assuming that A is a quarterly data set, the following statements compute running sums of X for each quarter. XSUM4 contains the cumulative sum of X for all observations for the same quarter as the current quarter. Thus, for a rst-quarter observation, XSUM4 contains a cumulative sum of current and past rst-quarter values.
data a; set a; retain xsum4 0; xsum4 = sum( lag3( xsum4 ), x ); run;
To compute higher-order sums, repeat the preceding process and sum the summation variable. For example, the following statements compute the rst and second summations of X:
data a; set a; xsum + x; x2sum + xsum; run;
You can also use PROC EXPAND to compute cumulative statistics and moving window statistics. See Chapter 14, The EXPAND Procedure, for details.
Log Transformation
The logarithmic transformation is often useful for series that must be greater than zero and that grow exponentially. For example, Figure 3.17 shows a plot of an airline passenger miles series. Notice that the series has exponential growth and the variability of the series increases over time. Airline passenger miles must also be zero or greater.
Figure 3.18 shows a plot of the log-transformed airline series. Notice that the log series has a linear trend and constant variance.
The %LOGTEST macro can help you decide if a log transformation is appropriate for a series. See Chapter 5, SAS Macros and Functions, for more information about the %LOGTEST macro.
Other Transformations
The Box-Cox transformation is a general class of transformations that includes the logarithm as a special case. The %BOXCOXAR macro can be used to nd an optimal Box-Cox transformation for a time series. See Chapter 5 for more information about the %BOXCOXAR macro. The logistic transformation is useful for variables with both an upper and a lower bound, such as market shares. The logistic transformation is useful for proportions, percent values, relative frequencies, or probabilities. The logistic function transforms values between 0 and 1 to values that can range from -1 to +1. For example, the following statements transform the variable SHARE from percent values to an unbounded range:
data a;
Many other data transformation can be used. You can create virtually any desired data transformation using DATA step statements.
See Table 14.2 in Chapter 14, The EXPAND Procedure, for a complete list of transformations supported by PROC EXPAND.
If the series have different time ranges, you can subset the time ranges of the output data sets accordingly. For example, if you know that CPI in USPRICE has the range August 1990 through the end of the data set, while PPI has the range from the beginning of the data set through June 1991, you could write the previous example as follows:
data uscpi(keep=date cpi) usppi(keep=date ppi); set usprice; if date >= 1aug1990d then output uscpi; if date <= 1jun1991d then output usppi; run;
To combine time series from different data sets into one data set, list the data sets to be combined in a MERGE statement and specify the dating variable in a BY statement. The following statements show how to combine the USCPI and USPPI data sets to produce the USPRICE data set. It is important to use the BY DATE statement so that observations are matched by time before merging.
data usprice; merge uscpi usppi; by date; run;
var cpi; id _type_; by date; where date > 1may1991d & date < 1oct1991d; run; title "Transposed Data Set"; proc print data=trans(obs=10); run;
The TRANSPOSE procedure adds the variables _NAME_ and _LABEL_ to the output data set. These variables contain the names and labels of the variables that were transposed. In this example, there is only one transposed variable, so _NAME_ has the value CPI for all observations. Thus, _NAME_ and _LABEL_ are of no interest and are dropped from the output data set by using the DROP= data set option. (If none of the variables transposed have a label, PROC TRANSPOSE does not output the _LABEL_ variable and the DROP=_LABEL_ option produces a warning message. You can ignore this message, or you can prevent the message by omitting _LABEL_ from the DROP= list.) The original and transposed data sets are shown in Figure 3.19 and Figure 3.20. (The observation numbers shown for the original data set reect the operation of the WHERE statement.)
Figure 3.19 Original Data Sets
Original Data Set Obs 37 38 39 40 41 42 43 44 45 46 date JUN1991 JUN1991 JUN1991 JUL1991 JUL1991 JUL1991 AUG1991 AUG1991 AUG1991 SEP1991 _TYPE_ ACTUAL FORECAST RESIDUAL ACTUAL FORECAST RESIDUAL FORECAST L95 U95 FORECAST _LEAD_ 0 0 0 0 0 0 1 1 1 2 cpi 136.000 136.146 -0.146 136.200 136.566 -0.366 136.856 135.723 137.990 137.443
136.146 -0.14616 . . 136.566 -0.36635 . . 136.856 . 135.723 137.990 137.443 . 136.126 138.761
The names of the variables in the transposed data sets are taken from the city names in the ID variable CITY. The original and the transposed data sets are shown in Figure 3.21 and Figure 3.22.
Obs 1 2 3 4 5 6 7
The following statements transpose the CITYCPI data set back to the original form of the CPICITY data set. The variable _NAME_ is added to the data set to tell PROC TRANSPOSE the name of the variable in which to store the observations in the transposed data set. (If the (DROP=_NAME_ _LABEL_) option were omitted from the rst PROC TRANSPOSE step, this would not be necessary. PROC TRANSPOSE assumes ID _NAME_ by default.) The NAME=CITY option in the PROC TRANSPOSE statement causes PROC TRANSPOSE to store the names of the transposed variables in the variable CITY. Because PROC TRANSPOSE recodes the values of the CITY variable to create valid SAS variable names in the transposed data set, the values of the variable CITY in the retransposed data set are not the same as in the original.
Interpolated values are computed only for embedded missing values in the input time series. Missing values before or after the range of a series are ignored by the EXPAND procedure. In the preceding example, PROC EXPAND assumes that all series are measured at points in time given by the value of the ID variable. In fact, the series in the USPRICE data set are monthly averages. PROC EXPAND can produce a better interpolation if this is taken into account. The following example uses the FROM=MONTH option to tell PROC EXPAND that the series is monthly and uses the CONVERT statement with the OBSERVED=AVERAGE to specify that the series values are averages over each month:
proc expand data=usprice out=interpl from=month; id date; convert cpi ppi / observed=average; run;
Several time series databases distributed by major data vendors can be read into SAS data sets by the DATASOURCE procedure. See Chapter 11, The DATASOURCE Procedure, for more information. The SASECRSP, SASEFAME, and SASEHAVR interface engines enable SAS users to access and process time series data in CRSPAccess data les, FAME databases, and Haver Analytics Data Link Express (DLX) data bases, respectively. See Chapter 33, The SASECRSP Interface Engine, Chapter 34, The SASEFAME Interface Engine, and Chapter 35, The SASEHAVR Interface Engine, for more details.
1jun1990d, _n_-1 );
data temp; length _name_ $8 _label_ $40; keep _name_ _label_ date value; format date monyy.; input _name_ month year nval _label_ &; date = mdy( month, 1, year ); do i = 1 to nval; input value @; output; date = intnx( month, date, 1 ); end; datalines; cpi 8 90 12 Consumer Price Index 131.6 132.7 133.5 133.8 133.8 134.6 134.8 135.0 135.2 135.6 136.0 136.2 ppi 6 90 13 Producer Price Index 114.3 114.5 116.5 118.4 120.8 120.1 118.7 119.0 117.2 116.2 116.0 116.5 116.3 ;
The following statements sort the data set by date and series name, and the TRANSPOSE procedure is used to transpose the data into a standard form time series data set.
proc sort data=temp; by date _name_; run; proc transpose data=temp out=usprice(drop=_name_); by date; var value; run; proc contents data=usprice; run; proc print data=usprice; run;
Chapter 4
Overview
This chapter summarizes the time intervals, date and datetime informats, date and datetime formats, and date, time, and datetime functions available in SAS software. The use of these features is explained in Chapter 3, Working with Time Series Data. The material in this chapter is also contained in SAS Language Reference: Concepts and SAS Language Reference: Dictionary. Because these features are useful for work with time series data, documentation of these features is consolidated and repeated here for easy reference.
Time Intervals
This section provides a reference for the different kinds of time intervals supported by SAS but does not cover how they are used. For an introduction to the use of time intervals see Chapter 3.
Some interval names are used with SAS date values, while other interval names are used with SAS datetime values. The interval names used with SAS date values are YEAR, SEMIYEAR, QTR, MONTH, SEMIMONTH, TENDAY, WEEK, WEEKDAY, DAY, YEARV, R445YR, R454YR, R544YR, R445QTR, R454QTR, R544QTR, R445MON, R454MON, R544MON, and WEEKV. The interval names used with SAS datetime or time values are HOUR, MINUTE, and SECOND. Various abbreviations of these names are also allowed, as described in the section Summary of Interval Types on page 132. Interval names for use with SAS date values can be prexed with DT to construct interval names for use with SAS datetime values. The interval names DTYEAR, DTSEMIYEAR, DTQTR, DTMONTH, DTSEMIMONTH, DTTENDAY, DTWEEK, DTWEEKDAY, DTDAY, DTYEARV, DTR445YR, DTR454YR, DTR544YR, DTR445QTR, DTR454QTR, DTR544QTR, DTR445MON, DTR454MON, DTR544MON, and DTWEEKV are used with SAS datetime or time values.
Both the multiplier n and the shift index s are optional and default to 1. For example, YEAR, YEAR1, YEAR.1, and YEAR1.1 are all equivalent ways of specifying ordinary calendar years. To test for a valid interval specication, use the INTTEST function:
interval = MONTH3.2; valid = INTTEST( interval ); valid = INTTEST( YEAR4);
INTTEST returns a value of 0 if the argument is not a valid interval specication and 1 if the argument is a valid interval specication. The INTTEST function can also be used in a DATA step to test an interval before calling an interval function:
valid = INTTEST( interval ); if ( valid = 1 ) then do; end = INTNX( interval, date, 0, E ); end;
For more information about the INTTEST function, see SAS Language Reference: Dictionary.
Shifted Intervals
Different kinds of intervals are shifted by different subperiods. YEAR, SEMIYEAR, QTR, and MONTH intervals are shifted by calendar months. WEEK and DAY intervals are shifted by days. SEMIMONTH intervals are shifted by semi-monthly periods. TENDAY intervals are shifted by 10-day periods. YEARV intervals are shifted by WEEKV intervals. R445YR, R445QTR, and R445MON intervals are shifted by R445MON intervals. R454YR, R454QTR, and R454MON intervals are shifted by R454MON intervals. R544YR, R544QTR, and R544MON intervals are shifted by R544MON intervals. WEEKV intervals are shifted by days. WEEKDAY intervals are shifted by weekdays. HOUR intervals are shifted by hours. MINUTE intervals are shifted by minutes. SECOND intervals are shifted by seconds. The INTSHIFT function returns the shift interval:
interval = MONTH3.2; shift_interval = INTSHIFT( interval );
In this example, the value of shift_interval is MONTH. For more information about the INTSHIFT function see SAS Language Reference: Dictionary. If a subperiod is specied, the shift index cannot be greater than the number of subperiods in the whole interval. For example, you could use YEAR2.24, but YEAR2.25 would be an error because there is no 25th month in a two-year interval. For interval types that shift by subperiods that are the same as the basic interval type, only multiperiod intervals can be shifted.
For example, MONTH type intervals shift by MONTH subintervals; thus, monthly intervals cannot be shifted because there is only one month in MONTH. However, bimonthly intervals can be shifted because there are two MONTH intervals in each MONTH2 interval. The interval name MONTH2.2 species bimonthly periods that start on the rst day of even-numbered months.
species yearly intervals. Abbreviations are YEAR, YEARS, YEARLY, YR, ANNUAL, ANNUALLY, and ANNUALS. The starting subperiod s is in months (MONTH).
YEARV
species ISO 8601 yearly intervals. The ISO 8601 year starts on the Monday on or immediately preceding January 4th. Note that it is possible for the ISO 8601 year to start in December of the preceding year. Also, some ISO 8601 years contain a leap week. For further discussion of ISO weeks, see Technical Committee ISO/TC 154, Documents in Commerce, and Administration (2004). The starting subperiod s is in ISO 8601 weeks (WEEKV).
R445YR
is the same as YEARV except that the starting subperiod s is in retail 4-4-5 months (R445MON).
R454YR
is the same as YEARV except that the starting subperiod s is in retail 4-5-4 months (R454MON). For a discussion of the retail 4-5-4 calendar, see Federation (2007).
R544YR
is the same as YEARV except that the starting subperiod s is in retail 5-4-4 months (R544MON).
SEMIYEAR
species semiannual intervals (every six months). Abbreviations are SEMIYEAR, SEMIYEARS, SEMIYEARLY, SEMIYR, SEMIANNUAL, and SEMIANN. The starting subperiod s is in months (MONTH). For example, SEMIYEAR.3 intervals are MarchAugust and SeptemberFebruary.
QTR
species quarterly intervals (every three months). Abbreviations are QTR, QUARTER, QUARTERS, QUARTERLY, QTRLY, and QTRS. The starting subperiod s is in months (MONTH).
R445QTR
species retail 4-4-5 quarterly intervals (every 13 ISO 8601 weeks). Some fourth quarters will contain a leap week. The starting subperiod s is in retail 4-4-5 months (R445MON).
R454QTR
species retail 4-5-4 quarterly intervals (every 13 ISO 8601 weeks). Some fourth quarters will contain a leap week. For a discussion of the retail 4-5-4 calendar, see Federation (2007). The starting subperiod s is in retail 4-5-4 months (R454MON).
R544QTR
species retail 5-4-4 quarterly intervals (every 13 ISO 8601 weeks). Some fourth quarters will contain a leap week. The starting subperiod s is in retail 5-4-4 months (R544MON).
MONTH
species monthly intervals. Abbreviations are MONTH, MONTHS, MONTHLY, and MON. The starting subperiod s is in months (MONTH). For example, MONTH2.2 intervals are FebruaryMarch, AprilMay, JuneJuly, AugustSeptember, OctoberNovember, and DecemberJanuary of the following year.
R445MON
species retail 4-4-5 monthly intervals. The 3rd, 6th, 9th, and 12th months are ve ISO 8601 weeks long with the exception that some 12th months contain leap weeks. All other months are four ISO 8601 weeks long. R445MON intervals begin with the 1st, 5th, 9th, 14th, 18th, 22nd, 27th, 31st, 35th, 40th, 44th, and 48th weeks of the ISO year. The starting subperiod s is in retail 4-4-5 months (R445MON).
R454MON
species retail 4-5-4 monthly intervals. The 2nd, 5th, 8th, and 11th months are ve ISO 8601 weeks long. All other months are four ISO 8601 weeks long with the exception that some 12th months contain leap weeks. R454MON intervals begin with the 1st, 5th, 10th, 14th, 18th, 23rd, 27th, 31st, 36th, 40th, 44th, and 49th weeks of the ISO year. For a discussion of the retail 4-5-4 calendar, see Federation (2007). The starting subperiod s is in retail 4-5-4 months (R454MON).
R544MON
species retail 5-4-4 monthly intervals. The 1st, 4th, 7th, and 10th months are ve ISO 8601 weeks long. All other months are four ISO 8601 weeks long with the exception that some 12th months contain leap weeks. R544MON intervals begin with the 1st, 6th, 10th, 14th, 19th, 23rd, 27th, 32nd, 36th, 40th, 45th, and 49th weeks of the ISO year. The starting subperiod s is in retail 5-4-4 months (R544MON).
SEMIMONTH
species semimonthly intervals. SEMIMONTH breaks each month into two periods, starting on the 1st and 16th days. Abbreviations are SEMIMONTH, SEMIMONTHS, SEMIMONTHLY, and SEMIMON. The starting subperiod s is in SEMIMONTH periods. For example, SEMIMONTH2.2 species intervals from the 16th of one month through the 15th of the next month.
TENDAY
species 10-day intervals. TENDAY breaks the month into three periods, the 1st through the 10th day of the month, the 11th through the 20th day of the month, and the remainder of the month. (TENDAY is a special interval typically used for reporting automobile sales data.) The starting subperiod s is in TENDAY periods. For example, TENDAY4.2 denes 40-day periods that start at the second TENDAY period.
WEEK
species weekly intervals of seven days. Abbreviations are WEEK, WEEKS, and WEEKLY. The starting subperiod s is in days (DAY), with the days of the week numbered as 1=Sunday, 2=Monday, 3=Tuesday, 4=Wednesday, 5=Thursday, 6=Friday, and 7=Saturday. For example, WEEK.7 means weekly with Saturday as the rst day of the week.
WEEKV
species ISO 8601 weekly intervals of seven days. Each week starts on Monday. The starting subperiod s is in days (DAY). Note that WEEKV differs from WEEK in that WEEKV.1 starts on Monday, WEEKV.2 starts on Tuesday, and so forth.
species daily intervals with weekend days included in the preceding weekday. Note that for a ve-day work week that starts on Monday, the appropriate interval is WEEKDAY5.2. Abbreviations are WEEKDAY and WEEKDAYS. The starting subperiod s is in weekdays (WEEKDAY). The WEEKDAY interval is the same as DAY except that weekend days are absorbed into the preceding weekday. Thus there are ve WEEKDAY intervals in a calendar week: Monday, Tuesday, Wednesday, Thursday, and the three-day period Friday-Saturday-Sunday. The default weekend days are Saturday and Sunday, but any one to six weekend days can be listed after the WEEKDAY string and followed by a W. Weekend days are specied as 1 for Sunday, 2 for Monday, and so forth. For example, WEEKDAY67W species a FridaySaturday weekend. WEEKDAY1W species a six-day work week with a Sunday weekend. WEEKDAY17W is the same as WEEKDAY.
DAY
species daily intervals. Abbreviations are DAY, DAYS, and DAILY. The starting subperiod s is in days (DAY).
HOUR
species hourly intervals. Abbreviations are HOUR, HOURS, HOURLY, and HR. The starting subperiod s is in hours (HOUR).
MINUTE
species minute intervals. Abbreviations are MINUTE, MINUTES, and MIN. The starting subperiod s is in minutes (MINUTE).
SECOND
species second intervals. Abbreviations are SECOND, SECONDS, and SEC. The starting subperiod s is in seconds (SECOND).
Table 4.1
Examples of Intervals
Kind of Interval years that start in January scal years that start in October biennial intervals that start in July of even years biennial intervals that start in July of odd years four-year intervals that start in November of leap years (frequency of U.S. presidential elections) YEAR4.35 four-year intervals that start in November of even years between leap years (frequency of U.S. midterm elections) WEEK weekly intervals that start on Sundays WEEK2 biweekly intervals that start on rst Sundays WEEK1.1 same as WEEK WEEK.2 weekly intervals that start on Mondays WEEK6.3 six-week intervals that start on rst Tuesdays WEEK6.11 six-week intervals that start on second Wednesdays WEEKDAY daily with Friday-Saturday-Sunday counted as the same day (veday work week with a Saturday-Sunday weekend) WEEKDAY17W same as WEEKDAY WEEKDAY5.2 ve weekdays that start on Monday. If WEEKDAY data are accumulated into weekly data, the interval of the accumulated data is WEEKDAY5.2 WEEKDAY67W daily with Thursday-Friday-Saturday counted as the same day (ve-day work week with a Friday-Saturday weekend) WEEKDAY1W daily with Saturday-Sunday counted as the same day (six-day work week with a Sunday weekend) WEEKDAY3.2 three-weekday intervals (with Friday-Saturday-Sunday counted as one weekday) with the cycle three-weekday periods aligned to Monday 4 Jan 1960 HOUR8.7 eight-hour intervals that start at 6 a.m., 2 p.m., and 10 p.m. (might be used for work shifts)
Table 4.2
Informat Example DATEw. 17oct91 DATETIMEw.d 17oct91:14:45:32 DDMMYYw. 17/10/91 JULIANw. 91290 MMDDYYw. 10/17/91 MONYYw. Oct91 NENGOw. H.03/10/17 TIMEw.d 14:45:32 WEEKVw. 1991-W42-04 YYMMDDw. 91/10/17 YYQw. 91Q4
Description day, month abbreviation, and year: ddmonyy date and time: ddmonyy:hh:mm:ss
Default Width 7
1340
18
day, month, year: ddmmyy, dd/mm/yy, dd-mm-yy, or dd mm yy year and day of year (Julian dates): yyddd
632
532
month, day, year: mmddyy, mm/dd/yy, mm-dd-yy, or mm dd yy month abbreviation and year
632
532
732
10
hours, minutes, seconds: hh:mm:ss or hours, minutes: hh:mm. ISO 8601 year, week, day of week: yyyy-Www-dd
532
3200
11
year, month, day: yymmdd, yy/mm/dd, yy-mm-dd, or yy mm dd year and quarter of year: yyQq
632
432
For example, the format MMDDYY8. writes the date 17 October 1991 as 10/17/91, while the format MMDDYY6. writes this date as 101791. In particular, formats that display the year show two-digit or four-digit year values depending on the width option. The examples shown in the tables are for the default width. The interval function INTFMT returns a recommended format for time ID values based on the interval that describes the frequency of the values. The following example uses INTFMT to select a format to display the quarterly time ID variable qtrDate. This selected format is stored in a macro variable created by the CALL SYMPUT statement. The macro variable &FMT is then used in the FORMAT statement in the PROC PRINT step.
data b(keep=qtrDate); interval = QTR; form = INTFMT( interval, long ); call symput(fmt,form); do i=1 to 4; qtrDate = INTNX( interval, 01jan00d, i-1 ); output; end; run; proc print; format qtrDate &fmt; run;
The second argument to INTFMT is long, l, short, or s and controls the width of the year (2 or 4) for date formats. For more information about the INTFMT function, see SAS Language Reference: Dictionary. See SAS Language Reference: Concepts for a complete description of these formats, including the variations of the formats produced by different width options. See Chapter 3, Working with Time Series Data, for a discussion of the use of date and datetime formats.
Date Formats
Table 4.3 lists some of the date formats available in SAS. For each format, an example is shown of a date value in the notation produced by the format. The date 17OCT91D is used as the example.
Table 4.3 Frequently Used SAS Date Formats
Width Range 59
Default Width 7
232
Table 4.3
continued
Format Example DDMMYYw. 17/10/91 DOWNAMEw. Thursday JULDAYw. 290 JULIANw. 91290 MMDDYYw. 10/17/91 MMYYw. 10M1991 MMYYCw. 10:1991 MMYYDw. 10-1991 MMYYPw. 10.1991 MMYYSw. 10/1991 MMYYNw. 101991 MONNAMEw. October MONTHw. 10 MONYYw. OCT91
Width Range 28
Default Width 8
132
day of year
332
57
28
532
532
532
532
532
532
name of month
132
month of year
132
57
Table 4.3
continued
Format Example QTRw. 4 QTRRw. IV NENGOw. H.03/10/17 WEEKDATEw. Thursday, October 17, 1991 WEEKDATXw. Thursday, 17 October 1991 WEEKDAYw. 5 WEEKVw. 1991-W42-04 WORDDATEw. October 17, 1991 WORDDATXw. 17 October 1991 YEARw. 1991 YYMMw. 1991M10 YYMMCw. 1991:10 YYMMDw. 1991-10 YYMMPw. 1991.10 YYMMSw. 1991/10
Default Width 1
332
210
10
337
29
day-of-week, dd month-name yy
337
29
day of week
132
3200
11
332
18
dd month-name yy
332
18
year
232
532
532
532
532
532
Table 4.3
continued
Format Example YYMMNw. 199110 YYMONw. 1991OCT YYMMDDw. 91/10/17 YYQw. 91Q4 YYQCw. 1991:4 YYQDw. 1991-4 YYQPw. 1991.4 YYQSw. 1991/4 YYQNw. 19914 YYQRw. 1991QIV YYQRCw. 1991:IV YYQRDw. 1991-IV YYQRPw. 1991.IV YYQRSw. 1991/IV YYQRNw.
Default Width 7
532
28
46
432
432
432
432
332
year and quarter in roman numerals: yyQrr year and quarter in roman numerals: yy:rr year and quarter in roman numerals: yy-rr year and quarter in roman numerals: yy.rr year and quarter in roman numerals: yy/rr year and quarter in roman
632
632
632
632
632
632
Table 4.3
continued
Width Range
Default Width
Format Example DATETIMEw.d 17OCT91:14:25:32 DTWKDATXw. Thursday, 17 October 1991 HHMMw.d 14:25 HOURw.d 14 MMSSw.d 25:32 TIMEw.d 14:25:32 TODw.d 14:25:32
Description ddMONyy:hh:mm:ss
Default Width 16
337
29
220
hour
220
220
220
220
that correspond to the observations. For example, for monthly data for 1994, the date values that identify the observations are 1Jan94, 1Feb94, 1Mar94, . . . , 1Dec94. However, for some applications it might be preferable to use end-of-period dates, such as 31Jan94, 28Feb94, 31Mar94, . . . , 31Dec94. For other applications, such as plotting time series, it might be more convenient to use interval midpoint dates to identify the observations. Many SAS/ETS and SAS High-Performance Forecasting procedures provide an ALIGN= option to control the alignment of dates for outputting time series observations. SAS/ETS procedures that support the ALIGN= option are ARIMA, DATASOURCE, ESM, EXPAND, FORECAST, SIMILARITY, TIMESERIES, UCM, and VARMAX. SAS High-Performance Forecasting procedures that support the ALIGN= option are HPFRECONCILE, HPF, HPFDIAGNOSE, HPFENGINE, and HPFEVENTS.
ALIGN=
The ALIGN= option can have the following values: BEGINNING MIDDLE ENDING species that dates be aligned to the start of the interval. This is the default. BEGINNING can be abbreviated as BEGIN, BEG, or B. species that dates be aligned to the interval midpoint, the average of the beginning and ending values. MIDDLE can be abbreviated as MID or M. species that dates be aligned to the end of the interval. ENDING can be abbreviated as END or E.
For information about the calculation of the beginning and ending values of intervals, see the section Beginning Dates and Datetimes of Intervals on page 132.
quoted string or as a SAS character variable. However, when you use a character variable, you should set the length of the character variable to at least the length of the longest string for that variable used in the DATA step. Also, to ensure correct results when using interval functions, use date intervals with date values and datetime intervals with datetime values. See SAS Language Reference: Dictionary for a complete description of these functions.
returns the SAS date value given the Julian date in yyddd or yyyyddd format. For example, DATE = DATEJUL(99001); assigns the SAS date value 01JAN99D to DATE, and DATE = DATEJUL(1999365); assigns the SAS date value 31DEC1999D to DATE.
DATEPART( datetime )
returns the current date and time of day as a SAS datetime value.
DAY( date )
returns a SAS datetime value for date, hour, minute, and second values.
HMS( hour, minute, second )
returns a SAS time value for hour, minute, and second values.
HOLIDAY( holiday , year )
returns a SAS date value for the holiday and year specied. Valid values for holiday are BOXING, CANADA, CANADAOBSERVED, CHRISTMAS, COLUMBUS, EASTER, FATHERS, HALLOWEEN, LABOR, MLK, MEMORIAL, MOTHERS, NEWYEAR,THANKSGIVING, THANKSGIVINGCANADA, USINDEPENDENCE, USPRESIDENTS, VALENTINES, VETERANS, VETERANSUSG, VETERANSUSPS, and VICTORIA. For example: EASTER2000 = HOLIDAY(EASTER,
2000);
HOUR( datetime )
returns the index of the seasonal cycle given an interval and an appropriate SAS date, datetime, or time value. For example, the seasonal cycle for INTERVAL=DAY is WEEK, so INTCINDEX(DAY,01SEP78D); returns 35 since September 1, 1978, is the sixth day of the 35th week of the year. For correct results, date intervals should be used with date values, and datetime intervals should be used with datetime values.
INTCK( date-interval, date1, date2 ) INTCK( datetime-interval, datetime1, datetime2 )
returns the number of boundaries of intervals of the given kind that lie between the two date or datetime values.
INTCYCLE( interval )
returns the interval of the seasonal cycle, given a date, time, or datetime interval. For example, INTCYCLE(MONTH) returns YEAR since the months January, February, . . . , December constitute a yearly cycle. INTCYCLE(DAY) returns WEEK since Sunday, Monday, . . . , Saturday is a weekly cycle.
INTFIT( date1, date2, D ) INTFIT( datetime1, datetime2, DT ) INTFIT( obs1, obs2, OBS )
returns an interval that ts exactly between two SAS date, datetime, or observation values, in the sense of the INTNX function uses SAMEDAY alignment. In the following example, result1 is the same as date1 and result2 is the same as date2.
FitInterval = INTFIT( date1, date2, D ); result1 = INTNX( FitInterval, date1, 0, SAMEDAY); result2 = INTNX( FitInterval, date1, 1, SAMEDAY);
More than one interval can t the preceding denition. For instance, two SAS date values that are seven days apart could be t with either DAY7 or WEEK. The INTFIT algorithm chooses the more common interval, so WEEK is the result when the dates are seven days apart. The INTFIT function can be used to detect the possible frequency of the time series or to analyze frequencies of other events in a time series, such as outliers or missing values.
INTFMT(interval ,size)
returns a recommended format, given a date, time, or datetime interval for displaying the time ID values associated with a time series of the given interval. The valid values of size (long, l, short, s) specify whether the returned format uses a two-digit or four-digit year to display the SAS date value.
INTGET( date1, date2, date3 ) INTGET( datetime1, datetime2, datetime3 )
returns an interval that ts three consecutive SAS date or datetime values. The INTGET function examines two intervals: the rst interval between date1 and date2, and the second interval between date2 and date3. In order for an interval to be detected, either the two intervals must be the same or one interval must be an integer multiple of the other interval. That is, INTGET assumes that at least two of the dates are consecutive points in the time series, and that the other two dates are also consecutive or represent the points before and after
missing observations. The INTGET algorithm assumes that large values are SAS datetime values, which are measured in seconds, and that smaller values are SAS date values, which are measured in days. The INTGET function can be used to detect the possible frequency of the time series or to analyze frequencies of other events in a time series, such as outliers or missing values.
INTINDEX( date-interval, date ) INTINDEX( datetime-interval, datetime )
returns the seasonal index, given a date, time, or datetime interval and an appropriate date, time, or datetime value. The seasonal index is a number that represents the position of the date, time, or datetime value in the seasonal cycle of the specied interval. For example, INTINDEX(MONTH,01DEC2000D); returns 12 because monthly data is yearly periodic and DECEMBER is the 12th month of the year. However, INTINDEX(DAY,01DEC2000D); returns 6 because daily data is weekly periodic and December 01, 2000, is a Friday, the sixth day of the week. To correctly identify the seasonal index, the interval specication should agree with the date, time, or datetime value. For example, INTINDEX(DTMONTH,01DEC2000D); and INTINDEX(MONTH,01DEC2000:00:00:00DT); do not return the expected value of 12. However, both INTINDEX(MONTH,01DEC2000D); and INTINDEX(DTMONTH,01DEC2000:00:00:00DT); return the expected value of 12.
INTNX( date-interval, date, n < , alignment > ) INTNX( datetime-interval, datetime, n < , alignment > )
returns the date or datetime value of the beginning of the interval that is n intervals from the interval that contains the given date or datetime value. The optional alignment argument species that the returned date is aligned to the beginning, middle, or end of the interval. Beginning is the default. In addition, you can specify SAME (or S) alignment. The SAME alignment bases the alignment of the calculated date or datetime value on the alignment of the input date or datetime value. As illustrated in the following example, the SAME alignment can be used to calculate the meaning of same day next year or same day 2 weeks from now.
nextYear = INTNX( YEAR, 15Apr2007D, 1, S ); TwoWeeks = INTNX( WEEK, 15Apr2007D, 2, S );
The preceding example returns 15Apr2008D for nextYear and 29Apr2007D for TwoWeeks. For all values of alignment, the number of intervals n between the input date and the resulting date agrees with the input value.
date2 = INTNX( interval, date1, n1, align ); n2 = INTCK( interval, date1, date2 );
returns the length of the seasonal cycle, given a date, time, or datetime interval. The length of a seasonal cycle is the number of intervals in a seasonal cycle. For example, when the interval for a time series is described as monthly, many procedures use the option INTERVAL=MONTH to indicate that each observation in the data then corresponds to a particular month. Monthly data are considered to be periodic for a one-year seasonal cycle. There are
12 months in one year, so the number of intervals (months) in a seasonal cycle (year) is 12. For quarterly data, there are 4 quarters in one year, so the number of intervals in a seasonal cycle is 4. The periodicity is not always one year. For example, INTERVAL=DAY is considered to have a seasonal cycle of one week, and because there are 7 days in a week, the number of intervals in a seasonal cycle is 7.
INTSHIFT( interval )
returns the shift interval that applies to the shift index if a subperiod is specied. For example, YEAR intervals are shifted by MONTH, so INTSHIFT(YEAR) returns MONTH.
INTTEST( interval )
returns 1 if the interval name is a valid interval, 0 otherwise. For example, VALID = INTTEST(MONTH); should set VALID to 1, while VALID = INTTEST(NOTANINTERVAL); should set VALID to 0. The INTTEST function can be useful in verifying which values of multiplier n and the shift index s are valid in constructing an interval name.
JULDATE( date )
returns the Julian date from a SAS date value. The format of the Julian date is either yyddd or yyyyddd depending on the value of the system option YEARCUTOFF=. For example, using the default system option values, JULDATE( 31DEC1999D ); returns 99365, while JULDATE( 31DEC1899D ); returns 1899365.
MDY( month, day, year )
returns a SAS date value for month, day, and year values.
MINUTE( datetime )
returns the numerical value for the month of the year from a SAS date value. For example, MONTH=MONTH(01JAN2000D); returns 1, the numerical value for January.
NWKDOM( n, weekday, month, year )
returns a SAS date value for the nth weekday of the month and year specied. For example, Thanksgiving is always the fourth (n D 4) Thursday (weekday D 5) in November (month D 11). Thus THANKS2000 = NWKDOM( 4, 5, 11, 2000); returns the SAS date value for Thanksgiving in the year 2000. The last weekday of a month can be specied using n D 5. Memorial Day in the United States is the last (n D 5) Monday (weekday D 2) in May (mont h D 5), and so MEMORIAL2002 = NWKDOM( 5, 2, 5, 2002); returns the SAS date value for Memorial Day in 2002. Because n D 5 always species the last occurrence of the month and most months have only 4 instances of each day, the result for n D 5 is often the same as the result for n D 4. NWKDOM is useful for calculating the SAS date values of holidays that are dened in this manner.
QTR( date )
TIME()
returns the current date as a SAS date value. (TODAY is another name for the DATE function.)
WEEK( date < , descriptor > )
returns the week of year from a SAS date value. The algorithm used to calculate the week depends on the descriptor. If the descriptor is U, weeks start on Sunday and the range is 0 to 53. If weeks 0 and 53 exist, they are only partial weeks. Week 52 can be a partial week. If the descriptor is V, the result is equivalent to the ISO 8601 week of year denition. The range is 1 to 53. Week 53 is a leap week. The rst week of the year, Week 1, and the last week of the year, Week 52 or 53, can include days in another Gregorian calendar year. If the descriptor is W, weeks start on Monday and the range is 0 to 53. If weeks 0 and 53 exist, they are only partial weeks. Week 52 can be a partial week.
WEEKDAY( date )
returns the day of the week from a SAS date value. For example WEEKDAY=WEEKDAY(17OCT1991D); returns 5, the numerical value for Thursday.
YEAR( date )
References
Federation, N. R. (2007), National Retail Federation 4-5-4 Calendar, Washington, DC: NRF. Technical Committee ISO/TC 154, D. E., Processes, Documents in Commerce, I., and Administration (2004), ISO 8601:2004 Data Elements and Interchange FormatsInformation Interchange Representation of Dates and Times, 3rd Edition, Technical report, International Organization for Standardization.
Chapter 5
SAS Macros
This chapter describes several SAS macros and the SAS function PROBDF that are provided with SAS/ETS software. A SAS macro is a program that generates SAS statements. Macros make it easy to produce and execute complex SAS programs that would be time-consuming to write yourself. SAS/ETS software includes the following macros: %AR %BOXCOXAR %DFPVALUE %DFTEST %LOGTEST %MA %PDL generates statements to dene autoregressive error models for the MODEL procedure. investigates Box-Cox transformations useful for modeling and forecasting a time series. computes probabilities for Dickey-Fuller test statistics. performs Dickey-Fuller tests for unit roots in a time series process. tests to see if a log transformation is appropriate for modeling and forecasting a time series. generates statements to dene moving-average error models for the MODEL procedure. generates statements to dene polynomial-distributed lag models for the MODEL procedure.
These macros are part of the SAS AUTOCALL facility and are automatically available for use in your SAS program. See SAS Macro Language: Reference for information about the SAS macro facility. Since the %AR, %MA, and %PDL macros are used only with PROC MODEL, they are documented with the MODEL procedure. See the sections on the %AR, %MA, and %PDL macros in Chapter 18, The MODEL Procedure, for more information about these macros. The %BOXCOXAR, %DFPVALUE, %DFTEST, and %LOGTEST macros are described in the following sections.
BOXCOXAR Macro
The %BOXCOXAR macro nds the optimal Box-Cox transformation for a time series. Transformations of the dependent variable are a useful way of dealing with nonlinear relationships or heteroscedasticity. For example, the logarithmic transformation is often used for modeling and forecasting time series that show exponential growth or that show variability proportional to the level of the series. The Box-Cox transformation is a general class of power transformations that include the log transformation and no transformation as special cases. The Box-Cox transformation is ( Yt D for ln.Xt C c/ for
.Xt Cc/ 1
0 D0
The parameter controls the shape of the transformation. For example, =0 produces a log transformation, while =0.5 results in a square root transformation. When =1, the transformed series differs from the original series by c 1. The constant c is optional. It can be used when some Xt values are negative or 0. You choose c so that the series Xt is always greater than c. The %BOXCOXAR macro tries a range of values and reports which of the values tried produces the optimal Box-Cox transformation. To evaluate different values, the %BOXCOXAR macro transforms the series with each value and ts an autoregressive model to the transformed series. It is assumed that this autoregressive model is a reasonably good approximation to the true time series model appropriate for the transformed series. The likelihood of the data under each autoregressive model is computed, and the value that produces the maximum likelihood over the values tried is reported as the optimal Box-Cox transformation for the series. The %BOXCOXAR macro prints and optionally writes to a SAS data set all of the values tried, the corresponding log-likelihood value, and related statistics for the autoregressive model. You can control the range and number of values tried. You can also control the order of the autoregressive models t to the transformed series. You can difference the transformed series before the autoregressive model is t. Note that the Box-Cox transformation might be appropriate when the data have a common distribution (apart from heteroscedasticity) but not when groups of observations for the variable are quite
different. Thus the %BOXCOXAR macro is more often appropriate for time series data than for cross-sectional data.
Syntax
The form of the %BOXCOXAR macro is
%BOXCOXAR ( SAS-data-set, variable < , options > ) ;
The rst argument, SAS-data-set, species the name of the SAS data set that contains the time series to be analyzed. The second argument, variable, species the time series variable name to be analyzed. The rst two arguments are required. The following options can be used with the %BOXCOXAR macro. Options must follow the required arguments and are separated by commas.
AR=n
species the order of the autoregressive model t to the transformed series. The default is AR=5.
CONST=value
species a constant c to be added to the series before transformation. Use the CONST= option when some values of the series are 0 or negative. The default is CONST=0.
DIF=( differencing-list )
species the degrees of differencing to apply to the transformed series before the autoregressive model is t. The differencing-list is a list of positive integers separated by commas and enclosed in parentheses. For example, DIF=(1,12) species that the transformed series be differenced once at lag 1 and once at lag 12. For more details, see the section IDENTIFY Statement on page 228 in Chapter 7, The ARIMA Procedure.
LAMBDAHI=value
species the maximum value of lambda for the grid search. The default is LAMBDAHI=1. A large (in magnitude) LAMBDAHI= value can result in problems with oating point arithmetic.
LAMBDALO=value
species the minimum value of lambda for the grid search. The default is LAMBDALO=0. A large (in magnitude) LAMBDALO= value can result in problems with oating point arithmetic.
NLAMBDA=value
species the number of lambda values considered, including the LAMBDALO= and LAMBDAHI= option values. The default is NLAMBDA=2.
OUT=SAS-data-set
writes the results to an output data set. The output data set includes the lambda values tried (LAMBDA), and for each lambda value, the log likelihood (LOGLIK), residual mean squared error (RMSE), Akaike Information Criterion (AIC), and Schwarzs Bayesian Criterion (SBC).
PRINT=YES | NO
species whether results are printed. The default is PRINT=YES. The printed output contains the lambda values, log likelihoods, residual mean square errors, Akaike Information Criterion (AIC), and Schwarzs Bayesian Criterion (SBC).
Results
The value of that produces the maximum log likelihood is returned in the macro variable &BOXCOXAR. The value of the variable &BOXCOXAR is ERROR if the %BOXCOXAR macro is unable to compute the best transformation due to errors. This might be the result of large lambda values. The Box-Cox transformation parameter involves exponentiation of the data, so that large lambda values can cause oating-point overow. Results are printed unless the PRINT=NO option is specied. Results are also stored in SAS data sets when the OUT= option is specied.
Details
Assume that the transformed series Yt is a stationary pth order autoregressive process generated by independent normally distributed innovations. .1
t
.B//.Yt i id N.0;
2
/D /
Given these assumptions, the log-likelihood function of the transformed data Yt is lY . / D 1 n n ln.2 / ln.jj/ ln. 2 / 2 2 2 1 .Y 1 /0 1 .Y 1 / 2 2
In this equation, n is the number of observations, is the mean of Yt , 1 is the n-dimensional column vector of 1s, 2 is the innovation variance, Y D .Y1 ; ; Yn /0 , and is the covariance matrix of Y. The log-likelihood function of the original data X1 ; lX . / D lY . / C . 1/
n X t D1
; Xn is
ln.Xt C c/
where c is the value of the CONST= option. For each value of , the maximum log-likelihood of the original data is obtained from the maximum log-likelihood of the transformed data given the maximum likelihood estimate of the autoregressive model. The maximum log-likelihood values are used to compute the Akaike Information Criterion (AIC) and Schwarzs Bayesian Criterion (SBC) for each value. The residual mean squared error based
on the maximum likelihood estimator is also produced. To compute the mean squared error, the predicted values from the model are transformed again to the original scale (Pankratz 1983, pp. 256258, and Taylor 1986). After differencing as specied by the DIF= option, the process is assumed to be a stationary autoregressive process. You can check for stationarity of the series with the %DFTEST macro. If the process is not stationary, differencing with the DIF= option is recommended. For a process with moving-average terms, a large value for the AR= option might be appropriate.
DFPVALUE Macro
The %DFPVALUE macro computes the signicance of the Dickey-Fuller test. The %DFPVALUE macro evaluates the p-value for the Dickey-Fuller test statistic for the test of H0 : The time series has a unit root versus Ha : The time series is stationary using tables published by Dickey (1976) and Dickey, Hasza, and Fuller (1984). The %DFPVALUE macro can compute p-values for tests of a simple unit root with lag 1 or for seasonal unit roots at lags 2, 4, or 12. The %DFPVALUE macro takes into account whether an intercept or deterministic time trend is assumed for the series. The %DFPVALUE macro is used by the %DFTEST macro described later in this chapter. Note that the %DFPVALUE macro has been superseded by the PROBDF function described later in this chapter. It remains for compatibility with past releases of SAS/ETS.
Syntax
The %DFPVALUE macro has the following form:
%DFPVALUE ( tau, nobs < , options > ) ;
The rst argument, tau, species the value of the Dickey-Fuller test statistic. The second argument, nobs, species the number of observations on which the test statistic is based. The rst two arguments are required. The following options can be used with the %DFPVALUE macro. Options must follow the required arguments and are separated by commas.
DLAG=1 | 2 | 4 | 12
species the lag period of the unit root to be tested. DLAG=1 species a one-period unit root test. DLAG=2 species a test for a seasonal unit root with lag 2. DLAG=4 species a test for a seasonal unit root with lag 4. DLAG=12 species a test for a seasonal unit root with lag 12. The default is DLAG=1.
TREND=0 | 1 | 2
species the degree of deterministic time trend included in the model. TREND=0 species no trend and assumes the series has a zero mean. TREND=1 includes an intercept term.
TREND=2 species both an intercept and a deterministic linear time trend term. The default is TREND=1. TREND=2 is not allowed with DLAG=2, 4, or 12.
Results
The computed p-value is returned in the macro variable &DFPVALUE. If the p-value is less than 0.01 or larger than 0.99, the macro variable &DFPVALUE is set to 0.01 or 0.99, respectively.
Minimum Observations
The minimum number of observations required by the %DFPVALUE macro depends on the value of the DLAG= option. The minimum observations are as follows: DLAG= 1 2 4 12 Minimum Observations 9 6 4 12
DFTEST Macro
The %DFTEST macro performs the Dickey-Fuller unit root test. You can use the %DFTEST macro to decide whether a time series is stationary and to determine the order of differencing required for the time series analysis of a nonstationary series. Most time series analysis methods require that the series to be analyzed is stationary. However, many economic time series are nonstationary processes. The usual approach to this problem is to difference the series. A time series that can be made stationary by differencing is said to have a unit root. For more information, see the discussion of this issue in the section Getting Started: ARIMA Procedure on page 191 of Chapter 7, The ARIMA Procedure. The Dickey-Fuller test is a method for testing whether a time series has a unit root. The %DFTEST macro tests the hypothesis H0 : The time series has a unit root versus Ha : The time series is stationary based on tables provided in Dickey (1976) and Dickey, Hasza, and Fuller (1984). The test can be applied for a simple unit root with lag 1, or for seasonal unit roots at lag 2, 4, or 12. Note that the %DFTEST macro has been superseded by the PROC ARIMA stationarity tests. See Chapter 7, The ARIMA Procedure, for details.
Syntax
The %DFTEST macro has the following form:
%DFTEST ( SAS-data-set, variable < , options > ) ;
The rst argument, SAS-data-set, species the name of the SAS data set that contains the time series variable to be analyzed. The second argument, variable, species the time series variable name to be analyzed. The rst two arguments are required. The following options can be used with the %DFTEST macro. Options must follow the required arguments and are separated by commas.
AR=n
species the order of autoregressive model t after any differencing specied by the DIF= and DLAG= options. The default is AR=3.
DIF=( differencing-list )
species the degrees of differencing to be applied to the series. The differencing list is a list of positive integers separated by commas and enclosed in parentheses. For example, DIF=(1,12) species that the series be differenced once at lag 1 and once at lag 12. For more details, see the section IDENTIFY Statement on page 228 in Chapter 7, The ARIMA Procedure. If the option DIF=( d 1 , , d k ) is specied, the series analyzed is .1 B d1 / .1 B dk /Yt , where Yt is the variable specied, and B is the backshift operator dened by BYt D Yt 1 .
DLAG=1 | 2 | 4 | 12
species the lag to be tested for a unit root. The default is DLAG=1.
OUT=SAS-data-set
writes the test statistic, parameter estimates, and other statistics to an output data set.
TREND=0 | 1 | 2
species the degree of deterministic time trend included in the model. TREND=0 includes no deterministic term and assumes the series has a zero mean. TREND=1 includes an intercept term. TREND=2 species an intercept and a linear time trend term. The default is TREND=1. TREND=2 is not allowed with DLAG=2, 4, or 12.
Results
The computed p-value is returned in the macro variable &DFTEST. If the p-value is less than 0.01 or larger than 0.99, the macro variable &DFTEST is set to 0.01 or 0.99, respectively. (The same value is given in the macro variable &DFPVALUE returned by the %DFPVALUE macro, which is used by the %DFTEST macro to compute the p-value.) Results can be stored in SAS data sets with the OUT= and OUTSTAT= options.
Minimum Observations
The minimum number of observations required by the %DFTEST macro depends on the value of the DLAG= option. Let s be the sum of the differencing orders specied by the DIF= option, let t be
the value of the TREND= option, and let p be the value of the AR= option. The minimum number of observations required is as follows: DLAG= 1 2 4 12 Minimum Observations 1 C p C s C max.9; p C t C 2/ 2 C p C s C max.6; p C t C 2/ 4 C p C s C max.4; p C t C 2/ 12 C p C s C max.12; p C t C 2/
Observations are not used if they have missing values for the series or for any lag or difference used in the autoregressive model.
LOGTEST Macro
The %LOGTEST macro tests whether a logarithmic transformation is appropriate for modeling and forecasting a time series. The logarithmic transformation is often used for time series that show exponential growth or variability proportional to the level of the series. The %LOGTEST macro ts an autoregressive model to a series and ts the same model to the log of the series. Both models are estimated by the maximum-likelihood method, and the maximum log-likelihood values for both autoregressive models are computed. These log-likelihood values are then expressed in terms of the original data and compared. You can control the order of the autoregressive models. You can also difference the series and the log-transformed series before the autoregressive model is t. You can print the log-likelihood values and related statistics (AIC, SBC, and MSE) for the autoregressive models for the series and the log-transformed series. You can also output these statistics to a SAS data set.
Syntax
The %LOGTEST macro has the following form:
%LOGTEST ( SAS-data-set, variable, < options > ) ;
The rst argument, SAS-data-set, species the name of the SAS data set that contains the time series variable to be analyzed. The second argument, variable, species the time series variable name to be analyzed. The rst two arguments are required. The following options can be used with the %LOGTEST macro. Options must follow the required arguments and are separated by commas.
AR=n
species the order of the autoregressive model t to the series and the log-transformed series. The default is AR=5.
CONST=value
species a constant to be added to the series before transformation. Use the CONST= option when some values of the series are 0 or negative. The series analyzed must be greater than the negative of the CONST= value. The default is CONST=0.
DIF=( differencing-list )
species the degrees of differencing applied to the original and log-transformed series before tting the autoregressive model. The differencing-list is a list of positive integers separated by commas and enclosed in parentheses. For example, DIF=(1,12) species that the transformed series be differenced once at lag 1 and once at lag 12. For more details, see the section IDENTIFY Statement on page 228 in Chapter 7, The ARIMA Procedure.
OUT=SAS-data-set
writes the results to an output data set. The output data set includes a variable TRANS that identies the transformation (LOG or NONE), the log-likelihood value (LOGLIK), residual mean squared error (RMSE), Akaike Information Criterion (AIC), and Schwarzs Bayesian Criterion (SBC) for the log-transformed and untransformed cases.
PRINT=YES | NO
species whether the results are printed. The default is PRINT=NO. The printed output shows the log-likelihood value, residual mean squared error, Akaike Information Criterion (AIC), and Schwarzs Bayesian Criterion (SBC) for the log-transformed and untransformed cases.
Results
The result of the test is returned in the macro variable &LOGTEST. The value of the &LOGTEST variable is LOG if the model t to the log-transformed data has a larger log likelihood than the model t to the untransformed series. The value of the &LOGTEST variable is NONE if the model t to the untransformed data has a larger log likelihood. The variable &LOGTEST is set to ERROR if the %LOGTEST macro is unable to compute the test due to errors. Results are printed when the PRINT=YES option is specied. Results are stored in SAS data sets when the OUT= option is specied.
Details
Assume that a time series Xt is a stationary pth order autoregressive process with normally distributed white noise innovations. That is, .1 where
x
.B//.Xt
x/
is the mean of Xt .
where n is the number of observations, 1 is the n-dimensional column vector of 1s, of the white noise, X D .X1 ; ; Xn /0 , and xx is the covariance matrix of X.
2 e
is the variance
On the other hand, if the log-transformed time series Yt D ln.Xt C c/ is a stationary pth order autoregressive process, the log-likelihood function of Xt is l0 . / D n 1 ln.2 / ln.jyy j/ 2 2 1 .Y 1 y /0 yy1 .Y 2 2 e n ln. 2 1
y/ 2 e/ n X tD1
ln.Xt C c/
where
The %LOGTEST macro compares the maximum values of l1 . / and l0 . / and determines which is larger. The %LOGTEST macro also computes the Akaike Information Criterion (AIC), Schwarzs Bayesian Criterion (SBC), and residual mean squared error based on the maximum likelihood estimator for the autoregressive model. For the mean squared error, retransformation of forecasts is based on Pankratz (1983, pp. 256258). After differencing as specied by the DIF= option, the process is assumed to be a stationary autoregressive process. You might want to check for stationarity of the series using the %DFTEST macro. If the process is not stationary, differencing with the DIF= option is recommended. For a process with moving average terms, a large value for the AR= option might be appropriate.
Functions
Syntax
PROBDF( x, n < , d < , type > > )
x n
is the test statistic. is the sample size. The minimum value of n allowed depends on the value specied for the third argument d. For d in the set (1,2,4,6,12), n must be an integer greater than or equal to max.2d; 5/; for other values of d the minimum value of n is 24.
is an optional integer giving the degree of the unit root tested for. Specify d D 1 for tests of a simple unit root .1 B/. Specify d equal to the seasonal cycle length for tests for a seasonal unit root .1 Bd /. The default value of d is 1; that is, a test for a simple unit root .1 B/ is assumed if d is not specied. The maximum value of d allowed is 12. is an optional character argument that species the type of test statistic used. The values of type are the following: SZM RZM SSM RSM RTR studentized test statistic for the zero mean (no intercept) case regression test statistic for the zero mean (no intercept) case studentized test statistic for the single mean (intercept) case regression test statistic for the single mean (intercept) case regression test statistic for the deterministic time trend case
type
STR studentized test statistic for the deterministic time trend case
The values STR and RTR are allowed only when d D 1. The default value of type is SZM.
Details
Theoretical Background
When a time series has a unit root, the series is nonstationary and the ordinary least squares (OLS) estimator is not normally distributed. Dickey (1976) and Dickey and Fuller (1979) studied the limiting distribution of the OLS estimator of autoregressive models for time series with a simple unit root. Dickey, Hasza, and Fuller (1984) obtained the limiting distribution for time series with seasonal unit roots. Consider the (p +1)th order autoregressive time series Yt D 1 Yt
1
C 2 Yt
C pC1 Yt
p 1
C et
pC1 D 0
If all the characteristic roots are less than 1 in absolute value, Yt is stationary. Yt is nonstationary if there is a unit root. If there is a unit root, the sum of the autoregressive parameters is 1, and hence you can test for a unit root by testing whether the sum of the autoregressive parameters is 1 or not. The no-intercept model is parameterized as rYt D Yt where rYt D Yt D 1 C k D kC1
1
C 1 rYt
1
C p rYt
C et
Yt
and 1
C pC1
pC1
The estimators are obtained by regressing rYt on Yt 1 ; rYt 1 ; ; rYt p . The t statistic of the ordinary least squares estimator of is the test statistic for the unit root test. If the type argument value species a test for a nonzero mean (intercept case), the autoregressive model includes a mean term 0 . If the type argument value species a test for a time trend, the model also includes a time trend term and the model is as follows: rYt D 0 C t C Yt
1
C 1 rYt
C p rYt
C et
For testing for a seasonal unit root, consider the multiplicative model .1 d B d /.1 Yt
d.
1 B
p B p /Yt D et
Let r d Yt Yt
O 1. Regress r d Yt on r d Yt 1 r d Yt p to obtain the initial estimators i and compute residuals O et . Under the null hypothesis that d D 1, i are consistent estimators of i . O O 2. Regress et on .1 1 B O O D d 1 and i i . O p B p /Yt
d;r dY t 1,
, r d Yt
to obtain estimates of
The t ratio for the estimate of produced by the second step is used as a test statistic for testing for O a seasonal unit root. The estimates of i are obtained by adding the estimates of i i from the Oi from the rst step. second step to The series .1 B d /Yt is assumed to be stationary, where d is the value of the third argument to the PROBDF function. If the series is an ARMA process, a large value of p might be desirable in order to obtain a reliable test statistic. To determine an appropriate value for p; see Said and Dickey (1984).
Test Statistics
The Dickey-Fuller test is used to test the null hypothesis that the time series exhibits a lag d unit root against the alternative of stationarity. The PROBDF function computes the probability of observing a test statistic more extreme than x under the assumption that the null hypothesis is true. You should reject the unit root hypothesis when PROBDF returns a small (signicant) probability value. There are several different versions of the Dickey-Fuller test. The PROBDF function supports six versions, as selected by the type argument. Specify the type value that corresponds to the way that you calculated the test statistic x. The last two characters of the type value specify the kind of regression model used to compute the Dickey-Fuller test statistic. The meaning of the last two characters of the type value are as follows: ZM zero mean or no-intercept case. The test statistic x is assumed to be computed from the regression model yt D d yt
d
C et
SM
single mean or intercept case. The test statistic x is assumed to be computed from the regression model yt D 0 C d yt
d
C et
TR
intercept and deterministic time trend case. The test statistic x is assumed to be computed from the regression model yt D 0 C t C 1 yt
1
C et
The rst character of the type value species whether the regression test statistic or the studentized test statistic is used. Let d be the estimated regression coefcient for the dth lag of the series, and O let se be the standard error of d . The meaning of the rst character of the type value is as follows: O O R the regression-coefcient-based test statistic. The test statistic is x D n.d O S 1/
See Dickey and Fuller (1979), Dickey, Hasza, and Fuller (1984), and Hamilton (1994) for more information about the Dickey-Fuller test null distribution. The preceding formulas are for the basic Dickey-Fuller test. The PROBDF function can also be used for the augmented Dickey-Fuller test, in which the error term et is modeled as an autoregressive process; however, the test statistic is computed somewhat differently for the augmented Dickey-Fuller test. See Dickey, Hasza, and Fuller (1984) and Hamilton (1994) for information about seasonal and nonseasonal augmented Dickey-Fuller tests. The PROBDF function is calculated from approximating functions t to empirical quantiles that are produced by a Monte Carlo simulation that employs 108 replications for each simulation. Separate simulations were performed for selected values of n and for d D 1; 2; 4; 6; 12 (where n and d are the second and third arguments to the PROBDF function). The maximum error of the PROBDF function is approximately 10 3 for d in the set (1,2,4,6,12) and can be slightly larger for other d values. (Because the number of simulation replications used to produce the PROBDF function is much greater than the 60,000 replications used by Dickey and Fuller (1979) and Dickey, Hasza, and Fuller (1984), the PROBDF function can be expected to produce results that are substantially more accurate than the critical values reported in those papers.)
Examples
Suppose the data set TEST contains 104 observations of the time series variable Y, and you want to test the null hypothesis that there exists a lag 4 seasonal unit root in the Y series. The following statements illustrate how to perform the single-mean Dickey-Fuller regression coefcient test using PROC REG and PROBDF.
data test1; set test; y4 = lag4(y); run; proc reg data=test1 outest=alpha; model y = y4 / noprint; run; data _null_; set alpha; x = 100 * ( y4 - 1 ); p = probdf( x, 100, 4, "RSM" ); put p= pvalue5.3; run;
To perform the augmented Dickey-Fuller test, regress the differences of the series on lagged differences and on the lagged value of the series, and compute the test statistic from the regression coefcient for the lagged series. The following statements illustrate how to perform the single-mean augmented Dickey-Fuller studentized test for a simple unit root using PROC REG and PROBDF:
data test1; set test; yl = lag(y); yd = dif(y); yd1 = lag1(yd); yd2 = lag2(yd); yd3 = lag3(yd); yd4 = lag4(yd); run; proc reg data=test1 outest=alpha covout; model yd = yl yd1-yd4 / noprint; run; data _null_; set alpha; retain a; if _type_ = PARMS then a = yl - 1; if _type_ = COV & _NAME_ = YL then do; x = a / sqrt(yl); p = probdf( x, 99, 1, "SSM" ); put p= pvalue5.3; end; run;
The %DFTEST macro provides an easier way to perform Dickey-Fuller tests. The following statements perform the same tests as the preceding example:
%dftest( test, y, ar=4 ); %put p=&dftest;
References ! 163
References
Dickey, D. A. (1976), Estimation and Testing of Nonstationary Time Series, Unpublished Ph.D. Thesis, Iowa State University, Ames. Dickey, D. A. and Fuller, W. A. (1979), Distribution of the Estimation for Autoregressive Time Series with a Unit Root, Journal of the American Statistical Association, 74, 427-431. Dickey, D. A., Hasza, D. P., and Fuller, W. A. (1984), Testing for Unit Roots in Seasonal Time Series, Journal of the American Statistical Association, 79, 355-367. Hamilton, J. D. (1994), Time Series Analysis, Princeton, NJ: Princeton University Press. Microsoft Excel 2000 Online Help, Redmond, WA: Microsoft Corp. Pankratz, A. (1983), Forecasting with Univariate Box-Jenkins Models: Concepts and Cases. New York: John Wiley. Said, S. E. and Dickey, D. A. (1984), Testing for Unit Roots in ARMA Models of Unknown Order, Biometrika, 71, 599-607. Taylor, J. M. G. (1986) The Retransformed Mean After a Fitted Power Transformation, Journal of the American Statistical Association, 81, 114-118.
164
Chapter 6
Overview
Several SAS/ETS procedures (COUNTREG, ENTROPY, MDC, QLIM, UCM, and VARMAX) use the nonlinear optimization (NLO) subsystem to perform nonlinear optimization. This chapter describes the options of the NLO system and some technical details of the available optimization methods. Note that not all options have been implemented for all procedures that use the NLO susbsystem. You should check each procedure chapter for more details about which options are available.
Options
The following table summarizes the options available in the NLO system.
Table 6.1 NLO options
Option
Description
Table 6.1
continued
Description line-search method line-search precision type of Hessian scaling start for approximated Hessian iteration number for update restart
Termination Criteria Specications MAXFUNC= maximum number of function calls MAXITER= maximum number of iterations MINITER= minimum number of iterations MAXTIME= upper limit seconds of CPU time ABSCONV= absolute function convergence criterion ABSFCONV= absolute function convergence criterion ABSGCONV= absolute gradient convergence criterion ABSXCONV= absolute parameter convergence criterion FCONV= relative function convergence criterion FCONV2= relative function convergence criterion GCONV= relative gradient convergence criterion XCONV= relative parameter convergence criterion FSIZE= used in FCONV, GCONV criterion XSIZE= used in XCONV criterion Step Length Options DAMPSTEP= damped steps in line search MAXSTEP= maximum trust region radius INSTEP= initial trust region radius Printed Output Options PALL display (almost) all printed optimization-related output PHISTORY display optimization history PHISTPARMS display parameter estimates in each iteration PSHORT reduce some default optimization-related output PSUMMARY reduce most default optimization-related output NOPRINT suppress all printed optimization-related output Remote Monitoring Options SOCKET= specify the leref for remote monitoring
species an absolute function convergence criterion. For minimization, termination requires f . .k/ / r. The default value of r is the negative square root of the largest double-precision value, which serves only as a protection against overows.
Options ! 167
ABSFCONV=rn ABSFTOL=rn
species an absolute function convergence criterion. For all techniques except NMSIMP, termination requires a small change of the function value in successive iterations: jf . .k
1/
f . .k/ /j r
The same formula is used for the NMSIMP technique, but .k/ is dened as the vertex with the lowest function value, and .k 1/ is dened as the vertex with the highest function value in the simplex. The default value is r D 0. The optional integer value n species the number of successive iterations for which the criterion must be satised before the process can be terminated.
ABSGCONV=rn ABSGTOL=rn
species an absolute gradient convergence criterion. Termination requires the maximum absolute gradient element to be small: max jgj . .k/ /j r
j
This criterion is not used by the NMSIMP technique. The default value is r D 1E 5. The optional integer value n species the number of successive iterations for which the criterion must be satised before the process can be terminated.
ABSXCONV=rn ABSXTOL=rn
species an absolute parameter convergence criterion. For all techniques except NMSIMP, termination requires a small Euclidean distance between successive parameter vectors, k .k/ .k
1/
k2 r
For the NMSIMP technique, termination requires either a small length .k/ of the vertices of a restart simplex, .k/ r or a small simplex size, .k/ r where the simplex size .k/ is dened as the L1 distance from the simplex vertex .k/ the smallest function value to the other n simplex points l .k/ : .k/ D X
l y .k/
with
k l
.k/
.k/
k1
The default is r D 1E 8 for the NMSIMP technique and r D 0 otherwise. The optional integer value n species the number of successive iterations for which the criterion must be satised before the process can terminate.
DAMPSTEP[=r ]
species that the initial step length value .0/ for each line search (used by the QUANEW, HYQUAN, CONGRA, or NEWRAP technique) cannot be larger than r times the step length value used in the former iteration. If the DAMPSTEP option is specied but r is not specied, the default is r D 2. The DAMPSTEP=r option can prevent the line-search algorithm from repeatedly stepping into regions where some objective functions are difcult to compute or where they could lead to oating point overows during the computation of objective functions and their derivatives. The DAMPSTEP=r option can save time-costly function calls during the line searches of objective functions that result in very small steps.
FCONV=rn FTOL=rn
species a relative function convergence criterion. For all techniques except NMSIMP, termination requires a small relative change of the function value in successive iterations, jf . .k/ / f . .k 1/ /j r max.jf . .k 1/ /j; FSIZE/ where FSIZE is dened by the FSIZE= option. The same formula is used for the NMSIMP technique, but .k/ is dened as the vertex with the lowest function value, and .k 1/ is dened as the vertex with the highest function value in the simplex. The default value may depend on the procedure. In most cases, you can use the PALL option to nd it.
FCONV2=rn FTOL2=rn
species another function convergence criterion. For all techniques except NMSIMP, termination requires a small predicted reduction df .k/ f . .k/ / f . .k/ C s .k/ /
of the objective function. The predicted reduction df .k/ D D r is computed by approximating the objective function f by the rst two terms of the Taylor series and substituting the Newton step s .k/ D H .k/
1 .k/
For the NMSIMP technique, termination requires a small standard deviation of the function r h i2 .k/ .k/ 1 P values of the nC1 simplex vertices l , l D 0; : : : ; n, nC1 l f .l / f . .k/ / r .k/ 1 P where f . .k/ / D nC1 l f .l /. If there are nact boundary constraints active at .k/ , the mean and standard deviation are computed only for the n C 1 nact unconstrained vertices.
Options ! 169
The default value is r D 1E 6 for the NMSIMP technique and r D 0 otherwise. The optional integer value n species the number of successive iterations for which the criterion must be satised before the process can terminate.
FSIZE=r
species the FSIZE parameter of the relative function and relative gradient termination criteria. The default value is r D 0. For more details, see the FCONV= and GCONV= options.
GCONV=rn GTOL=rn
species a relative gradient convergence criterion. For all techniques except CONGRA and NMSIMP, termination requires that the normalized predicted function reduction is small, f racg. .k/ /T H .k/
1
where FSIZE is dened by the FSIZE= option. For the CONGRA technique (where a reliable Hessian estimate H is not available), the following criterion is used: k g. .k/ / k2 k s. .k/ / k2 2 r k g. .k/ / g. .k 1/ / k2 max.jf . .k/ /j; FSIZE/ This criterion is not used by the NMSIMP technique. The default value is r D 1E 8. The optional integer value n species the number of successive iterations for which the criterion must be satised before the process can terminate.
HESCAL=0j1j2j3 HS=0j1j2j3
species the scaling version of the Hessian matrix used in NRRIDG, TRUREG, NEWRAP, or DBLDOG optimization. If HS is not equal to 0, the rst iteration and each restart iteration sets the diagonal scaling .0/ matrix D .0/ D diag.di /: q .0/ .0/ di D max.jHi;i j; / where Hi;i are the diagonal elements of the Hessian. In every other iteration, the diagonal scaling matrix D .0/ D diag.di / is updated depending on the HS option: HS=0 HS=1 species that no scaling is done. species the Mor (1978) scaling update: q .kC1/ .k/ .k/ di D max di ; max.jHi;i j; / species the Dennis, Gay, & Welsch (1981) scaling update: q .k/ .kC1/ .k/ di D max 0:6 di ; max.jHi;i j; / species that di is reset in each iteration: q .kC1/ .k/ di D max.jHi;i j; /
.0/ .0/
HS=2
HS=3
In each scaling update, is the relative machine precision. The default value is HS=0. Scaling of the Hessian can be time consuming in the case where general linear constraints are active.
INHESSIAN[= r ] INHESS[= r ]
species how the initial estimate of the approximate Hessian is dened for the quasi-Newton techniques QUANEW and DBLDOG. There are two alternatives: If you do not use the r specication, the initial estimate of the approximate Hessian is set to the Hessian at .0/ . If you do use the r specication, the initial estimate of the approximate Hessian is set to the multiple of the identity matrix rI . By default, if you do not specify the option INHESSIAN=r, the initial estimate of the approximate Hessian is set to the multiple of the identity matrix rI , where the scalar r is computed from the magnitude of the initial gradient.
INSTEP=r
reduces the length of the rst trial step during the line search of the rst iterations. For highly nonlinear objective functions, such as the EXP function, the default initial radius of the trust-region algorithm TRUREG or DBLDOG or the default step length of the line-search algorithms can result in arithmetic overows. If this occurs, you should specify decreasing values of 0 < r < 1 such as INSTEP=1E 1, INSTEP=1E 2, INSTEP=1E 4, and so on, until the iteration starts successfully. For trust-region algorithms (TRUREG, DBLDOG), the INSTEP= option species a factor r > 0 for the initial radius .0/ of the trust region. The default initial trust-region radius is the length of the scaled gradient. This step corresponds to the default radius factor of r D 1. For line-search algorithms (NEWRAP, CONGRA, QUANEW), the INSTEP= option species an upper bound for the initial step length for the line search during the rst ve iterations. The default initial step length is r D 1. For the Nelder-Mead simplex algorithm, using TECH=NMSIMP, the INSTEP=r option denes the size of the start simplex.
LINESEARCH=i LIS=i
species the line-search method for the CONGRA, QUANEW, and NEWRAP optimization techniques. Refer to Fletcher (1987) for an introduction to line-search techniques. The value of i can be 1; : : : ; 8. For CONGRA, QUANEW and NEWRAP, the default value is i D 2. LIS=1 species a line-search method that needs the same number of function and gradient calls for cubic interpolation and cubic extrapolation; this method is similar to one used by the Harwell subroutine library. species a line-search method that needs more function than gradient calls for quadratic and cubic interpolation and cubic extrapolation; this method is implemented as shown in Fletcher (1987) and can be modied to an exact line search by using the LSPRECISION= option.
LIS=2
Options ! 171
LIS=3
species a line-search method that needs the same number of function and gradient calls for cubic interpolation and cubic extrapolation; this method is implemented as shown in Fletcher (1987) and can be modied to an exact line search by using the LSPRECISION= option. species a line-search method that needs the same number of function and gradient calls for stepwise extrapolation and cubic interpolation. species a line-search method that is a modied version of LIS=4. species golden section line search (Polak 1971), which uses only function values for linear approximation. species bisection line search (Polak 1971), which uses only function values for linear approximation. species the Armijo line-search technique (Polak 1971), which uses only function values for linear approximation.
LSPRECISION=r LSP=r
species the degree of accuracy that should be obtained by the line-search algorithms LIS=2 and LIS=3. Usually an imprecise line search is inexpensive and successful. For more difcult optimization problems, a more precise and expensive line search may be necessary (Fletcher 1987). The second line-search method (which is the default for the NEWRAP, QUANEW, and CONGRA techniques) and the third linesearch method approach exact line search for small LSPRECISION= values. If you have numerical problems, you should try to decrease the LSPRECISION= value to obtain a more precise line search. The default values are shown in the following table.
Table 6.2 Line Search Precision Defaults
species the maximum number i of function calls in the optimization process. The default values are TRUREG, NRRIDG, NEWRAP: 125 QUANEW, DBLDOG: 500 CONGRA: 1000 NMSIMP: 3000
Note that the optimization can terminate only after completing a full iteration. Therefore, the number of function calls that is actually performed can exceed the number that is specied by the MAXFUNC= option.
MAXITER=i MAXIT=i
species the maximum number i of iterations in the optimization process. The default values are TRUREG, NRRIDG, NEWRAP: 50 QUANEW, DBLDOG: 200 CONGRA: 400 NMSIMP: 1000 These default values are also valid when i is specied as a missing value.
MAXSTEP=rn
species an upper bound for the step length of the line-search algorithms during the rst n iterations. By default, r is the largest double-precision value and n is the largest integer available. Setting this option can improve the speed of convergence for the CONGRA, QUANEW, and NEWRAP techniques.
MAXTIME=r
species an upper limit of r seconds of CPU time for the optimization process. The default value is the largest oating-point double representation of your computer. Note that the time specied by the MAXTIME= option is checked only once at the end of each iteration. Therefore, the actual running time can be much longer than that specied by the MAXTIME= option. The actual running time includes the rest of the time needed to nish the iteration and the time needed to generate the output of the results.
MINITER=i MINIT=i
species the minimum number of iterations. The default value is 0. If you request more iterations than are actually needed for convergence to a stationary point, the optimization algorithms can behave strangely. For example, the effect of rounding errors can prevent the algorithm from continuing for the required number of iterations.
NOPRINT
suppresses the output. (See procedure documentation for availability of this option.)
PALL
displays all optional output for optimization. (See procedure documentation for availability of this option.)
PHISTORY
displays the optimization history. (See procedure documentation for availability of this option.)
Options ! 173
PHISTPARMS
display parameter estimates in each iteration. (See procedure documentation for availability of this option.)
PINIT
displays the initial values and derivatives (if available). (See procedure documentation for availability of this option.)
PSHORT
restricts the amount of default output. (See procedure documentation for availability of this option.)
PSUMMARY
restricts the amount of default displayed output to a short form of iteration history and notes, warnings, and errors. (See procedure documentation for availability of this option.)
RESTART=i > 0 REST=i > 0
species that the QUANEW or CONGRA algorithm is restarted with a steepest descent/ascent search direction after, at most, i iterations. Default values are as follows: CONGRA UPDATE=PB: restart is performed automatically, i is not used. CONGRA UPDATEPB: i D min.10n; 80/, where n is the number of parameters. QUANEW i is the largest integer available.
SOCKET=leref
Species the leref that contains the information needed for remote monitoring. See the section Remote Monitoring on page 181 for more details.
TECHNIQUE=value TECH=value
species the optimization technique. Valid values are as follows: CONGRA performs a conjugate-gradient optimization, which can be more precisely specied with the UPDATE= option and modied with the LINESEARCH= option. When you specify this option, UPDATE=PB by default. DBLDOG performs a version of double-dogleg optimization, which can be more precisely specied with the UPDATE= option. When you specify this option, UPDATE=DBFGS by default. NMSIMP performs a Nelder-Mead simplex optimization. NONE does not perform any optimization. This option can be used as follows:
to perform a grid search without optimization to compute estimates and predictions that cannot be obtained efciently with any of the optimization techniques NEWRAP performs a Newton-Raphson optimization that combines a line-search algorithm with ridging. The line-search algorithm LIS=2 is the default method. NRRIDG performs a Newton-Raphson optimization with ridging. QUANEW performs a quasi-Newton optimization, which can be dened more precisely with the UPDATE= option and modied with the LINESEARCH= option. This is the default estimation method. TRUREG performs a trust region optimization.
UPDATE=method UPD=method
species the update method for the QUANEW, DBLDOG, or CONGRA optimization technique. Not every update method can be used with each optimizer. Valid methods are as follows: BFGS performs the original Broyden, Fletcher, Goldfarb, and Shanno (BFGS) update of the inverse Hessian matrix. DBFGS performs the dual BFGS update of the Cholesky factor of the Hessian matrix. This is the default update method. DDFP performs the dual Davidon, Fletcher, and Powell (DFP) update of the Cholesky factor of the Hessian matrix. DFP performs the original DFP update of the inverse Hessian matrix. PB performs the automatic restart update method of Powell (1977) and Beale (1972). FR performs the Fletcher-Reeves update (Fletcher 1987). PR performs the Polak-Ribiere update (Fletcher 1987). CD performs a conjugate-descent update of Fletcher (1987).
XCONV=rn XTOL=rn
species the relative parameter convergence criterion. For all techniques except NMSIMP, termination requires a small relative parameter change in subsequent iterations. maxj jj
.k/
.k 1/
r
.k/
is dened as the vertex with the highest function value the lowest function value and in the simplex. The default value is r D 1E 8 for the NMSIMP technique and r D 0 otherwise. The optional integer value n species the number of successive iterations for which the criterion must be satised before the process can be terminated.
XSIZE=r > 0
species the XSIZE parameter of the relative parameter termination criterion. The default value is r D 0. For more detail, see the XCONV= option.
Overview
There are several optimization techniques available. You can choose a particular optimizer with the TECH=name option in the PROC statement or NLOPTIONS statement.
Table 6.3 Optimization Techniques
Algorithm trust region Method Newton-Raphson method with line search Newton-Raphson method with ridging quasi-Newton methods (DBFGS, DDFP, BFGS, DFP) double-dogleg method (DBFGS, DDFP) conjugate gradient methods (PB, FR, PR, CD) Nelder-Mead simplex method
No algorithm for optimizing general nonlinear functions exists that always nds the global optimum for a general nonlinear minimization problem in a reasonable amount of time. Since no single optimization technique is invariably superior to others, NLO provides a variety of optimization techniques that work well in various circumstances. However, you can devise problems for which none of the techniques in NLO can nd the correct solution. Moreover, nonlinear optimization can
be computationally expensive in terms of time and memory, so you must be careful when matching an algorithm to a problem. All optimization techniques in NLO use O.n2 / memory except the conjugate gradient methods, which use only O.n/ of memory and are designed to optimize problems with many parameters. These iterative techniques require repeated computation of the following: the function value (optimization criterion) the gradient vector (rst-order partial derivatives) for some techniques, the (approximate) Hessian matrix (second-order partial derivatives) However, since each of the optimizers requires different derivatives, some computational efciencies can be gained. Table 6.4 shows, for each optimization technique, which derivatives are required. (FOD means that rst-order derivatives or the gradient is computed; SOD means that second-order derivatives or the Hessian is computed.)
Table 6.4 Optimization Computations
FOD x x x x x x -
SOD x x x -
Each optimization method employs one or more convergence criteria that determine when it has converged. The various termination criteria are listed and described in the previous section. An algorithm is considered to have converged when any one of the convergence criterion is satised. For example, under the default settings, the QUANEW algorithm will converge if ABSGCON V < 1E 5, F CON V < 10 FDIGI T S , or GCON V < 1E 8.
less reliable. For example, they can more easily terminate at stationary points rather than at global optima. A few general remarks about the various optimization techniques follow. The second-derivative methods TRUREG, NEWRAP, and NRRIDG are best for small problems where the Hessian matrix is not expensive to compute. Sometimes the NRRIDG algorithm can be faster than the TRUREG algorithm, but TRUREG can be more stable. The NRRIDG algorithm requires only one matrix with n.n C 1/=2 double words; TRUREG and NEWRAP require two such matrices. The rst-derivative methods QUANEW and DBLDOG are best for medium-sized problems where the objective function and the gradient are much faster to evaluate than the Hessian. The QUANEW and DBLDOG algorithms, in general, require more iterations than TRUREG, NRRIDG, and NEWRAP, but each iteration can be much faster. The QUANEW and DBLDOG algorithms require only the gradient to update an approximate Hessian, and they require slightly less memory than TRUREG or NEWRAP (essentially one matrix with n.n C 1/=2 double words). QUANEW is the default optimization method. The rst-derivative method CONGRA is best for large problems where the objective function and the gradient can be computed much faster than the Hessian and where too much memory is required to store the (approximate) Hessian. The CONGRA algorithm, in general, requires more iterations than QUANEW or DBLDOG, but each iteration can be much faster. Since CONGRA requires only a factor of n double-word memory, many large applications can be solved only by CONGRA. The no-derivative method NMSIMP is best for small problems where derivatives are not continuous or are very difcult to compute.
Algorithm Descriptions
Some details about the optimization techniques are as follows.
Trust Region Optimization (TRUREG)
The trust region method uses the gradient g..k/ / and the Hessian matrix H..k/ /; thus, it requires that the objective function f . / have continuous rst- and second-order derivatives inside the feasible region. The trust region method iteratively optimizes a quadratic approximation to the nonlinear objective function within a hyperelliptic trust region with radius that constrains the step size that corresponds to the quality of the quadratic approximation. The trust region method is implemented using Dennis, Gay, and Welsch (1981), Gay (1983), and Mor and Sorensen (1983). The trust region method performs well for small- to medium-sized problems, and it does not need many function, gradient, and Hessian calls. However, if the computation of the Hessian matrix is computationally expensive, one of the (dual) quasi-Newton or conjugate gradient algorithms may be more efcient.
The NEWRAP technique uses the gradient g..k/ / and the Hessian matrix H..k/ /; thus, it requires that the objective function have continuous rst- and second-order derivatives inside the feasible region. If second-order derivatives are computed efciently and precisely, the NEWRAP method can perform well for medium-sized to large problems, and it does not need many function, gradient, and Hessian calls. This algorithm uses a pure Newton step when the Hessian is positive denite and when the Newton step reduces the value of the objective function successfully. Otherwise, a combination of ridging and line search is performed to compute successful steps. If the Hessian is not positive denite, a multiple of the identity matrix is added to the Hessian matrix to make it positive denite. In each iteration, a line search is performed along the search direction to nd an approximate optimum of the objective function. The default line-search method uses quadratic interpolation and cubic extrapolation (LIS=2).
Newton-Raphson Ridge Optimization (NRRIDG)
The NRRIDG technique uses the gradient g..k/ / and the Hessian matrix H..k/ /; thus, it requires that the objective function have continuous rst- and second-order derivatives inside the feasible region. This algorithm uses a pure Newton step when the Hessian is positive denite and when the Newton step reduces the value of the objective function successfully. If at least one of these two conditions is not satised, a multiple of the identity matrix is added to the Hessian matrix. The NRRIDG method performs well for small- to medium-sized problems, and it does not require many function, gradient, and Hessian calls. However, if the computation of the Hessian matrix is computationally expensive, one of the (dual) quasi-Newton or conjugate gradient algorithms might be more efcient. Since the NRRIDG technique uses an orthogonal decomposition of the approximate Hessian, each iteration of NRRIDG can be slower than that of the NEWRAP technique, which works with Cholesky decomposition. Usually, however, NRRIDG requires fewer iterations than NEWRAP.
Quasi-Newton Optimization (QUANEW)
The (dual) quasi-Newton method uses the gradient g..k/ /, and it does not need to compute secondorder derivatives since they are approximated. It works well for medium to moderately large optimization problems where the objective function and the gradient are much faster to compute than the Hessian; but, in general, it requires more iterations than the TRUREG, NEWRAP, and NRRIDG techniques, which compute second-order derivatives. QUANEW is the default optimization algorithm because it provides an appropriate balance between the speed and stability required for most nonlinear mixed model applications. The QUANEW technique is one of the following, depending upon the value of the UPDATE= option. the original quasi-Newton algorithm, which updates an approximation of the inverse Hessian
the dual quasi-Newton algorithm, which updates the Cholesky factor of an approximate Hessian (default) You can specify four update formulas with the UPDATE= option: DBFGS performs the dual Broyden, Fletcher, Goldfarb, and Shanno (BFGS) update of the Cholesky factor of the Hessian matrix. This is the default. DDFP performs the dual Davidon, Fletcher, and Powell (DFP) update of the Cholesky factor of the Hessian matrix. BFGS performs the original BFGS update of the inverse Hessian matrix. DFP performs the original DFP update of the inverse Hessian matrix. In each iteration, a line search is performed along the search direction to nd an approximate optimum. The default line-search method uses quadratic interpolation and cubic extrapolation to obtain a step size satisfying the Goldstein conditions. One of the Goldstein conditions can be violated if the feasible region denes an upper limit of the step size. Violating the left-side Goldstein condition can affect the positive deniteness of the quasi-Newton update. In that case, either the update is skipped or the iterations are restarted with an identity matrix, resulting in the steepest descent or ascent search direction. You can specify line-search algorithms other than the default with the LIS= option. The QUANEW algorithm performs its own line-search technique. All options and parameters (except the INSTEP= option) that control the line search in the other algorithms do not apply here. In several applications, large steps in the rst iterations are troublesome. You can use the INSTEP= option to impose an upper bound for the step size during the rst ve iterations. You can also use the INHESSIAN[=r] option to specify a different starting approximation for the Hessian. If you specify only the INHESSIAN option, the Cholesky factor of a (possibly ridged) nite difference approximation of the Hessian is used to initialize the quasi-Newton update process. The values of the LCSINGULAR=, LCEPSILON=, and LCDEACT= options, which control the processing of linear and boundary constraints, are valid only for the quadratic programming subroutine used in each iteration of the QUANEW algorithm.
The double-dogleg optimization method combines the ideas of the quasi-Newton and trust region methods. In each iteration, the double-dogleg algorithm computes the step s .k/ as the linear com.k/ bination of the steepest descent or ascent search direction s1 and a quasi-Newton search direction .k/ s2 . s .k/ D 1 s1 C 2 s2
.k/ .k/
The step is requested to remain within a prespecied trust region radius; see Fletcher (1987, p. 107). Thus, the DBLDOG subroutine uses the dual quasi-Newton update but does not perform a line search. You can specify two update formulas with the UPDATE= option:
DBFGS performs the dual Broyden, Fletcher, Goldfarb, and Shanno update of the Cholesky factor of the Hessian matrix. This is the default. DDFP performs the dual Davidon, Fletcher, and Powell update of the Cholesky factor of the Hessian matrix. The double-dogleg optimization technique works well for medium to moderately large optimization problems where the objective function and the gradient are much faster to compute than the Hessian. The implementation is based on Dennis and Mei (1979) and Gay (1983), but it is extended for dealing with boundary and linear constraints. The DBLDOG technique generally requires more iterations than the TRUREG, NEWRAP, or NRRIDG technique, which requires second-order derivatives; however, each of the DBLDOG iterations is computationally cheap. Furthermore, the DBLDOG technique requires only gradient calls for the update of the Cholesky factor of an approximate Hessian.
Second-order derivatives are not required by the CONGRA algorithm and are not even approximated. The CONGRA algorithm can be expensive in function and gradient calls, but it requires only O.n/ memory for unconstrained optimization. In general, many iterations are required to obtain a precise solution, but each of the CONGRA iterations is computationally cheap. You can specify four different update formulas for generating the conjugate directions by using the UPDATE= option: PB performs the automatic restart update method of Powell (1977) and Beale (1972). This is the default. FR performs the Fletcher-Reeves update (Fletcher 1987). PR performs the Polak-Ribiere update (Fletcher 1987). CD performs a conjugate-descent update of Fletcher (1987). The default, UPDATE=PB, behaved best in most test examples. You are advised to avoid the option UPDATE=CD, which behaved worst in most test examples. The CONGRA subroutine should be used for optimization problems with large n. For the unconstrained or boundary constrained case, CONGRA requires only O.n/ bytes of working memory, whereas all other optimization methods require order O.n2 / bytes of working memory. During n successive iterations, uninterrupted by restarts or changes in the working set, the conjugate gradient algorithm computes a cycle of n conjugate search directions. In each iteration, a line search is performed along the search direction to nd an approximate optimum of the objective function. The default line-search method uses quadratic interpolation and cubic extrapolation to obtain a step size satisfying the Goldstein conditions. One of the Goldstein conditions can be violated if the feasible region denes an upper limit for the step size. Other line-search algorithms can be specied with the LIS= option.
The Nelder-Mead simplex method does not use any derivatives and does not assume that the objective function has continuous derivatives. The objective function itself needs to be continuous. This technique is quite expensive in the number of function calls, and it might be unable to generate precise results for n much greater than 40. The original Nelder-Mead simplex algorithm is implemented and extended to boundary constraints. This algorithm does not compute the objective for infeasible points, but it changes the shape of the simplex by adapting to the nonlinearities of the objective function, which contributes to an increased speed of convergence. It uses a special termination criteria.
Remote Monitoring
The SAS/EmMonitor is an application for Windows that enables you to monitor and stop from your PC a CPU-intensive application performed by the NLO subsystem that runs on a remote server. On the server side, a FILENAME statement assigns a leref to a SOCKET-type device that denes the IP address of the client and the port number for listening. The leref is then specied in the SOCKET= option in the PROC statement to control the EmMonitor. The following statements show an example of server-side statements for PROC ENTROPY.
data one; do t = 1 to 10; x1 = 5 * ranuni(456); x2 = 10 * ranuni( 456); x3 = 2 * rannor(1456); e1 = rannor(1456); e2 = rannor(4560); tmp1 = 0.5 * e1 - 0.1 * e2; tmp2 = -0.1 * e1 - 0.3 * e2; y1 = 7 + 8.5*x1 + 2*x2 + tmp1; y2 = -3 + -2*x1 + x2 + 3*x3 + tmp2; output; end; run; filename sock socket your.pc.address.com:6943; proc entropy data=one tech=tr gmenm gconv=2.e-5 socket=sock; model y1 = x1 x2 x3; run;
On the client side, the EmMonitor application is started with the following syntax:
EmMonitor options
-p port_number -t title -k
denes the port number denes the title of the EmMonitor window keeps the monitor alive when the iteration is completed
The default port number is 6943. The server does not need to be running when you start the EmMonitor, and you can start or dismiss the server at any time during the iteration process. You only need to remember the port number. Starting the PC client, or closing it prematurely, does not have any effect on the server side. In other words, the iteration process continues until one of the criteria for termination is met. Figure 6.1 through Figure 6.4 show screenshots of the application on the client side.
Figure 6.1 Graph Tab Group 0
Table 6.5
ODS Table Name ConvergenceStatus InputOptions IterHist IterStart IterStop Lagrange LinCon LinConDel LinConSol ParameterEstimatesResults ParameterEstimatesStart ProblemDescription ProjGrad
Description Convergence status Input options Iteration history Iteration start Iteration stop Lagrange multipliers at the solution Linear constraints Deleted linear constraints Linear constraints at the solution Estimates at the results Estimates at the start of the iterations Problem description Projected gradients
References ! 185
References
Beale, E.M.L. (1972), A Derivation of Conjugate Gradients, in Numerical Methods for Nonlinear Optimization, ed. F.A. Lootsma, London: Academic Press. Dennis, J.E., Gay, D.M., and Welsch, R.E. (1981), An Adaptive Nonlinear Least-Squares Algorithm, ACM Transactions on Mathematical Software, 7, 348368. Dennis, J.E. and Mei, H.H.W. (1979), Two New Unconstrained Optimization Algorithms Which Use Function and Gradient Values, J. Optim. Theory Appl., 28, 453482. Dennis, J.E. and Schnabel, R.B. (1983), Numerical Methods for Unconstrained Optimization and Nonlinear Equations, Englewood, NJ: Prentice-Hall. Fletcher, R. (1987), Practical Methods of Optimization, Second Edition, Chichester: John Wiley & Sons, Inc. Gay, D.M. (1983), Subroutines for Unconstrained Minimization, ACM Transactions on Mathematical Software, 9, 503524. Mor, J.J. (1978), The Levenberg-Marquardt Algorithm: Implementation and Theory, in Lecture Notes in Mathematics 630, ed. G.A. Watson, Berlin-Heidelberg-New York: Springer Verlag. Mor, J.J. and Sorensen, D.C. (1983), Computing a Trust-region Step, SIAM Journal on Scientic and Statistical Computing, 4, 553572. Polak, E. (1971), Computational Methods in Optimization, New York: Academic Press. Powell, J.M.D. (1977), Restart Procedures for the Conjugate Gradient Method, Math. Prog., 12, 241254.
186
Part II
Procedure Reference
188
Chapter 7
Missing Values and Autocorrelations . . . . . . . . . . . . . . . Estimation Details . . . . . . . . . . . . . . . . . . . . . . . . . Specifying Inputs and Transfer Functions . . . . . . . . . . . . . Initial Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . Stationarity and Invertibility . . . . . . . . . . . . . . . . . . . . Naming of Model Parameters . . . . . . . . . . . . . . . . . . . Missing Values and Estimation and Forecasting . . . . . . . . . . Forecasting Details . . . . . . . . . . . . . . . . . . . . . . . . . Forecasting Log Transformed Data . . . . . . . . . . . . . . . . Specifying Series Periodicity . . . . . . . . . . . . . . . . . . . Detecting Outliers . . . . . . . . . . . . . . . . . . . . . . . . . OUT= Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . . OUTCOV= Data Set . . . . . . . . . . . . . . . . . . . . . . . . OUTEST= Data Set . . . . . . . . . . . . . . . . . . . . . . . . OUTMODEL= SAS Data Set . . . . . . . . . . . . . . . . . . . OUTSTAT= Data Set . . . . . . . . . . . . . . . . . . . . . . . . Printed Output . . . . . . . . . . . . . . . . . . . . . . . . . . . ODS Table Names . . . . . . . . . . . . . . . . . . . . . . . . . Statistical Graphics . . . . . . . . . . . . . . . . . . . . . . . . . Examples: ARIMA Procedure . . . . . . . . . . . . . . . . . . . . . . Example 7.1: Simulated IMA Model . . . . . . . . . . . . . . . Example 7.2: Seasonal Model for the Airline Series . . . . . . . Example 7.3: Model for Series J Data from Box and Jenkins . . Example 7.4: An Intervention Model for Ozone Data . . . . . . Example 7.5: Using Diagnostics to Identify ARIMA Models . . Example 7.6: Detection of Level Changes in the Nile River Data Example 7.7: Iterative Outlier Detection . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
248 248 253 254 255 256 256 257 258 259 260 262 263 264 267 268 269 272 273 277 277 282 289 298 300 305 307 309
The ARIMA procedure provides a comprehensive set of tools for univariate time series model identication, parameter estimation, and forecasting, and it offers great exibility in the kinds of ARIMA or ARIMAX models that can be analyzed. The ARIMA procedure supports seasonal, subset, and factored ARIMA models; intervention or interrupted time series models; multiple regression analysis with ARMA errors; and rational transfer function models of any complexity. The design of PROC ARIMA closely follows the Box-Jenkins strategy for time series modeling with features for the identication, estimation and diagnostic checking, and forecasting steps of the Box-Jenkins method. Before you use PROC ARIMA, you should be familiar with Box-Jenkins methods, and you should exercise care and judgment when you use the ARIMA procedure. The ARIMA class of time series models is complex and powerful, and some degree of expertise is needed to use them correctly.
3. In the forecasting stage, you use the FORECAST statement to forecast future values of the time series and to generate condence intervals for these forecasts from the ARIMA model produced by the preceding ESTIMATE statement. These three steps are explained further and illustrated through an extended example in the following sections.
Identication Stage
Suppose you have a variable called SALES that you want to forecast. The following example illustrates ARIMA modeling and forecasting by using a simulated data set TEST that contains a time series SALES generated by an ARIMA(1,1,1) model. The output produced by this example is explained in the following sections. The simulated SALES series is shown in Figure 7.1.
proc sgplot data=test; scatter y=sales x=date; run;
Descriptive Statistics
The IDENTIFY statement rst prints descriptive statistics for the SALES series. This part of the IDENTIFY statement output is shown in Figure 7.2.
Figure 7.2 IDENTIFY Statement Descriptive Statistics Output
The ARIMA Procedure Name of Variable = sales Mean of Working Series Standard Deviation Number of Observations 137.3662 17.36385 100
The IDENTIFY statement next produces a panel of plots used for its autocorrelation and trend analysis. The panel contains the following plots: the time series plot of the series the sample autocorrelation function plot (ACF) the sample inverse autocorrelation function plot (IACF) the sample partial autocorrelation function plot (PACF) This correlation analysis panel is shown in Figure 7.3.
These autocorrelation function plots show the degree of correlation with past values of the series as a function of the number of periods in the past (that is, the lag) at which the correlation is computed. The NLAG= option controls the number of lags for which the autocorrelations are shown. By default, the autocorrelation functions are plotted to lag 24. Most books on time series analysis explain how to interpret the autocorrelation and the partial autocorrelation plots. See the section The Inverse Autocorrelation Function on page 239 for a discussion of the inverse autocorrelation plots. By examining these plots, you can judge whether the series is stationary or nonstationary. In this case, a visual inspection of the autocorrelation function plot indicates that the SALES series is nonstationary, since the ACF decays very slowly. For more formal stationarity tests, use the STATIONARITY= option. (See the section Stationarity on page 210.)
The last part of the default IDENTIFY statement output is the check for white noise. This is an approximate statistical test of the hypothesis that none of the autocorrelations of the series up to a
given lag are signicantly different from 0. If this is true for all lags, then there is no information in the series to model, and no ARIMA model is needed for the series. The autocorrelations are checked in groups of six, and the number of lags checked depends on the NLAG= option. The check for white noise output is shown in Figure 7.4.
Figure 7.4 IDENTIFY Statement Check for White Noise
Autocorrelation Check for White Noise To Lag 6 12 18 24 ChiSquare 426.44 547.82 554.70 585.73 Pr > ChiSq <.0001 <.0001 <.0001 <.0001
DF 6 12 18 24
---------------Autocorrelations--------------0.957 0.588 0.174 -0.135 0.907 0.514 0.112 -0.167 0.852 0.440 0.052 -0.192 0.791 0.370 -0.004 -0.211 0.726 0.303 -0.054 -0.227 0.659 0.238 -0.098 -0.240
In this case, the white noise hypothesis is rejected very strongly, which is expected since the series is nonstationary. The p-value for the test of the rst six autocorrelations is printed as <0.0001, which means the p-value is less than 0.0001.
The second IDENTIFY statement produces the same information as the rst, but for the change in SALES from one period to the next rather than for the total SALES in each period. The summary statistics output from this IDENTIFY statement is shown in Figure 7.5. Note that the period of differencing is given as 1, and one observation was lost through the differencing operation.
Figure 7.5 IDENTIFY Statement Output for Differenced Series
The ARIMA Procedure Name of Variable = sales Period(s) of Differencing Mean of Working Series Standard Deviation Number of Observations Observation(s) eliminated by differencing 1 0.660589 2.011543 99 1
The autocorrelation plots for the differenced series are shown in Figure 7.6.
Figure 7.6 Correlation Analysis of the Change in SALES
The autocorrelations decrease rapidly in this plot, indicating that the change in SALES is a stationary time series. The next step in the Box-Jenkins methodology is to examine the patterns in the autocorrelation plot to choose candidate ARMA models to the series. The partial and inverse autocorrelation function plots are also useful aids in identifying appropriate ARMA models for the series. In the usual Box-Jenkins approach to ARIMA modeling, the sample autocorrelation function, inverse autocorrelation function, and partial autocorrelation function are compared with the theoretical correlation functions expected from different kinds of ARMA models. This matching of theoretical autocorrelation functions of different ARMA models to the sample autocorrelation functions computed from the response series is the heart of the identication stage of Box-Jenkins modeling. Most textbooks on time series analysis, such as Pankratz (1983), discuss the theoretical autocorrelation functions for different kinds of ARMA models. Since the input data are only a limited sample of the series, the sample autocorrelation functions computed from the input series only approximate the true autocorrelation function of the process that generates the series. This means that the sample autocorrelation functions do not exactly match the theoretical autocorrelation functions for any ARMA model and can have a pattern similar to
that of several different ARMA models. If the series is white noise (a purely random process), then there is no need to t a model. The check for white noise, shown in Figure 7.7, indicates that the change in SALES is highly autocorrelated. Thus, an autocorrelation model, for example an AR(1) model, might be a good candidate model to t to this process.
Figure 7.7 IDENTIFY Statement Check for White Noise
Autocorrelation Check for White Noise To Lag 6 12 18 24 ChiSquare 154.44 173.66 209.64 218.04 Pr > ChiSq <.0001 <.0001 <.0001 <.0001
DF 6 12 18 24
---------------Autocorrelations--------------0.828 0.151 -0.305 -0.144 0.591 0.081 -0.271 -0.141 0.454 -0.039 -0.218 -0.125 0.369 -0.141 -0.183 -0.085 0.281 -0.210 -0.174 -0.040 0.198 -0.274 -0.161 -0.032
The ESTIMATE statement ts the model to the data and prints parameter estimates and various diagnostic statistics that indicate how well the model ts the data. The rst part of the ESTIMATE statement output, the table of parameter estimates, is shown in Figure 7.8.
Parameter MU AR1,1
Lag 0 1
The table of parameter estimates is titled Conditional Least Squares Estimation, which indicates the estimation method used. You can request different estimation methods with the METHOD= option. The table of parameter estimates lists the parameters in the model; for each parameter, the table shows the estimated value and the standard error and t value for the estimate. The table also indicates the lag at which the parameter appears in the model. In this case, there are two parameters in the model. The mean term is labeled MU; its estimated value is 0.90280. The autoregressive parameter is labeled AR1,1; this is the coefcient of the lagged value of the change in SALES, and its estimate is 0.86847. The t values provide signicance tests for the parameter estimates and indicate whether some terms in the model might be unnecessary. In this case, the t value for the autoregressive parameter is 15.83, so this term is highly signicant. The t value for MU indicates that the mean term adds little to the model. The standard error estimates are based on large sample theory. Thus, the standard errors are labeled as approximate, and the standard errors and t values might not be reliable in small samples. The next part of the ESTIMATE statement output is a table of goodness-of-t statistics, which aid in comparing this model to other models. This output is shown in Figure 7.9.
Figure 7.9 Goodness-of-Fit Statistics for AR(1) Model
Constant Estimate Variance Estimate Std Error Estimate AIC SBC Number of Residuals 0.118749 1.15794 1.076076 297.4469 302.6372 99
The Constant Estimate is a function of the mean term MU and the autoregressive parameters. This estimate is computed only for AR or ARMA models, but not for strictly MA models. See the section General Notation for ARIMA Models on page 207 for an explanation of the constant estimate. The Variance Estimate is the variance of the residual series, which estimates the innovation variance. The item labeled Std Error Estimate is the square root of the variance estimate. In general,
when you are comparing candidate models, smaller AIC and SBC statistics indicate the better tting model. The section Estimation Details on page 248 explains the AIC and SBC statistics. The ESTIMATE statement next prints a table of correlations of the parameter estimates, as shown in Figure 7.10. This table can help you assess the extent to which collinearity might have inuenced the results. If two parameter estimates are very highly correlated, you might consider dropping one of them from the model.
Figure 7.10 Correlations of the Estimates for AR(1) Model
Correlations of Parameter Estimates Parameter MU AR1,1 MU 1.000 0.114 AR1,1 0.114 1.000
The next part of the ESTIMATE statement output is a check of the autocorrelations of the residuals. This output has the same form as the autocorrelation check for white noise that the IDENTIFY statement prints for the response series. The autocorrelation check of residuals is shown in Figure 7.11.
Figure 7.11 Check for White Noise Residuals for AR(1) Model
Autocorrelation Check of Residuals To Lag 6 12 18 24 ChiSquare 19.09 22.90 31.63 32.83 Pr > ChiSq 0.0019 0.0183 0.0167 0.0841
DF 5 11 17 23
---------------Autocorrelations--------------0.327 0.072 -0.233 0.009 -0.220 0.116 -0.129 -0.057 -0.128 -0.042 -0.024 -0.057 0.068 -0.066 0.056 -0.001 -0.002 0.031 -0.014 0.049 -0.096 -0.091 -0.008 -0.015
The 2 test statistics for the residuals series indicate whether the residuals are uncorrelated (white noise) or contain additional information that might be used by a more complex model. In this case, the test statistics reject the no-autocorrelation hypothesis at a high level of signicance (p = 0.0019 for the rst six lags.) This means that the residuals are not white noise, and so the AR(1) model is not a fully adequate model for this series. The ESTIMATE statement output also includes graphical analysis of the residuals. It is not shown here. The graphical analysis also reveals the inadequacy of the AR(1) model. The nal part of the ESTIMATE statement output is a listing of the estimated model, using the backshift notation. This output is shown in Figure 7.12.
This listing combines the differencing specication given in the IDENTIFY statement with the parameter estimates of the model for the change in SALES. Since the AR(1) model is for the change in SALES, the nal model for SALES is an ARIMA(1,1,0) model. Using B, the backshift operator, the mathematical form of the estimated model shown in this output is as follows: .1 B/salest D 0:902799 C .1 1 at 0:86847B/
See the section General Notation for ARIMA Models on page 207 for further explanation of this notation.
The parameter estimates table and goodness-of-t statistics for this model are shown in Figure 7.13.
Lag 0 1 1
Constant Estimate Variance Estimate Std Error Estimate AIC SBC Number of Residuals
The moving-average parameter estimate, labeled MA1,1, is 0.58935. Both the moving-average and the autoregressive parameters have signicant t values. Note that the variance estimate, AIC, and SBC are all smaller than they were for the AR(1) model, indicating that the ARMA(1,1) model ts the data better without over-parameterizing. The graphical check of the residuals from this model is shown in Figure 7.14 and Figure 7.15. The residual correlation and white noise test plots show that you cannot reject the hypothesis that the residuals are uncorrelated. The normality plots also show no departure from normality. Thus, you conclude that the ARMA(1,1) model is adequate for the change in SALES series, and there is no point in trying more complex models.
Figure 7.14 White Noise Check of Residuals for the ARMA(1,1) Model
The form of the estimated ARIMA(1,1,1) model for SALES is shown in Figure 7.16.
Figure 7.16 Estimated ARIMA(1,1,1) Model for SALES
Model for variable sales Estimated Mean Period(s) of Differencing 0.892875 1
The estimated model shown in this output is .1 B/salest D 0:892875 C .1 C 0:58935B/ at .1 0:74755B/
In addition to the residual analysis of a model, it is often useful to check whether there are any changes in the time series that are not accounted for by the currently estimated model. The OUTLIER statement enables you to detect such changes. For a long series, this task can be computationally burdensome. Therefore, in general, it is better done after a model that ts the data reasonably well has been found. Figure 7.17 shows the output of the simplest form of the OUTLIER statement:
outlier; run;
Two possible outliers have been found for the model in question. See the section Detecting Outliers on page 260, and the examples Example 7.6 and Example 7.7, for more details about modeling in the presence of outliers. In this illustration these outliers are not discussed any further.
Figure 7.17 Outliers for the ARIMA(1,1,1) Model for SALES
The ARIMA Procedure Outlier Detection Summary Maximum number searched Number found Significance used Outlier Details ChiSquare 4.20 4.42 Approx Prob> ChiSq 0.0403 0.0355 2 2 0.05
Obs 10 67
Since the model diagnostic tests show that all the parameter estimates are signicant and the residual series is white noise, the estimation and diagnostic checking stage is complete. You can now proceed to forecasting the SALES series with this ARIMA(1,1,1) model.
Forecasting Stage
To produce the forecast, use a FORECAST statement after the ESTIMATE statement for the model you decide is best. If the last model t is not the best, then repeat the ESTIMATE statement for the best model before you use the FORECAST statement. Suppose that the SALES series is monthly, that you want to forecast one year ahead from the most recently available SALES gure, and that the dates for the observations are given by a variable DATE in the input data set TEST. You use the following FORECAST statement:
forecast lead=12 interval=month id=date out=results; run;
The LEAD= option species how many periods ahead to forecast (12 months, in this case). The ID= option species the ID variable, which is typically a SAS date, time, or datetime variable, used to date the observations of the SALES time series. The INTERVAL= option indicates that data are monthly and enables PROC ARIMA to extrapolate DATE values for forecast periods. The OUT=
option writes the forecasts to the output data set RESULTS. See the section OUT= Data Set on page 262 for information about the contents of the output data set. By default, the FORECAST statement also prints and plots the forecast values, as shown in Figure 7.18 and Figure 7.19. The forecast table shows for each forecast period the observation number, forecast value, standard error estimate for the forecast value, and lower and upper limits for a 95% condence interval for the forecast.
Figure 7.18 Forecasts for ARIMA(1,1,1) Model for SALES
The ARIMA Procedure Forecasts for variable sales Obs 101 102 103 104 105 106 107 108 109 110 111 112 Forecast 171.0320 174.7534 177.7608 180.2343 182.3088 184.0850 185.6382 187.0247 188.2866 189.4553 190.5544 191.6014 Std Error 0.9508 2.4168 3.9879 5.5658 7.1033 8.5789 9.9841 11.3173 12.5807 13.7784 14.9153 15.9964 95% Confidence Limits 169.1684 170.0165 169.9445 169.3256 168.3866 167.2707 166.0698 164.8433 163.6289 162.4501 161.3209 160.2491 172.8955 179.4903 185.5770 191.1430 196.2310 200.8993 205.2066 209.2061 212.9443 216.4605 219.7879 222.9538
Normally, you want the forecast values stored in an output data set, and you are not interested in seeing this printed list of the forecast. You can use the NOPRINT option in the FORECAST statement to suppress this output.
with a RUN statement. The output for each statement or group of statements is produced when the RUN statement is entered. A RUN statement does not terminate the PROC ARIMA step but tells the procedure to execute the statements given so far. You can end PROC ARIMA by submitting a QUIT statement, a DATA step, another PROC step, or an ENDSAS statement. The example in the preceding section illustrates the interactive use of ARIMA procedure statements. The complete PROC ARIMA program for that example is as follows:
proc arima data=test; identify var=sales nlag=24; run; identify var=sales(1); run; estimate p=1; run; estimate p=1 q=1; run; outlier; run; forecast lead=12 interval=month id=date out=results; run; quit;
If no differencing is done (d = 0), the models are usually referred to as ARMA(p, q) models. The nal model in the preceding example is an ARIMA(1,1,1) model since the IDENTIFY statement specied d = 1, and the nal ESTIMATE statement specied p = 1 and q = 1.
t Wt B .B/ .B/ at
indexes time is the response series Yt or a difference of the response series is the mean term is the backshift operator; that is, BXt D Xt
1
is the autoregressive operator, represented as a polynomial in the backshift operp ator: .B/ D 1 ::: 1B pB is the moving-average operator, represented as a polynomial in the backshift operator: .B/ D 1 1 B : : : q B q is the independent disturbance, also called the random error
The series Wt is computed by the IDENTIFY statement and is the series processed by the ESTIMATE statement. Thus, Wt is either the response series Yt or a difference of Yt specied by the differencing operators in the IDENTIFY statement. For simple (nonseasonal) differencing, Wt D .1 B/d Yt . For seasonal differencing Wt D .1 B/d .1 B s /D Yt , where d is the degree of nonseasonal differencing, D is the degree of seasonal differencing, and s is the length of the seasonal cycle. For example, the mathematical form of the ARIMA(1,1,1) model estimated in the preceding example is .1 B/Yt D C .1 .1 1 B/ at 1 B/
/ D .B/at
:::
Thus, when an autoregressive operator and a mean term are both included in the model, the constant term for the model can be represented as .B/ . This value is printed with the label Constant Estimate in the ESTIMATE statement output.
where
is the ith input time series or a difference of the ith input series at time t is the pure time delay for the effect of the ith input series is the numerator polynomial of the transfer function for the ith input series is the denominator polynomial of the transfer function for the ith input series.
The model can also be written more compactly as Wt D where i .B/ nt is the transfer function for the ith input series modeled as a ratio of the ! and polynomials: i .B/ D .!i .B/=i .B//B ki is the noise series: nt D ..B/= .B//at C X
i
i .B/Xi;t C nt
This model expresses the response series as a combination of past values of the random shocks and past values of other input series. The response series is also called the dependent series or output series. An input time series is also referred to as an independent series or a predictor series. Response variable, dependent variable, independent variable, or predictor variable are other terms often used.
1 .B/ 2 .B/
When an ARIMA model is expressed in factored form, the order of the model is usually expressed by using a factored notation also. The order of an ARIMA model expressed as the product of two factors is denoted as ARIMA(p,d,q) (P,D,Q).
the number of observations in a seasonal cycle: 12 for monthly series, 4 for quarterly series, 7 for daily series with day-of-week effects, and so forth. For example, the notation ARIMA(0,1,2) (0,1,1)12 describes a seasonal ARIMA model for monthly data with the following mathematical form: .1 B/.1 B 12 /Yt D C .1 1;1 B 1;2 B 2 /.1 2;1 B 12 /at
Stationarity
The noise (or residual) series for an ARMA model must be stationary, which means that both the expected values of the series and its autocovariance function are independent of time. The standard way to check for nonstationarity is to plot the series and its autocorrelation function. You can visually examine a graph of the series over time to see if it has a visible trend or if its variability changes noticeably over time. If the series is nonstationary, its autocorrelation function will usually decay slowly. Another way of checking for stationarity is to use the stationarity tests described in the section Stationarity Tests on page 246. Most time series are nonstationary and must be transformed to a stationary series before the ARIMA modeling process can proceed. If the series has a nonstationary variance, taking the log of the series can help. You can compute the log values in a DATA step and then analyze the log values with PROC ARIMA. If the series has a trend over time, seasonality, or some other nonstationary pattern, the usual solution is to take the difference of the series from one period to the next and then analyze this differenced series. Sometimes a series might need to be differenced more than once or differenced at lags greater than one period. (If the trend or seasonal effects are very regular, the introduction of explanatory variables can be an appropriate alternative to differencing.)
Differencing
Differencing of the response series is specied with the VAR= option of the IDENTIFY statement by placing a list of differencing periods in parentheses after the variable name. For example, to take a simple rst difference of the series SALES, use the statement
identify var=sales(1);
In this example, the change in SALES from one period to the next is analyzed. A deterministic seasonal pattern also causes the series to be nonstationary, since the expected value of the series is not the same for all time periods but is higher or lower depending on the season. When the series has a seasonal pattern, you might want to difference the series at a lag that corresponds to the length of the seasonal cycle. For example, if SALES is a monthly series, the statement
identify var=sales(12);
takes a seasonal difference of SALES, so that the series analyzed is the change in SALES from its value in the same month one year ago. To take a second difference, add another differencing period to the list. For example, the following statement takes the second difference of SALES:
identify var=sales(1,1);
That is, SALES is differenced once at lag 1 and then differenced again, also at lag 1. The statement
identify var=sales(2);
creates a 2-span differencethat is, current period SALES minus SALES from two periods ago. The statement
identify var=sales(1,12);
takes a second-order difference of SALES, so that the series analyzed is the difference between the current period-to-period change in SALES and the change 12 periods ago. You might want to do this if the series had both a trend over time and a seasonal pattern. There is no limit to the order of differencing and the degree of lagging for each difference. Differencing not only affects the series used for the IDENTIFY statement output but also applies to any following ESTIMATE and FORECAST statements. ESTIMATE statements t ARMA models to the differenced series. FORECAST statements forecast the differences and automatically sum these differences back to undo the differencing operation specied by the IDENTIFY statement, thus producing the nal forecast result. Differencing of input series is specied by the CROSSCORR= option and works just like differencing of the response series. For example, the statement
identify var=y(1) crosscorr=(x1(1) x2(1));
takes the rst difference of Y, the rst difference of X1, and the rst difference of X2. Whenever X1 and X2 are used in INPUT= options in following ESTIMATE statements, these names refer to the differenced series.
Subset Models
You can control which lags have parameters by specifying the P= or Q= option as a list of lags in parentheses. A model that includes parameters for only some lags is sometimes called a subset or additive model. For example, consider the following two ESTIMATE statements:
identify var=sales; estimate p=4; estimate p=(1 4);
Both specify AR(4) models, but the rst has parameters for lags 1, 2, 3, and 4, while the second has parameters for lags 1 and 4, with the coefcients for lags 2 and 3 constrained to 0. The mathematical form of the autoregressive models produced by these two specications is shown in Table 7.1.
Table 7.1 Saturated versus Subset Models
Autoregressive Operator .1 .1
1B 1B 2B 4 2 3B 3 4B 4/
B 4/
Seasonal Models
One particularly useful kind of subset model is a seasonal model. When the response series has a seasonal pattern, the values of the series at the same time of year in previous years can be important for modeling the series. For example, if the series SALES is observed monthly, the statements
identify var=sales; estimate p=(12);
model SALES as an average value plus some fraction of its deviation from this average value a year ago, plus a random error. Although this is an AR(12) model, it has only one autoregressive parameter.
Factored Models
A factored model (also referred to as a multiplicative model) represents the ARIMA model as a product of simpler ARIMA models. For example, you might model SALES as a combination of an AR(1) process that reects short term dependencies and an AR(12) model that reects the seasonal pattern. It might seem that the way to do this is with the option P=(1 12), but the AR(1) process also operates in past years; you really need autoregressive parameters at lags 1, 12, and 13. You can specify a subset model with separate parameters at these lags, or you can specify a factored model that represents the model as the product of an AR(1) model and an AR(12) model. Consider the following two ESTIMATE statements:
The mathematical form of the autoregressive models produced by these two specications are shown in Table 7.2.
Table 7.2 Subset versus Factored Models
Autoregressive Operator .1 .1
1B 1 B/.1 12 B 12 13 B 13 / 12
B 12 /
Both models t by these two ESTIMATE statements predict SALES from its values 1, 12, and 13 periods ago, but they use different parameterizations. The rst model has three parameters, whose meanings may be hard to interpret. The factored specication P=(1)(12) represents the model as the product of two different AR models. It has only two parameters: one that corresponds to recent effects and one that represents seasonal effects. Thus the factored model is more parsimonious, and its parameter estimates are more clearly interpretable.
This example performs a simple linear regression of SALES on PRICE; it produces the same results as PROC REG or another SAS regression procedure. The mathematical form of the model estimated
by these statements is Yt D C !0 Xt C at
The parameter estimates table for this example (using simulated data) is shown in Figure 7.20. The intercept parameter is labeled MU. The regression coefcient for PRICE is labeled NUM1. (See the section Naming of Model Parameters on page 256 for information about how parameters for input series are named.)
Figure 7.20 Parameter Estimates Table for Regression Model
The ARIMA Procedure Conditional Least Squares Estimation Standard Error 2.99463 0.02885 Approx Pr > |t| <.0001 <.0001
Parameter MU NUM1
Lag 0 0
Shift 0 0
Any number of input variables can be used in a model. For example, the following statements t a multiple regression of SALES on PRICE and INCOME:
proc arima data=a; identify var=sales crosscorr=(price income); estimate input=(price income); run;
The mathematical form of the regression model estimated by these statements is Yt D C !1 X1;t C !2 X2;t C at
C at
proc arima data=a; identify var=sales crosscorr=(price income) noprint; estimate input=(price income) plot; run; estimate p=1 q=1 input=(price income); run;
In this example, the IDENTIFY statement includes the NOPRINT option since the autocorrelation plots for the response series are not useful when you know that the response series depends on input series. The rst ESTIMATE statement ts the regression model with no model for the noise process. The PLOT option produces plots of the autocorrelation function, inverse autocorrelation function, and partial autocorrelation function for the residual series of the regression on PRICE and INCOME. By examining the PLOT option output for the residual series, you verify that the residual series is stationary and identify an ARMA(1,1) model for the noise process. The second ESTIMATE statement ts the nal model. Although this discussion addresses regression models, the same remarks apply to identifying an ARIMA model for the noise process in models that include input series with complex transfer functions.
Impulse Interventions
The intervention can be a one-time event. For example, you might want to study the effect of a short-term advertising campaign on the sales of a product. In this case, the input variable has the value of 1 for the period during which the advertising campaign took place and the value 0 for all other periods. Intervention variables of this kind are sometimes called impulse functions or pulse functions. Suppose that SALES is a monthly series, and a special advertising effort was made during the month of March 1992. The following statements estimate the effect of this intervention by assuming an ARMA(1,1) model for SALES. The model is specied just like the regression model, but the
intervention variable AD is constructed in the DATA step as a zero-one indicator for the month of the advertising effort.
data a; set a; ad = (date = 1mar1992d); run; proc arima data=a; identify var=sales crosscorr=ad; estimate p=1 q=1 input=ad; run;
Continuing Interventions
Other interventions can be continuing, in which case the input variable ags periods before and after the intervention. For example, you might want to study the effect of a change in tax rates on some economic measure. Another example is a study of the effect of a change in speed limits on the rate of trafc fatalities. In this case, the input variable has the value 1 after the new speed limit went into effect and the value 0 before. Intervention variables of this kind are called step functions. Another example is the effect of news on product demand. Suppose it was reported in July 1996 that consumption of the product prevents heart disease (or causes cancer), and SALES is consistently higher (or lower) thereafter. The following statements model the effect of this news intervention:
data a; set a; news = (date >= 1jul1996d); run; proc arima data=a; identify var=sales crosscorr=news; estimate p=1 q=1 input=news; run;
Interaction Effects
You can include any number of intervention variables in the model. Intervention variables can have any patternimpulse and continuing interventions are just two possible cases. You can mix discrete valued intervention variables and continuous regressor variables in the same model. You can also form interaction effects by multiplying input variables and including the product variable as another input. Indeed, as long as the dependent measure is continuous and forms a regular time series, you can use PROC ARIMA to t any general linear model in conjunction with an ARMA model for the error process by using input variables that correspond to the columns of the design matrix of the linear model.
Numerator Factors
For example, suppose you want to model the effect of PRICE on SALES as taking place gradually with the impact distributed over several past lags of PRICE. This is illustrated by the following statements:
proc arima data=a; identify var=sales crosscorr=price; estimate input=( (1 2 3) price ); run;
This example models the effect of PRICE on SALES as a linear function of the current and three most recent values of PRICE. It is equivalent to a multiple linear regression of SALES on PRICE, LAG(PRICE), LAG2(PRICE), and LAG3(PRICE). This is an example of a transfer function with one numerator factor. The numerator factors for a transfer function for an input series are like the MA part of the ARMA model for the noise series.
Denominator Factors
You can also use transfer functions with denominator factors. The denominator factors for a transfer function for an input series are like the AR part of the ARMA model for the noise series. Denominator factors introduce exponentially weighted, innite distributed lags into the transfer function. To specify transfer functions with denominator factors, place the denominator factors after a slash (/) in the INPUT= option. For example, the following statements estimate the PRICE effect as an innite distributed lag model with exponentially declining weights:
proc arima data=a; identify var=sales crosscorr=price; estimate input=( / (1) price ); run;
The transfer function specied by these statements is as follows: !0 Xt .1 1 B/ This transfer function also can be written in the following equivalent form: ! 1 X i !0 1 C 1 B i Xt
i D1
This transfer function can be used with intervention inputs. When it is used with a pulse function input, the result is an intervention effect that dies out gradually over time. When it is used with a step function input, the result is an intervention effect that increases gradually to a limiting value.
See the section Specifying Inputs and Transfer Functions on page 253 for more information.
If you do not have future values of the input variables in the input data set used by the FORECAST statement, the input series must be forecast before the ARIMA procedure can forecast the response series. If you t an ARIMA model to each of the input series for which you need forecasts before tting the model for the response series, the FORECAST statement automatically uses the ARIMA models for the input series to generate the needed forecasts of the inputs. For example, suppose you want to forecast SALES for the next 12 months. In this example, the change in SALES is predicted as a function of the change in PRICE, plus an ARMA(1,1) noise process. To forecast SALES by using PRICE as an input, you also need to t an ARIMA model for PRICE. The following statements t an AR(2) model to the change in PRICE before tting and forecasting the model for SALES. The FORECAST statement automatically forecasts PRICE using this AR(2) model to get the future inputs needed to produce the forecast of SALES.
proc arima data=a; identify var=price(1); estimate p=2; identify var=sales(1) crosscorr=price(1); estimate p=1 q=1 input=price; forecast lead=12 interval=month id=date out=results; run;
Fitting a model to the input series is also important for identifying transfer functions. (See the section Prewhitening on page 246 for more information.) Input values from the DATA= data set and input values forecast by PROC ARIMA can be combined. For example, a model for SALES might have three input series: PRICE, INCOME, and TAXRATE. For the forecast, you assume that the tax rate will be unchanged. You have a forecast for INCOME from another source but only for the rst few periods of the SALES forecast you want to make. You have no future values for PRICE, which needs to be forecast as in the preceding example. In this situation, you include observations in the input data set for all forecast periods, with SALES and PRICE set to a missing value, with TAXRATE set to its last actual value, and with INCOME set to forecast values for the periods you have forecasts for and set to missing values for later periods. In the PROC ARIMA step, you estimate ARIMA models for PRICE and INCOME before you estimate the model for SALES, as shown in the following statements:
proc arima data=a; identify var=price(1); estimate p=2; identify var=income(1); estimate p=2; identify var=sales(1) crosscorr=( price(1) income(1) taxrate ); estimate p=1 q=1 input=( price income taxrate ); forecast lead=12 interval=month id=date out=results; run;
In forecasting SALES, the ARIMA procedure uses as inputs the value of PRICE forecast by its ARIMA model, the value of TAXRATE found in the DATA= data set, and the value of INCOME found
in the DATA= data set, or, when the INCOME variable is missing, the value of INCOME forecast by its ARIMA model. (Because SALES is missing for future time periods, the estimation of model parameters is not affected by the forecast values for PRICE, INCOME, or TAXRATE.)
Data Requirements
PROC ARIMA can handle time series of moderate size; there should be at least 30 observations. With fewer than 30 observations, the parameter estimates might be poor. With thousands of observations, the method requires considerable computer time and memory.
Functional Summary
The statements and options that control the ARIMA procedure are summarized in Table 7.3.
Table 7.3 Functional Summary
Description Data Set Options specify the input data set specify the output data set include only forecasts in the output data set write autocovariances to output data set write parameter estimates to an output data set write correlation of parameter estimates write covariance of parameter estimates
Statement PROC ARIMA IDENTIFY PROC ARIMA FORECAST FORECAST IDENTIFY ESTIMATE ESTIMATE ESTIMATE
Option DATA= DATA= OUT= OUT= NOOUTALL OUTCOV= OUTEST= OUTCORR OUTCOV
Table 7.3
continued
Description write estimated model to an output data set write statistics of t to an output data set
Options for Identifying the Series difference time series and plot autocorrelations specify response series and differencing specify and cross-correlate input series center data by subtracting the mean exclude missing values delete previous models and start specify the signicance level for tests perform tentative ARMA order identication by using the ESACF method perform tentative ARMA order identication by using the MINIC method perform tentative ARMA order identication by using the SCAN method specify the range of autoregressive model orders for estimating the error series for the MINIC method determine the AR dimension of the SCAN, ESACF, and MINIC tables determine the MA dimension of the SCAN, ESACF, and MINIC tables perform stationarity tests selection of white noise test statistic in the presence of missing values
IDENTIFY IDENTIFY IDENTIFY IDENTIFY IDENTIFY IDENTIFY IDENTIFY IDENTIFY IDENTIFY IDENTIFY IDENTIFY VAR= CROSSCORR= CENTER NOMISS CLEAR ALPHA= ESACF MINIC SCAN PERROR=
P= Q= STATIONARITY= WHITENOISE=
Options for Dening and Estimating the Model specify and estimate ARIMA models ESTIMATE specify autoregressive part of model ESTIMATE specify moving-average part of model ESTIMATE specify input variables and transfer functions ESTIMATE drop mean term from the model ESTIMATE specify the estimation method ESTIMATE use alternative form for transfer functions ESTIMATE suppress degrees-of-freedom correction in ESTIMATE variance estimates selection of white noise test statistic in the ESTIMATE presence of missing values
Table 7.3
continued
Description Options for Outlier Detection specify the signicance level for tests identify detected outliers with variable limit the number of outliers limit the number of outliers to a percentage of the series specify the variance estimator used for testing specify the type of level shifts
Printing Control Options limit number of lags shown in correlation plots suppress printed output for identication plot autocorrelation functions of the residuals print log-likelihood around the estimates control spacing for GRID option print details of the iterative estimation process suppress printed output for estimation suppress printing of the forecast values print the one-step forecasts and residuals
Plotting Control Options request plots associated with model identication, residual analysis, and forecasting
PROC ARIMA
PLOTS=
Options to Specify Parameter Values specify autoregressive starting values specify moving-average starting values specify a starting value for the mean parameter specify starting values for transfer functions
Options to Control the Iterative Estimation Process specify convergence criterion ESTIMATE specify the maximum number of iterations ESTIMATE specify criterion for checking for singularity ESTIMATE suppress the iterative estimation process ESTIMATE omit initial observations from objective ESTIMATE specify perturbation for numerical derivatives ESTIMATE omit stationarity and invertibility checks ESTIMATE
Table 7.3
continued
Statement ESTIMATE
Option NOLS
Options for Forecasting forecast the response series specify how many periods to forecast specify the ID variable specify the periodicity of the series specify size of forecast condence limits start forecasting before end of the input data specify the variance term used to compute forecast standard errors and condence limits control the alignment of SAS date values BY Groups specify BY group processing
BY
species the name of the SAS data set that contains the time series. If different DATA= specications appear in the PROC ARIMA and IDENTIFY statements, the one in the IDENTIFY statement is used. If the DATA= option is not specied in either the PROC ARIMA or IDENTIFY statement, the most recently created SAS data set is used.
PLOTS< (global-plot-options) > < = plot-request < (options) > > PLOTS< (global-plot-options) > < = (plot-request < (options) > < ... plot-request < (options) > >) >
controls the plots produced through ODS Graphics. When you specify only one plot request, you can omit the parentheses around the plot request. Here are some examples:
plots=none plots=all plots(unpack)=series(corr crosscorr) plots(only)=(series(corr crosscorr) residual(normal smooth))
You must enable ODS Graphics before requesting plots as shown in the following statements. For general information about ODS Graphics, see Chapter 21, Statistical Graphics Using ODS (SAS/STAT Users Guide). If you have enabled ODS Graphics but do not specify any specic plot request, then the default plots associated with each of the PROC ARIMA statements used in the program are produced. The old line printer plots are suppressed when ODS Graphics is enabled.
ods graphics on; proc arima; identify var=y(1 12); estimate q=(1)(12) noint; run;
Since no specic plot is requested in this program, the default plots associated with the identication and estimation stages are produced. Global Plot Options: The global-plot-options apply to all relevant plots generated by the ARIMA procedure. The following global-plot-options are supported:
ONLY
suppresses the default plots. Only the plots specically requested are produced.
UNPACK
Specic Plot Options: The following list describes the specic plots and their options.
ALL
produces plots associated with the identication stage of the modeling. The panel plots corresponding to the CORR and CROSSCORR options are produced by default. The following series-plot-options are available:
ACF
CORR
produces a panel of plots that are useful in the trend and correlation analysis of the series. The panel consists of the following: the time series plot the series-autocorrelation plot the series-partial-autocorrelation plot the series-inverse-autocorrelation plot
CROSSCORR
produces the residuals plots. The residual correlation and normality diagnostic panels are produced by default. The following residual-plot-options are available:
ACF
produces all the residual diagnostics plots appropriate for the particular analysis.
CORR
produces a summary panel of the residual correlation diagnostics that consists of the following: the residual-autocorrelation plot the residual-partial-autocorrelation plot the residual-inverse-autocorrelation plot a plot of Ljung-Box white-noise test p-values at different lags
HIST
produces a summary panel of the residual normality diagnostics that consists of the following:
BY Statement ! 227
produces a scatter plot of the residuals against time, which has an overlaid smooth t.
WN
produces the forecast plots in the forecasting stage. The forecast-only plot that shows the multistep forecasts in the forecast region is produced by default. The following forecast-plot-options are available:
ALL
produces a plot that shows the one-step-ahead forecasts as well as the multistepahead forecasts.
FORECASTONLY
produces a plot that shows only the multistep-ahead forecasts in the forecast region.
OUT=SAS-data-set
species a SAS data set to which the forecasts are output. If different OUT= specications appear in the PROC ARIMA and FORECAST statements, the one in the FORECAST statement is used.
BY Statement
BY variables ;
A BY statement can be used in the ARIMA procedure to process a data set in groups of observations dened by the BY variables. Note that all IDENTIFY, ESTIMATE, and FORECAST statements specied are applied to all BY groups.
Because of the need to make data-based model selections, BY-group processing is not usually done with PROC ARIMA. You usually want to use different models for the different series contained in different BY groups, and the PROC ARIMA BY statement does not let you do this. Using a BY statement imposes certain restrictions. The BY statement must appear before the rst RUN statement. If a BY statement is used, the input data must come from the data set specied in the PROC statement; that is, no input data sets can be specied in IDENTIFY statements. When a BY statement is used with PROC ARIMA, interactive processing applies only to the rst BY group. Once the end of the PROC ARIMA step is reached, all ARIMA statements specied are executed again for each of the remaining BY groups in the input data set.
IDENTIFY Statement
IDENTIFY VAR=variable options ;
The IDENTIFY statement species the time series to be modeled, differences the series if desired, and computes statistics to help identify models to t. Use an IDENTIFY statement for each time series that you want to model. If other time series are to be used as inputs in a subsequent ESTIMATE statement, they must be listed in a CROSSCORR= list in the IDENTIFY statement. The following options are used in the IDENTIFY statement. The VAR= option is required.
ALPHA=signicance-level
The ALPHA= option species the signicance level for tests in the IDENTIFY statement. The default is 0.05.
CENTER
centers each time series by subtracting its sample mean. The analysis is done on the centered data. Later, when forecasts are generated, the mean is added back. Note that centering is done after differencing. The CENTER option is normally used in conjunction with the NOCONSTANT option of the ESTIMATE statement.
CLEAR
deletes all old models. This option is useful when you want to delete old models so that the input variables are not prewhitened. (See the section Prewhitening on page 246 for more information.)
CROSSCORR=variable (d11, d12, . . . , d1k ) CROSSCORR= (variable (d11, d12, . . . , d1k )... variable (d21, d22, . . . , d2k ))
names the variables cross-correlated with the response variable given by the VAR= specication. Each variable name can be followed by a list of differencing lags in parentheses, the same as for the VAR= specication. If differencing is specied for a variable in the CROSSCORR= list, the differenced series is cross-correlated with the VAR= option series, and the differenced series is used when the ESTIMATE statement INPUT= option refers to the variable.
DATA=SAS-data-set
species the input SAS data set that contains the time series. If the DATA= option is omitted, the DATA= data set specied in the PROC ARIMA statement is used; if the DATA= option is omitted from the PROC ARIMA statement as well, the most recently created data set is used.
ESACF
computes the extended sample autocorrelation function and uses these estimates to tentatively identify the autoregressive and moving-average orders of mixed models. The ESACF option generates two tables. The rst table displays extended sample autocorrelation estimates, and the second table displays probability values that can be used to test the signicance of these estimates. The P=.pmi n W pmax / and Q=.qmi n W qmax / options determine the size of the table. The autoregressive and moving-average orders are tentatively identied by nding a triangular pattern in which all values are insignicant. The ARIMA procedure nds these patterns based on the IDENTIFY statement ALPHA= option and displays possible recommendations for the orders. The following code generates an ESACF table with dimensions of p=(0:7) and q=(0:8).
proc arima data=test; identify var=x esacf p=(0:7) q=(0:8); run;
See the section The ESACF Method on page 241 for more information.
MINIC
uses information criteria or penalty functions to provide tentative ARMA order identication. The MINIC option generates a table that contains the computed information criterion associated with various ARMA model orders. The PERROR=.p ;mi n W p ;max / option determines the range of the autoregressive model orders used to estimate the error series. The P=.pmi n W pmax / and Q=.qmi n W qmax / options determine the size of the table. The ARMA orders are tentatively identied by those orders that minimize the information criterion. The following statements generate a MINIC table with default dimensions of p=(0:5) and q=(0:5) and with the error series estimated by an autoregressive model with an order, p , that minimizes the AIC in the range from 8 to 11.
proc arima data=test; identify var=x minic perror=(8:11); run;
See the section The MINIC Method on page 243 for more information.
NLAG=number
indicates the number of lags to consider in computing the autocorrelations and crosscorrelations. To obtain preliminary estimates of an ARIMA(p, d, q ) model, the NLAG= value must be at least p +q +d. The number of observations must be greater than or equal to the NLAG= value. The default value for NLAG= is 24 or one-fourth the number of observations, whichever is less. Even though the NLAG= value is specied, the NLAG= value can be changed according to the data set.
NOMISS
uses only the rst continuous sequence of data with no missing values. By default, all observations are used.
NOPRINT
suppresses the normal printout (including the correlation plots) generated by the IDENTIFY statement.
OUTCOV=SAS-data-set
writes the autocovariances, autocorrelations, inverse autocorrelations, partial autocorrelations, and cross covariances to an output SAS data set. If the OUTCOV= option is not specied, no covariance output data set is created. See the section OUTCOV= Data Set on page 263 for more information.
P=(pmi n W pmax )
W p ;max ) determines the range of the autoregressive model orders used to estimate the error series in MINIC, a tentative ARMA order identication method. See the section The MINIC Method on page 243 for more information. By default p ;mi n is set to pmax and p ;max is set to pmax C qmax , where pmax and qmax are the maximum settings of the P= and Q= options on the IDENTIFY statement.
;mi n
Q=(qmi n W qmax )
computes estimates of the squared canonical correlations and uses these estimates to tentatively identify the autoregressive and moving-average orders of mixed models. The SCAN option generates two tables. The rst table displays squared canonical correlation estimates, and the second table displays probability values that can be used to test the signicance of these estimates. The P=.pmi n W pmax / and Q=.qmi n W qmax / options determine the size of each table. The autoregressive and moving-average orders are tentatively identied by nding a rectangular pattern in which all values are insignicant. The ARIMA procedure nds these patterns based on the IDENTIFY statement ALPHA= option and displays possible recommendations for the orders. The following code generates a SCAN table with default dimensions of p=(0:5) and q=(0:5). The recommended orders are based on a signicance level of 0.1.
proc arima data=test; identify var=x scan alpha=0.1; run;
See the section The SCAN Method on page 244 for more information.
STATIONARITY=
performs stationarity tests. Stationarity tests can be used to determine whether differencing terms should be included in the model specication. In each stationarity test, the autoregressive orders can be specied by a range, test= armax , or as a list of values, test= .ar1 ; ::; arn /, where test is ADF, PP, or RW. The default is (0,1,2). See the section Stationarity Tests on page 246 for more information.
STATIONARITY=(ADF= AR orders DLAG= s ) STATIONARITY=(DICKEY= AR orders DLAG= s )
performs augmented Dickey-Fuller tests. If the DLAG=s option is specied with s is greater than one, seasonal Dickey-Fuller tests are performed. The maximum allowable value of s is 12. The default value of s is 1. The following code performs augmented Dickey-Fuller tests with autoregressive orders 2 and 5.
proc arima data=test; identify var=x stationarity=(adf=(2,5)); run;
performs Phillips-Perron tests. The following statements perform augmented Phillips-Perron tests with autoregressive orders ranging from 0 to 6.
proc arima data=test; identify var=x stationarity=(pp=6); run;
performs random-walk-with-drift tests. The following statements perform random-walkwith-drift tests with autoregressive orders ranging from 0 to 2.
proc arima data=test; identify var=x stationarity=(rw); run;
names the variable that contains the time series to analyze. The VAR= option is required. A list of differencing lags can be placed in parentheses after the variable name to request that the series be differenced at these lags. For example, VAR=X(1) takes the rst differences of X. VAR=X(1,1) requests that X be differenced twice, both times with lag 1, producing a second difference series, which is .Xt Xt 1 / .Xt 1 Xt 2 / D Xt 2Xt 1 C Xt 2 . VAR=X(2) differences X once at lag two .Xt Xt
2 /.
If differencing is specied, it is the differenced series that is processed by any subsequent ESTIMATE statement.
WHITENOISE=ST | IGNOREMISS
species the type of test statistic that is used in the white noise test of the series when the series contains missing values. If WHITENOISE=IGNOREMISS, the standard Ljung-Box test statistic is used. If WHITENOISE=ST, a modication of this statistic suggested by Stoffer and Toloi (1992) is used. The default is WHITENOISE=ST.
ESTIMATE Statement
< label: >ESTIMATE options ;
The ESTIMATE statement species an ARMA model or transfer function model for the response variable specied in the previous IDENTIFY statement, and produces estimates of its parameters. The ESTIMATE statement also prints diagnostic information by which to check the model. The label in the ESTIMATE statement is optional. Include an ESTIMATE statement for each model that you want to estimate. Options used in the ESTIMATE statement are described in the following sections.
species the alternative parameterization of the overall scale of transfer functions in the model. See the section Alternative Model Parameterization on page 253 for details.
INPUT=variable INPUT=( transfer-function variable . . . )
species input variables and their transfer functions. The variables used on the INPUT= option must be included in the CROSSCORR= list in the previous IDENTIFY statement. If any differencing is specied in the CROSSCORR= list, then the differenced series is used as the input to the transfer function. The transfer function specication for an input variable is optional. If no transfer function is specied, the input variable enters the model as a simple regressor. If specied, the transfer function specication has the following syntax: S $.L1;1 ; L1;2 ; : : :/.L2;1 ; : : :/ : : : =.Lj;1 ; : : :/ : : : Here, S is a shift or lag of the input variable, the terms before the slash (/) are numerator factors, and the terms after the slash (/) are denominator factors of the transfer function. All three parts are optional. See the section Specifying Inputs and Transfer Functions on page 253 for details.
METHOD=value
species the estimation method to use. METHOD=ML species the maximum likelihood
method. METHOD=ULS species the unconditional least squares method. METHOD=CLS species the conditional least squares method. METHOD=CLS is the default. See the section Estimation Details on page 248 for more information.
NOCONSTANT NOINT
suppresses the tting of a constant (or intercept) parameter in the model. (That is, the parameter is omitted.)
NODF
estimates the variance by dividing the error sum of squares (SSE) by the number of residuals. The default is to divide the SSE by the number of residuals minus the number of free parameters in the model.
NOPRINT
suppresses the normal printout generated by the ESTIMATE statement. If the NOPRINT option is specied for the ESTIMATE statement, then any error and warning messages are printed to the SAS log.
P=order P=(lag, . . . , lag ) . . . (lag, . . . , lag )
species the autoregressive part of the model. By default, no autoregressive parameters are t. P=(l 1 , l 2 , . . . , l k ) denes a model with autoregressive parameters at the specied lags. P= order is equivalent to P=(1, 2, . . . , order). A concatenation of parenthesized lists species a factored model. P=(1,2,5)(6,12) species the autoregressive model .1
PLOT
1;1 B 1;2 B 2 1;3 B 5
For example,
/.1
2;1 B
2;2 B
12
plots the residual autocorrelation functions. The sample autocorrelation, the sample inverse autocorrelation, and the sample partial autocorrelation functions of the model residuals are plotted.
Q=order Q=(lag, . . . , lag ) . . . (lag, . . . , lag )
species the moving-average part of the model. By default, no moving-average part is included in the model. Q=(l 1 , l 2 , . . . , l k ) denes a model with moving-average parameters at the specied lags. Q= order is equivalent to Q=(1, 2, . . . , order). A concatenation of parenthesized lists species a factored model. The interpretation of factors and lags is the same as for the P= option.
WHITENOISE=ST | IGNOREMISS
species the type of test statistic that is used in the white noise test of the series when the series contains missing values. If WHITENOISE=IGNOREMISS, the standard Ljung-Box test statistic is used. If WHITENOISE=ST, a modication of this statistic suggested by Stoffer and Toloi (1992) is used. The default is WHITENOISE=ST.
writes the parameter estimates to an output data set. If the OUTCORR or OUTCOV option is used, the correlations or covariances of the estimates are also written to the OUTEST= data set. See the section OUTEST= Data Set on page 264 for a description of the OUTEST= output data set.
OUTCORR
writes the correlations of the parameter estimates to the OUTEST= data set.
OUTCOV
writes the covariances of the parameter estimates to the OUTEST= data set.
OUTMODEL=SAS-data-set
writes the model and parameter estimates to an output data set. If OUTMODEL= is not specied, no model output data set is created. See the section OUTMODEL= SAS Data Set on page 267 for a description of the OUTMODEL= output data set.
OUTSTAT=SAS-data-set
writes the model diagnostic statistics to an output data set. If OUTSTAT= is not specied, no statistics output data set is created. See the section OUTSTAT= Data Set on page 268 for a description of the OUTSTAT= output data set.
lists starting values for the autoregressive parameters. See the section Initial Values on page 254 for more information.
INITVAL=(initializer-spec variable . . . )
species starting values for the parameters in the transfer function parts of the model. See the section Initial Values on page 254 for more information.
MA=value . . .
lists starting values for the moving-average parameters. See the section Initial Values on page 254 for more information.
MU=value
uses the values specied with the AR=, MA=, INITVAL=, and MU= options as nal parameter values. The estimation process is suppressed except for estimation of the residual variance. The specied parameter values are used directly by the next FORECAST statement. When NOEST is specied, standard errors, t values, and the correlations between estimates are displayed as 0 or missing. (The NOEST option is useful, for example, when you want to generate forecasts that correspond to a published model.)
omits the specied number of initial residuals from the sum of squares or likelihood function. Omitting values can be useful for suppressing transients in transfer function models that are sensitive to start-up values.
CONVERGE=value
species the convergence criterion. Convergence is assumed when the largest change in the estimate for any parameter is less that the CONVERGE= option value. If the absolute value of the parameter estimate is greater than 0.01, the relative change is used; otherwise, the absolute change in the estimate is used. The default is CONVERGE=0.001.
DELTA=value
The default is
prints the error sum of squares (SSE) or concentrated log-likelihood surface in a small grid of the parameter space around the nal estimates. For each pair of parameters, the SSE is printed for the nine parameter-value combinations formed by the grid, with a center at the nal estimates and with spacing given by the GRIDVAL= specication. The GRID option can help you judge whether the estimates are truly at the optimum, since the estimation process does not always converge. For models with a large number of parameters, the GRID option produces voluminous output.
GRIDVAL=number
controls the spacing in the grid printed by the GRID option. The default is GRIDVAL=0.005.
MAXITER=n MAXIT=n
NOLS
begins the maximum likelihood or unconditional least squares iterations from the preliminary estimates rather than from the conditional least squares estimates that are produced after four iterations. See the section Estimation Details on page 248 for more information.
NOSTABLE
species that the autoregressive and moving-average parameter estimates for the noise part of the model not be restricted to the stationary and invertible regions, respectively. See the section Stationarity and Invertibility on page 255 for more information.
PRINTALL
prints preliminary estimation results and the iterations in the nal estimation process.
NOTFSTABLE
species that the parameter estimates for the denominator polynomial of the transfer function part of the model not be restricted to the stability region. See the section Stationarity and Invertibility on page 255 for more information.
SINGULAR=value
species the criterion for checking singularity. If a pivot of a sweep operation is less than the SINGULAR= value, the matrix is deemed singular. Sweep operations are performed on the Jacobian matrix during nal estimation and on the covariance matrix when preliminary estimates are obtained. The default is SINGULAR=1E7.
OUTLIER Statement
OUTLIER options ;
The OUTLIER statement can be used to detect shifts in the level of the response series that are not accounted for by the previously estimated model. An ESTIMATE statement must precede the OUTLIER statement. The following options are used in the OUTLIER statement:
TYPE=ADDITIVE TYPE=SHIFT TYPE=TEMP ( d1 ; : : : ; dk ) TYPE=(< ADDITIVE >< SHIFT > < TEMP ( d1 ; : : : ; dk ) ) >
species the types of level shifts to search for. The default is TYPE=(ADDITIVE SHIFT), which requests searching for additive outliers and permanent level shifts. The option TEMP( d1 ; : : : ; dk ) requests searching for temporary changes in the level of durations d1 ; : : : ; dk . These options can also be abbreviated as AO, LS, and TC.
ALPHA=signicance-level
species the signicance level for tests in the OUTLIER statement. The default is 0.05.
SIGMA=ROBUST | MSE
species the type of error variance estimate to use in the statistical tests performed during the outlier detection. SIGMA=MSE corresponds to the usual mean squared error (MSE) estimate,
and SIGMA=ROBUST corresponds to a robust estimate of the error variance. The default is SIGMA=ROBUST.
MAXNUM=number
limits the number of outliers to search for according to a percentage of the series length. The default is MAXPCT=2. When both the MAXNUM= and MAXPCT= options are specied, the minimum of the two search numbers is used.
ID=Date-Time ID variable
species a SAS date, time, or datetime identication variable to label the detected outliers. This variable must be present in the input data set. The following examples illustrate a few possibilities for the OUTLIER statement. The most basic usage, shown as follows, sets all the options to their default values.
outlier;
The following statement requests a search for permanent level shifts and for temporary level changes of durations 6 and 12. The search is limited to at most three changes and the signicance level of the underlying tests is 0.001. MSE is used as the estimate of error variance. It also requests labeling of the detected shifts using an ID variable date.
outlier type=(ls tc(6 12)) alpha=0.001 sigma=mse maxnum=3 ID=date;
FORECAST Statement
FORECAST options ;
The FORECAST statement generates forecast values for a time series by using the parameter estimates produced by the previous ESTIMATE statement. See the section Forecasting Details on page 257 for more information about calculating forecasts. The following options can be used in the FORECAST statement:
ALIGN=option
controls the alignment of SAS dates used to identify output observations. The ALIGN= option allows the following values: BEGINNING|BEG|B, MIDDLE|MID|M, and ENDING|END|E. BEGINNING is the default.
ALPHA=n
sets the size of the forecast condence limits. The ALPHA= value must be between 0 and
1. When you specify ALPHA=, the upper and lower condence limits have a 1 condence level. The default is ALPHA=0.05, which produces 95% condence intervals. ALPHA values are rounded to the nearest hundredth.
BACK=n
species the number of observations before the end of the data where the multistep forecasts are to begin. The BACK= option value must be less than or equal to the number of observations minus the number of parameters. The default is BACK=0, which means that the forecast starts at the end of the available data. The end of the data is the last observation for which a noise value can be calculated. If there are no input series, the end of the data is the last nonmissing value of the response time series. If there are input series, this observation can precede the last nonmissing value of the response variable, since there may be missing values for some of the input series.
ID=variable
names a variable in the input data set that identies the time periods associated with the observations. The ID= variable is used in conjunction with the INTERVAL= option to extrapolate ID values from the end of the input data to identify forecast periods in the OUT= data set. If the INTERVAL= option species an interval type, the ID variable must be a SAS date or datetime variable with the spacing between observations indicated by the INTERVAL= value. If the INTERVAL= option is not used, the last input value of the ID= variable is incremented by one for each forecast period to extrapolate the ID values for forecast observations.
INTERVAL=interval INTERVAL=n
species the time interval between observations. See Chapter 4, Date Intervals, Formats, and Functions, for information about valid INTERVAL= values. The value of the INTERVAL= option is used by PROC ARIMA to extrapolate the ID values for forecast observations and to check that the input data are in order with no missing periods. See the section Specifying Series Periodicity on page 259 for more details.
LEAD=n
species the number of multistep forecast values to compute. For example, if LEAD=10, PROC ARIMA forecasts for ten periods beginning with the end of the input series (or earlier if BACK= is specied). It is possible to obtain fewer than the requested number of forecasts if a transfer function model is specied and insufcient data are available to compute the forecast. The default is LEAD=24.
NOOUTALL
includes only the nal forecast observations in the OUT= output data set, not the one-step forecasts for the data before the forecast period.
NOPRINT
writes the forecast (and other values) to an output data set. If OUT= is not specied, the OUT=
data set specied in the PROC ARIMA statement is used. If OUT= is also not specied in the PROC ARIMA statement, no output data set is created. See the section OUT= Data Set on page 262 for more information.
PRINTALL
prints the FORECAST computation throughout the whole data set. The forecast values for the data before the forecast period (specied by the BACK= option) are one-step forecasts.
SIGSQ=value
species the variance term used in the formula for computing forecast standard errors and condence limits. The default value is the variance estimate computed by the preceding ESTIMATE statement. This option is useful when you wish to generate forecast standard errors and condence limits based on a published model. It would often be used in conjunction with the NOEST option in the preceding ESTIMATE statement.
Notice that if the original model is a pure autoregressive model, then the IACF is an ACF that corresponds to a pure moving-average model. Thus, it cuts off sharply when the lag is greater than p; this behavior is similar to the behavior of the partial autocorrelation function (PACF). The sample inverse autocorrelation function (SIACF) is estimated in the ARIMA procedure by the following steps. A high-order autoregressive model is t to the data by means of the Yule-Walker equations. The order of the autoregressive model used to calculate the SIACF is the minimum of the NLAG= value and one-half the number of observations after differencing. The SIACF is then calculated as the autocorrelation function that corresponds to this autoregressive operator when treated as a moving-average operator. That is, the autoregressive coefcients are convolved with themselves and treated as autocovariances. Under certain conditions, the sampling distribution of the SIACF can be approximated by the sampling distribution of the SACF of the dual model (Bhansali 1980). In the plots generated by ARIMA, p the condence limit marks (.) are located at 2= n. These limits bound an approximate 95% condence interval for the hypothesis that the data are from a white noise process.
plots the cross-correlation function of Y and X, Cor.yt ; xt s /, for s D L to L, where L is the value of the NLAG= option. Study of the cross-correlation functions can indicate the transfer functions through which the input series should enter the model for the response series.
The cross-correlation function is computed after any specied differencing has been done. If differencing is specied for the VAR= variable or for a variable in the CROSSCORR= list, it is the differenced series that is cross-correlated (and the differenced series is processed by any following ESTIMATE statement). For example,
identify var=y(1) crosscorr=x(1);
computes the cross-correlations of the changes in Y with the changes in X. When differencing is specied, the subsequent ESTIMATE statement models changes in the variables rather than the variables themselves.
O D .m;j / .B/zt D zt Q Q
m X i D1
O .m;j / zt Q
i
where B represents the backshift operator, where m D pmi n ; : : :; pmax are the autoregressive test .m;j / orders, where j D qmi n C 1; : : :; qmax C 1 are the moving-average test orders, and where O i are the autoregressive parameter estimates under the assumption that the series is an ARMA(m; j ) process. For purely autoregressive models (j D 0), ordinary least squares (OLS) is used to consistently .m;0/ estimate O i . For ARMA models, consistent estimates are obtained by the iterated least squares recursion formula, which is initiated by the pure autoregressive estimates: O .m;j / D O .mC1;j i i
1/
O .m;j
i 1
1/
O .mC1;j
The j th lag of the sample autocorrelation function of the ltered series wt sample autocorrelation function, and it is denoted as rj.m/ D rj .w .m;j / /.
is the extended
The standard errors of rj.m/ are computed in the usual way by using Bartletts approximation of the Pj 1 2 variance of the sample autocorrelation function, var.rj.m/ / .1 C t D1 rj .w .m;j / //.
If the true model is an ARMA (p C d; q) process, the ltered series wt model for j q so that rj.pCd / 0 j >q j Dq
.m;j /
follows an MA(q)
rj.pCd / 0
Additionally, Tsay and Tiao (1984) show that the extended sample autocorrelation satises rj.m/ 0 j q>m p d; j p q/ d 0 0j qm p d
d; j
An ESACF table is then constructed by using the rj.m/ for m D pmi n; : : :; pmax and j D qmi n C 1; : : :; qmax C 1 to identify the ARMA orders (see Table 7.4). The orders are tentatively identied by nding a right (maximal) triangular pattern with vertices located at .p C d; q/ and .p C d; qmax / and in which all elements are insignicant (based on asymptotic normality of the autocorrelation function). The vertex .p C d; q/ identies the order. Table 7.5 depicts the theoretical pattern associated with an ARMA(1,2) series.
Table 7.4 ESACF Table
AR 0 1 2 3
Table 7.5
AR 0 1 2 3 4
6 X 0 0 0 0
7 X 0 0 0 0
O where the parameter estimates .p ;q/ are obtained from the Yule-Walker estimates. The choice of the autoregressive order p is determined by the order that minimizes the Akaike information criterion (AIC) in the range p ;mi n p p ;max
2 AIC.p ; 0/ D ln. Q .p ;0/ /
C 2.p C 0/=n
where
2 Q .p ;0/
1 n
n X t Dp C1
2 Ot
Note that Hannan and Rissannen (1982) use the Bayesian information criterion (BIC) to determine the autoregressive order used to estimate the error series. Box, Jenkins, and Reinsel (1994) and Choi (1992) recommend the AIC. Once the error series has been estimated for autoregressive test order m D pmi n ; : : :; pmax and for O O moving-average test order j D qmi n ; : : :; qmax , the OLS estimates .m;j / and .m;j / are computed from the regression model zt D Q
m X i D1 .m;j / zt i Q i
j X kD1
.m;j /
Ot
C error
where
2 Q .m;j /
0 n X 1 @zt D Q n t Dt
0
m X i D1
.m;j / zt i Q i
j X kD1
1 k
.m;j /
Ot
where t0 D p C max.m; j /. A MINIC table is then constructed using BIC.m; j /; see Table 7.6. If pmax > p ;mi n , the preceding regression might fail due to linear dependence on the estimated error series and the mean-corrected series. Values of BIC.m; j / that cannot be computed are set to missing. For large autoregressive and moving-average test orders with relatively few observations, a nearly perfect t can result. This condition can be identied by a large negative BIC.m; j / value.
Table 7.6 MINIC Table
MA AR 0 1 2 3 0 BIC.0; 0/ BIC.1; 0/ BIC.2; 0/ BIC.3; 0/ 1 BIC.0; 1/ BIC.1; 1/ BIC.2; 1/ BIC.3; 1/ 2 BIC.0; 2/ BIC.1; 2/ BIC.2; 2/ BIC.3; 2/ 3 BIC.0; 3/ BIC.1; 3/ BIC.2; 3/ BIC.3; 3/
O .m; j C 1/ D O .m; j C 1/ D
! X
t 0 Ym;t Ym;t
! X
t 0 Ym;t Ym;t j 1
O O O A .m; j / D .m; j C 1/.m; j C 1/ where t ranges from j C m C 2 to n. O 2. Find the smallest eigenvalue, O .m; j /, of A .m; j / and its corresponding normalized eigen.m;j / .m;j / .m;j / vector, m;j D .1; 1 ; 2 ; : : : ; m /. The squared canonical correlation estimate is O .m; j /.
3. Using the m;j as AR(m) coefcients, obtain the residuals for t D j C m C 1 to n, by .m;j / .m;j / .m;j / .m;j / following the formula: wt D zt Q zt 1 Q zt 2 : : : Q zt m . Q m 1 2 4. From the sample autocorrelations of the residuals, rk .w/, approximate the standard error of the squared canonical correlation estimate by var. O .m; j /1=2 / where d.m; j / D .1 C 2 d.m; j /=.n Pj m j/
1 .m;j / //. i D1 rk .w
The test statistic to be used as an identication criterion is c.m; j / D .n m j /ln.1 O .m; j /=d.m; j //
which is asymptotically 2 if m D p C d and j q or if m p C d and j D q. For m > p and 1 j < q, there is more than one theoretical zero canonical correlation between Ym;t and Ym;t j 1 . Since the O .m; j / are the smallest canonical correlations for each .m; j /, the percentiles of c.m; j / are less than those of a 2 ; therefore, Tsay and Tiao (1985) state that it is safe to assume a 2 . For 1 1 m < p and j < q, no conclusions about the distribution of c.m; j / are made. A SCAN table is then constructed using c.m; j / to determine which of the O .m; j / are signicantly different from zero (see Table 7.7). The ARMA orders are tentatively identied by nding a (maximal) rectangular pattern in which the O .m; j / are insignicant for all test orders m p C d and j q. There may be more than one pair of values (p C d; q) that permit such a rectangular pattern. In this case, parsimony and the number of insignicant items in the rectangular pattern should help determine the model order. Table 7.8 depicts the theoretical pattern associated with an ARMA(2,2) series.
Table 7.7 SCAN Table
AR 0 1 2 3
Table 7.8
AR 0 1 2 3 4
6 X X 0 0 0
7 X X 0 0 0
Stationarity Tests
When a time series has a unit root, the series is nonstationary and the ordinary least squares (OLS) estimator is not normally distributed. Dickey (1976) and Dickey and Fuller (1979) studied the limiting distribution of the OLS estimator of autoregressive models for time series with a simple unit root. Dickey, Hasza, and Fuller (1984) obtained the limiting distribution for time series with seasonal unit roots. Hamilton (1994) discusses the various types of unit root testing. For a description of Dickey-Fuller tests, see the section PROBDF Function for Dickey-Fuller Tests on page 158 in Chapter 5. See Chapter 8, The AUTOREG Procedure, for a description of Phillips-Perron tests. The random-walk-with-drift test recommends whether or not an integrated times series has a drift term. Hamilton (1994) discusses this test.
Prewhitening
If, as is usually the case, an input series is autocorrelated, the direct cross-correlation function between the input and response series gives a misleading indication of the relation between the input and response series. One solution to this problem is called prewhitening. You rst t an ARIMA model for the input series sufcient to reduce the residuals to white noise; then, lter the input series with this model to get the white noise residual series. You then lter the response series with the same model and cross-correlate the ltered response with the ltered input series. The ARIMA procedure performs this prewhitening process automatically when you precede the IDENTIFY statement for the response series with IDENTIFY and ESTIMATE statements to t a model for the input series. If a model with no inputs was previously t to a variable specied by the CROSSCORR= option, then that model is used to prewhiten both the input series and the response series before the cross-correlations are computed for the input series.
For example,
proc arima data=in; identify var=x; estimate p=1 q=1; identify var=y crosscorr=x; run;
Both X and Y are ltered by the ARMA(1,1) model t to X before the cross-correlations are computed. Note that prewhitening is done to estimate the cross-correlation function; the unltered series are used in any subsequent ESTIMATE or FORECAST statements, and the correlation functions of Y with its own lags are computed from the unltered Y series. But initial values in the ESTIMATE statement are obtained with prewhitened data; therefore, the result with prewhitening can be different from the result without prewhitening. To suppress prewhitening for all input variables, use the CLEAR option in the IDENTIFY statement to make PROC ARIMA disregard all previous models.
Estimation Details
The ARIMA procedure primarily uses the computational methods outlined by Box and Jenkins. Marquardts method is used for the nonlinear least squares iterations. Numerical approximations of the derivatives of the sum-of-squares function are taken by using a xed delta (controlled by the DELTA= option). The methods do not always converge successfully for a given set of data, particularly if the starting values for the parameters are not close to the least squares estimates.
Back-Forecasting
The unconditional sum of squares is computed exactly; thus, back-forecasting is not performed. Early versions of SAS/ETS software used the back-forecasting approximation and allowed a positive value of the BACKLIM= option to control the extent of the back-forecasting. In the current version, requesting a positive number of back-forecasting steps with the BACKLIM= option has no effect.
Preliminary Estimation
If an autoregressive or moving-average operator is specied with no missing lags, preliminary estimates of the parameters are computed by using the autocorrelations computed in the IDENTIFY stage. Otherwise, the preliminary estimates are arbitrarily set to values that produce stable polynomials. When preliminary estimation is not performed by PROC ARIMA, then initial values of the coefcients for any given autoregressive or moving-average factor are set to 0.1 if the degree of the polynomial associated with the factor is 9 or less. Otherwise, the coefcients are determined by expanding the polynomial (1 0:1B) to an appropriate power by using a recursive algorithm.
These preliminary estimates are the starting values in an iterative algorithm to compute estimates of the parameters.
Estimation Methods
Maximum Likelihood
The METHOD= ML option produces maximum likelihood estimates. The likelihood function is maximized via nonlinear least squares using Marquardts method. Maximum likelihood estimates are more expensive to compute than the conditional least squares estimates; however, they may be preferable in some cases (Ansley and Newbold 1980; Davidson 1981). The maximum likelihood estimates are computed as follows. Let the univariate ARMA model be .B/.Wt
t/
D .B/at
where at is an independent sequence of normally distributed innovations with mean 0 and variance 2 . Here plus the transfer function inputs. The log-likelihood function t is the mean parameter can be written as follows: 1 2
2
x0
1 ln.j j/ 2
n ln. 2
In this equation, n is the number of observations, 2 is the variance of x as a function of the and parameters, and j j denotes the determinant. The vector x is the time series Wt minus the structural part of the model t , written as a column vector, as follows: 2 3 2 3 W1 1 6W2 7 6 2 7 6 7 6 7 xD6 : 7 6 : 7 4 : 5 4 : 5 : : Wn
n 2
is
Note that the default estimator of the variance divides by n r, where r is the number of parameters in the model, instead of by n. Specifying the NODF option causes a divisor of n to be used. The log-likelihood concentrated with respect to n ln.x0 2
1 2
x/
1 ln.j j/ 2
Let H be the lower triangular matrix with positive elements on the diagonal such that HH0 D . Let e be the vector H 1 x. The concentrated log-likelihood with respect to 2 can now be written as n ln.e0 e/ 2 ln.jHj/
or n ln.jHj1=n e0 ejHj1=n / 2 The MLE is produced by using a Marquardt algorithm to minimize the following sum of squares: jHj1=n e0 ejHj1=n The subsequent analysis of the residuals is done by using e as the vector of residuals.
Unconditional Least Squares
The METHOD=ULS option produces unconditional least squares estimates. The ULS method is also referred to as the exact least squares (ELS) method. For METHOD=ULS, the estimates minimize
n X t D1
at D Q2
n X t D1
.xt
Ct Vt 1 .x1 ;
; xt
0 2 1/ /
where Ct is the covariance matrix of xt and .x1 ; ; xt 1 /, and Vt is the variance matrix of P 2 .x1 ; ; xt 1 /. In fact, n at is the same as x0 1 x, and hence e0 e. Therefore, the unconditD1 Q tional least squares estimates are obtained by minimizing the sum of squared residuals rather than using the log-likelihood as the criterion function.
Conditional Least Squares
The METHOD=CLS option produces conditional least squares estimates. The CLS estimates are conditional on the assumption that the past unobserved errors are equal to 0. The series xt can be represented in terms of the previous observations, as follows: xt D at C
1 X i D1 i xt i
The
at D O2
n X t D1
.xt
1 X i D1
O i xt
2 i/
where the unobserved past values of xt are set to 0 and O i are computed from the estimates of and at each iteration. For METHOD=ULS and METHOD=ML, initial estimates are computed using the METHOD=CLS algorithm.
Information Criteria
PROC ARIMA computes and prints two information criteria, Akaikes information criterion (AIC) (Akaike 1974; Harvey 1981) and Schwarzs Bayesian criterion (SBC) (Schwarz 1978). The AIC and SBC are used to compare competing models t to the same series. The model with the smaller information criteria is said to t the data better. The AIC is computed as 2ln.L/ C 2k where L is the likelihood function and k is the number of free parameters. The SBC is computed as 2ln.L/ C ln.n/k where n is the number of residuals that can be computed for the time series. Sometimes Schwarzs Bayesian criterion is called the Bayesian information criterion (BIC). If METHOD=CLS is used to do the estimation, an approximation value of L is used, where L is based on the conditional sum of squares instead of the exact sum of squares, and a Jacobian factor is left out.
Tests of Residuals
A table of test statistics for the hypothesis that the model residuals are white noise is printed as part of the ESTIMATE statement output. The chi-square statistics used in the test for lack of t are computed using the Ljung-Box formula
2 m
D n.n C 2/
m X kD1
2 rk
.n
k/
where Pn k t D1 at at Ck rk D Pn 2 t D1 at
and at is the residual series. This formula has been suggested by Ljung and Box (1978) as yielding a better t to the asymptotic chi-square distribution than the Box-Pierce Q statistic. Some simulation studies of the nite sample properties of this statistic are given by Davies, Triggs, and Newbold (1977) and by Ljung and Box (1978). When the time series has missing values, Stoffer and Toloi (1992) suggest a modication of this test statistic that has improved distributional properties over the standard Ljung-Box formula given above. When the series contains missing values, this modied test statistic is used by default. Each chi-square statistic is computed for all lags up to the indicated lag value and is not independent of the preceding chi-square values. The null hypotheses tested is that the current set of autocorrelations is white noise.
t-values
The t values reported in the table of parameter estimates are approximations whose accuracy depends on the validity of the model, the nature of the model, and the length of the observed series. When the length of the observed series is short and the number of estimated parameters is large with respect to the series length, the t approximation is usually poor. Probability values that correspond to a t distribution should be interpreted carefully because they may be misleading.
If the data are differenced and a moving-average model is t, the parameter estimates might try to converge exactly on the invertibility boundary. In this case, the standard error estimates that are based on derivatives might be inaccurate.
The transfer function for an input variable is optional. The name of a variable by itself can be used to specify a pure regression term for the variable. If specied, the syntax of the transfer function is S $ .L1;1 ; L1;2 ; : : :/.L2;1 ; : : :/: : :=.Li;1 ; Li;2 ; : : :/.Li C1;1 ; : : :/: : : S is the number of periods of time delay (lag) for this input series. Each term in parentheses species a polynomial factor with parameters at the lags specied by the Li;j values. The terms before the slash (/) are numerator factors. The terms after the slash (/) are denominator factors. All three parts are optional. Commas can optionally be used between input specications to make the INPUT= option more readable. The $ sign after the shift is also optional. Except for the rst numerator factor, each of the terms Li;1 ; Li;2 ; : : :; Li;k indicates a factor of the form .1 !i;1 B Li;1 !i;2 B Li;2 ::: !i;k B Li;k /
The form of the rst numerator factor depends on the ALTPARM option. By default, the constant 1 in the rst numerator factor is replaced with a free parameter !0 .
The ALTPARM option does not materially affect the results; it just presents the results differently. Some people prefer to see the model written one way, while others prefer the alternative representation. Table 7.9 illustrates the effect of the ALTPARM option.
Table 7.9 The ALTPARM Option
INPUT= Option INPUT=((1 2)(12)/(1)X); ALTPARM No Yes Model .!0 !1 B !2 B 2 /.1 !3 B 12 /=.1 1 B/Xt !0 .1 !1 B !2 B 2 /.1 !3 B 12 /=.1 1 B/Xt
which is a very different model. The point to remember is that a differencing operation requested for the response variable specied by the VAR= option is applied only to that variable and not to the noise term of the model.
Initial Values
The syntax for giving initial values to transfer function parameters in the INITVAL= option parallels the syntax of the INPUT= option. For each transfer function in the INPUT= option, the INITVAL= option should give an initialization specication followed by the input series name. The initialization specication for each transfer function has the form C $ .V1;1 ; V1;2 ; : : :/.V2;1 ; : : :/: : :=.Vi;1 ; : : :/: : :
where C is the lag 0 term in the rst numerator factor of the transfer function (or the overall scale factor if the ALTPARM option is specied) and Vi;j is the coefcient of the Li;j element in the transfer function. To illustrate, suppose you want to t the model Yt D C .!0 !1 B !2 B 2 / Xt .1 1 B 2 B 2 3 B 3 /
3
.1
1 1B
2B
a 3/ t
and start the estimation process with the initial values =10, !0 =1, !1 =0.5, !2 =0.03, 1 =0.8, 2 =0.1, 3 =0.002, 1 =0.1, 2 =0.01. (These are arbitrary values for illustration only.) You would use the following statements:
identify var=y crosscorr=x; estimate p=(1,3) input=(3$(1,2)/(1,2,3)x) mu=10 ar=.1 .01 initval=(1$(.5,.03)/(.8,-.1,.002)x);
Note that the lags specied for a particular factor are sorted, so initial values should be given in sorted order. For example, if the P= option had been entered as P=(3,1) instead of P=(1,3), the model would be the same and so would the AR= option. Sorting is done within all factors, including transfer function factors, so initial values should always be given in order of increasing lags. Here is another illustration, showing initialization for a factored model with multiple inputs. The model is Yt D C C !1;0 Wt C .!2;0 .1 1;1 B/ 1 6 .1 1 B/.1 2B !2;1 B/Xt
3B 3
a 12 / t
1 =0.1, 2 =0.05, and 3 =0.01.
and the initial values are =10, !1;0 =5, 1;1 =0.8, !2;0 =1, !2;1 =0.5, You would use the following statements:
identify var=y crosscorr=(w x); estimate p=(1)(6,12) input=(/(1)w, 3$(1)x) mu=10 ar=.1 .05 .01 initval=(5$/(.8)w 1$(.5)x);
(possibly with lag shifts) without any numerator or denominator transfer function factors. One-stepahead forecasts are not available for the response variable when one or more of the input variables have missing values. When embedded missing values are present for a model with complex transfer functions, PROC ARIMA uses the rst continuous nonmissing piece of each series to do the analysis. That is, PROC ARIMA skips observations at the beginning of each series until it encounters a nonmissing value and then uses the data from there until it encounters another missing value or until the end of the data is reached. This makes the current version of PROC ARIMA compatible with earlier releases that did not allow embedded missing values.
Forecasting Details
If the model has input variables, a forecast beyond the end of the data for the input variables is possible only if univariate ARIMA models have previously been t to the input variables or future values for the input variables are included in the DATA= data set. If input variables are used, the forecast standard errors and condence limits of the response depend on the estimated forecast error variance of the predicted inputs. If several input series are used, the forecast errors for the inputs should be independent; otherwise, the standard errors and condence limits for the response series will not be accurate. If future values for the input variables are included in the DATA= data set, the standard errors of the forecasts will be underestimated since these values are assumed to be known with certainty. The forecasts are generated using forecasting equations consistent with the method used to estimate the model parameters. Thus, the estimation method specied in the ESTIMATE statement also controls the way forecasts are produced by the FORECAST statement. If METHOD=CLS is used, the forecasts are innite memory forecasts, also called conditional forecasts. If METHOD=ULS or METHOD=ML, the forecasts are nite memory forecasts, also called unconditional forecasts. A complete description of the steps to produce the series forecasts and their standard errors by using either of these methods is quite involved, and only a brief explanation of the algorithm is given in the next two sections. Additional details about the nite and innite memory forecasts can be found in Brockwell and Davis (1991). The prediction of stationary ARMA processes is explained in Chapter 5, and the prediction of nonstationary ARMA processes is given in Chapter 9 of Brockwell and Davis (1991).
where .B/=.B/ D 1
P1
i D1
iB
i.
O i xt Ck O
1 X i Dk
O i xt Ck
where unobserved past values of xt are set to zero and O i is obtained from the estimated parameters O O and .
; xt
1 /,
is
where Ck;t is the covariance of xtCk and .x1 ; ; xt 1 / and Vt is the covariance matrix of the vector .x1 ; ; xt 1 /. Ck;t and Vt are derived from the estimated parameters. Finite memory forecasts minimize the mean squared error of prediction if the parameters of the ARMA model are known exactly. (In most cases, the parameters of the ARMA model are estimated, so the predictors are not true best linear forecasts.) If the response series is differenced, the nal forecast is produced by summing the forecast of the differenced series. This summation and the forecast are conditional on the initial values of the series. Thus, when the response series is differenced, the nal forecasts are not true nite memory forecasts because they are derived by assuming that the differenced series begins in a steady-state condition. Thus, they fall somewhere between nite memory and innite memory forecasts. In practice, there is seldom any practical difference between these forecasts and true nite memory forecasts.
As one alternative, you can simply exponentiate the forecast series. This procedure gives a forecast for the median of the series, but the antilog of the forecast log series underpredicts the mean of the original series. If you want to predict the expected value of the series, you need to take into account the standard error of the forecast, as shown in the following example, which uses an AR(2) model to forecast the log of a series Y:
data in; set in; ylog = log( y ); run; proc arima data=in; identify var=ylog; estimate p=2; forecast lead=10 out=out; run; data out; set out; y = exp( l95 = exp( u95 = exp( forecast = run;
according to the frequency specications of the INTERVAL= option. If the INTERVAL= option is not specied, PROC ARIMA extrapolates the ID variable by incrementing the ID variable value for the last observation in the input data by 1 for each forecast period. Values of the ID variable over the range of the input data are copied to the output data set. The ALIGN= option is used to align the ID variable to the beginning, middle, or end of the time ID interval specied by the INTERVAL= option.
Detecting Outliers
You can use the OUTLIER statement to detect changes in the level of the response series that are not accounted for by the estimated model. The types of changes considered are additive outliers (AO), level shifts (LS), and temporary changes (TC). Let t be a regression variable that describes some type of change in the mean response. In time series literature t is called a shock signature. An additive outlier at some time point s corresponds to a shock signature t such that s D 1:0 and t is 0.0 at all other points. Similarly a permanent level shift that originates at time s has a shock signature such that t is 0.0 for t < s and 1.0 for t s. A temporary level shift of duration d that originates at time s has t equal to 1.0 between s and s C d and 0.0 otherwise. Suppose that you are estimating the ARIMA model D.B/Yt D
t
.B/ at .B/
where Yt is the response series, D.B/ is the differencing polynomial in the backward shift operator B (possibly identity), t is the transfer function input, .B/ and .B/ are the AR and MA polynomials, respectively, and at is the Gaussian white noise series. The problem of detection of level shifts in the OUTLIER statement is formulated as a problem of sequential selection of shock signatures that improve the model in the ESTIMATE statement. This is similar to the forward selection process in the stepwise regression procedure. The selection process starts with considering shock signatures of the type specied in the TYPE= option, originating at each nonmissing measurement. This involves testing H0 W D 0 versus Ha W 0 in the model D.B/.Yt t / D
t
.B/ at .B/
for each of these shock signatures. The most signicant shock signature, if it also satises the signicance criterion in ALPHA= option, is included in the model. If no signicant shock signature is found, then the outlier detection process stops; otherwise this augmented model, which incorporates the selected shock signature in its transfer function input, becomes the null model for the subsequent selection process. This iterative process stops if at any stage no more signicant shock signatures are found or if the number of iterations exceeds the maximum search number that results due to the MAXNUM= and MAXPCT= settings. In all these iterations, the parameters of the ARIMA model in the ESTIMATE statement are held xed.
The precise details of the testing procedure for a given shock signature t are as follows: The preceding testing problem is equivalent to testing H0 W D 0 versus Ha W 0 in the following regression with ARMA errors model Nt D
t
.B/ at .B/
t/
In this setting, under H0 ; N D .N1 ; N2 ; : : : ; Nn /T is a mean zero Gaussian vector with variance covariance matrix 2 . Here 2 is the variance of the white noise process at and is the variancecovariance matrix associated with the ARMA model. Moreover, under Ha , N has as the mean vector where D . 1 ; 2 ; : : : ; n /T . Additionally, the generalized least squares estimate of and its variance is given by O D = 2 O = Var./ D
1 N and D T 1 . The test statistic 2 D 2 =. 2 / is used to test the where D T signicance of , which has an approximate chi-squared distribution with 1 degree of freedom under H0 . The type of estimate of 2 used in the calculation of 2 can be specied by the SIGMA= option. The default setting is SIGMA=ROBUST, which corresponds to a robust estimate suggested in an outlier detection procedure in X-12-ARIMA, the Census Bureaus time series analysis program; see Findley et al. (1998) for additional information. The robust estimate of 2 is computed by the formula
O 2 D .1:49
Median.jat j//2 O
where at are the standardized residuals of the null ARIMA model. The setting SIGMA=MSE O corresponds to the usual mean squared error estimate (MSE) computed the same way as in the ESTIMATE statement with the NODF option. The quantities and are efciently computed by a method described in de Jong and Penzer (1998); see also Kohn and Ansley (1985).
1. Proceed with model identication and estimation as usual. Suppose this results in a tentative ARIMA model, say M. 2. Check for additive and permanent level shifts unaccounted for by the model M by using the OUTLIER statement. In this step, unless there is evidence to justify it, the number of level shifts searched should be kept small. 3. Augment the original dataset with the regression variables that correspond to the detected outliers. 4. Include the rst few of these regression variables in M, and call this model M1. Reestimate all the parameters of M1. It is important not to include too many of these outlier variables in the model in order to avoid the danger of over-tting. 5. Check the adequacy of M1 by examining the parameter estimates, residual analysis, and outlier detection. Rene it more if necessary.
Unless the NOOUTALL option is specied, the data set contains the whole time series. The FORECAST variable has the one-step forecasts (predicted values) for the input periods, followed by n forecast values, where n is the LEAD= value. The actual and RESIDUAL values are missing beyond the end of the series. If you specify the same OUT= data set in different FORECAST statements, the latter FORECAST statements overwrite the output from the previous FORECAST statements. If you want to combine the forecasts from different FORECAST statements in the same output data set, specify the OUT= option once in the PROC ARIMA statement and omit the OUT= option in the FORECAST statements. When a global output data set is created by the OUT= option in the PROC ARIMA statement, the variables in the OUT= data set are dened by the rst FORECAST statement that is executed. The results of subsequent FORECAST statements are vertically concatenated onto the OUT= data set. Thus, if no ID variable is specied in the rst FORECAST statement that is executed, no ID variable appears in the output data set, even if one is specied in a later FORECAST statement. If an ID variable is specied in the rst FORECAST statement that is executed but not in a later FORECAST statement, the value of the ID variable is the same as the last value processed for the ID variable for all observations created by the later FORECAST statement. Furthermore, even if the response variable changes in subsequent FORECAST statements, the response variable name in the output data set is that of the rst response variable analyzed.
CORR, a numeric variable that contains the autocorrelation or cross-correlation function values. CORR contains the autocorrelations of the VAR= variable when the value of the CROSSVAR variable is blank. Otherwise CORR contains the cross-correlations between the VAR= variable and the variable named by the CROSSVAR variable. STDERR, a numeric variable that contains the standard errors of the autocorrelations. The standard error estimate is based on the hypothesis that the process that generates the time series is a pure moving-average process of order LAG1. For the cross-correlations, STDERR p contains the value 1= n, which approximates the standard error under the hypothesis that the two series are uncorrelated. INVCORR, a numeric variable that contains the inverse autocorrelation function values of the VAR= variable. For cross-correlation observations (that is, when the value of the CROSSVAR variable is not blank), INVCORR contains missing values. PARTCORR, a numeric variable that contains the partial autocorrelation function values of the VAR= variable. For cross-correlation observations (that is, when the value of the CROSSVAR variable is not blank), PARTCORR contains missing values.
MU
MAj _k
These numeric variables contain values for the moving-average parameters. The variables for moving-average parameters are named MAj _k, where j is the factor-number and k is the index of the parameter within a factor. These numeric variables contain values for the autoregressive parameters. The variables for autoregressive parameters are named ARj _k, where j is the factor number and k is the index of the parameter within a factor. These variables contain values for the transfer function parameters. Variables for transfer function parameters are named Ij _k, where j is the number of the INPUT variable associated with the transfer function component and k is the number of the parameter for the particular INPUT variable. INPUT variables are numbered according to the order in which they appear in the INPUT= list. This variable describes the convergence status of the model. A value of 0_CONVERGED indicates that the model converged.
ARj _k
Ij _k
_STATUS_
The value of the _TYPE_ variable for each observation indicates the kind of value contained in the variables for model parameters for the observation. The OUTEST= data set contains observations with the following _TYPE_ values: EST STD CORR COV FACTOR The observation contains parameter estimates. The observation contains approximate standard errors of the estimates. The observation contains correlations of the estimates. OUTCORR must be specied to get these observations. The observation contains covariances of the estimates. OUTCOV must be specied to get these observations. The observation contains values that identify for each parameter the factor that contains it. Negative values indicate denominator factors in transfer function models. The observation contains values that identify the lag associated with each parameter. The observation contains values that identify the shift associated with the input series for the parameter.
LAG SHIFT
The values given for _TYPE_=FACTOR, _TYPE_=LAG, or _TYPE_=SHIFT observations enable you to reconstruct the model employed when provided with only the OUTEST= data set.
OUTEST= Examples
This section claries how model parameters are stored in the OUTEST= data set with two examples. Consider the following example:
proc arima data=input; identify var=y cross=(x1 x2); estimate p=(1)(6) q=(1,3)(12) input=(x1 x2) outest=est;
The model specied by these statements is Yt D C !1;0 X1;t C !2;0 X2;t C .1 11 B .1 12 B 3 /.1 21 B 12 / at 6 11 B/.1 21 B /
The OUTEST= data set contains the values shown in Table 7.10.
Table 7.10
Obs 1 2 3 4 5
MU se 0 0 0
. . . .
MA1_1 11 se 11 1 1 0
MA1_2 12 se 12 1 3 0
MA2_1 21 se 21 2 12 0
AR1_1
11
AR2_1
21
se 1 1 0
11
se 2 6 0
21
Note that the symbols in the rows for _TYPE_=EST and _TYPE_=STD in Table 7.10 would be numeric values in a real data set. Next, consider the following example:
proc arima data=input; identify var=y cross=(x1 x2); estimate p=1 q=1 input=(2 $ (1)/(1,2)x1 1 $ /(1)x2) outest=est; run; proc print data=est; run;
!20 X2;t 1 21 B
.1 .1
1 B/ at 1 B/
The OUTEST= data set contains the values shown in Table 7.11.
Table 7.11
Obs 1 2 3 4 5
MU se 0 0 0
. . . .
MA1_1 1 se 1 1 1 0
AR1_1
1
se 1 1 0
I1_3 11 se 11 -1 1 2
I1_4 12 se 12 -1 2 2
I2_2 21 se 21 -1 1 1
_VALUE_ contains the number of residuals. _VALUE_ contains the number of parameters in the model. _VALUE_ contains the sum of the differencing lags employed for the response variable. _VALUE_ contains the estimate of the innovation variance. _VALUE_ contains the estimate of the mean term. _VALUE_ contains the estimate of the autoregressive parameter indexed by the _FACTOR_ and _LAG_ variable values. _VALUE_ contains the estimate of a moving-average parameter indexed by the _FACTOR_ and _LAG_ variable values. _VALUE_ contains the estimate of the parameter in the numerator factor of the transfer function of the input variable indexed by the _FACTOR_, _LAG_, and _SHIFT_ variable values. _VALUE_ contains the estimate of the parameter in the denominator factor of the transfer function of the input variable indexed by the _FACTOR_, _LAG_, and _SHIFT_ variable values. _VALUE_ contains the difference operator dened by the difference lag given by the value in the _LAG_ variable.
DEN
DIF
AIC SBC LOGLIK SSE NUMRESID NPARMS NDIFS ERRORVAR MU CONV NITER
Akaikes information criterion Schwarzs Bayesian criterion the log-likelihood, if METHOD=ML or METHOD=ULS is specied the sum of the squared residuals the number of residuals the number of parameters in the model the sum of the differencing lags employed for the response variable the estimate of the innovation variance the estimate of the mean term tells if the estimation converged. The value of 0 signies that estimation converged. Nonzero values reect convergence problems. the number of iterations
Remark. CONV takes an integer value that corresponds to the error condition of the parameter estimation process. The value of 0 signies that estimation process has converged. The higher values signify convergence problems of increasing severity. Specically: CONV D 0 indicates that the estimation process has converged. CONV D 1 or 2 indicates that the estimation process has run into numerical problems (such as encountering an unstable model or a ridge) during the iterations. CONV >D 3 indicates that the estimation process has failed to converge.
Printed Output
The ARIMA procedure produces printed output for each of the IDENTIFY, ESTIMATE, and FORECAST statements. The output produced by each ARIMA statement is described in the following sections. If ODS Graphics is enabled, the line printer plots mentioned below are replaced by the corresponding ODS plots.
plot if the value of LINESIZE= option is sufciently large. The standard errors are derived using Bartletts approximation (Box and Jenkins 1976, p. 177). The approximation for a standard error for the estimated autocorrelation function at lag k is based on a null hypothesis that a pure moving-average Gaussian process of order k1 generated the time series. The relative position of an approximate 95% condence interval under this null hypothesis is indicated by the dots in the plot, while the asterisks represent the relative magnitude of the autocorrelation value. a plot of the sample inverse autocorrelation function. See the section The Inverse Autocorrelation Function on page 239 for more information about the inverse autocorrelation function. a plot of the sample partial autocorrelation function a table of test statistics for the hypothesis that the series is white noise. These test statistics are the same as the tests for white noise residuals produced by the ESTIMATE statement and are described in the section Estimation Details on page 248. a plot of the sample cross-correlation function for each series specied in the CROSSCORR= option. If a model was previously estimated for a variable in the CROSSCORR= list, the cross-correlations for that series are computed for the prewhitened input and response series. For each input variable with a prewhitening lter, the cross-correlation report for the input series includes the following: a table of test statistics for the hypothesis of no cross-correlation between the input and response series the prewhitening lter used for the prewhitening transformation of the predictor and response variables ESACF tables if the ESACF option is used MINIC table if the MINIC option is used SCAN table if the SCAN option is used STATIONARITY test results if the STATIONARITY option is used
the estimates of the constant term, the innovation variance (variance estimate), the innovation standard deviation (Std Error Estimate), Akaikes information criterion (AIC), Schwarzs Bayesian criterion (SBC), and the number of residuals the correlation matrix of the parameter estimates a table of test statistics for hypothesis that the residuals of the model are white noise. The table is titled Autocorrelation Check of Residuals. if the PLOT option is specied, autocorrelation, inverse autocorrelation, and partial autocorrelation function plots of the residuals if an INPUT variable has been modeled in such a way that prewhitening is performed in the IDENTIFY step, a table of test statistics titled Crosscorrelation Check of Residuals. The test statistic is based on the chi-square approximation suggested by Box and Jenkins (1976, pp. 395396). The cross-correlation function is computed by using the residuals from the model as one series and the prewhitened input variable as the other series. if the GRID option is specied, the sum-of-squares or likelihood surface over a grid of parameter values near the nal estimates a summary of the estimated model that shows the autoregressive factors, moving-average factors, and transfer function factors in backshift notation with the estimated parameter values.
The Approx Prob > ChiSq column lists the approximate p-value of the test statistic.
a summary of the estimated model a table of forecasts with following columns: The Obs column contains the observation number. The Forecast column contains the forecast values. The Std Error column contains the forecast standard errors. The Lower and Uppers columns contain the approximate 95% condence limits. The ALPHA= option can be used to change the condence interval for forecasts. If the PRINTALL option is specied, the forecast table also includes columns for the actual values of the response series (Actual) and the residual values (Residual).
ODS Table Name ChiSqAuto ChiSqCross CorrGraph DescStats ESACF ESACFPValues IACFGraph InputDescStats MINIC PACFGraph SCAN SCANPValues StationarityTests TentativeOrders ARPolynomial
Description chi-square statistics table for autocorrelation chi-square statistics table for cross-correlations Correlations graph Descriptive statistics Extended sample autocorrelation function ESACF probability values Inverse autocorrelations graph Input descriptive statistics Minimum information criterion Partial autocorrelations graph Squared canonical correlation estimates SCAN chi-square probability values Stationarity tests Tentative order selection Filter equations
Statement IDENTIFY IDENTIFY IDENTIFY IDENTIFY IDENTIFY IDENTIFY IDENTIFY IDENTIFY IDENTIFY IDENTIFY IDENTIFY IDENTIFY IDENTIFY IDENTIFY ESTIMATE
Option
CROSSCORR
ESACF ESACF
MINIC
Table 7.12
continued
ODS Table Name ChiSqAuto ChiSqCross CorrB DenPolynomial FitStatistics IterHistory InitialAREstimates InitialMAEstimates InputDescription MAPolynomial ModelDescription NumPolynomial ParameterEstimates PrelimEstimates ObjectiveGrid OptSummary OutlierDetails Forecasts
Description chi-square statistics table for autocorrelation chi-square statistics table for cross-correlations Correlations of the estimates Filter equations Fit statistics Iteration history Initial autoregressive parameter estimates Initial moving-average parameter estimates Input description Filter equations Model description Filter equations Parameter estimates Preliminary estimates Objective function grid matrix ARIMA estimation optimization Detected outliers Forecast
Statement ESTIMATE ESTIMATE ESTIMATE ESTIMATE ESTIMATE ESTIMATE ESTIMATE ESTIMATE ESTIMATE ESTIMATE ESTIMATE ESTIMATE ESTIMATE ESTIMATE ESTIMATE ESTIMATE OUTLIER FORECAST
Option
PRINTALL
GRID PRINTALL
Statistical Graphics
This section provides information about the basic ODS statistical graphics produced by the ARIMA procedure. To request graphics with PROC ARIMA, you must rst enable ODS Graphics by specifying the ODS GRAPHICS ON; statement. See Chapter 21, Statistical Graphics Using ODS (SAS/STAT Users Guide), for more information. The main types of plots available are as follows: plots useful in the trend and correlation analysis of the dependent and input series plots useful for the residual analysis of an estimated model forecast plots
You can obtain most plots relevant to the specied model by default if ODS Graphics is enabled. For ner control of the graphics, you can use the PLOTS= option in the PROC ARIMA statement. The following example is a simple illustration of how to use the PLOTS= option.
Plot Description Time series plot of the dependent series Autocorrelation plot of the dependent series Partial-autocorrelation plot of the dependent series Inverse-autocorrelation plot of the dependent series Series trend and correlation analysis panel
Table 7.13
continued
Plot Description Cross-correlation plots, either individual or paneled. They are numbered 1, 2, and so on as needed. Residual-autocorrelation plot Residual-partialautocorrelation plot Residual-inverseautocorrelation plot Residual-white-noiseprobability plot Residual histogram Residual normal Q-Q Plot Time series plot of residuals with a superimposed smoother Time series plot of multistep forecasts Time series plot of one-step-ahead as well as multistep forecasts
Option Default
ResidualACFPlot ResidualPACFPlot
PLOTS(UNPACK) PLOTS(UNPACK)
ResidualIACFPlot
PLOTS(UNPACK)
ResidualWNPlot
PLOTS(UNPACK)
ForecastsOnlyPlot ForecastsPlot
Default PLOTS=FORECAST(FORCAST)
The following ARIMA procedure statements identify and estimate the model:
/*-- Simulated IMA Model --*/ proc arima data=a; identify var=u; run; identify var=u(1); run; estimate q=1 ; run; quit;
The graphical series correlation analysis output of the rst IDENTIFY statement is shown in Output 7.1.1. The output shows the behavior of the sample autocorrelation function when the process is nonstationary. Note that in this case the estimated autocorrelations are not very high, even at small lags. Nonstationarity is reected in a pattern of signicant autocorrelations that do not decline quickly with increasing lag, not in the size of the autocorrelations.
The second IDENTIFY statement differences the series. The results of the second IDENTIFY statement are shown in Output 7.1.2. This output shows autocorrelation, inverse autocorrelation, and partial autocorrelation functions typical of MA(1) processes.
The ESTIMATE statement ts an ARIMA(0,1,1) model to the simulated data. Note that in this case the parameter estimates are reasonably close to the values used to generate the simulated data. O ( D 0; O D 0:02I 1 D 0:8; 1 D 0:79I 2 D 1; O 2 D 0:82:) Moreover, the graphical analysis of the residuals shows no model inadequacies (see Output 7.1.4 and Output 7.1.5). The ESTIMATE statement results are shown in Output 7.1.3.
Output 7.1.3 Output from Fitting ARIMA(0,1,1) Model
Conditional Least Squares Estimation Standard Error 0.01972 0.06474 Approx Pr > |t| 0.2997 <.0001
Parameter MU MA1,1
Lag 0 1
The following PROC TIMESERIES step plots the series, as shown in Output 7.2.1:
proc timeseries data=seriesg plot=series; id date interval=month; var x; run;
The following statements specify an ARIMA(0,1,1) (0,1,1)12 model without a mean term to the logarithms of the airline passengers series, xlog. The model is forecast, and the results are stored in the data set B.
/*-- Seasonal Model for the Airline Series --*/ proc arima data=seriesg; identify var=xlog(1,12); estimate q=(1)(12) noint method=ml; forecast id=date interval=month printall out=b; run;
The output from the IDENTIFY statement is shown in Output 7.2.2. The autocorrelation plots shown are for the twice differenced series .1 B/.1 B 12 /XLOG. Note that the autocorrelation functions have the pattern characteristic of a rst-order moving-average process combined with a seasonal moving-average process with lag 12.
Output 7.2.3 Trand and Correlation Analysis for the Twice Differenced Series
The results of the ESTIMATE statement are shown in Output 7.2.4, Output 7.2.5, and Output 7.2.6. The model appears to t the data quite well.
Lag 1 12
Model for variable xlog Period(s) of Differencing Moving Average Factors Factor 1: Factor 2: 1 - 0.40194 B**(1) 1 - 0.55686 B**(12) 1,12
The forecasts and their condence limits for the transformed series are shown in Output 7.2.7.
Output 7.2.7 Forecast Plot for the Transformed Series
The following statements retransform the forecast values to get forecasts in the original scales. See the section Forecasting Log Transformed Data on page 258 for more information.
data c; set b; x forecast l95 u95 run;
= = = =
The forecasts and their condence limits are plotted by using the following PROC SGPLOT step. The plot is shown in Output 7.2.8.
proc sgplot data=c; where date >= 1jan58d; band Upper=u95 Lower=l95 x=date / LegendLabel="95% Confidence Limits"; scatter x=date y=x; series x=date y=forecast; run;
Example 7.3: Model for Series J Data from Box and Jenkins ! 289
Example 7.3: Model for Series J Data from Box and Jenkins
This example uses the Series J data from Box and Jenkins (1976). First, the input series X is modeled with a univariate ARMA model. Next, the dependent series Y is cross-correlated with the input series. Since a model has been t to X, both Y and X are prewhitened by this model before the sample cross-correlations are computed. Next, a transfer function model is t with no structure on the noise term. The residuals from this model are analyzed; then, the full model, transfer function and noise, is t to the data. The following statements read 'Input Gas Rate' and 'Output CO2 ' from a gas furnace. (Data values are not shown. The full example including data is in the SAS/ETS sample library.)
title1 Gas Furnace Data; title2 (Box and Jenkins, Series J); data seriesj; input x y @@; label x = Input Gas Rate y = Output CO2; datalines;
The results of the rst IDENTIFY statement for the input series X are shown in Output 7.3.1. The correlation analysis suggests an AR(3) model.
Output 7.3.1 IDENTIFY Statement Results for X
Gas Furnace Data (Box and Jenkins, Series J) The ARIMA Procedure Name of Variable = x Mean of Working Series Standard Deviation Number of Observations -0.05683 1.070952 296
Example 7.3: Model for Series J Data from Box and Jenkins ! 291
The ESTIMATE statement results for the AR(3) model for the input series X are shown in Output 7.3.3.
Output 7.3.3 Estimates of the AR(3) Model for X
Conditional Least Squares Estimation Standard Error 0.10902 0.05499 0.09967 0.05502 Approx Pr > |t| 0.2609 <.0001 <.0001 <.0001
t Value -1.13 35.94 -13.80 6.24 -0.00682 0.035797 0.1892 -141.667 -126.906 296
Lag 0 1 2 3
Constant Estimate Variance Estimate Std Error Estimate AIC SBC Number of Residuals
The IDENTIFY statement results for the dependent series Y cross-correlated with the input series X are shown in Output 7.3.4, Output 7.3.5, Output 7.3.6, and Output 7.3.7. Since a model has been t to X, both Y and X are prewhitened by this model before the sample cross-correlations are computed.
Output 7.3.4 Summary Table: Y Cross-Correlated with X
Correlation of y and x Number of Observations Variance of transformed series y Variance of transformed series x 296 0.131438 0.035357
Example 7.3: Model for Series J Data from Box and Jenkins ! 293
The ESTIMATE statement results for the transfer function model with no structure on the noise term are shown in Output 7.3.8, Output 7.3.9, and Output 7.3.10.
Output 7.3.8 Estimation Output of the First Transfer Function Model
Conditional Least Squares Estimation Standard Error 0.04926 0.22405 0.46472 0.35506 0.04101 Approx Pr > |t| <.0001 0.0123 0.3598 0.4002 <.0001 53.32256 0.702625 0.838227 728.0754 746.442 291
Lag 0 0 1 2 1
Variable y x x x x
Shift 0 3 3 3 3
Constant Estimate Variance Estimate Std Error Estimate AIC SBC Number of Residuals
Example 7.3: Model for Series J Data from Box and Jenkins ! 295
Numerator Factors Factor 1: -0.5647 - 0.42623 B**(1) - 0.29914 B**(2) Denominator Factors Factor 1: 1 - 0.60073 B**(1)
The residual correlation analysis suggests an AR(2) model for the noise part of the model. The
ESTIMATE statement results for the nal transfer function model with AR(2) noise are shown in Output 7.3.11.
Output 7.3.11 Estimation Output of the Final Model
Conditional Least Squares Estimation Standard Error 0.11929 0.04754 0.05006 0.07482 0.10287 0.10783 0.03822 Approx Pr > |t| <.0001 <.0001 <.0001 <.0001 0.0003 <.0001 <.0001 5.329425 0.058828 0.242544 8.292809 34.00607 291
Lag 0 1 2 0 1 2 1
Variable y y y x x x x
Shift 0 0 0 3 3 3 3
Constant Estimate Variance Estimate Std Error Estimate AIC SBC Number of Residuals
Example 7.3: Model for Series J Data from Box and Jenkins ! 297
Autoregressive Factors Factor 1: 1 - 1.53291 B**(1) + 0.63297 B**(2) Input Number 1 Input Variable Shift x 3
Numerator Factors Factor 1: -0.5352 - 0.37603 B**(1) - 0.51895 B**(2) Denominator Factors Factor 1: 1 - 0.54841 B**(1)
crosscorr=( x1(12) summer winter ) noprint; /* Fit a multiple regression with a seasonal MA model */ /* by the maximum likelihood method */ estimate q=(1)(12) input=( x1 summer winter ) noconstant method=ml; /* Forecast */ forecast lead=12 id=date interval=month; run;
The ESTIMATE statement results are shown in Output 7.4.1 and Output 7.4.2.
Output 7.4.1 Parameter Estimates
Intervention Data for Ozone Concentration (Box and Tiao, JASA 1975 P.70) The ARIMA Procedure Maximum Likelihood Estimation Standard Error 0.06710 0.05973 0.19236 0.05952 0.04978 Approx Pr > |t| <.0001 <.0001 <.0001 <.0001 0.1071 0.634506 0.796559 501.7696 518.3602 204
Lag 1 12 0 0 0
Shift 0 0 0 0 0
Input Number 1 Input Variable Period(s) of Differencing Overall Regression Factor x1 12 -1.33062
SCAN Chi-Square[1] Probability Values Lags AR AR AR AR AR AR 0 1 2 3 4 5 MA 0 <.0001 0.0003 0.2754 0.2349 0.3297 0.0477 MA 1 <.0001 0.6649 0.5106 0.9812 0.7154 0.7254 MA 2 <.0001 0.5194 0.5860 0.7667 0.7113 0.6652 MA 3 0.0007 0.9235 0.7346 0.7861 0.6995 0.9576 MA 4 0.0037 0.3993 0.6782 0.6810 0.5807 0.2660 MA 5 0.0024 0.8528 0.2766 0.6546 0.2205 0.9168
In Output 7.5.1, there is one (maximal) rectangular region in which all the elements are insignicant with 95% condence. This region has a vertex at (1,1). Output 7.5.2 gives recommendations based on the signicance level specied by the ALPHA=siglevel option.
Output 7.5.2 Example of SCAN Option Tentative Order Selection
ARMA(p+d,q) Tentative Order Selection Tests ----SCAN--p+d q 1 1
Another order identication diagnostic is the extended sample autocorrelation function or ESACF method. See The ESACF Method on page 241 for details of the ESACF method. The following statements generate Output 7.5.3 and Output 7.5.4:
/*-- Order Identification Diagnostic with ESACF Method --*/
ESACF Probability Values Lags AR AR AR AR AR AR 0 1 2 3 4 5 MA 0 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 MA 1 <.0001 0.5974 0.0002 0.9022 0.8380 <.0001 MA 2 0.0001 0.4622 0.6106 0.2400 0.3180 0.0765 MA 3 0.0014 0.9198 0.9182 0.8713 0.7737 0.9142 MA 4 0.0053 0.4292 0.5683 0.8930 0.8913 0.1038 MA 5 0.0041 0.8768 0.8592 0.7372 0.6213 0.8103
In Output 7.5.3, there are three right-triangular regions in which all elements are insignicant at the 5% level. The triangles have vertices (1,1), (3,1), and (4,1). Since the triangle at (1,1) covers more insignicant terms, it is recommended rst. Similarly, the remaining recommendations are ordered by the number of insignicant terms contained in the triangle. Output 7.5.4 gives recommendations based on the signicance level specied by the ALPHA=siglevel option.
Output 7.5.4 Example of ESACF Option Tentative Order Selection
ARMA(p+d,q) Tentative Order Selection Tests ----SCAN--p+d q 1 1
If you also specify the SCAN option in the same IDENTIFY statement, the two recommendations are printed side by side:
/*-- Combination of SCAN and ESACF Methods --*/ proc arima data=SeriesA; identify var=x scan esacf; run;
From Output 7.5.5, the autoregressive and moving-average orders are tentatively identied by both SCAN and ESACF tables to be (p C d; q)=(1,1). Because both the SCAN and ESACF indicate a p C d term of 1, a unit root test should be used to determine whether this autoregressive term is a unit root. Since a moving-average term appears to be present, a large autoregressive term is appropriate for the augmented Dickey-Fuller test for a unit root. Submitting the following statements generates Output 7.5.6:
/*-- Augmented Dickey-Fuller Unit Root Tests --*/ proc arima data=SeriesA; identify var=x stationarity=(adf=(5,6,7,8)); run;
Single Mean
Trend
The preceding test results show that a unit root is very likely given that none of the p-values are small enough to cause you to reject the null hypothesis that the series has a unit root. Based on this test and the previous results, the series should be differenced, and an ARIMA(0,1,1) would be a good choice for a tentative model for Series A. Using the recommendation that the series be differenced, the following statements generate Output 7.5.7:
/*-- Minimum Information Criterion --*/ proc arima data=SeriesA; identify var=x(1) minic; run;
Example 7.6: Detection of Level Changes in the Nile River Data ! 305
The error series is estimated by using an AR(7) model, and the minimum of this MINIC table is BIC.0; 1/. This diagnostic conrms the previous result which indicates that an ARIMA(0,1,1) is a tentative model for Series A. If you also specify the SCAN or MINIC option in the same IDENTIFY statement as follows, the BIC associated with the SCAN table and ESACF table recommendations is listed. Output 7.5.8 shows the results.
/*-- Combination of MINIC, SCAN, and ESACF Options --*/ proc arima data=SeriesA; identify var=x(1) minic scan esacf; run;
1140 1140
The following program ts an ARIMA model, ARIMA(0,1,1), similar to the structural model suggested in de Jong and Penzer (1998). This model is also suggested by the usual correlation analysis of the series. By default, the OUTLIER statement requests detection of additive outliers and level shifts, assuming that the series follows the estimated model.
/*-- ARIMA(0, 1, 1) Model --*/ proc arima data=nile; identify var=level(1); estimate q=1 noint method=ml; outlier maxnum= 5 id=year; run;
Obs 29 43 7 94 18
Note that the rst three outliers detected are indeed the ones discussed earlier. You can include the shock signatures that correspond to these three outliers in the Nile data set as follows:
data nile; set nile; AO1877 = ( year = 1jan1877d ); AO1913 = ( year = 1jan1913d ); LS1899 = ( year >= 1jan1899d ); run;
Now you can rene the earlier model by including these outliers. After examining the parameter estimates and residuals (not shown) of the ARIMA(0,1,1) model with these regressors, the following stationary MA1 model (with regressors) appears to t the data well:
/*-- MA1 Model with Outliers --*/ proc arima data=nile; identify var=level crosscorr=( AO1877 AO1913 LS1899 ); estimate q=1 input=( AO1877 AO1913 LS1899 ) method=ml; outlier maxnum=5 alpha=0.01 id=year; run;
The relevant outlier detection process output is shown in Output 7.6.2. No outliers, at signicance level 0.01, were detected.
Output 7.6.2 MA1 Model with Outliers
SERIES A: Chemical Process Concentration Readings The ARIMA Procedure Outlier Detection Summary Maximum number searched Number found Significance used 5 0 0.01
In Example 7.2 the airline model, ARIMA.0; 1; 1/ .0; 1; 1/12 , was seen to be a good t to the unmodied log-transformed airline passenger series. The preliminary identication steps (not shown) again suggest the airline model as a suitable initial model for the modied data. The following statements specify the airline model and request an outlier search.
/*-- Outlier Detection --*/ proc arima data=airline; identify var=logair( 1, 12 ) noprint; estimate q= (1)(12) noint method= ml;
Clearly the level shift at observation number 100 and the additive outlier at observation number 50 are the dominant outliers. Moreover, the corresponding regression coefcients seem to correctly estimate the size and sign of the change. You can augment the airline data with these two regressors, as follows:
data airline; set airline; if _n_ = 50 then AO = 1; else AO = 0.0; if _n_ >= 100 then LS = 1; else LS = 0.0; run;
You can now rene the previous model by including these regressors, as follows. Note that the differencing order of the dependent series is matched to the differencing orders of the outlier regressors to get the correct effective outlier signatures.
/*-- Airline Model with Outliers --*/ proc arima data=airline; identify var=logair(1, 12) crosscorr=( AO(1, 12) LS(1, 12) ) noprint; estimate q= (1)(12) noint input=( AO LS ) method=ml plot; outlier maxnum=3 alpha=0.01;
References ! 309
run;
Obs 135 62 29
The output shows that a few outliers still remain to be accounted for and that the model could be rened further.
References
Akaike, H. (1974), A New Look at the Statistical Model Identication, IEEE Transaction on Automatic Control, AC19, 716723. Anderson, T. W. (1971), The Statistical Analysis of Time Series, New York: John Wiley & Sons. Andrews and Herzberg (1985), A Collection of Problems from Many Fields for the Student and Research Worker, New York: SpringerVerlag. Ansley, C. (1979), An Algorithm for the Exact Likelihood of a Mixed Autoregressive MovingAverage Process, Biometrika, 66, 59. Ansley, C. and Newbold, P. (1980), Finite Sample Properties of Estimators for Autoregressive Moving-Average Models, Journal of Econometrics, 13, 159. Bhansali, R. J. (1980), Autoregressive and Window Estimates of the Inverse Correlation Function, Biometrika, 67, 551566. Box, G. E. P. and Jenkins, G. M. (1976), Time Series Analysis: Forecasting and Control, San
Francisco: Holden-Day. Box, G. E. P., Jenkins, G. M., and Reinsel, G. C. (1994), Time Series Analysis: Forecasting and Control, Third Edition, Englewood Cliffs, NJ: Prentice Hall, 197199. Box, G. E. P. and Tiao, G. C. (1975), Intervention Analysis with Applications to Economic and Environmental Problems, JASA, 70, 7079. Brocklebank, J. C. and Dickey, D. A. (2003), SAS System for Forecasting Time Series, Second Edition, Cary, North Carolina: SAS Institute Inc. Brockwell, P. J. and Davis, R. A. (1991), Time Series: Theory and Methods, Second Edition, New York: Springer-Verlag. Chateld, C. (1980), Inverse Autocorrelations, Journal of the Royal Statistical Society, A142, 363377. Choi, ByoungSeon (1992), ARMA Model Identication, New York: Springer-Verlag, 129132. Cleveland, W. S. (1972), The Inverse Autocorrelations of a Time Series and Their Applications, Technometrics, 14, 277. Cobb, G. W. (1978), The Problem of the Nile: Conditional Solution to a Change Point Problem, Biometrika, 65, 243251. Davidson, J. (1981), Problems with the Estimation of Moving-Average Models, Journal of Econometrics, 16, 295. Davies, N., Triggs, C. M., and Newbold, P. (1977), Signicance Levels of the Box-Pierce Portmanteau Statistic in Finite Samples, Biometrika, 64, 517522. de Jong, P. and Penzer, J. (1998), Diagnosing Shocks in Time Series, Journal of the American Statistical Association, Vol. 93, No. 442. Dickey, D. A. (1976), Estimation and Testing of Nonstationary Time Series, unpublished Ph.D. thesis, Iowa State University, Ames. Dickey, D. A., and Fuller, W. A. (1979), Distribution of the Estimators for Autoregressive Time Series with a Unit Root, Journal of the American Statistical Association, 74 (366), 427431. Dickey, D. A., Hasza, D. P., and Fuller, W. A. (1984), Testing for Unit Roots in Seasonal Time Series, Journal of the American Statistical Association, 79 (386), 355367. Dunsmuir, William (1984), Large Sample Properties of Estimation in Time Series Observed at Unequally Spaced Times, in Time Series Analysis of Irregularly Observed Data, Emanuel Parzen, ed., New York: Springer-Verlag. Findley, D. F., Monsell, B. C., Bell, W. R., Otto, M. C., and Chen, B. C. (1998), New Capabilities and Methods of the X-12-ARIMA Seasonal Adjustment Program, Journal of Business and Economic Statistics, 16, 127177. Fuller, W. A. (1976), Introduction to Statistical Time Series, New York: John Wiley & Sons.
References ! 311
Hamilton, J. D. (1994), Time Series Analysis, Princeton: Princeton University Press. Hannan, E. J. and Rissanen, J. (1982), Recursive Estimation of Mixed Autoregressive MovingAverage Order, Biometrika, 69 (1), 8194. Harvey, A. C. (1981), Time Series Models, New York: John Wiley & Sons. Jones, Richard H. (1980), Maximum Likelihood Fitting of ARMA Models to Time Series with Missing Observations, Technometrics, 22, 389396. Kohn, R. and Ansley, C. (1985), Efcient Estimation and Prediction in Time Series Regression Models, Biometrika, 72, 3, 694697. Ljung, G. M. and Box, G. E. P. (1978), On a Measure of Lack of Fit in Time Series Models, Biometrika, 65, 297303. Montgomery, D. C. and Johnson, L. A. (1976), Forecasting and Time Series Analysis, New York: McGraw-Hill. Morf, M., Sidhu, G. S., and Kailath, T. (1974), Some New Algorithms for Recursive Estimation on Constant Linear Discrete Time Systems, IEEE Transactions on Automatic Control, AC19, 315323. Nelson, C. R. (1973), Applied Time Series for Managerial Forecasting, San Francisco: Holden-Day. Newbold, P. (1981), Some Recent Developments in Time Series Analysis, International Statistical Review, 49, 5366. Newton, H. Joseph and Pagano, Marcello (1983), The Finite Memory Prediction of Covariance Stationary Time Series, SIAM Journal of Scientic and Statistical Computing, 4, 330339. Pankratz, Alan (1983), Forecasting with Univariate Box-Jenkins Models: Concepts and Cases, New York: John Wiley & Sons. Pankratz, Alan (1991), Forecasting with Dynamic Regression Models, New York: John Wiley & Sons. Pearlman, J. G. (1980), An Algorithm for the Exact Likelihood of a High-Order Autoregressive Moving-Average Process, Biometrika, 67, 232233. Priestly, M. B. (1981), Spectra Analysis and Time Series, Volume 1: Univariate Series, New York: Academic Press Schwarz, G. (1978), Estimating the Dimension of a Model, Annals of Statistics, 6, 461464. Stoffer, D. and Toloi, C. (1992), A Note on the Ljung-Box-Pierce Portmanteau Statistic with Missing Data, Statistics & Probability Letters 13, 391396. Tsay, R. S. and Tiao, G. C. (1984), Consistent Estimates of Autoregressive Parameters and Extended Sample Autocorrelation Function for Stationary and Nonstationary ARMA Models, JASA, 79 (385), 8496.
Tsay, R. S. and Tiao, G. C. (1985), Use of Canonical Analysis in Time Series Model Identication, Biometrika, 72 (2), 299315. Woodeld, T. J. (1987), Time Series Intervention Analysis Using SAS Software, Proceedings of the Twelfth Annual SAS Users Group International Conference, 331339. Cary, NC: SAS Institute Inc.
Chapter 8
Example 8.3: Example 8.4: Example 8.5: Example 8.6: Example 8.7: References . . . .
Lack-of-Fit Study . . . . . . . . Missing Values . . . . . . . . . Money Demand Model . . . . . Estimation of ARCH(2) Process Illustration of ODS Graphics . . . . . . . . . . . . . . . . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
generalized ARCH (GARCH) integrated GARCH (IGARCH) exponential GARCH (EGARCH) GARCH-in-mean (GARCH-M) For GARCH-type models, the AUTOREG procedure produces the conditional prediction error variances as well as parameter and covariance estimates. The AUTOREG procedure can also analyze models that combine autoregressive errors and GARCH-type heteroscedasticity. PROC AUTOREG can output predictions of the conditional mean and variance for models with autocorrelated disturbances and changing conditional error variances over time. Four estimation methods are supported for the autoregressive error model: Yule-Walker iterated Yule-Walker unconditional least squares exact maximum likelihood The maximum likelihood method is used for GARCH models and for mixed AR-GARCH models. The AUTOREG procedure produces forecasts and forecast condence limits when future values of the independent variables are included in the input data set. PROC AUTOREG is a useful tool for forecasting because it uses the time series part of the model as well as the systematic part in generating predicted values. The autoregressive error model takes into account recent departures from the trend in producing forecasts. The AUTOREG procedure permits embedded missing values for the independent or dependent variables. The procedure should be used only for ordered and equally spaced time series data.
time series data since the assumptions on which the classical linear regression model is based will usually be violated. Violation of the independent errors assumption has three important consequences for ordinary regression. First, statistical tests of the signicance of the parameters and the condence limits for the predicted values are not correct. Second, the estimates of the regression coefcients are not as efcient as they would be if the autocorrelation were taken into account. Third, since the ordinary regression residuals are not independent, they contain information that can be used to improve the prediction of future values. The AUTOREG procedure solves this problem by augmenting the regression model with an autoregressive model for the random error, thereby accounting for the autocorrelation of the errors. Instead of the usual regression model, the following autoregressive error model is used: yt D x0 C t t t D '1 t 1 IN.0; 2 / t '2 :::
2/
t 2
'm
t m
By simultaneously estimating the regression coefcients and the autoregressive error model parameters 'i , the AUTOREG procedure corrects the regression estimates for autocorrelation. Thus, this kind of regression analysis is often called autoregressive error correction or serial correlation correction.
The series Y is a time trend plus a second-order autoregressive error. The model simulated is yt D 10 C 0:5t C t 0:5 t t D 1:3 t 1 IN.0; 4/ t C
The following statements plot the simulated time series Y. A linear regression trend line is shown for reference.
title Autocorrelated Time Series; proc sgplot data=a noautolegend;
The plot of series Y and the regression line are shown in Figure 8.1.
Figure 8.1 Autocorrelated Time Series
Note that when the series is above (or below) the OLS regression trend line, it tends to remain above (below) the trend for several periods. This pattern is an example of positive autocorrelation. Time series regression usually involves independent variables other than a time trend. However, the simple time trend model is convenient for illustrating regression with autocorrelated errors, and the series Y shown in Figure 8.1 is used in the following introductory examples.
Autocorrelated Time Series The AUTOREG Procedure Ordinary Least Squares Estimates SSE MSE SBC MAE MAPE Durbin-Watson 214.953429 6.32216 173.659101 2.01903356 12.5270666 0.4752 DFE Root MSE AIC AICC Regress R-Square Total R-Square Standard Error 0.8559 0.0403 34 2.51439 170.492063 170.855699 0.8200 0.8200 Approx Pr > |t| <.0001 <.0001
DF 1 1
The output rst shows statistics for the model residuals. The model root mean square error (Root MSE) is 2.51, and the model R2 is 0.82. Notice that two R2 statistics are shown, one for the regression model (Reg Rsq) and one for the full model (Total Rsq) that includes the autoregressive error process, if any. In this case, an autoregressive error model is not used, so the two R2 statistics are the same. Other statistics shown are the sum of square errors (SSE), mean square error (MSE), mean absolute error (MAE), mean absolute percentage error (MAPE), error degrees of freedom (DFE, the number of observations minus the number of parameters), the information criteria SBC, AIC, and AICC, and the Durbin-Watson statistic. (Durbin-Watson statistics, MAE, MAPE, SBC, AIC, and AICC are discussed in the section R-Square Statistics and Other Measures of Fit on page 367 later in this chapter.) The output then shows a table of regression coefcients, with standard errors and t tests. The estimated model is yt D 8:23 C 0:502t C Est: Var. t / D 6:32
t
The OLS parameter estimates are reasonably close to the true values, but the estimated error variance, 6.32, is much larger than the true value, 4.
The rst part of the results is shown in Figure 8.3. The initial OLS results are produced rst, followed by estimates of the autocorrelations computed from the OLS residuals. The autocorrelations are also displayed graphically.
Figure 8.3 Preliminary Estimate for AR(2) Error Model
Autocorrelated Time Series The AUTOREG Procedure Dependent Variable y
Autocorrelated Time Series The AUTOREG Procedure Ordinary Least Squares Estimates SSE MSE SBC MAE MAPE Durbin-Watson 214.953429 6.32216 173.659101 2.01903356 12.5270666 0.4752 DFE Root MSE AIC AICC Regress R-Square Total R-Square Standard Error 0.8559 0.0403 34 2.51439 170.492063 170.855699 0.8200 0.8200 Approx Pr > |t| <.0001 <.0001
DF 1 1
Estimates of Autocorrelations Lag 0 1 2 Covariance 5.9709 4.5169 2.0241 Correlation 1.000000 0.756485 0.338995 -1 9 8 7 6 5 4 3 2 1 0 1 2 3 4 5 6 7 8 9 1 | | | |********************| |*************** | |******* |
The maximum likelihood estimates are shown in Figure 8.4. Figure 8.4 also shows the preliminary Yule-Walker estimates used as starting values for the iterative computation of the maximum likelihood estimates.
Figure 8.4 Maximum Likelihood Estimates of AR(2) Error Model
Estimates of Autoregressive Parameters Standard Error 0.148172 0.148172
Lag 1 2
Algorithm converged. Autocorrelated Time Series The AUTOREG Procedure Maximum Likelihood Estimates SSE MSE SBC MAE MAPE Durbin-Watson 54.7493022 1.71092 133.476508 0.98307236 6.45517689 2.2761 DFE Root MSE AIC AICC Regress R-Square Total R-Square Standard Error 1.1693 0.0551 0.1385 0.1366 32 1.30802 127.142432 128.432755 0.7280 0.9542 Approx Pr > |t| <.0001 <.0001 <.0001 <.0001
DF 1 1 1 1
Autoregressive parameters assumed given. Standard Error 1.1678 0.0551 Approx Pr > |t| <.0001 <.0001
DF 1 1
The diagnostic statistics and parameter estimates tables in Figure 8.4 have the same form as in the OLS output, but the values shown are for the autoregressive error model. The MSE for the autoregressive model is 1.71, which is much smaller than the true value of 4. In small samples, the
2,
2.
Notice that the total R2 statistic computed from the autoregressive model residuals is 0.954, reecting the improved t from the use of past residuals to help predict the next Y value. The Reg Rsq value 0.728 is the R2 statistic for a regression of transformed variables adjusted for the estimated autocorrelation. (This is not the R2 for the estimated trend line. For details, see the section R-Square Statistics and Other Measures of Fit on page 367 later in this chapter.) The parameter estimates table shows the ML estimates of the regression coefcients and includes two additional rows for the estimates of the autoregressive parameters, labeled AR(1) and AR(2). The estimated model is yt D 7:88 C 0:5096t C
t t t 2
D 1:25
t 1
0:628
Est: Var. t / D 1:71 Note that the signs of the autoregressive parameters shown in this equation for t are the reverse of the estimates shown in the AUTOREG procedure output. Figure 8.4 also shows the estimates of the regression coefcients with the standard errors recomputed on the assumption that the autoregressive parameter estimates equal the true values.
The following statements plot the predicted values from the regression trend line and from the full model together with the actual values:
title Predictions for Autocorrelation Model; proc sgplot data=p;
scatter x=time y=y / markerattrs=(color=blue); series x=time y=yhat / lineattrs=(color=blue); series x=time y=trendhat / lineattrs=(color=black); run;
In Figure 8.5 the straight line is the autocorrelation corrected regression line, traced out by the structural predicted values TRENDHAT. The jagged line traces the full model prediction values. The actual values are marked by asterisks. This plot graphically illustrates the improvement in t provided by the autoregressive error process for highly autocorrelated data.
For the time trend model, the only regressor is time. The following statements add observations for time periods 37 through 46 to the data set A to produce an augmented data set B:
data b; y = .; do time = 37 to 46; output; end; run; data b; merge a b; by time; run;
To produce the forecast, use the augmented data set as input to PROC AUTOREG, and specify the appropriate options in the OUTPUT statement. The following statements produce forecasts for the time trend with autoregressive error model. The output data set includes all the variables in the input data set, the forecast values (YHAT), the predicted trend (YTREND), and the upper (UCL) and lower (LCL) 95% condence limits.
proc autoreg data=b; model y = time / nlag=2 method=ml; output out=p p=yhat pm=ytrend lcl=lcl ucl=ucl; run;
The following statements plot the predicted values and condence limits, and they also plot the trend line for reference. The actual observations are shown for periods 16 through 36, and a reference line is drawn at the start of the out-of-sample forecasts.
title Forecasting Autocorrelated Time Series; proc sgplot data=p; band x=time upper=ucl lower=lcl; scatter x=time y=y; series x=time y=yhat; series x=time y=ytrend / lineattrs=(color=black); run;
The plot is shown in Figure 8.6. Notice that the forecasts take into account the recent departures from the trend but converge back to the trend line for longer forecast horizons.
/*-- Durbin-Watson test for autocorrelation --*/ proc autoreg data=a; model y = time / dw=4 dwprob; run;
The AUTOREG procedure output is shown in Figure 8.7. In this case, the rst-order Durbin-Watson test is highly signicant, with p < .0001 for the hypothesis of no rst-order autocorrelation. Thus, autocorrelation correction is needed.
Figure 8.7 Durbin-Watson Test Results for OLS Residuals
Forecasting Autocorrelated Time Series The AUTOREG Procedure Dependent Variable y
Forecasting Autocorrelated Time Series The AUTOREG Procedure Ordinary Least Squares Estimates SSE MSE SBC MAE MAPE 214.953429 6.32216 173.659101 2.01903356 12.5270666 DFE Root MSE AIC AICC Regress R-Square Total R-Square 34 2.51439 170.492063 170.855699 0.8200 0.8200
Durbin-Watson Statistics Order 1 2 3 4 DW 0.4752 1.2935 2.0694 2.5544 Pr < DW <.0001 0.0137 0.6545 0.9818 Pr > DW 1.0000 0.9863 0.3455 0.0182
NOTE: Pr<DW is the p-value for testing positive autocorrelation, and Pr>DW is the p-value for testing negative autocorrelation. Standard Error 0.8559 0.0403 Approx Pr > |t| <.0001 <.0001
DF 1 1
Using the Durbin-Watson test, you can decide if autocorrelation correction is needed. However, generalized Durbin-Watson tests should not be used to decide on the autoregressive order. The higherorder tests assume the absence of lower-order autocorrelation. If the ordinary Durbin-Watson test indicates no rst-order autocorrelation, you can use the second-order test to check for second-order autocorrelation. Once autocorrelation is detected, further tests at higher orders are not appropriate.
In Figure 8.7, since the rst-order Durbin-Watson test is signicant, the order 2, 3, and 4 tests can be ignored. When using Durbin-Watson tests to check for autocorrelation, you should specify an order at least as large as the order of any potential seasonality, since seasonality produces autocorrelation at the seasonal lag. For example, for quarterly data use DW=4, and for monthly data use DW=12.
The results are shown in Figure 8.8. The Durbin h statistic 2.78 is signicant with a p-value of 0.0027, indicating autocorrelation.
Figure 8.8 Durbin h Test with a Lagged Dependent Variable
Forecasting Autocorrelated Time Series The AUTOREG Procedure Dependent Variable y
Forecasting Autocorrelated Time Series The AUTOREG Procedure Ordinary Least Squares Estimates SSE MSE SBC MAE MAPE 97.711226 2.96095 142.369787 1.29949385 8.1922836 DFE Root MSE AIC AICC Regress R-Square Total R-Square 33 1.72074 139.259091 139.634091 0.9109 0.9109
DF 1 1
Stepwise Autoregression
Once you determine that autocorrelation correction is needed, you must select the order of the autoregressive error model to use. One way to select the order of the autoregressive error model is stepwise autoregression. The stepwise autoregression method initially ts a high-order model with many autoregressive lags and then sequentially removes autoregressive parameters until all remaining autoregressive parameters have signicant t tests. To use stepwise autoregression, specify the BACKSTEP option, and specify a large order with the NLAG= option. The following statements show the stepwise feature, using an initial order of 5:
/*-- stepwise autoregression --*/ proc autoreg data=a; model y = time / method=ml nlag=5 backstep; run;
DF 1 1
Estimates of Autocorrelations Lag 0 1 2 3 4 5 Covariance 5.9709 4.5169 2.0241 -0.4402 -2.1175 -2.8534 Correlation 1.000000 0.756485 0.338995 -0.073725 -0.354632 -0.477887 -1 9 8 7 6 5 4 3 2 1 0 1 2 3 4 5 6 7 8 9 1 | | | | | | |********************| |*************** | |******* | | *| | *******| | **********|
Backward Elimination of Autoregressive Terms Lag 4 3 5 Estimate -0.052908 0.115986 0.131734 t Value -0.20 0.57 1.21 Pr > |t| 0.8442 0.5698 0.2340
The estimates of the autocorrelations are shown for 5 lags. The backward elimination of autoregressive terms report shows that the autoregressive parameters at lags 3, 4, and 5 were insignicant and eliminated, resulting in the second-order model shown previously in Figure 8.4. By default, retained autoregressive parameters must be signicant at the 0.05 level, but you can control this with the SLSTAY= option. The remainder of the output from this example is the same as that in Figure 8.3 and Figure 8.4, and it is not repeated here. The stepwise autoregressive process is performed using the Yule-Walker method. The maximum likelihood estimates are produced after the order of the model is determined from the signicance tests of the preliminary Yule-Walker estimates. When using stepwise autoregression, it is a good idea to specify an NLAG= option value larger than the order of any potential seasonality, since seasonality produces autocorrelation at the seasonal lag.
For example, for monthly data use NLAG=13, and for quarterly data use NLAG=5.
The MODEL statement species the following fth-order autoregressive error model: yt D a C bt C
t t
'1
t 1
'4
t 4
'5
t 5
data a; ul = 0; ull = 0; do time = -10 to 120; s = 1 + (time >= 60 & time < 90); u = + 1.3 * ul - .5 * ull + s*rannor(12346); y = 10 + .5 * time + u; if time > 0 then output; ull = ul; ul = u; end; run; title Heteroscedastic Autocorrelated Time Series; proc sgplot data=a noautolegend; series x=time y=y / markers; reg x=time y=y / lineattrs=(color=black); run;
To test for heteroscedasticity with PROC AUTOREG, specify the ARCHTEST option. The following statements regress Y on TIME and use the ARCHTEST option to test for heteroscedastic OLS
The PROC AUTOREG output is shown in Figure 8.11. The Q statistics test for changes in variance across time by using lag windows ranging from 1 through 12. (See the section Heteroscedasticity and Normality Tests on page 374 for details.) The p-values for the test statistics are given in parentheses. These tests strongly indicate heteroscedasticity, with p < 0.0001 for all lag windows. The Lagrange multiplier (LM) tests also indicate heteroscedasticity. These tests can also help determine the order of the ARCH model appropriate for modeling the heteroscedasticity, assuming that the changing variance follows an autoregressive conditional heteroscedasticity model.
Figure 8.11 Heteroscedasticity Tests
Heteroscedastic Autocorrelated Time Series The AUTOREG Procedure Dependent Variable y
Heteroscedastic Autocorrelated Time Series The AUTOREG Procedure Ordinary Least Squares Estimates SSE MSE SBC MAE MAPE Durbin-Watson 690.266009 5.84971 560.070468 3.72760089 5.24916664 0.4060 DFE Root MSE AIC AICC Regress R-Square Total R-Square 118 2.41862 554.495484 554.598048 0.9814 0.9814
Q and LM Tests for ARCH Disturbances Order 1 2 3 4 5 6 7 8 9 10 11 12 Q 37.5445 40.4245 41.0753 43.6893 55.3846 60.6617 62.9655 63.7202 64.2329 66.2778 68.1923 69.3725 Pr > Q <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 LM 37.0072 40.9189 42.5032 43.3822 48.2511 49.7799 52.0126 52.7083 53.2393 53.2407 53.5924 53.7559 Pr > LM <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001
DF 1 1
D D
'1 ht et
q X i D1
:::
'm
p X j D1
t m
ht D ! C et
2 i t i
j ht j
IN.0; 1/
This model combines the mth-order autoregressive error model with the GARCH.p; q/ variance model. It is denoted as the AR.m/-GARCH.p; q/ regression model. The Lagrange multiplier (LM) tests shown in Figure 8.11 can help determine the order of the ARCH model appropriate for the data. The tests are signicant .p < :0001/ through order 12, which indicates that a very high-order ARCH model is needed to model the heteroscedasticity. The basic ARCH.q/ model .p D 0/ is a short memory process in that only the most recent q squared residuals are used to estimate the changing variance. The GARCH model .p > 0/ allows long memory processes, which use all the past squared residuals to estimate the current variance. The LM tests in Figure 8.11 suggest the use of the GARCH model .p > 0/ instead of the ARCH model. The GARCH.p; q/ model is specied with the GARCH=(P=p ,Q=q ) option in the MODEL statement. The basic ARCH.q/ model is the same as the GARCH.0; q/ model and is specied with the GARCH=(Q=q) option. The following statements t an AR(2)-GARCH.1; 1/ model for the Y series regressed on TIME. The GARCH=(P=1,Q=1) option species the GARCH.1; 1/ conditional variance model. The
NLAG=2 option species the AR(2) error process. Only the maximum likelihood method is supported for GARCH models; therefore, the METHOD= option is not needed. The CEV= option in the OUTPUT statement stores the estimated conditional error variance at each time period in the variable VHAT in an output data set named OUT. The data set is the same as in the section Testing for Heteroscedasticity on page 329.
/*-- AR(2)-GARCH(1,1) model for the Y series regressed on TIME --*/ proc autoreg data=a; model y = time / nlag=2 garch=(q=1,p=1) maxit=50; output out=out cev=vhat; run;
The results for the GARCH model are shown in Figure 8.12. (The preliminary estimates are not shown.)
Figure 8.12 AR(2)-GARCH.1; 1/ Model
Heteroscedastic Autocorrelated Time Series The AUTOREG Procedure GARCH Estimates SSE MSE Log Likelihood SBC MAE MAPE 218.860964 1.82384 -187.44013 408.392693 0.97051351 2.75945185 Observations Uncond Var Total R-Square AIC AICC Normality Test Pr > ChiSq Standard Error 0.7235 0.0107 0.1078 0.1057 0.0757 0.0847 0.0960 120 1.62996534 0.9941 388.88025 389.88025 0.0839 0.9589 Approx Pr > |t| <.0001 <.0001 <.0001 <.0001 0.2614 0.0130 <.0001
DF 1 1 1 1 1 1 1
The normality test is not signicant (p = 0.959), which is consistent with the hypothesis that the p residuals from the GARCH model, t = ht , are normally distributed. The parameter estimates table includes rows for the GARCH parameters. ARCH0 represents the estimate for the parameter !, ARCH1 represents 1 , and GARCH1 represents 1 . The following statements transform the estimated conditional error variance series VHAT to the estimated standard deviation series SHAT. Then, they plot SHAT together with the true standard deviation S used to generate the simulated data.
data out;
set out; shat = sqrt( vhat ); run; title Predicted and Actual Standard Deviations; proc sgplot data=out noautolegend; scatter x=time y=s; series x=time y=shat/ lineattrs=(color=black); run;
Note that in this example the form of heteroscedasticity used in generating the simulated series Y does not t the GARCH model. The GARCH model assumes conditional heteroscedasticity, with homoscedastic unconditional error variance. That is, the GARCH model assumes that the changes in variance are a function of the realizations of preceding errors and that these changes represent temporary and random departures from a constant unconditional variance. The data-generating process used to simulate series Y, contrary to the GARCH model, has exogenous unconditional heteroscedasticity that is independent of past errors. Nonetheless, as shown in Figure 8.13, the GARCH model does a reasonably good job of approxi-
mating the error variance in this example, and some improvement in the efciency of the estimator of the regression parameters can be expected. The GARCH model may perform better in cases where theory suggests that the data-generating process produces true autoregressive conditional heteroscedasticity. This is the case in some economic theories of asset returns, and GARCH-type models are often used for analysis of nancial market data.
See the section GARCH, IGARCH, EGARCH, and GARCH-M Models on page 362 later in this chapter for details.
At least one MODEL statement must be specied. One OUTPUT statement can follow each MODEL statement. One HETERO statement can follow each MODEL statement.
Functional Summary
The statements and options used with the AUTOREG procedure are summarized in the following table.
Table 8.1 AUTOREG Functional Summary
Description Data Set Options specify the input data set write parameter estimates to an output data set include covariances in the OUTEST= data set requests that the procedure produce graphics via the Output Delivery System write predictions, residuals, and condence limits to an output data set Declaring the Role of Variables specify BY-group processing Printing Control Options request all printing options print transformed coefcients print correlation matrix of the estimates print covariance matrix of the estimates print DW statistics up to order j print marginal probability of the generalized Durbin-Watson test statistics for large sample sizes print the p-values for the Durbin-Watson test be computed using a linearized approximation of the design matrix print inverse of Toeplitz matrix print the Godfrey LM serial correlation test print details at each iteration step print the Durbin t statistic print the Durbin h statistic print the log likelihood value of the regression model print the Jarque-Bera normality test print tests for ARCH process print the Lagrange multiplier test print the Chow test print the predictive Chow test suppress printed output print partial autocorrelations print Ramseys RESET test
BY MODEL MODEL MODEL MODEL MODEL MODEL ALL COEF CORRB COVB DW=j DWPROB
MODEL
LDW
MODEL MODEL MODEL MODEL MODEL MODEL MODEL MODEL HETERO MODEL MODEL MODEL MODEL MODEL
GINV GODFREY= ITPRINT LAGDEP LAGDEP= LOGLIKL NORMAL ARCHTEST TEST=LM CHOW= PCHOW= NOPRINT PARTIAL RESET
Table 8.1
continued
Description print Phillips-Perron tests for stationarity or unit roots print KPSS tests for stationarity or unit roots print tests of linear hypotheses specify the test statistics to use prints the uncentered regression R2 Model Estimation Options specify the order of autoregressive process center the dependent variable suppress the intercept parameter remove nonsignicant AR parameters specify signicance level for BACKSTEP specify the convergence criterion specify the type of covariance matrix set the initial values of parameters used by the iterative optimization algorithm specify iterative Yule-Walker method specify maximum number of iterations specify the estimation method use only rst sequence of nonmissing data specify the optimization technique imposes restrictions on the regression estimates estimate and test heteroscedasticity models GARCH Related Options specify order of GARCH process specify type of GARCH model specify various forms of the GARCH-M model suppress GARCH intercept parameter specify the trust region method estimate the GARCH model for the conditional t distribution estimates the start-up values for the conditional variance equation specify the functional form of the heteroscedasticity model specify that the heteroscedasticity model does not include the unit term impose constraints on the estimated parameters the heteroscedasticity model impose constraints on the estimated standard deviation of the heteroscedasticity model output conditional error variance
Statement MODEL MODEL TEST TEST MODEL MODEL MODEL MODEL MODEL MODEL MODEL MODEL MODEL MODEL MODEL MODEL MODEL MODEL RESTRICT HETERO MODEL MODEL MODEL MODEL MODEL MODEL MODEL HETERO HETERO HETERO HETERO OUTPUT
Option STATIONARITY=(PHILLIPS=) STATIONARITY=(KPSS=) TYPE= URSQ NLAG= CENTER NOINT BACKSTEP SLSTAY= CONVERGE= COVEST= INITIAL= ITER MAXITER= METHOD= NOMISS OPTMETHOD=
GARCH=(Q=,P=) GARCH=(: : :,TYPE=) GARCH=(: : :,MEAN=) GARCH=(: : :,NOINT) GARCH=(: : :,TR) GARCH=(: : :) DIST= GARCH=(: : :,STARTUP=) LINK= NOCONST COEF= STD= CEV=
Table 8.1
continued
Description output conditional prediction error variance specify the exible conditional variance form of the GARCH model Output Control Options specify condence limit size specify condence limit size for structural predicted values specify the signicance level for the upper and lower bounds of the CUSUM and CUSUMSQ statistics specify the name of a variable to contain the values of the Theils BLUS residuals output the value of the error variance t2 output transformed intercept variable specify the name of a variable to contain the CUSUM statistics specify the name of a variable to contain the CUSUMSQ statistics specify the name of a variable to contain the upper condence bound for the CUSUM statistic specify the name of a variable to contain the lower condence bound for the CUSUM statistic specify the name of a variable to contain the upper condence bound for the CUSUMSQ statistic option specify the name of a variable to contain the lower condence bound for the CUSUMSQ statistic output lower condence limit output lower condence limit for structural predicted values output predicted values output predicted values of structural part output residuals output residuals from structural predictions specify the name of a variable to contain the part of the predictive error variance (vt ) specify the name of a variable to contain recursive residuals output transformed variables output upper condence limit output upper condence limit for structural predicted values
Option CPEV=
OUTPUT
CUSUMLB=
OUTPUT
CUSUMSQUB=
OUTPUT
CUSUMSQLB=
OUTPUT OUTPUT OUTPUT OUTPUT OUTPUT OUTPUT OUTPUT OUTPUT OUTPUT OUTPUT OUTPUT
species the input SAS data set. If the DATA= option is not specied, PROC AUTOREG uses the most recently created SAS data set.
OUTEST= SAS-data-set
writes the parameter estimates to an output data set. See the section OUTEST= Data Set on page 384 later in this chapter for information on the contents of these data set.
COVOUT
writes the covariance matrix for the parameter estimates to the OUTEST= data set. This option is valid only if the OUTEST= option is specied.
PLOTS<(Global-Plot-Options)> < = (Specic Plot Options)>
requests that the AUTOREG procedure produce statistical graphics via the Output Delivery System, provided that the ODS GRAPHICS statement has been specied. For general information about ODS Graphics, see Chapter 21, Statistical Graphics Using ODS (SAS/STAT Users Guide). The global-plot-options apply to all relevant plots generated by the AUTOREG procedure. The global-plot-options supported by the AUTOREG procedure follow.
UNPACKPANEL breaks a graphic that is otherwise paneled into individual component plots.
STUDENTRESIDUAL plots the studentized residuals. For the models with the NLAG= or GARCH= options in the MODEL statement or with the HETERO statement, this option is replaced by the STANDARDRESIDUAL option. STANDARDRESIDUAL WHITENOISE NONE plots the standardized residuals. plots the histogram of residuals. plots the white noise probabilities. suppresses all plots.
RESIDUALHISTOGRAM | RESIDHISTOGRAM
In addition, any of the following MODEL statement options can be specied in the PROC AUTOREG statement, which is equivalent to specifying the option for every MODEL statement: ALL, ARCHTEST, BACKSTEP, CENTER, COEF, CONVERGE=, CORRB, COVB, DW=, DWPROB, GINV, ITER, ITPRINT, MAXITER=, METHOD=, NOINT, NOMISS, NOPRINT, and PARTIAL.
BY Statement
BY variables ;
A BY statement can be used with PROC AUTOREG to obtain separate analyses on observations in groups dened by the BY variables.
MODEL Statement
MODEL dependent = regressors / options ;
The MODEL statement species the dependent variable and independent regressor variables for the regression model. If no independent variables are specied in the MODEL statement, only the mean is tted. (This is a way to obtain autocorrelations of a series.) Models can be given labels of up to eight characters. Model labels are used in the printed output to identify the results for different models. The model label is specied as follows:
label : MODEL . . . ;
The following options can be used in the MODEL statement after a slash (/).
CENTER
centers the dependent variable by subtracting its mean and suppresses the intercept parameter from the model. This option is valid only when the model does not have regressors (explanatory variables).
NOINT
species the order of the autoregressive error process or the subset of autoregressive error lags to be tted. Note that NLAG=3 is the same as NLAG=(1 2 3). If the NLAG= option is not specied, PROC AUTOREG does not t an autoregressive model.
species the distribution assumed for the error term in GARCH-type estimation. If no GARCH= option is specied, the option is ignored. If EGARCH is specied, the distribution is always normal distribution. The values of the DIST= option are as follows: T NORMAL
GARCH= ( option-list )
species Students t distribution. species the standard normal distribution. The default is DIST=NORMAL.
species a GARCH-type conditional heteroscedasticity model. The GARCH= option in the MODEL statement species the family of ARCH models to be estimated. The GARCH.1; 1/ regression model is specied in the following statement:
model y = x1 x2 / garch=(q=1,p=1);
When you want to estimate the subset of ARCH terms, such as ARCH.1; 3/, you can write the SAS statement as follows:
model y = x1 x2 / garch=(q=(1 3));
With the TYPE= option, you can specify various GARCH models. The IGARCH.2; 1/ model without trend in variance is estimated as follows:
model y = / garch=(q=2,p=1,type=integ,noint);
The following options can be used in the GARCH=( ) option. The options are listed within parentheses and separated by commas.
Q= number Q= (number-list)
species the order of the process or the subset of ARCH terms to be tted.
P= number P= (number-list)
species the order of the process or the subset of GARCH terms to be tted. If only the P= option is specied, Q=1 is assumed.
TYPE= value
species the type of GARCH model. The values of the TYPE= option are as follows:
species the exponential GARCH or EGARCH model. species the integrated GARCH or IGARCH model. species the Nelson-Cao inequality constraints. species the GARCH model with nonnegativity constraints. constrains the sum of GARCH coefcients to be less than 1.
NELSON | NELSONCAO
species the functional form of the GARCH-M model. The values of the MEAN= option are as follows: LINEAR species the linear function: yt D x0 C ht C t LOG species the log function: yt D x0 C ln.ht / C t SQRT
t t
NOINT
suppresses the intercept parameter in the conditional variance model. This option is valid only with the TYPE=INTEG option.
STARTUP= MSE | ESTIMATE
requests that the positive constant c for the start-up values of the GARCH conditional error variance process be estimated. By default or if STARTUP=MSE is specied, the value of the mean squared error is used as the default constant.
TR
uses the trust region method for GARCH estimation. This algorithm is numerically stable, though computation is expensive. The double quasi-Newton method is the default.
Printing Options
ALL
requests the Q and LM statistics testing for the absence of ARCH effects.
computes Chow tests to evaluate the stability of the regression coefcient. The Chow test is also called the analysis-of-variance test. Each value obsi listed on the CHOW= option species a break point of the sample. The sample is divided into parts at the specied break point, with observations before obsi in the rst part and obsi and later observations in the second part, and the ts of the model in the two parts are compared to whether both parts of the sample are consistent with the same model. The break points obsi refer to observations within the time range of the dependent variable, ignoring missing values before the start of the dependent series. Thus, CHOW=20 species the 20th observation after the rst nonmissing observation for the dependent variable. For example, if the dependent variable Y contains 10 missing values before the rst observation with a nonmissing Y value, then CHOW=20 actually refers to the 30th observation in the data set. When you specify the break point, you should note the number of presample missing values.
COEF
prints the transformation coefcients for the rst p observations. These coefcients are formed from a scalar multiplied by the inverse of the Cholesky root of the Toeplitz matrix of autocovariances.
CORRB
species the type of covariance matrix for the GARCH or heteroscedasticity model. When COVEST=OP is specied, the outer product matrix is used to compute the covariance matrix of the parameter estimates. The COVEST=HESSIAN option produces the covariance matrix by using the Hessian matrix. The quasi-maximum likelihood estimates are computed with COVEST=QML. The default is COVEST=OP.
DW= n
prints Durbin-Watson statistics up to the order n. The default is DW=1. When the LAGDEP option is specied, the Durbin-Watson statistic is not printed unless the DW= option is explicitly specied.
DWPROB
now produces p-values for the generalized Durbin-Watson test statistics for large sample sizes. Previously, the Durbin-Watson probabilities were calculated only for small sample sizes. The new method of calculating Durbin-Watson probabilities is based on the algorithm of Ansley, Kohn, and Shively (1992).
GINV
prints the inverse of the Toeplitz matrix of autocovariances for the Yule-Walker solution. See the section Computational Methods on page 359 later in this chapter for details.
GODFREY GODFREY= r
prints the objective function and parameter estimates at each iteration. The objective function is the full log likelihood function for the maximum likelihood method, while the error sum of squares is produced as the objective function of unconditional least squares. For the ML method, the ITPRINT option prints the value of the full log likelihood function, not the concentrated likelihood.
LAGDEP LAGDV
prints the Durbin t statistic, which is used to detect residual autocorrelation in the presence of lagged dependent variables. See the section Generalized Durbin-Watson Tests on page 370 later in this chapter for details.
LAGDEP= name LAGDV= name
prints the Durbin h statistic for testing the presence of rst-order autocorrelation when regressors contain the lagged dependent variable whose name is specied as LAGDEP=name. If the Durbin h statistic cannot be computed, the asymptotically equivalent t statistic is printed instead. See the section Generalized Durbin-Watson Tests on page 370 for details. When the regression model contains several lags of the dependent variable, specify the lagged dependent variable for the smallest lag in the LAGDEP= option. For example:
model y = x1 x2 ylag2 ylag3 / lagdep=ylag2;
LOGLIKL
prints the log likelihood value of the regression model, assuming normally distributed errors.
NOPRINT
computes the predictive Chow test. The form of the PCHOW= option is the same as the CHOW= option; see the discussion of the CHOW= option earlier in this chapter.
RESET
produces Ramseys RESET test statistics. The RESET option tests the null model y t D x t C ut
C ut
where yt is the predicted value from the OLS estimation of the null model. The RESET O option produces three RESET test statistics for p D 2, 3, and 4.
STATIONARITY= ( PHILLIPS ) STATIONARITY= ( PHILLIPS=( value . . . value ) ) STATIONARITY= ( KPSS ) STATIONARITY= ( KPSS=(KERNEL=TYPE ) ) STATIONARITY= ( KPSS=(KERNEL=TYPE TRUNCPOINTMETHOD) ) STATIONARITY= ( PHILLIPS<=(. . . )>, KPSS<=(. . . )> )
species tests of stationarity or unit roots. The STATIONARITY= option provides PhillipsPerron, Phillips-Ouliaris, and KPSS tests. The PHILLIPS or PHILLIPS= suboption of the STATIONARITY= option produces the Phillips-Perron unit root test when there are no regressors in the MODEL statement. When the model includes regressors, the PHILLIPS option produces the Phillips-Ouliaris cointegration test. The PHILLIPS option can be abbreviated as PP. The PHILLIPS option performs the Phillips-Perron test for three null hypothesis cases: zero mean, single mean, and deterministic trend. For each case, the PHILLIPS option computes O O O O two test statistics, Z and Zt (in the original paper they are referred to as Z and Zt ) , and reports their p-values. These test statistics have the same limiting distributions as the corresponding Dickey-Fuller tests. The three types of the Phillips-Perron unit root test reported by the PHILLIPS option are as follows. Zero mean computes the Phillips-Perron test statistic based on the zero mean autoregressive model: yt D yt Single mean
1
C ut
computes the Phillips-Perron test statistic based on the autoregressive model with a constant term: yt D C yt
1
C ut
Trend
computes the Phillips-Perron test statistic based on the autoregressive model with constant and time trend terms: yt D C yt
1
C t C ut
You can specify several truncation points l for weighted variance estimators by using the PHILLIPS=(l1 : : :ln ) specication. The statistic for each truncation point l is computed as
2 Tl
where wsl D 1 s=.l C 1/ and ut are OLS residuals. If you specify the PHILLIPS option O p without specifying truncation points, the default truncation point is max.1; T =5/, where T is the number of observations. The Phillips-Perron test can be used in general time series models since its limiting distribution is derived in the context of a class of weakly dependent and heterogeneously distributed data. The marginal probability for the Phillips-Perron test is computed assuming that error disturbances are normally distributed. When there are regressors in the MODEL statement, the PHILLIPS option computes the Phillips-Ouliaris cointegration test statistic by using the least squares residuals. The normalized cointegrating vector is estimated using OLS regression. Therefore, the cointegrating vector estimates might vary with the regressand (normalized element) unless the regression R-square is 1. The marginal probabilities for cointegration testing are not produced. You can refer to Phillips O O and Ouliaris (1990) tables IaIc for the Z test and tables IIaIIc for the Zt test. The standard residual-based cointegration test can be obtained using the NOINT option in the MODEL statement, while the demeaned test is computed by including the intercept term. To obtain the demeaned and detrended cointegration tests, you should include the time trend variable in the regressors. Refer to Phillips and Ouliaris (1990) or Hamilton (1994, Tbl. 19.1) for information about the Phillips-Ouliaris cointegration test. Note that Hamilton (1994, Tbl. 19.1) uses Z and Zt instead of the original Phillips and Ouliaris (1990) notation. We adopt the notation introduced in Hamilton. To distinguish from Students t distribu tion, these two statistics are named accordingly as (rho) and (tau).
The KPSS, KPSS=(KERNEL=TYPE), or KPSS=(KERNEL=TYPE TRUNCPOINTMETHOD) specications of the STATIONARITY= option produce the Kwiatkowski, Phillips, Schmidt, and Shin (1992) (KPSS) unit root test. Unlike the null hypothesis of the Dickey-Fuller and Phillips-Perron tests, the null hypothesis of the KPSS states that the time series is stationary. As a result, it tends to reject a random walk more often. If the model does not have an intercept, the KPSS option performs the KPSS test for three null hypothesis cases: zero mean, single mean, and deterministic trend. Otherwise, it reports single mean and deterministic trend only. It computes a test statistic and tabulated p-values (see Hobijn, Franses, and Ooms (2004)) for the hypothesis that the random walk component of the time series is equal to zero in the following cases (for details, see Kwiatkowski, Phillips, Schmidt, and Shin (KPSS) Unit Root Test on page 378): Zero mean computes the KPSS test statistic based on the zero mean autoregressive model. The p-value reported is used from Hobijn, Franses, and Ooms (2004). y t D ut Single mean computes the KPSS test statistic based on the autoregressive model with a constant term. The p-value reported is used from Kwiatkowski et al. (1992). yt D C ut
Trend
computes the KPSS test statistic based on the autoregressive model with constant and time trend terms. The p-value reported is from Kwiatkowski et al. (1992). yt D C t C ut
This test depends on the long-run variance of the series being dened as
2 Tl T l T X 1 X 2 2 X D ui C O wsl ut ut O O T T i D1 sD1 t DsC1 s
where wsl is a kernel, s is a maximum lag (truncation point), and ut are OLS residuals or O original data series. You can specify two types of the kernel: KERNEL=NW|BART Newey-West (or Bartlett) kernel w.s; l/ D 1 KERNEL=QS Quadratic spectral kernel w.s= l/ D w.x/ D 25 12
2x2
s l C1 cos .6 x=5/
You can set the truncation point s by using three different methods: SCHW=c Schwert maximum lag formula (
LAG=s AUTO
LAG=s manually dened number of lags. Automatic bandwidth selection (Hobijn, Franses, and Ooms 2004) (for details, see Kwiatkowski, Phillips, Schmidt, and Shin (KPSS) Unit Root Test on page 378).
If STATIONARITY=KPSS is dened without additional parameters, the Newey-West kernel is used. For the Newey-West kernel the default is the Schwert truncation point method with c D 4. For the quadratic spectral kernel the default is AUTO. The KPSS test can be used in general time series models since its limiting distribution is derived in the context of a class of weakly dependent and heterogeneously distributed data. The limiting probability for the KPSS test is computed assuming that error disturbances are normally distributed. The asymptotic distribution of the test does not depend on the presence of regressors in the MODEL statement. The marginal probabilities for the test are reported. They are copied from Kwiatkowski et al. (1992) and Hobijn, Franses, and Ooms (2004). When there is an intercept in the model, results for mean and trend tests statistics are provided. Examples: To test for stationarity of regression residuals, using default KERNEL= NW and SCHW= 4, you can use the following code:
/*-- test for stationarity of regression residuals --*/ proc autoreg data=a; model y= / stationarity = (KPSS); run;
To test for stationarity of regression residuals, using quadratic spectral kernel and automatic bandwidth selection, you can use:
/*-- test for stationarity using quadratic spectral kernel and automatic bandwidth selection --*/ proc autoreg data=a; model y= / stationarity = (KPSS=(KERNEL=QS AUTO)); run;
URSQ
prints the uncentered regression R2 . The uncentered regression R2 is useful to compute Lagrange multiplier test statistics, since most LM test statistics are computed as T *URSQ, where T is the number of observations used in estimation.
removes insignicant autoregressive parameters. The parameters are removed in order of least signicance. This backward elimination is done only once on the Yule-Walker estimates computed after the initial ordinary least squares estimation. The BACKSTEP option can be used with all estimation methods since the initial parameter values for other estimation methods are estimated using the Yule-Walker method.
SLSTAY= value
species the signicance level criterion to be used by the BACKSTEP option. The default is SLSTAY=.05.
species the convergence criterion. If the maximum absolute value of the change in the autoregressive parameter estimates between iterations is less than this amount, then convergence is assumed. The default is CONVERGE=.001. If the GARCH= option and/or the HETERO statement is specied, convergence is assumed when the absolute maximum gradient is smaller than the value specied by the CONVERGE= option or when the relative gradient is smaller than 1E8. By default, CONVERGE=1E5.
INITIAL= ( initial-values ) START= ( initial-values )
species initial values for some or all of the parameter estimates. The values specied are
assigned to model parameters in the same order as the parameter estimates are printed in the AUTOREG procedure output. The order of values in the INITIAL= or START= option is as follows: the intercept, the regressor coefcients, the autoregressive parameters, the ARCH parameters, the GARCH parameters, the inverted degrees of freedom for Students t distribution, the start-up value for conditional variance, and the heteroscedasticity model parameters specied by the HETERO statement. The following is an example of specifying initial values for an AR(1)-GARCH.1; 1/ model with regressors X1 and X2:
/*-- specifying initial values --*/ model y = w x / nlag=1 garch=(p=1,q=1) initial=(1 1 1 .5 .8 .1 .6);
1 t 1
p D h t et
2 t 1
ht D ! C 1
t
1 ht 1
N.0;
2 t /
The initial values for the regression parameters, INTERCEPT (0 ), X1 (1 ), and X2 (2 ), are specied as 1. The initial value of the AR(1) coefcient ( 1 ) is specied as 0.5. The initial value of ARCH0 (!) is 0.8, the initial value of ARCH1 (1 ) is 0.1, and the initial value of GARCH1 ( 1 ) is 0.6. When you use the RESTRICT statement, the initial values specied by the INITIAL= option should satisfy the restrictions specied for the parameter estimates. If they do not, the initial values you specify are adjusted to satisfy the restrictions.
LDW
species that p-values for the Durbin-Watson test be computed using a linearized approximation of the design matrix when the model is nonlinear due to the presence of an autoregressive error process. (The Durbin-Watson tests of the OLS linear regression model residuals are not affected by the LDW option.) Refer to White (1992) for Durbin-Watson testing of nonlinear models.
MAXITER= number
requests the type of estimates to be computed. The values of the METHOD= option are as follows: METHOD=ML METHOD=ULS species maximum likelihood estimates. species unconditional least squares estimates.
METHOD=YW METHOD=ITYW
If the GARCH= or LAGDEP option is specied, the default is METHOD=ML. Otherwise, the default is METHOD=YW.
NOMISS
requests the estimation to the rst contiguous sequence of data with no missing values. Otherwise, all complete observations are used.
OPTMETHOD= QN | TR
species the optimization technique when the GARCH or heteroscedasticity model is estimated. The OPTMETHOD=QN option species the quasi-Newton method. The OPTMETHOD=TR option species the trust region method. The default is OPTMETHOD=QN.
HETERO Statement
The HETERO statement species variables that are related to the heteroscedasticity of the residuals and the way these variables are used to model the error variance of the regression. The syntax of the HETERO statement is
HETERO variables / options ;
N.0; D
2
2 t /
ht
ht D l.z0 t / The HETERO statement species a model for the conditional variance ht . The vector zt is composed of the variables listed in the HETERO statement, is a parameter vector, and l. / is a link function that depends on the value of the LINK= option. In the printed output, HET 0 represents the estimate of sigma, while HET 1 - HET n are the estimates of parameters in the vector. The keyword XBETA can be used in the variables list to refer to the model predicted value x0 t . If XBETA is specied in the variables list, other variables in the HETERO statement will be ignored. In addition, XBETA cannot be specied in the GARCH process. For heteroscedastic regression models without GARCH effects, the errors t are assumed to be uncorrelated the heteroscedasticity models specied by the HETERO statement cannot be combined with an autoregressive model for the errors. Thus, when a HETERO statement is used, the NLAG= option cannot be specied unless the GARCH= option is also specied. You can specify the following options in the HETERO statement.
LINK= value
species the functional form of the heteroscedasticity model. By default, LINK=EXP. If you specify a GARCH model with the HETERO statement, the model is estimated using LINK= LINEAR only. For details, see the section Using the HETERO Statement with GARCH Models on page 364. Values of the LINK= option are as follows: EXP species the exponential link function. The following model is estimated when you specify LINK=EXP: ht D exp.z0 t / SQUARE species the square link function. The following model is estimated when you specify LINK=SQUARE: ht D .1 C z0 t /2 LINEAR species the linear function; that is, the HETERO statement variables predict the error variance linearly. The following model is estimated when you specify LINK=LINEAR: ht D .1 C z0 t /
COEF= value
imposes constraints on the estimated parameters of the heteroscedasticity model. The values of the COEF= option are as follows: NONNEG species that the estimated heteroscedasticity parameters must be nonnegative. When the HETERO statement is used in conjunction with the GARCH= option, the default is COEF=NONNEG. constrains all heteroscedasticity parameters to equal 1. constrains all heteroscedasticity parameters to equal 0. species unrestricted estimation of . When the GARCH= option is not specied, the default is COEF=UNREST.
STD= value
imposes constraints on the estimated standard deviation The values of the STD= option are as follows: NONNEG UNIT UNREST
TEST= LM
species that the estimated standard deviation parameter negative. constrains the standard deviation parameter to equal 1. species unrestricted estimation of . This is the default.
must be non-
produces a Lagrange multiplier test for heteroscedasticity. The null hypothesis is homoscedasticity; the alternative hypothesis is heteroscedasticity of the form specied by the
HETERO statement. The power of the test depends on the variables specied in the HETERO statement. The test may give different results depending on the functional form specied by the LINK= option. However, in many cases the test does not depend on the LINK= option. The test is invariant to the form of ht when ht .0/ D 1 and h0 .0/ 0. (The condition ht .0/ D 1 is satised t except when the NOCONST option is specied with LINK=SQUARE or LINK=LINEAR.)
NOCONST
species that the heteroscedasticity model does not include the unit term for the LINK=SQUARE and LINK=LINEAR options. For example, the following model is estimated when you specify the options LINK=SQUARE NOCONST: ht D .z0 t /2
RESTRICT Statement
The RESTRICT statement provides constrained estimation. The syntax of the RESTRICT statement is
RESTRICT equation , . . . , equation ;
The RESTRICT statement places restrictions on the parameter estimates for covariates in the preceding MODEL statement. Any number of RESTRICT statements can follow a MODEL statement. Several restrictions can be specied in a single RESTRICT statement by separating the individual restrictions with commas. Each restriction is written as a linear equation composed of constants and parameter names. Refer to model parameters by the name of the corresponding regressor variable. Each name used in the equation must be a regressor in the preceding MODEL statement. Use the keyword INTERCEPT to refer to the intercept parameter in the model. The following is an example of a RESTRICT statement:
model y = a b c d; restrict a+b=0, 2*d-c=0;
When restricting a linear combination of parameters to be 0, you can omit the equal sign. For example, the following RESTRICT statement is equivalent to the preceding example:
restrict a+b, 2*d-c;
The following RESTRICT statement constrains the parameters estimates for three regressors (X1, X2, and X3) to be equal:
restrict x1 = x2, x2 = x3;
restrict x1 = x2 = x3;
Only simple linear combinations of parameters can be specied in RESTRICT statement expressions; complex expressions involving parentheses, division, functions, or complex products are not allowed.
TEST Statement
The AUTOREG procedure now supports a TEST statement for linear hypothesis tests. The syntax of the TEST statement is
TEST equation , . . . , equation / option ;
The TEST statement tests hypotheses about the covariates in the model estimated by the preceding MODEL statement. Each equation species a linear hypothesis to be tested. If more than one equation is specied, the equations are separated by commas. Each test is written as a linear equation composed of constants and parameter names. Refer to parameters by the name of the corresponding regressor variable. Each name used in the equation must be a regressor in the preceding MODEL statement. Use the keyword INTERCEPT to refer to the intercept parameter in the model. You can specify the following option in the TEST statement:
TYPE= value
species the test statistics to use, F or Wald. TYPE=F produces an F test. TYPE=WALD produces a Wald test. The default is TYPE=F. The following example of a TEST statement tests the hypothesis that the coefcients of two regressors A and B are equal:
model y = a b c d; test a = b;
To test separate null hypotheses, use separate TEST statements. To test a joint hypothesis, specify the component hypotheses on the same TEST statement, separated by commas. For example, consider the following linear model: yt D 0 C 1 x1t C 2 x2t C
t
OUTPUT Statement
OUTPUT OUT=SAS-data-set keyword = options . . . ; ;
The OUTPUT statement creates an output SAS data set as specied by the following options.
OUT= SAS-data-set
names the output SAS data set containing the predicted and transformed values. If the OUT= option is not specied, the new data set is named according to the DATAn convention.
ALPHACLI= number
sets the condence limit size for the estimates of future values of the response time series. The ALPHACLI= value must be between 0 and 1. The resulting condence interval has 1-number condence. The default is ALPHACLI=.05, corresponding to a 95% condence interval.
ALPHACLM= number
sets the condence limit size for the estimates of the structural or regression part of the model. The ALPHACLI= value must be between 0 and 1. The resulting condence interval has 1number condence. The default is ALPHACLM=.05, corresponding to a 95% condence interval.
ALPHACSM= .01 | .05 | .10
species the signicance level for the upper and lower bounds of the CUSUM and CUSUMSQ statistics output by the CUSUMLB=, CUSUMUB=, CUSUMSQLB=, and CUSUMSQUB= options. The signicance level specied by the ALPHACSM= option can be .01, .05, or .10. Other values are not supported. The following options are of the form KEYWORD=name, where KEYWORD species the statistic to include in the output data set and name gives the name of the variable in the OUT= data set containing the statistic.
BLUS= variable
species the name of a variable to contain the values of the Theils BLUS residuals. Refer to Theil (1971) for more information on BLUS residuals.
CEV= variable HT= variable
writes to the output data set the value of the error variance t2 from the heteroscedasticity model specied by the HETERO statement or the value of the conditional error variance ht by the GARCH= option in the MODEL statement.
CPEV= variable
writes the conditional prediction error variance to the output data set. The value of conditional prediction error variance is equal to that of the conditional error variance when there are no
autoregressive parameters. For the exponential GARCH model, conditional prediction error variance cannot be calculated. See the section Predicted Values on page 381 later in this chapter for details.
CONSTANT= variable
writes the transformed intercept to the output data set. The details of the transformation are described in Computational Methods on page 359 later in this chapter.
CUSUM= variable
species the name of a variable to contain the upper condence bound for the CUSUM statistic.
CUSUMLB= variable
species the name of a variable to contain the lower condence bound for the CUSUM statistic.
CUSUMSQUB= variable
species the name of a variable to contain the upper condence bound for the CUSUMSQ statistic.
CUSUMSQLB= variable
species the name of a variable to contain the lower condence bound for the CUSUMSQ statistic.
LCL= name
writes the lower condence limit for the predicted value (specied in the PREDICTED= option) to the output data set. The size of the condence interval is set by the ALPHACLI= option. When a GARCH model is estimated, the lower condence limit is calculated assuming that the disturbances have homoscedastic conditional variance. See the section Predicted Values on page 381 later in this chapter for details.
LCLM= name
writes the lower condence limit for the structural predicted value (specied in the PREDICTEDM= option) to the output data set under the name given. The size of the condence interval is set by the ALPHACLM= option.
PREDICTED= name P= name
writes the predicted values to the output data set. These values are formed from both the structural and autoregressive parts of the model. See the section Predicted Values on page 381 later in this chapter for details.
writes the structural predicted values to the output data set. These values are formed from only the structural part of the model. See the section Predicted Values on page 381 later in this chapter for details.
RECPEV= variable
species the name of a variable to contain the part of the predictive error variance (vt ) that is used to compute the recursive residuals.
RECRES= variable
species the name of a variable to contain recursive residuals. The recursive residuals are used to compute the CUSUM and CUSUMSQ statistics.
RESIDUAL= name R= name
writes the residuals from the predicted values based on both the structural and time series parts of the model to the output data set.
RESIDUALM= name RM= name
writes the residuals from the structural prediction to the output data set.
TRANSFORM= variables
transforms the specied variables from the input data set by the autoregressive model and writes the transformed variables to the output data set. The details of the transformation are described in Computational Methods on page 359 later in this chapter. If you need to reproduce the data suitable for reestimation, you must also transform an intercept variable. To do this, transform a variable that is all 1s or use the CONSTANT= option.
UCL= name
writes the upper condence limit for the predicted value (specied in the PREDICTED= option) to the output data set. The size of the condence interval is set by the ALPHACLI= option. When the GARCH model is estimated, the upper condence limit is calculated assuming that the disturbances have homoscedastic conditional variance. See the section Predicted Values on page 381 later in this chapter for details.
UCLM= name
writes the upper condence limit for the structural predicted value (specied in the PREDICTEDM= option) to the output data set. The size of the condence interval is set by the ALPHACLM= option.
Missing Values
PROC AUTOREG skips any missing values at the beginning of the data set. If the NOMISS option is specied, the rst contiguous set of data with no missing values is used; otherwise, all data with nonmissing values for the independent and dependent variables are used. Note, however, that the observations containing missing values are still needed to maintain the correct spacing in the time series. PROC AUTOREG can generate predicted values when the dependent variable is missing.
t 2
'1 /
:::
'm
t m
N.0;
In these equations, yt are the dependent values, xt is a column vector of regressor variables, is a column vector of structural parameters, and t is normally and independently distributed with a mean of 0 and a variance of 2 . Note that in this parameterization, the signs of the autoregressive parameters are reversed from the parameterization documented in most of the literature. PROC AUTOREG offers four estimation methods for the autoregressive error model. The default method, Yule-Walker (YW) estimation, is the fastest computationally. The Yule-Walker method used by PROC AUTOREG is described in Gallant and Goebel (1976). Harvey (1981) calls this method the two-step full transform method. The other methods are iterated YW, unconditional least squares (ULS), and maximum likelihood (ML). The ULS method is also referred to as nonlinear least squares (NLS) or exact least squares (ELS). You can use all of the methods with data containing missing values, but you should use ML estimation if the missing values are plentiful. See the section Alternative Autocorrelation Correction Methods on page 361 later in this chapter for further discussion of the advantages of different methods.
D . 1 ; : : :;
0 N/
be ,
/DD
If the vector of autoregressive parameters ' is known, the matrix V can be computed from the autoregressive parameters. is then 2 V. Given , the efcient estimates of regression parameters can be computed using generalized least squares (GLS). The GLS estimates then yield the unbiased estimate of the variance 2 , The Yule-Walker method alternates estimation of using generalized least squares with estimation of ' using the Yule-Walker equations applied to the sample autocorrelation function. The YW method starts by forming the OLS estimate of . Next, ' is estimated from the sample autocorrelation function of the OLS residuals by using the Yule-Walker equations. Then V is estimated from the estimate of ', and is estimated from V and the OLS estimate of 2 . The autocorrelation corrected estimates of the regression parameters are then computed by GLS, using the estimated matrix. These are the Yule-Walker estimates. If the ITER option is specied, the Yule-Walker residuals are used to form a new sample autocorrelation function, the new autocorrelation function is used to form a new estimate of ' and V, and the GLS estimates are recomputed using the new variance matrix. This alternation of estimates continues until either the maximum change in the b estimate between iterations is less than the value ' specied by the CONVERGE= option or the maximum number of allowed iterations is reached. This produces the iterated Yule-Walker estimates. Iteration of the estimates may not yield much improvement. The Yule-Walker equations, solved to obtain b and a preliminary estimate of ' R' D O r
2,
are
Here r D .r1 ; : : :; rm /0 , where ri is the lag i sample autocorrelation. The matrix R is the Toeplitz matrix whose i,jth element is rji j j . If you specify a subset model, then only the rows and columns of R and r corresponding to the subset of lags specied are used. If the BACKSTEP option is specied, for purposes of signicance testing, the matrix R r is treated as a sum-of-squares-and-crossproducts matrix arising from a simple regression with N k observations, where k is the number of estimated parameters.
n X.
n D e0 e
The ULS estimates are computed by minimizing S with respect to the parameters and 'i .
The full log likelihood function for the autoregressive error model is lD N ln.2 / 2 N ln. 2
2
1 ln.jVj/ 2
S 2 2
where jVj denotes determinant of V. For the ML method, the likelihood function is maximized by minimizing an equivalent sum-of-squares function. Maximizing l with respect to 2 (and concentrating 2 out of the likelihood) and dropping the constant term N ln.2 / C 1 ln.N / produces the concentrated log likelihood function 2 lc D N ln.SjVj1=N / 2
Rewriting the variable term within the logarithm gives Sml D jLj1=N e0 ejLj1=N PROC AUTOREG computes the ML estimates by minimizing the objective function Sml D jLj1=N e0 ejLj1=N . The maximum likelihood estimates may not exist for some data sets (Anderson and Mentz 1980). This is the case for very regular data sets, such as an exact linear trend.
Computational Methods
Sample Autocorrelation Function
The sample autocorrelation function is computed from the structural residuals or noise nt D yt x0 b, where b is the current estimate of . The sample autocorrelation function is t the sum of all available lagged products of nt of order j divided by ` C j , where ` is the number of such products. If there are no missing values, then ` C j D N , the number of observations. In this case, the Toeplitz matrix of autocorrelations, R, is at least positive semidenite. If there are missing values, these autocorrelation estimates of r can yield an R matrix that is not positive semidenite. If such estimates occur, a warning message is printed, and the estimates are tapered by exponentially declining weights until R is positive denite.
The calculation of V from ' for the general AR.m/ model is complicated, and the size of V depends on the number of observations. Instead of actually calculating V and performing GLS in the usual way, in practice a Kalman lter algorithm is used to transform the data and compute the GLS results through a recursive process. In all of the estimation methods, the original data are transformed by the inverse of the Cholesky root of V. Let L denote the Cholesky root of V that is, V D LL0 with L lower triangular. For an AR.m/ model, L 1 is a band diagonal matrix with m anomalous rows at the beginning and the
autoregressive parameters along the remaining rows. Thus, if there are no missing values, after the rst m 1 observations the data are transformed as zt D xt C '1 xt O
1
C : : : C 'm xt O
The transformation is carried out using a Kalman lter, and the lower triangular matrix L is never directly computed. The Kalman lter algorithm, as it applies here, is described in Harvey and Phillips (1979) and Jones (1980). Although L is not computed explicitly, for ease of presentation the remaining discussion is in terms of L. If there are missing values, then the submatrix of L consisting of the rows and columns with nonmissing values is used to generate the transformations.
Gauss-Newton Algorithms
The ULS and ML estimates employ a Gauss-Newton algorithm to minimize the sum of squares and maximize the log likelihood, respectively. The relevant optimization is performed simultaneously for both the regression and AR parameters. The OLS estimates of and the Yule-Walker estimates of ' are used as starting values for these methods. The Gauss-Newton algorithm requires the derivatives of e or jLj1=N e with respect to the parameters. The derivatives with respect to the parameter vector are @e D @ 0 L
1
X jLj1=N L
1
@jLj1=N e D @ 0
These derivatives are computed by the transformation described previously. The derivatives with respect to ' are computed by differentiating the Kalman lter recurrences and the equations for the initial conditions.
The estimates of the standard errors calculated with the ULS or ML method take into account the joint estimation of the AR and the regression parameters and may give more accurate standard-error values than the YW method. At the same values of the autoregressive parameters, the ULS and ML standard errors will always be larger than those computed from Yule-Walker. However, simulations of the models used by Park and Mitchell (1980) suggest that the ULS and ML standard error estimates can also be underestimates. Caution is advised, especially when the estimated autocorrelation is high and the sample size is small. High autocorrelation in the residuals is a symptom of lack of t. An autoregressive error model should not be used as a nostrum for models that simply do not t. It is often the case that time series variables tend to move as a random walk. This means that an AR(1) process with a parameter near one absorbs a great deal of the variation. See Example 8.3 later in this chapter, which ts a linear trend to a sine wave. For ULS or ML estimation, the joint variance-covariance matrix of all the regression and autoregression parameters is computed. For the Yule-Walker method, the variance-covariance matrix is computed only for the regression parameters.
The unconditional least squares (ULS) method, which minimizes the error sum of squares for all observations, is referred to as the nonlinear least squares (NLS) method by Spitzer (1979). The Hildreth-Lu method (Hildreth and Lu 1960) uses nonlinear least squares to jointly estimate the parameters with an AR(1) model, but it omits the rst transformed residual from the sum of squares. Thus, the Hildreth-Lu method is a more primitive version of the ULS method supported by PROC AUTOREG in the same way Cochrane-Orcutt is a more primitive version of Yule-Walker. The maximum likelihood method is also widely cited in the literature. Although the maximum likelihood method is well dened, some early literature refers to estimators that are called maximum likelihood but are not full unconditional maximum likelihood estimates. The AUTOREG procedure produces full unconditional maximum likelihood estimates. Harvey (1981) and Judge et al. (1985) summarize the literature on various estimators for the autoregressive error model. Although asymptotically efcient, the various methods have different small sample properties. Several Monte Carlo experiments have been conducted, although usually for the AR(1) model. Harvey and McAvinchey (1978) found that for a one-variable model, when the independent variable is trending, methods similar to Cochrane-Orcutt are inefcient in estimating the structural parameter. This is not surprising since a pure trend model is well modeled by an autoregressive process with a parameter close to 1. Harvey and McAvinchey (1978) also made the following conclusions: The Yule-Walker method appears to be about as efcient as the maximum likelihood method. Although Spitzer (1979) recommended ML and NLS, the Yule-Walker method (labeled PraisWinsten) did as well or better in estimating the structural parameter in Spitzers Monte Carlo study (table A2 in their article) when the autoregressive parameter was not too large. Maximum likelihood tends to do better when the autoregressive parameter is large. For small samples, it is important to use a full transformation (Yule-Walker) rather than the Cochrane-Orcutt method, which loses the rst observation. This was also demonstrated by Maeshiro (1976), Chipman (1979), and Park and Mitchell (1980). For large samples (Harvey and McAvinchey used 100), losing the rst few observations does not make much difference.
ht D ! C
i yt2 i
p X j D1
j ht j
where p 0; q > 0 0;
j
! > 0; i
The GARCH.p; q/ model reduces to the ARCH.q/ process when p D 0. At least one of the ARCH parameters must be nonzero (q > 0). The GARCH regression model can be written yt D x0 C t p h t et t D ht D ! C
t
q X i D1
2 i t i
p X j D1
j ht j
where et IN.0; 1/. In addition, you can consider the model with disturbances following an autoregressive process and with the GARCH errors. The AR.m/-GARCH.p; q/ regression model is denoted yt D x0 C t
t t t t 1
'1 t p D h t et
q X i D1
:::
'm
p X j D1
t m
ht D ! C
2 i t i
j ht j
1
jB jA
1"
!C
q X i D1
# i
2 t i
D ! C
1 X i D1
2 i t i
where B is a backshift operator. Therefore, ht 0 if ! 0 and roots of the following polynomial equation are inside the unit circle:
p X j D0 jZ p j
Pp Pq p j and q i do not share comwhere 0 D 1 and Z is a complex scalar. j D0 j Z i D1 i Z mon factors. Under these conditions, j! j < 1, j i j < 1, and these coefcients of the ARCH(1) process are well dened.
is written
D 1 D D D
1 0
C 2 C C
2 n 3 2 k 2 j
n 1 k
1 n 2 1 k 1
C C
C C
n 1 0 n k n
C n for k n
D 0 for j > p.
Nelson and Cao (1992) proposed the nite inequality constraints for GARCH.1; q/ and GARCH.2; q/ cases. However, it is not straightforward to derive the nite inequality constraints for the general GARCH.p; q/ model. For the GARCH.1; q/ model, the nonlinear inequality constraints are !
1 k
0 0 0 for k D 0; 1; ;q 1
2 R for i D 1; 2 0
;q
2 /.
For the GARCH.p; q/ model with p > 2, only max.q 1; p/ C 1 nonlinear inequality constraints ( k 0 for k D 0 to max(q 1; p)) are imposed, together with the in-sample positivity constraints of the conditional variance ht .
ht et
2 t 1
ht
D ! C 1
1 ht 1
C 1 D1t C 2 D2t
The parameters for the variables D1 and D2 can be constrained using the COEF= option. For example, the constraints 1 D 2 D 1 are imposed by the following statements:
proc autoreg data=one; model y = x z / garch=(p=1,q=1); hetero d1 d2 / coef=unit; run;
i C
p X j D1
<1
is computed as
.1
Pq
! i D1 i
Pp
j D1 j /
Sometimes the multistep forecasts of the variance do not approach the unconditional variance when Pq Pp the model is integrated in variance; that is, i D1 i C j D1 j D 1. The unconditional variance for the IGARCH model does not exist. However, it is interesting that the IGARCH model can be strongly stationary even though it is not weakly stationary. Refer to Nelson (1990) for details.
EGARCH Model
The EGARCH model was proposed by Nelson (1991). Nelson and Cao (1992) argue that the nonnegativity constraints in the linear GARCH model are too restrictive. The GARCH model imposes
the nonnegative constraints on the parameters, i and j , while there are no restrictions on these parameters in the EGARCH model. In the EGARCH model, the conditional variance, ht , is an asymmetric function of lagged disturbances t i : ln.ht / D ! C where g.zt / D zt C jzt j p zt D t = h t Ejzt j
q X i D1
i g.zt
i/
p X j D1
j ln.ht j /
The coefcient of the second term in g.zt / is set to be 1 ( =1) in our formulation. Note that Ejzt j D .2= /1=2 if zt N.0; 1/. The properties of the EGARCH model are summarized as follows: The function g.zt / is linear in zt with slope coefcient C 1 if zt is positive while g.zt / is linear in zt with slope coefcient 1 if zt is negative. Suppose that D 0. Large innovations increase the conditional variance if jzt j and decrease. the conditional variance if jzt j Ejzt j < 0 Ejzt j > 0
Suppose that < 1. The innovation in variance, g.zt /, is positive if the innovations zt are less than .2= /1=2 =. 1/. Therefore, the negative innovations in returns, t , cause the innovation to the conditional variance to be positive if is much less than 1.
GARCH-in-Mean
The GARCH-M model has the added regressor that is the conditional standard deviation: p y t D x0 C ht C t t p ht et t D where ht follows the ARCH or GARCH process.
ln.2 /
ln.ht /
2 t
ht
where t D yt x0 and ht p the conditional variance. When the GARCH.p; q/-M model is is t 0 estimated, t D yt xt ht . When there are no regressors, the residuals t are denoted as yt p or yt ht . If et has the standardized Students t distribution, the log-likelihood function for the conditional t distribution is " N 1 X C1 `D ln ln ln.. 2/ht / 2 2 2 t D1 # 2 1 t . C 1/ln 1 C 2 ht . 2/ where . / is the gamma function and is the degree of freedom ( > 2). Under the conditional t distribution, the additional parameter 1= is estimated. The log-likelihood function for the conditional t distribution converges to the log-likelihood function of the conditional normal GARCH model as 1= ! 0. The likelihood function is maximized via either the dual quasi-Newton or trust region algorithm. The default is the dual quasi-Newton algorithm. The starting values for the regression parameters are obtained from the OLS estimates. When there are autoregressive parameters in the model, the initial values are obtained from the Yule-Walker estimates. The starting value 1:0 6 is used for the GARCH process parameters. The variance-covariance matrix is computed using the Hessian matrix. The dual quasi-Newton method approximates the Hessian matrix while the quasi-Newton method gets an approximation of the inverse of Hessian. The trust region method uses the Hessian matrix obtained using numerical differentiation. When there are active constraints, that is, q./ D 0, the variance-covariance matrix is given by O V. / D H 1 I Q0 .QH 1 Q0 / 1 QH 1 where H D @2 l=@ @ 0 and Q D @q. /=@ 0 . Therefore, the variance-covariance matrix without O active constraints reduces to V. / D H 1 .
Total R-Square
The total R2 statistic (Total Rsq) is computed as S SE 2 Rtot D 1 SST where SST is the sum of squares for the original response variable corrected for the mean and SSE is the nal error sum of squares. The Total Rsq is a measure of how well the next value can be predicted using the structural part of the model and the past values of the residuals. If the NOINT option is specied, SST is the uncorrected sum of squares.
Regression R-Square
The regression R2 (Reg RSQ) is computed as
2 Rreg D 1
T S SE TSST
where TSST is the total sum of squares of the transformed response variable corrected for the transformed intercept, and TSSE is the error sum of squares for this transformed regression problem. If the NOINT option is requested, no correction for the transformed intercept is made. The Reg RSQ is a measure of the t of the structural part of the model after transforming for the autocorrelation and is the R2 for the transformed regression. The regression R2 and the total R2 should be the same when there is no autocorrelation correction (OLS regression).
Mean Absolute Error (MAE) and Mean Absolute Percentage Error (MAPE)
The mean absolute error (MAE) is computed as
T 1 X MAE D jet j T t D1
where et are the estimated model residuals and T is the number of observations. The mean absolute percentage error (MAPE) is computed as MAPE D
T jet j 1 X yt 0 0 T jyt j t D1
where et are the estimated model residuals, yt are the original response variable observations, yt 0 D 1 if yt 0, yt 0 jet =yt j D 0 if yt D 0, and T 0 is the number of nonzero original response variable observations.
t 1 X i D1
! xi yi
vt D 1 C x t
"t 1 X
i D1
# xi x0 i
xt
Note that the rst .t/ can be computed for t D p C 1, where p is the number of regression coefcients. As a result, rst p recursive residuals are not dened. Note also that the forecast error variance of et is the scalar multiple of vt such that V .et / D 2 vt . The CUSUM and CUSUMSQ statistics are computed using the preceding recursive residuals.
CUSUMt D
t X wi i DkC1 w 2 wi
CUSUMSQt D Pi DkC1 T
Pt
2 i DkC1 wi
wi
and k is the number of regressors. The CUSUM statistics can be used to test for misspecication of the model. The upper and lower critical values for CUSUMt are # " p .t k/ a T kC2 1 .T k/ 2 where a = 1.143 for a signicance level 0.01, 0.948 for 0.05, and 0.850 for 0.10. These critical values are output by the CUSUMLB= and CUSUMUB= options for the signicance level specied by the ALPHACSM= option. The upper and lower critical values of CUSUMSQt are given by a C .t T k/ k
where the value of a is obtained from the table by Durbin (1969) if the 1 .T k/ 1 60. Edgerton 2 and Wells (1994) provided the method of obtaining the value of a for large samples. These critical values are output by the CUSUMSQLB= and CUSUMSQUB= options for the significance level specied by the ALPHACSM= option.
where O are OLS residuals. Using the matrix notation, dj D Y0 MA0 Aj MY j Y0 MY X.X0 X/ 1 X0 and Aj is a .N 1 0 : : : 0 and there are j 0 1 : : : 0 0 : : : 0 1 0 : : : 0 1 : : : 0 : : : 0 j / N matrix: 3 0 7 7 :7 :5 : 1
where M D IN 2 6 6 Aj D 6 4
: : : 1 0
where R is an N k upper triangular matrix. There exists an N .N k/ submatrix of Q such that Q1 Q0 D M and Q0 Q1 D IN k . Consequently, the generalized Durbin-Watson statistic is stated as 1 1 a ratio of two quadratic forms: Pn 2 lD1 j l l d j D Pn 2
lD1 l
where j1 : : : j n are upper n eigenvalues of MA0 j Aj M and l is a standard normal variate, and n D min.N k; N j /. These eigenvalues are obtained by a singular value decomposition of Q0 1 A0 j (Golub and Van Loan 1989; Savin and White 1978). The marginal probability (or p-value) for dj given c0 is Pn 2 Prob. Pn where qj D
n X lD1 lD1 jl l 2 lD1 l
jl
c0 /
2 l
When the null hypothesis H0 W 'j D 0 holds, the quadratic form qj has the characteristic function
j .t /
n Y lD1
.1
2.
jl
c0 /i t /
1=2
The distribution function is uniquely determined by this characteristic function: Z 1 itx 1 e e i tx j .t/ 1 j . t/ dt F .x/ D C 2 2 it 0 For example, to test H0 W '4 D 0 given '1 D '2 D '3 D 0 against H1 W probability (p-value) can be used: Z 1 1 1 . 4. t / 4 .t// F .0/ D C dt 2 2 it 0 where
4 .t /
n Y lD1
.1
2.
4l
O d4 /i t /
1=2
O and d4 is the calculated value of the fourth-order Durbin-Watson statistic. In the Durbin-Watson test, the marginal probability indicates positive autocorrelation ( 'j > 0) if it is less than the level of signicance (), while you can conclude that a negative autocorrelation ( 'j < 0) exists if the marginal probability based on the computed Durbin-Watson statistic is greater than 1 . Wallis (1972) presented tables for bounds tests of fourth-order autocorrelation, and Vinod (1973) has given tables for a 5% signicance level for orders two to four. Using the
AUTOREG procedure, you can calculate the exact p-values for the general order of Durbin-Watson test statistics. Tests for the absence of autocorrelation of order p can be performed sequentially; at the j th step, test H0 W 'j D 0 given '1 D : : : D 'j 1 D 0 against 'j 0. However, the size of the sequential test is not known. The Durbin-Watson statistic is computed from the OLS residuals, while that of the autoregressive error model uses residuals that are the difference between the predicted values and the actual values. When you use the Durbin-Watson test from the residuals of the autoregressive error model, you must be aware that this test is only an approximation. See Autoregressive Error Model on page 357 earlier in this chapter. If there are missing values, the Durbin-Watson statistic is computed using all the nonmissing values and ignoring the gaps caused by missing residuals. This does not affect the signicance level of the resulting test, although the power of the test against certain alternatives may be adversely affected. Savin and White (1978) have examined the use of the Durbin-Watson statistic with missing values.
t;
t D 1; : : : ; N
where X is an N k data matrix, is a k 1 coefcient vector, u is a N 1 disturbance vector, and t is a sequence of independent normal error terms with mean 0 and variance 2 . The generalized Durbin-Watson statistic is written as DWj D O j O u0 A0 Aj u O O u0 u
O where u is a vector of OLS residuals and Aj is a .T j / T matrix. The generalized Durbin-Watson statistic DWj can be rewritten as DWj D Y0 MA0 Aj MY j Y0 MY
k;
0 .Q0 A0 Aj Q1 / 1 j 0
where Q0 Q1 D IT 1
Q0 X D 0; and D Q0 u. 1 1
The marginal probability for the Durbin-Watson statistic is Pr.DWj < c/ D Pr.h < 0/ where h D 0 .Q0 A0 Aj Q1 1 j cI/.
The p-value or the marginal probability for the generalized Durbin-Watson statistic is computed by numerical inversion of the characteristic function .u/ of the quadratic form h D 0 .Q0 A0 Aj Q1 1 j
cI/. The trapezoidal rule approximation to the marginal probability Pr.h < 0/ is Pr.h < 0/ D 1 2
K X Im kD0
..k C 1 // 2
1 .k C 2 /
C EI ./ C ET .K/
where Im . / is the imaginary part of the characteristic function, EI ./ and ET .K/ are integration and truncation errors, respectively. Refer to Davies (1973) for numerical inversion of the characteristic function. Ansley, Kohn, and Shively (1992) proposed a numerically efcient algorithm that requires O(N ) operations for evaluation of the characteristic function .u/. The characteristic function is denoted as .u/ D I 2i u.Q0 A0 Aj Q1 cIN k / 1 j 1=2 0 1=2 X X D jVj 1=2 X0 V 1 X
1=2
p where V D .1C2i uc/I 2i uA0 Aj and i D 1. By applying the Cholesky decomposition to the j complex matrix V, you can obtain the lower triangular matrix G that satises V D GG0 . Therefore, the characteristic function can be evaluated in O(N ) operations by using the following formula: .u/ D jGj
1
0 X X
1=2 0
1=2 X X
where X D G 1 X. Refer to Ansley, Kohn, and Shively (1992) for more information on evaluation of the characteristic function.
Testing
Heteroscedasticity and Normality Tests
Portmanteau Q Test
For nonlinear time series models, the portmanteau test statistic based on squared residuals is used to test for independence of the series (McLeod and Li 1983):
q X r.i I O 2 / t Q.q/ D N.N C 2/ .N i/ i D1
where r.i I O t2 / PN D
2 O 2 /. O t2 i t Di C1 . O t PN 2 O 2 /2 t D1 . O t
O 2/
O2 D
N 1 X 2 Ot N t D1
This Q statistic is used to test the nonlinear effects (for example, GARCH effects) present in the residuals. The GARCH.p; q/ process can be considered as an ARMA.max.p; q/; p/ process. See the section Predicting the Conditional Variance on page 383 later in this chapter. Therefore, the Q statistic calculated from the squared residuals can be used to identify the order of the GARCH process.
Engle (1982) proposed a Lagrange multiplier test for ARCH disturbances. The test statistic is asymptotically equivalent to the test used by Breusch and Pagan (1979). Engles Lagrange multiplier test for the qth order ARCH process is written LM.q/ D where WD and 1 6: 6: : Z D 6: 6: 4: 1 2
2 O0 : : : : : : 2 ON 1 2 O2 O1 ; : : :; N O2 O2
N W0 Z.Z0 Z/ W0 W
1 Z0 W
!0
: : : : : :
3 O 2 qC1 : 7 : 7 : 7 : 7 : 5 : 2 ON q
Testing ! 375
2 The presample values ( 0 ,: : :, 2 qC1 ) have been set to 0. Note that the LM.q/ tests may have different nite-sample properties depending on the presample values, though they are asymptotically equivalent regardless of the presample values. The LM and Q statistics are computed from the OLS residuals assuming that disturbances are white noise. The Q and LM statistics have an approximate 2 distribution under the white-noise null hypothesis. .q/
Normality Test
Based on skewness and kurtosis, Jarque and Bera (1980) calculated the test statistic N 2 N TN D b C .b2 3/2 6 1 24 where p PN N t D1 u3 Ot b1 D 3 PN 2 O2 t D1 ut P N N u4 t D1 O t b2 D 2 PN u2 Ot t D1 The
2 (2)
When the GARCH model is estimated, the normality test is obtained using the standardized residup als ut D Ot = ht . The normality test can be used to detect misspecication of the family of ARCH O models.
y2 D X2 2 C u2 where the number of observations from the rst set is n1 and the number of observations from the second set is n2 . The Chow test statistic is used to test the null hypothesis H0 W 1 D 2 conditional on the same error variance V .u1 / D V .u2 /. The Chow test is computed using three sums of square errors: Fchow D O O O1 O O2 O .u0 u u0 u1 u0 u2 /=k 0 O 0 O O O .u1 u1 C u2 u2 /=.n1 C n2 2k/
O O where u is the regression residual vector from the full set model, u1 is the regression residual vector O from the rst set model, and u2 is the regression residual vector from the second set model. Under the null hypothesis, the Chow test statistic has an F distribution with k and .n1 C n2 2k/ degrees of freedom, where k is the number of elements in . Chow (1960) suggested another test statistic that tests the hypothesis that the mean of prediction errors is 0. The predictive Chow test can also be used when n2 < k. The PCHOW= option computes the predictive Chow test statistic Fpchow D O O O1 O .u0 u u0 u1 /=n2 O1 O u0 u1 =.n1 C k/ k/ degrees of freedom.
C ut
where the disturbances might be serially correlated with possible heteroscedasticity. Phillips and Perron (1988) proposed the unit root test of the OLS regression model, y t D yt
1
C ut
P Let s 2 D T 1 k TD1 u2 and let O 2 be the variance estimate of the OLS estimator O, where ut is the Ot O t 1 PT 2 by using the truncation lag OLS residual. You can estimate the asymptotic variance of T t D1 ut O l.
O D
l X j D0
j 1
j=.l C 1/ Oj
1 T
PT
O O t Dj C1 ut ut j .
O O Then the Phillips-Perron Z (dened here as Z ) test (zero mean case) is written O Z D T. O 1/ 1 2 2 O T O . 2 O0 /=s 2
Testing ! 377
where B( ) is a standard Brownian motion. Note that the realization z from the stochastic process B.x/ is distributed as N.0; x/ and thus B.1/2 2 . 1 Note that P. O < 1/ 0:68 as T!1, which shows that the limiting distribution is skewed to the left. Let be the O O statistic for O. The Phillips-Perron Zt (dened here as Z ) test is written 1 T O .O 2 O0 /=.s O 1=2 /
O Z D . O0 = O /1=2 t O
When you test the regression model yt D C yt 1 C ut for the true random walk process (single O mean case), the limiting distribution of the statistic Z is written
1 2 2 fB.1/
1g
R1
0
B.x/2 dx
O while the limiting distribution of the statistic Z is given by R1 1 2 1g B.1/ 0 B.x/dx 2 fB.1/ hR i2 R1 1 B.x/dx g1=2 f 0 B.x/2 dx 0 Finally, the limiting distribution of the Phillips-Perron test for the random walk with drift process yt D C yt 1 C ut (trend case) can be derived as 2 3 B.1/ 6 7 B.1/2 1 0 c 0 V 14 5 R2 1 B.1/ 0 B.x/dx O where c D 1 for Z and c D 1 6R 1 V D 4 0 B.x/dx
1 2 1 p Q
O for Z ,
2 3 0 14 5 c QD 0 c 0 V 0
When several variables zt D .z1t ; ; zk t /0 are cointegrated, there exists a .k 1/ cointegrating vector c such that c0 zt is stationary and c is a nonzero vector. The residual based cointegration test assumes the following regression model: yt D 1 C x0 C ut t where yt D z1t , xt D .z2t ; ; zkt /0 , and = ( 2 , , k /0 . You can estimate the consistent cointegrating vector by using OLS if all variables are difference stationary that is, I(1). The PhillipsOuliaris test is computed using the OLS residuals from the preceding regression model, and it performs the test for the null hypothesis of no cointegration. The estimated cointegrating vector is O O O c D .1; 2 ; ; k /0 . You need to refer to the tables by Phillips and Ouliaris (1990) to obtain the p-value of the cointegration test. Before you apply the cointegration test, you may want to perform the unit root test for each variable (see the option STATIONARITY= ( PHILLIPS )).
2 with rt D rt 1 C et , et iid .0; e /, and an intercept (in the original paper, the authors use r0 instead of .) Under stronger assumptions of normality and iid of ut and et , a one-sided LM test of the null that there is no random walk (et D 0; 8t ) can be constructed as follows:
b T1 LM D
1 s .l/ D T
2
T X S2 t s 2 .l/ t D1
T X t D1
u2 t
St D
t X D1
2 Following the original work of Kwiatkowski, Phillips, Schmidt, and Shin, under the null ( e D 0), LM statistic converges asymptotically to three different distributions depending on whether the model is trend-stationary, level-stationary ( D 0), or zero-mean stationary ( D 0, D 0). The trend-stationary model is denoted by subscript and the level-stationary model is denoted by subscript . The case when there is no trend and zero intercept denoted as 0. The last case is
Testing ! 379
C t C ut W with:
b b LM b LM
LM 0 !
D
Z
0
B 2 .r/dr V 2 .r/dr
2 V2 .r/dr
0 Z 1 0
V .r/ D B.r/
rB.1/ 3r /B.1/ C . 6r C 6r /
2 2
Z
0
B.s/ds
where V .r/ is a standard Brownian bridge, V2 .r/ is a Brownian bridge of a second-level, B.r/ is a Brownian motion (Wiener process), and ! is convergence in distribution. O Using the notation of Kwiatkowski et al. (1992) the LM statistic is named as . This test depends on the computational method used to compute the long-run variance s.l/ that is, the window width l and the kernel type w.:/. You can specify the kernel used in the test, using the KERNEL option: Newey-West/Bartlett (KERNEL=NW j BART), default w.s; l/ D 1 Quadratic spectral (KERNEL=QS) w.s= l/ D w.x/ D 25 12
2x2 D
s l C1
cos .6 x=5/
You can specify the number of lags, l, in three different ways: Schwert (SCHW = c) (default for NW, c=4) ( ) T 1=4 l D oor c 100 Manual (LAG = l) Automatic selection (AUTO) (default for QS) Hobijn, Franses, and Ooms (2004) The last option (AUTO) needs more explanation, summarized in the following table.
NW Kernel
QS Kernel
l D min.T; oor. O T 1=3 // l D min.T; oor. O T 1=5 // where T number of observations, O D 1:1447
s .1/ O s .0/ O
1=3
O D 1:3221 Pn
i D1 i j
s .2/ O s .0/ O
1=5
Oi
n D oor.T 2=25 /
0;j D
1 j D0 0 otherwise
T i 1 X ut ut Ci Oi D T t D1
C ut
where yt is the predicted value from the linear model and p is the power of yt in the unrestricted O O model equation starting from 2. The number of higher-ordered terms to be chosen depends on the discretion of the analyst. The RESET opti on produces test results for p D 2, 3, and 4. The reset test is an F statistic for testing H0 W j D 0; 8j D 2; : : : ; p, against H1 W least one j D 2; : : : ; p in the unrestricted model and is computed as follows: F.p
1;n k pC1/ j
0 for at
where S SER is the sum of squared errors due to the restricted model, SSEU is the sum of squared errors due to the unrestricted model, n is the total number of observations, and k is the number of parameters in the original linear model. Ramseys test can be viewed as a linearity test that checks whether any nonlinear transformation of the specied independent variables has been omitted, but it need not help in identifying a new relevant variable other than those already specied in the current model.
Predicted Values
The AUTOREG procedure can produce two kinds of predicted values for the response series and corresponding residuals and condence limits. The residuals in both cases are computed as the actual value minus the predicted value. In addition, when GARCH models are estimated, the AUTOREG procedure can output predictions of the conditional error variance.
where v2 is an estimate of the variance of yt and t=2 is the upper /2 percentage point of the t O distribution.
Prob.T > t=2 / D =2 where T is an observation from a t distribution with q degrees of freedom. The value of can be set with the ALPHACLM= option. The degrees of freedom parameter, q, is taken to be the number of observations minus the number of free parameters in the regression and autoregression parts of the model. For the YW estimation method, the value of v is calculated as q v D s 2 x0 .X0 V 1 X/ 1 xt t where s 2 is the error sum of squares divided by q. For the ULS and ML methods, it is calculated as q v D s 2 x0 Wxt t where W is the k k submatrix of .J0 J/ 1 that corresponds to the regression parameters. For details, see the section Computational Methods on page 359 earlier in this chapter.
where xt is the vector of independent variables and t jt 1 is the minimum variance linear predictor of the error term given the available past values of t j ; j D 1; 2; : : :; t 1, and the autoregressive model for t . If the m previous values of the structural residuals are available, then
t jt 1
'1 O
t 1
:::
'm O
t m
where '1 ; : : :; 'm are the estimated AR parameters. The upper and lower condence limits are O O computed as ut D yt C t=2 v Q Q Q lt D yt Q t=2 v
where v, in this case, is computed as q v D s 2 .x0 .X0 V 1 X/ 1 xt C r/ t where the value rs 2 is the estimate of the variance of tjt 1 . At the start of the series, and after missing values, r is generally greater than 1. See the section Predicting the Conditional Variance on page 383 for computational details of r. The plot of residuals and condence limits in Example 8.4 later in this chapter illustrates this behavior. Except to adjust the degrees of freedom for the error sum of squares, the preceding formulas do not account for the fact that the autoregressive parameters are estimated. In particular, the condence limits are likely to be somewhat too narrow. In large samples, this is probably not an important effect, but it may be appreciable in small samples. Refer to Harvey (1981) for some discussion of this problem for AR(1) models. Note that at the beginning of the series (the rst m observations, where m is the value of the NLAG= option) and after missing values, these residuals do not match the residuals obtained by using OLS on the transformed variables. This is because, in these cases, the predicted noise values must be based on less than a complete set of past noise values and, thus, have larger variance. The GLS transformation for these observations includes a scale factor as well as a linear combination of past values. Put another way, the L 1 matrix dened in the section Computational Methods on page 359 has the value 1 along the diagonal, except for the rst m observations and after missing values.
D!C
n X i D1
.i C
i/
2 t i
p X j D1
j t j
C t
2 2 where t D t ht and n D max.p; q/. This representation shows that the squared residual t follows an ARMA.n; p/ process. Then for any d > 0, the conditional expectations are as follows: n X i D1 p X j D1
E.
2 t Cd jt /
D!C
.i C
i /E.
2 t Cd i jt /
j E.t Cd j jt /
t Cd
= yt Cd
2 t Cd j jt
D E.
2 t Cd j jt /
Coefcients in the conditional d-step prediction error variance are calculated recursively using the following formula: gj D '1 gj
1
:::
'm gj
where g0 D 1 and gj D 0 if j < 0; '1 , : : :, 'm are autoregressive parameters. Since the parameters are not known, the conditional variance is computed using the estimated autoregressive parameters. The d-step-ahead prediction error variance is simplied when there are no autoregressive terms: V.
t Cd jt /
2 t Cd jt
Therefore, the one-step-ahead prediction error variance is equivalent to the conditional error variance dened in the GARCH process:
2 ht D E. t jt 1/
2 t jt 1
Note that the conditional prediction error variance of the EGARCH and GARCH-M models cannot be calculated using the preceding formula. Therefore, the condence intervals for the predicted values are computed assuming the homoscedastic conditional error variance. That is, the conditional prediction error variance is identical to the unconditional prediction error variance: V.
t Cd jt /
D V.
t Cd /
d 1 X j D0
2 gj
D 2 . You can compute s 2 r, which is the second term of the variance for the since t2 Cd j jt predicted value yt explained previously in the section Predicting Future Series Realizations on Q P 1 2 P 1 2 page 382 using the formula 2 dD0 gj ; r is estimated from dD0 gj by using the estimated j j autoregressive parameters. Consider the following conditional prediction error variance: V.
t Cd jt /
d 1 X j D0
2 gj
d 1 X j D0
2 gj .
2 t Cd j jt
The second term in the preceding equation can be interpreted as the noise from using the homoscedastic conditional variance when the errors follow the GARCH process. However, it is expected that if the GARCH process is covariance stationary, the difference between the conditional prediction error variance and the unconditional prediction error variance disappears as the forecast horizon d increases.
the intercept estimate. INTERCEP contains a missing value for models for which the NOINT option is specied. the estimation method that is specied in the METHOD= option the label of the MODEL statement if one is given, or blank otherwise the value of the mean square error for the model the name of the row of covariance matrix for the parameter estimate, if the COVOUT option is specied the log-likelihood value of the GARCH model the value of the error sum of squares This variable indicates the optimization status. _STATUS_ D 0 indicates that there were no errors during the optimization and the algorithm converged. _STATUS_ D 1 indicates that the optimization could not improve the function value and means that the results should be interpreted with caution. _STATUS_ D 2 indicates that the optimization failed due to the number of iterations exceeding either the maximum default or the specied number of iterations or the number of function calls allowed. _STATUS_ D 3 indicates that an error occurred during the optimization process. For example, this error message is obtained when a function or its derivatives cannot be calculated at the initial values or during the iteration process, when an optimization step is outside of the feasible region or when active constraints are linearly dependent. standard error of the parameter estimate, if the COVOUT option is specied the estimate of the parameter in the EGARCH model, if an EGARCH model is specied OLS for observations containing parameter estimates, or COV for observations containing covariance matrix elements.
The OUTEST= data set contains one observation for each MODEL statement giving the parameter estimates for that model. If the COVOUT option is specied, the OUTEST= data set includes additional observations for each MODEL statement giving the rows of the covariance of parameter estimates matrix. For covariance observations, the value of the _TYPE_ variable is COV, and the _NAME_ variable identies the parameter associated with that row of the covariance matrix.
Printed Output
The AUTOREG procedure prints the following items: 1. the name of the dependent variable 2. the ordinary least squares estimates 3. Estimates of autocorrelations, which include the estimates of the autocovariances, the autocorrelations, and (if there is sufcient space) a graph of the autocorrelation at each LAG
4. if the PARTIAL option is specied, the partial autocorrelations 5. the preliminary MSE, which results from solving the Yule-Walker equations. This is an estimate of the nal MSE. 6. the estimates of the autoregressive parameters (Coefcient), their standard errors (Std Error), and the ratio of estimate to standard error (t Ratio) 7. the statistics of t for the nal model. These include the error sum of squares (SSE), the degrees of freedom for error (DFE), the mean square error (MSE), the mean absolute error (MAE), the mean absolute percentage error (MAPE), the root mean square error (Root MSE), the Schwarz information criterion (SBC), the Akaike information criterion (AIC), the corrected Akaike information criterion (AICC), the regression R2 (Reg Rsq), and the total R2 (Total Rsq). For GARCH models, the following additional items are printed: the value of the log-likelihood function the number of observations that are used in estimation (OBS) the unconditional variance (UVAR) the normality test statistic and its p-value 8. the parameter estimates for the structural model (B Value), a standard error estimate (Std Error), the ratio of estimate to standard error (t Ratio), and an approximation to the signicance probability for the parameter being 0 (Approx Prob) 9. If the NLAG= option is specied with METHOD=ULS or METHOD=ML, the regression parameter estimates are printed again, assuming that the autoregressive parameter estimates are known. In this case, the Std Error and related statistics for the regression estimates will, in general, be different from the case when they are estimated. Note that from a standpoint of estimation, Yule-Walker and iterated Yule-Walker methods (NLAG= with METHOD=YW, ITYW) generate only one table, assuming AR parameters are given. 10. If you specify the NORMAL option, the Bera-Jarque normality test statistics are printed. If you specify the LAGDEP option, Durbins h or Durbins t is printed.
Description
ODS Tables Created by the MODEL Statement FitSummary Summary of regression SummaryDepVarCen Summary of regression (centered dependent var)
Table 8.2
continued
ODS Table Name SummaryNoIntercept YWIterSSE PreMSE Dependent DependenceEquations ARCHTest ChowTest Godfrey PhilPerron
Description Summary of regression (no intercept) Yule-Walker iteration sum of squared error Preliminary MSE Dependent variable Linear dependence equation Q and LM tests for ARCH disturbances Chow test and predictive Chow test Godfreys serial correlation test Phillips-Perron unit root test
Option NOINT METHOD=ITYW NLAG= default ARCHTEST CHOW= PCHOW= GODFREY<=> STATIONARITY= (PHILIPS<=()>) (no regressor) STATIONARITY= (PHILIPS<=()>) (has regressor) STATIONARITY= (KPSS<=()>) RESET NLAG= NLAG= BACKSTEP NLAG= ITPRINT default NLAG=, METHOD= ULS | ML PARTIAL COVB CORRB ALL COEF GINV default LAGDEP=; NORMAL DW= default
PhilOul
KPSS
Kwiatkowski, Phillips, Schmidt, and Shin test ResetTest Ramseys RESET test ARParameterEstimates Estimates of autoregressive parameters CorrGraph estimates of autocorrelations BackStep Backward elimination of autoregressive terms ExpAutocorr Expected autocorrelations IterHistory Iteration history ParameterEstimates Parameter estimates ParameterEstimatesGivenAR Parameter estimates assuming AR parameters are given PartialAutoCorr CovB CorrB CholeskyFactor Coefcients GammaInverse ConvergenceStatus MiscStat DWTest Partial autocorrelation Covariance of parameter estimates Correlation of parameter estimates Cholesky root of gamma Coefcients for rst NLAG observations Gamma inverse Convergence status table Durbin t or Durbin h, Bera-Jarque normality test Durbin-Watson statistics
Table 8.2
continued
Description
ODS Tables Created by the TEST Statement FTest F test WaldTest Wald test
ODS Graphics
This section describes the use of ODS for creating graphics with the AUTOREG procedure. To request these graphs, you must specify the ODS GRAPHICS statement. By default, only the residual, predicted versus actual, and autocorrelation of residuals plots are produced. If, in addition to the ODS GRAPHICS statement, you also specify the ALL option in either the PROC AUTOREG statement or MODEL statement, all plots are created. For HETERO, GARCH, and AR models studentized residuals are replaced by standardized residuals. For the autoregressive models, the conditional variance of the residuals is computed as described in the section Predicting Future Series Realizations on page 382. For the GA RCH and HETERO models, residuals are assumed to have ht conditional variance invoked by the HT= option of the OUTPUT statement. For all these cases, the Cooks D plot is not produced.
ODS Graph Name ACFPlot FitPlot CooksD IACFPlot QQPlot PACFPlot ResidualHistogram ResidualPlot StudentResidualPlot StandardResidualPlot WhiteNoiseLogProbPlot
Plot Description Autocorrelation of residuals Predicted versus actual plot Cooks D plot Inverse autocorrelation of residuals Q-Q plot of residuals Partial autocorrelation of residuals Histogram of the residuals Residual plot Studentized residual plot Standardized residual plot Tests for white noise residuals
Option ACF Default ALL (no NLAG=) ALL ALL ALL ALL Default ALL (no NLAG=/HETERO=/GARCH=) ALL ALL
proc sgplot data=gnp noautolegend; scatter x=date y=y; xaxis grid values=(01jan1901d 01jan1911d 01jan1921d 01jan1931d 01jan1941d 01jan1951d 01jan1961d 01jan1971d 01jan1981d 01jan1991d); run;
The (linear) trend-stationary process is estimated using the following form: yt D 0 C 1 t C where
t t t
'1 /
t 1
'2
t 2
IN.0;
The preceding trend-stationary model assumes that uncertainty over future horizons is bounded since the error term, t , has a nite variance. The maximum likelihood AR estimates from the statements that follow are shown in Output 8.1.2:
proc autoreg data=gnp; model y = t / nlag=2 method=ml; run;
DF 1 1 1 1
Time Trend
Autoregressive parameters assumed given. Standard Error 0.0661 0.001346 Approx Pr > |t| <.0001 <.0001 Variable Label
Variable Intercept t
DF 1 1
Time Trend
Nelson and Plosser (1982) failed to reject the hypothesis that macroeconomic time series are nonstationary and have no tendency to return to a trend line. In this context, the simple random walk process can be used as an alternative process: yt D 0 C yt where
t 1
.L/.1
where L is the lag operator. You can observe that the class of a difference-stationary process should have at least one unit root in the AR polynomial .L/.1 L/. The Dickey-Fuller procedure is used to test the null hypothesis that the series has a unit root in the AR polynomial. Consider the following equation for the augmented Dickey-Fuller test: yt D 0 C t C 1 yt
1C m X i D1 i yt i
The %DFTEST macro in the following statements computes the test statistic and its p-value to perform the Dickey-Fuller test. The default value of m is 3, but you can specify m with the AR= option. The option TREND=2 implies that the Dickey-Fuller test equation contains linear time trend. See Chapter 5, SAS Macros and Functions, for details.
%dftest(gnp,y,trend=2,outstat=stat1); proc print data=stat1; run;
The augmented Dickey-Fuller test indicates that the output series may have a difference-stationary process. The statistic _TAU_ has a value of 2:61903 and its p-value is 0.29104. (See Output 8.1.3.)
Output 8.1.3 Augmented Dickey-Fuller Test Results
Analysis of Real GNP I n t e r c e p t 0.76919 0.08085 0.00051 -0.01695 0.00549 0.00842 0.01056
O b s 1 2 3 4 5 6 7
_ N A M E _
A R _ V -1 . . . . . .
O b s 1 2 3 4 5 6 7
_ N O B S _ 79 79 79 79 79 79 79
_ T R E N D _ 2 2 2 2 2 2 2
_ D L A G _ 1 1 1 1 1 1 1
The AR(1) model for the differenced series DY is estimated using the maximum likelihood method for the period 1902 to 1983. The difference-stationary process is written yt D 0 C
t t t 1
'1
The estimated value of '1 is 0:297 and that of 0 is 0.0293. All estimated values are statistically signicant. The PROC step follows:
proc autoreg data=gnp; model dy = / nlag=1 method=ml; run;
The printed output produced by the PROC step is shown in Output 8.1.4.
Output 8.1.4 Estimating the Differenced Series with AR(1) Error
Analysis of Real GNP The AUTOREG Procedure Maximum Likelihood Estimates SSE MSE SBC MAE MAPE Durbin-Watson 0.27107673 0.00339 -226.77848 0.04333026 153.637587 1.9268 DFE Root MSE AIC AICC Regress R-Square Total R-Square Standard Error 0.009093 0.1067 80 0.05821 -231.59192 -231.44002 0.0000 0.0900 Approx Pr > |t| 0.0018 0.0067
DF 1 1
Autoregressive parameters assumed given. Standard Error 0.009093 Approx Pr > |t| 0.0018
Variable Intercept
DF 1
Estimate 0.0293
t Value 3.22
option, the p-value of the Durbin-Watson statistic is printed. The Durbin-Watson test indicates the positive autocorrelation of the regression residuals. The DATA and PROC steps follow:
title Grunfelds Investment Models Fit with Autoregressive Errors; data grunfeld; input year gei gef gec; label gei = Gross investment GE gec = Lagged Capital Stock GE gef = Lagged Value of GE shares; datalines; ... more lines ...
data=grunfeld; = gef gec / nlag=1 dwprob; = gef gec / nlag=1 method=uls; = gef gec / nlag=1 method=ml;
The printed output produced by each of the MODEL statements is shown in Output 8.2.1 through Output 8.2.4.
Output 8.2.1 OLS Analysis of Residuals
Grunfelds Investment Models Fit with Autoregressive Errors The AUTOREG Procedure Dependent Variable gei Gross investment GE
Grunfelds Investment Models Fit with Autoregressive Errors The AUTOREG Procedure Ordinary Least Squares Estimates SSE MSE SBC MAE MAPE Durbin-Watson 13216.5878 777.44634 195.614652 19.9433255 23.2047973 1.0721 DFE Root MSE AIC AICC Regress R-Square Total R-Square 17 27.88272 192.627455 194.127455 0.7053 0.7053
DF 1 1 1
Standard Approx Error t Value Pr > |t| Variable Label 31.3742 0.0156 0.0257 -0.32 1.71 5.90 0.7548 0.1063 Lagged Value of GE shares <.0001 Lagged Capital Stock GE
Preliminary MSE
Lag 1
Coefficient -0.460867
t Value -2.08
Grunfelds Investment Models Fit with Autoregressive Errors The AUTOREG Procedure Yule-Walker Estimates SSE MSE SBC MAE MAPE Durbin-Watson 10238.2951 639.89344 193.742396 18.0715195 21.0772644 1.3321 DFE Root MSE AIC AICC Regress R-Square Total R-Square 16 25.29612 189.759467 192.426133 0.5717 0.7717
DF 1 1 1
Standard Approx Error t Value Pr > |t| Variable Label 33.2511 0.0158 0.0383 -0.55 2.10 3.63 0.5911 0.0523 Lagged Value of GE shares 0.0022 Lagged Capital Stock GE
Lag 1
Coefficient -0.460867
t Value -2.08
Algorithm converged.
DF 1 1 1 1
Standard Approx Error t Value Pr > |t| Variable Label 34.8101 0.0179 0.0449 0.2592 -0.54 1.89 3.05 -1.93 0.5993 0.0769 Lagged Value of GE shares 0.0076 Lagged Capital Stock GE 0.0718
Autoregressive parameters assumed given. Standard Approx Error t Value Pr > |t| Variable Label 33.7567 0.0159 0.0404 -0.55 2.13 3.39 0.5881 0.0486 Lagged Value of GE shares 0.0037 Lagged Capital Stock GE
DF 1 1 1
Lag 1
Coefficient -0.460867
t Value -2.08
Algorithm converged. Grunfelds Investment Models Fit with Autoregressive Errors The AUTOREG Procedure Maximum Likelihood Estimates SSE MSE SBC MAE MAPE Durbin-Watson 10229.2303 639.32689 193.738877 18.0892426 21.0978407 1.3385 DFE Root MSE AIC AICC Regress R-Square Total R-Square 16 25.28491 189.755947 192.422614 0.5656 0.7719
DF 1 1 1 1
Autoregressive parameters assumed given. Standard Approx Error t Value Pr > |t| Variable Label 33.3931 0.0158 0.0389 -0.55 2.11 3.56 0.5897 0.0512 Lagged Value of GE shares 0.0026 Lagged Capital Stock GE
DF 1 1 1
proc autoreg data=a plots; model y = x / nlag=1; output out=b p=pred pm=xbeta; run; proc sgplot data=b; scatter y=y x=time / markerattrs=(color=black); series y=pred x=time / lineattrs=(color=blue); series y=xbeta x=time / lineattrs=(color=red); run;
The printed output produced by PROC AUTOREG is shown in Output 8.3.1 and Output 8.3.2. Plots of observed and predicted values are shown in Output 8.3.3 and Output 8.3.4. Note: the plot Output 8.3.3 can be viewed in the Autoreg.Model.FitDiagnosticPlots category by selecting ViewIResults.
Output 8.3.1 Results of OLS Analysis: No Autoregressive Model Fit
Lack of Fit Study Fitting White Noise Plus Autoregressive Errors to a Sine Wave The AUTOREG Procedure Dependent Variable y
Lack of Fit Study Fitting White Noise Plus Autoregressive Errors to a Sine Wave The AUTOREG Procedure Ordinary Least Squares Estimates SSE MSE SBC MAE MAPE Durbin-Watson 34.8061005 0.47680 163.898598 0.59112447 117894.045 0.0057 DFE Root MSE AIC AICC Regress R-Square Total R-Square Standard Error 0.1584 0.2771 73 0.69050 159.263622 159.430289 0.0008 0.0008 Approx Pr > |t| 0.1367 0.8109
Variable Intercept x
DF 1 1
Estimates of Autocorrelations Lag 0 1 Covariance 0.4641 0.4531 Correlation 1.000000 0.976386 -1 9 8 7 6 5 4 3 2 1 0 1 2 3 4 5 6 7 8 9 1 | | 0.0217 |********************| |********************|
Preliminary MSE
Lag 1
Coefficient -0.976386
t Value -38.35
Lack of Fit Study Fitting White Noise Plus Autoregressive Errors to a Sine Wave The AUTOREG Procedure Yule-Walker Estimates SSE MSE SBC MAE MAPE Durbin-Watson 0.18304264 0.00254 -222.30643 0.04551667 29145.3526 0.0942 DFE Root MSE AIC AICC Regress R-Square Total R-Square Standard Error 0.1702 0.0141 72 0.05042 -229.2589 -228.92087 0.0001 0.9947 Approx Pr > |t| 0.3898 0.9315
Variable Intercept x
DF 1 1
The model is estimated using maximum likelihood, and the residuals are plotted with 99% condence limits. The PARTIAL option prints the partial autocorrelations. The following statements t the model:
proc autoreg data=ar partial; model y = / nlag=(1 4 5) method=ml; output out=a predicted=p residual=r ucl=u lcl=l alphacli=.01; run;
The printed output produced by the AUTOREG procedure is shown in Output 8.4.1 and Output 8.4.2. Note: the plot Output 8.4.2 can be viewed in the Autoreg.Model.FitDiagnosticPlots category by selecting ViewIResults.
Simulated Time Series with Roots: (X-1.25)(X**4-1.25) With 15% Missing Values The AUTOREG Procedure Ordinary Least Squares Estimates SSE MSE SBC MAE MAPE Durbin-Watson 182.972379 4.57431 181.39282 1.80469152 270.104379 1.3962 DFE Root MSE AIC AICC Regress R-Square Total R-Square Standard Error 0.3340 40 2.13876 179.679248 179.781813 0.0000 0.0000 Approx Pr > |t| <.0001
Variable Intercept
DF 1
Estimate -2.2387
t Value -6.70
Estimates of Autocorrelations Lag 0 1 2 3 4 5 Covariance 4.4627 1.4241 1.6505 0.6808 2.9167 -0.3816 Correlation 1.000000 0.319109 0.369829 0.152551 0.653556 -0.085519 -1 9 8 7 6 5 4 3 2 1 0 1 2 3 4 5 6 7 8 9 1 | | | | | | |********************| |****** | |******* | |*** | |************* | | **|
Lag 1 4 5
Expected Autocorrelations Lag 0 1 2 3 4 5 Algorithm converged. Simulated Time Series with Roots: (X-1.25)(X**4-1.25) With 15% Missing Values The AUTOREG Procedure Maximum Likelihood Estimates SSE MSE SBC MAE MAPE Durbin-Watson 48.4396756 1.30918 146.879013 0.88786192 141.377721 2.9457 DFE Root MSE AIC AICC Regress R-Square Total R-Square Standard Error 0.5239 0.1129 0.0914 0.1202 37 1.14419 140.024725 141.135836 0.0000 0.7353 Approx Pr > |t| 0.0001 <.0001 <.0001 <.0001 Autocorr 1.0000 0.4204 0.2480 0.3160 0.6903 0.0228
DF 1 1 1 1
Autoregressive parameters assumed given. Standard Error 0.5225 Approx Pr > |t| 0.0001
Variable Intercept
DF 1
Estimate -2.2370
t Value -4.28
The plot of the predicted values and the upper and lower condence limits is shown in Output 8.4.3. Note that the condence interval is wider at the beginning of the series (when there are no past noise values to use in the forecast equation) and after missing values where, again, there is an incomplete set of past residuals.
Output 8.4.3 Plot of Predicted Values and Condence Interval
The money demand equation is rst estimated using OLS. The DW=4 option produces generalized Durbin-Watson statistics up to the fourth order. Their exact marginal probabilities (p-values) are also calculated with the DWPROB option. The Durbin-Watson test indicates positive rst-order
autocorrelation at, say, the 10% condence level. You can use the Durbin-Watson table, which is available only for 1% and 5% signicance points. The relevant upper (dU ) and lower (dL ) bounds are dU D 1:731 and dL D 1:471, respectively, at 5% signicance level. However, the bounds test is inconvenient, since sometimes you may get the statistic in the inconclusive region while the interval between the upper and lower bounds becomes smaller with the increasing sample size. The PROC step follows:
title Partial Adjustment Money Demand Equation; title2 Quarterly Data - 1968:2 to 1983:4; proc autoreg data=money outest=est covout; model m = m1cp y intr infr / dw=4 dwprob; run;
Output 8.5.2 OLS Estimation of the Partial Adjustment Money Demand Equation
Partial Adjustment Money Demand Equation Quarterly Data - 1968:2 to 1983:4 The AUTOREG Procedure Dependent Variable m Real Money Stock (M1)
Partial Adjustment Money Demand Equation Quarterly Data - 1968:2 to 1983:4 The AUTOREG Procedure Ordinary Least Squares Estimates SSE MSE SBC MAE MAPE 0.00271902 0.0000469 -433.68709 0.00483389 0.08888324 DFE Root MSE AIC AICC Regress R-Square Total R-Square 58 0.00685 -444.40276 -443.35013 0.9546 0.9546
Durbin-Watson Statistics Order 1 2 3 4 DW 1.7355 2.1058 2.0286 2.2835 Pr < DW 0.0607 0.5519 0.5002 0.8880 Pr > DW 0.9393 0.4481 0.4998 0.1120
NOTE: Pr<DW is the p-value for testing positive autocorrelation, and Pr>DW is the p-value for testing negative autocorrelation.
DF
Estimate
Lagged M1/Current GDF Real GNP Yield on Corporate Bonds Rate of Prices Changes
The autoregressive model is estimated using the maximum likelihood method. Though the DurbinWatson test statistic is calculated after correcting the autocorrelation, it should be used with care since the test based on this statistic is not justied theoretically. The PROC step follows:
proc autoreg data=money; model m = m1cp y intr infr / nlag=1 method=ml maxit=50; output out=a p=p pm=pm r=r rm=rm ucl=ucl lcl=lcl uclm=uclm lclm=lclm; run; proc print data=a(obs=8); var p pm r rm ucl lcl uclm lclm; run;
A difference is shown between the OLS estimates in Output 8.5.2 and the AR(1)-ML estimates in Output 8.5.3. The estimated autocorrelation coefcient is signicantly negative . 0:88345/. Note that the negative coefcient of AR(1) should be interpreted as a positive autocorrelation. Two predicted values are produced: predicted values computed for the structural model and predicted values computed for the full model. The full model includes both the structural and errorprocess parts. The predicted values and residuals are stored in the output data set A, as are the upper and lower 95% condence limits for the predicted values. Part of the data set A is shown in Output 8.5.4. The rst observation is missing since the explanatory variables, M1CP and INFR, are missing for the corresponding observation.
Output 8.5.3 Estimated Partial Adjustment Money Demand Equation
Partial Adjustment Money Demand Equation Quarterly Data - 1968:2 to 1983:4 The AUTOREG Procedure Estimates of Autoregressive Parameters Standard Error 0.131393
Lag 1
Coefficient -0.126273
t Value -0.96
Algorithm converged.
DF
Estimate
Standard Approx Error t Value Pr > |t| Variable Label 0.4880 0.0908 0.0411 0.0159 0.001834 0.0686 4.94 4.50 3.67 -6.92 -3.46 -12.89 <.0001 <.0001 0.0005 <.0001 0.0010 <.0001
Lagged M1/Current GDF Real GNP Yield on Corporate Bonds Rate of Prices Changes
Autoregressive parameters assumed given. Standard Approx Error t Value Pr > |t| Variable Label 0.4685 0.0840 0.0402 0.0155 0.001828 5.15 4.87 3.75 -7.08 -3.47 <.0001 <.0001 0.0004 <.0001 0.0010
DF
Estimate
Lagged M1/Current GDF Real GNP Yield on Corporate Bonds Rate of Prices Changes
proc sgplot data=ibm; series y=r x=time/lineattrs=(color=blue); refline 0/ axis = y LINEATTRS = (pattern=ShortDash); run;
The simple ARCH(2) model is estimated using the AUTOREG procedure. The MODEL statement option GARCH=(Q=2) species the ARCH(2) model. The OUTPUT statement with the CEV= option produces the conditional variances V. The conditional variance and its forecast are calculated using parameter estimates: ht D ! C 1 O O E.
2 t Cd jt / 2 t 1
C 2 O
2 X i D1
2 t 2
D!C O
i E. O
2 t Cd i jt /
The parameter estimates for !; 1 , and 2 are 0.00011, 0.04136, and 0.06976, respectively. The normality test indicates that the conditional normal distribution may not fully explain the leptokurtosis in the stock returns (Bollerslev 1987).
The ARCH model estimates are shown in Output 8.6.2, and conditional variances are also shown in Output 8.6.3. The code that generates Output 8.6.3 is shown below.
data b; set a; length type $ 8.; if r ^= . then do; type = ESTIMATE; output; end; else do; type = FORECAST; output; end; run; proc sgplot data=b; series x=time y=v/group=type; refline 254/ axis = x LINEATTRS = (pattern=ShortDash); run;
IBM Stock Returns (daily) 29jun1959 - 30jun1960 The AUTOREG Procedure Ordinary Least Squares Estimates SSE MSE SBC MAE MAPE Durbin-Watson 0.03214307 0.0001265 -1558.802 0.00814086 100.378566 2.1377 DFE Root MSE AIC AICC Regress R-Square Total R-Square 254 0.01125 -1558.802 -1558.802 0.0000 0.0000
NOTE: No intercept term is used. R-squares are redefined. Standard Error 7.5608E-6 0.0511 0.0432 Approx Pr > |t| <.0001 0.4181 0.1062
DF 1 1 1
if time > 0 then output; ull = ul; ul = u; end; run; data b; y = .; do time = 37 to 46; output; end; run; data b; merge a b; by time; run; proc autoreg data=b all plots(unpack); model y = time / nlag=2 method=ml; output out=p p=yhat pm=ytrend lcl=lcl ucl=ucl; run;
References
Anderson, T. W. and Mentz, R. P. (1980), On the Structure of the Likelihood Function of Autoregressive and Moving Average Models, Journal of Time Series, 1, 8394. Ansley, C. F., Kohn, R., and Shively, T. S. (1992), Computing p-Values for the Generalized DurbinWatson and Other Invariant Test Statistics, Journal of Econometrics, 54, 277300. Baillie, R. T. and Bollerslev, T. (1992), Prediction in Dynamic Models with Time-Dependent Conditional Variances, Journal of Econometrics, 52, 91113. Balke, N. S. and Gordon, R. J. (1986), Historical Data, in R. J. Gordon, ed., The American Business Cycle, 781850, Chicago: The University of Chicago Press. Beach, C. M. and MacKinnon, J. G. (1978), A Maximum Likelihood Procedure for Regression with Autocorrelated Errors, Econometrica, 46, 5158. Bollerslev, T. (1986), Generalized Autoregressive Conditional Heteroskedasticity, Journal of Econometrics, 31, 307327. Bollerslev, T. (1987), A Conditionally Heteroskedastic Time Series Model for Speculative Prices and Rates of Return, The Review of Economics and Statistics, 69, 542547. Box, G. E. P. and Jenkins, G. M. (1976), Time Series Analysis: Forecasting and Control, Revised Edition, San Francisco: Holden-Day. Breusch, T. S. and Pagan, A. R. (1979), A Simple Test for Heteroscedasticity and Random Coefcient Variation, Econometrica, 47, 12871294. Chipman, J. S. (1979), Efciency of Least Squares Estimation of Linear Trend When Residuals Are Autocorrelated, Econometrica, 47, 115128. Chow, G. (1960), Tests of Equality between Sets of Coefcients in Two Linear Regressions, Econometrica, 28, 531534. Cochrane, D. and Orcutt, G. H. (1949), Application of Least Squares Regression to Relationships Containing Autocorrelated Error Terms, Journal of the American Statistical Association, 44, 3261. Davies, R. B. (1973), Numerical Inversion of a Characteristic Function, Biometrika, 60, 415417. Dufe, D. (1989), Futures Markets, Englewood Cliffs, NJ: Prentice Hall. Durbin, J. (1969), Tests for Serial Correlation in Regression Analysis Based on the Periodogram of Least-Squares Residuals, Biometrika, 56, 115. Durbin, J. (1970), Testing for Serial Correlation in Least-Squares Regression When Some of the Regressors Are Lagged Dependent Variables, Econometrica, 38, 410421. Edgerton, D. and Wells, C. (1994), Critical Values for the CUSUMSQ Statistic in Medium and Large Sized Samples, Oxford Bulletin of Economics and Statistics, 56, 355365.
References ! 425
Engle, R. F. (1982), Autoregressive Conditional Heteroscedasticity with Estimates of the Variance of United Kingdom Ination, Econometrica, 50, 9871007. Engle, R. F. and Bollerslev, T. (1986), Modelling the Persistence of Conditional Variances, Econometric Review, 5, 150. Engle, R. F., Lilien, D. M., and Robins, R. P. (1987), Estimating Time Varying Risk in the Term Structure: The ARCH-M Model, Econometrica, 55, 391407. Fuller, W. A. (1976), Introduction to Statistical Time Series, New York: John Wiley & Sons. Gallant, A. R. and Goebel, J. J. (1976), Nonlinear Regression with Autoregressive Errors, Journal of the American Statistical Association, 71, 961967. Godfrey, L. G. (1978a), Testing against General Autoregressive and Moving Average Error Models When the Regressors Include Lagged Dependent Variables, Econometrica, 46, 12931301. Godfrey, L. G. (1978b), Testing for Higher Order Serial Correlation in Regression Equations When the Regressors Include Lagged Dependent Variables, Econometrica, 46, 13031310. Godfrey, L. G. (1988), Misspecication Tests in Econometrics, Cambridge: Cambridge University Press. Golub, G. H. and Van Loan, C. F. (1989), Matrix Computations, Second Edition, Baltimore: Johns Hopkins University Press. Greene, W. H. (1993), Econometric Analysis, Second Edition, New York: Macmillan. Hamilton, J. D. (1994), Time Series Analysis, Princeton, NJ: Princeton University Press. Harvey, A. C. (1981), The Econometric Analysis of Time Series, New York: John Wiley & Sons. Harvey, A. C. (1990), The Econometric Analysis of Time Series, Second Edition, Cambridge: MIT Press. Harvey, A. C. and McAvinchey, I. D. (1978), The Small Sample Efciency of Two-Step Estimators in Regression Models with Autoregressive Disturbances, Discussion Paper No. 78-10, University of British Columbia. Harvey, A. C. and Phillips, G. D. A. (1979), Maximum Likelihood Estimation of Regression Models with Autoregressive-Moving Average Disturbances, Biometrika, 66, 4958. Hildreth, C. and Lu, J. Y. (1960), Demand Relations with Autocorrelated Disturbances, Technical Report 276, Michigan State University Agricultural Experiment Station, East Lansing, MI. Hobijn, B., Franses, P. H., and Ooms, M. (2004), Generalization of the KPSS-Test for Stationarity, Statistica Neerlandica, 58, 483502. Hurvich, C. M. and Tsai, C.-L. (1989), Regression and Time Series Model Selection in Small Samples, Biometrika, 76, 297307. Inder, B. A. (1984), Finite-Sample Power of Tests for Autocorrelation in Models Containing Lagged Dependent Variables, Economics Letters, 14, 179185.
Inder, B. A. (1986), An Approximation to the Null Distribution of the Durbin-Watson Statistic in Models Containing Lagged Dependent Variables, Econometric Theory, 2, 413428. Jarque, C. M. and Bera, A. K. (1980), Efcient Tests for Normality, Homoskedasticity and Serial Independence of Regression Residuals, Economics Letters, 6, 255259. Johnston, J. (1972), Econometric Methods, Second Edition, New York: McGraw-Hill. Jones, R. H. (1980), Maximum Likelihood Fitting of ARMA Models to Time Series with Missing Observations, Technometrics, 22, 389396. Judge, G. G., Grifths, W. E., Hill, R. C., Lutkepohl, H., and Lee, T. C. (1985), The Theory and Practice of Econometrics, Second Edition, New York: John Wiley & Sons. King, M. L. and Wu, P. X. (1991), Small-Disturbance Asymptotics and the Durbin-Watson and Related Tests in the Dynamic Regression Model, Journal of Econometrics, 47, 145152. Kwiatkowski, D., Phillips, P. C. B., Schmidt, P., and Shin, Y. (1992), Testing the Null Hypothesis of Stationarity against the Alternative of a Unit Root, Journal of Econometrics, 54, 159178. LEsperance, W. L., Chall, D., , and Taylor, D. (1976), An Algorithm for Determining the Distribution Function of the Durbin-Watson Test Statistic, Econometrica, 44, 13251326. Maddala, G. S. (1977), Econometrics, New York: McGraw-Hill. Maeshiro, A. (1976), Autoregressive Transformation, Trended Independent Variables and Autocorrelated Disturbance Terms, Review of Economics and Statistics, 58, 497500. McLeod, A. I. and Li, W. K. (1983), Diagnostic Checking ARMA Time Series Models Using Squared-Residual Autocorrelations, Journal of Time Series Analysis, 4, 269273. Nelson, C. R. and Plosser, C. I. (1982), Trends and Random Walks in Macroeconomic Time Series: Some Evidence and Implications, Journal of Monetary Economics, 10, 139162. Nelson, D. B. (1990), Stationarity and Persistence in the GARCH(1,1) Model, Econometric Theory, 6, 318334. Nelson, D. B. (1991), Conditional Heteroskedasticity in Asset Returns: A New Approach, Econometrica, 59, 347370. Nelson, D. B. and Cao, C. Q. (1992), Inequality Constraints in the Univariate GARCH Model, Journal of Business & Economic Statistics, 10, 229235. Park, R. E. and Mitchell, B. M. (1980), Estimating the Autocorrelated Error Model with Trended Data, Journal of Econometrics, 13, 185201. Phillips, P. C. and Ouliaris, S. (1990), Asymptotic Properties of Residual Based Tests for Cointegration, Econometrica, 58, 165193. Phillips, P. C. and Perron, P. (1988), Testing for a Unit Root in Time Series Regression, Biometrika, 75, 335346. Prais, S. J. and Winsten, C. B. (1954), Trend Estimators and Serial Correlation, Cowles Commission Discussion Paper No. 383.
References ! 427
Ramsey, J. B. (1969), Tests for Specication Errors in Classical Linear Least Squares Regression Analysis, Journal of Royal Statistical Society. Savin, N. E. and White, K. J. (1978), Testing for Autocorrelation with Missing Observations, Econometrica, 46, 5967. Schwarz, G. (1978), Estimating the Dimension of a Model, Annals of Statistics, 6, 461464. Shively, T. S. (1990), Fast Evaluation of the Distribution of the Durbin-Watson and Other Invariant Test Statistics in Time Series Regression, Journal of the American Statistical Association, 85, 676685. Spitzer, J. J. (1979), Small-Sample Properties of Nonlinear Least Squares and Maximum Likelihood Estimators in the Context of Autocorrelated Errors, Journal of the American Statistical Association, 74, 4147. Theil, H. (1971), Principles of Econometrics, New York: John Wiley & Sons. Vinod, H. D. (1973), Generalization of the Durbin-Watson Statistic for Higher Order Autoregressive Process, Communication in Statistics, 2, 115144. Wallis, K. F. (1972), Testing for Fourth Order Autocorrelation in Quarterly Regression Equations, Econometrica, 40, 617636. White, H. (1982), Maximum Likelihood Estimation of Misspecied Models, Econometrica, 50, 125. White, K. J. (1992), The Durbin-Watson Test for Autocorrelation in Nonlinear Models, Review of Economics and Statistics, 74, 370373.
428
Chapter 9
475 479
Travel Expenses within U.S. Advertising Permanent Staff Salaries Benefits Including Insurance Total
You can get a listing of the data set by using the PRINT procedure, as follows:
title Listing of Monthly Divisional Expense Data; proc print data=report; run;
To get a simple, transposed report of the same data set, use the following PROC COMPUTAB statement:
title Monthly Divisional Expense Report; proc computab data=report; run;
When a COLUMNS statement is not specied, each observation becomes a new column. If you use a COLUMNS statement, you must specify to which column each observation belongs by using program statements for column selection. When more than one observation is selected for the same column, values are summed. The following statements produce Figure 9.5:
proc computab data= report; rows travel advrtise salary insure; columns a b c; *----select column for company division, based on value of compdiv----*; a = compdiv = A; b = compdiv = B; c = compdiv = C; run;
The statement A=COMPDIV=A; illustrates the use of logical operators as a selection technique. If COMPDIV=A, then the current observation is added to the A column. See SAS Language: Reference, Version 6, First Edition for more information about logical operators.
rows sum / Total; a = compdiv = A; b = compdiv = B; c = compdiv = C; colblk: total = a + b + c; rowblk: sum = travel + advrtise + salary + insure; run;
Travel Expenses within U.S. Advertising Permanent Staff Salaries Benefits Including Insurance Total
The PROC COMPUTAB statement is the only required statement. The COLUMNS, ROWS, and CELL statements dene the COMPUTAB table. The INIT statement initializes the COMPUTAB table values. Programming statements process COMPUTAB table values. The BY and SUMBY statements provide BY-group processing and consolidation (roll up) tables.
Functional Summary
Table 9.1 summarizes the COMPUTAB procedure statements and options.
Table 9.1
Description Statements specify BY-group processing specify the format for printing a particular cell dene columns of the report initialize values in the COMPUTAB data table dene rows of the report produce consolidation tables Data Set Options specify the input data set specify an output data set Input Options specify a value to use when testing for 0 initialize the data table to missing prevent the transposition of the input data set Printing Control Options suppress printing of the listed columns suppress all printed output suppress printing of the listed rows suppress columns with all 0 or missing values suppress rows with all 0 or missing values list option values overprint titles, values, overlining, and underlining associated with listed rows print only consolidation tables Report Formatting Options specify number of decimal places to print specify number of spaces between columns specify column width for the report overline the listed rows with double lines underline the listed rows with double lines specify a format for printing the cell values specify a format for printing column values specify a format for printing the row values left-align the column headings left-justify character rows in each column specify indentation from the margin suppress printing of row titles on later pages overline the listed rows with a single line start a new page before printing the listed rows specify number of spaces before row titles
Option
COMPUTAB COMPUTAB
DATA= OUT=
COMPUTAB COMPUTAB COMPUTAB ROWS ROWS CELL COLUMNS ROWS COLUMNS ROWS ROWS COMPUTAB ROWS ROWS COMPUTAB
CDEC= CSPACE= CWIDTH= DOL DUL FORMAT= FORMAT= FORMAT= LJC LJC +n NORTR OL _PAGE_ RTS=
Description print a blank row underline the listed rows with a single line specify text to print if column is 0 or missing specify text to print if row is 0 or missing Row and Column Type Options specify that columns contain character data specify that rows contain character data Options for Column Headings specify literal column headings use variable labels in column headings specify a master title centered over columns use column names in column headings Options for Row Titling use labels in row titles use row names in row titles specify literal row titles
COLUMNS ROWS
CHAR CHAR
Input Options
DATA=SAS-data-set
names the SAS data set that contains the input data. If this option is not specied, the last created data set is used. If you are not reading a data set, use DATA=_NULL_.
FUZZ=value
species the criterion to use when testing for 0. If a number is within the FUZZ= value of 0, the number is set to 0.
INITMISS
initializes the COMPUTAB data table to missing rather than to 0. The COMPUTAB data table is discussed further in the section Details: COMPUTAB Procedure on page 448.
NOTRANSPOSE
NOTRANS
prevents the transposition of the input data set in building the COMPUTAB report tables. The NOTRANS option causes input data set variables to appear among the columns of the report rather than among the rows.
species the default number of decimal places for printing. The default is CDEC=2. See the FORMAT= option in the sections on COLUMN, ROWS, and CELL statements later in this chapter.
CSPACE=n
species the default number of spaces to insert between columns. The value of the CSPACE= option is used as the default value for the +n option in the COLUMNS statement. The default is CSPACE=2.
CWIDTH=w
species a default column width for the report. The default is CWIDTH=9. The width must be in the range of 132.
NORTR
suppresses the printing of row titles on each page. The NORTR (no row-title repeat) option is useful to suppress row titles when report pages are to be joined together in a larger report.
RTS=n
species the default number of spaces to be inserted before row titles when row titles appear after the rst printed column. The default row-title spacing is RTS=2.
Output Options
NOPRINT
suppresses all printed output. Use the NOPRINT option with the OUT= option to produce an output data set but no printed reports.
OPTIONS
lists PROC COMPUTAB option values. The option values appear on a separate page preceding the procedures normal output.
OUT=SAS-data-set
names the SAS data set to contain the output data. See the section Details: COMPUTAB Procedure on page 448 for a description of the structure of the output data set.
SUMONLY
suppresses printing of detailed reports. When the SUMONLY option is used, PROC COMPUTAB generates and prints only consolidation tables as specied in the SUMBY statement.
COLUMNS Statement
COLUMNS column-list / options ;
COLUMNS statements dene the columns of the report. The COLUMNS statement can be abbreviated COLUMN, COLS, or COL. The specied column names must be valid SAS names. Abbreviated lists, as described in SAS Language: Reference, can also be used. You can use as many COLUMNS statements as you need. A COLUMNS statement can describe more than one column, and one column of the report can be described with several different COLUMNS statements. The order of the columns on the report is determined by the order of appearance of column names in COLUMNS statements. The rst occurrence of the name determines where in the sequence of columns a particular column is located. The following options can be used in the COLUMNS statement.
species that the characters enclosed in quotes are to be used in the column heading for the variable or variables listed in the COLUMNS statement. Each quoted string appears on a separate line of the heading.
_LABEL_
uses labels, if provided, in the heading for the column or columns listed in the COLUMNS statement. If a label has not been provided, the name of the column is used. See SAS Language: Reference for information about the LABEL statement.
MTITLE=text
species that the string of characters enclosed in quotes is a master title to be centered over all the columns listed in the COLUMNS statement. The list of columns must be consecutive. Special characters (+, *, =, and so forth) placed on either side of the text expand to ll the space. The MTITLE= option can be abbreviated M=.
_NAME_
uses column names in column headings for the columns listed in the COLUMNS statement. This option allows headings (text) and names to be combined in a heading.
inserts n spaces before each column listed in the COLUMNS statement. The default spacing is given by the CSPACE= option in the PROC COMPUTAB statement.
NOPRINT
suppresses printing of columns listed in the COLUMNS statement. This option enables you to create columns to be used for intermediate calculations without having those columns printed.
NOZERO
suppresses printing of columns when all the values in a column are 0 or missing. Numbers within the FUZZ= value of 0 are treated as 0.
_PAGE_
starts a new page of the report before printing each of the columns in the list that follows.
_TITLES_
prints row titles before each column in the list. The _TITLES_ option can be abbreviated as _TITLE_.
species a format for printing the values of the columns listed in the COLUMNS statement. The FORMAT= option can be abbreviated F=.
LJC
left-justies the column headings for the columns listed. By default, columns are rightjustied. When the LJC (left-justify character) option is used, any character row values in the column are also left-justied rather than right-justied.
ZERO=text
ROWS Statement
ROWS row-list / options ;
ROWS statements dene the rows of the report. The ROWS statement can be abbreviated ROW. The specied row names must be valid SAS names. Abbreviated lists, as described in SAS Language: Reference, can also be used. You can use as many ROWS statements as you need. A ROWS statement can describe more than one row, and one row of the report can be described with several different ROWS statements. The order of the rows in the report is determined by the order of appearance of row names in ROWS statements. The rst occurrence of the name determines where the row is located. The following options can be used in the ROWS statement.
uses labels as row titles for the row or rows listed in the ROWS statement. If a label is not provided, the name of the row is substituted. See SAS Language: Reference for more information about the LABEL statement.
_NAME_
uses row names in row titles for the row or rows listed in the ROWS statement.
row title
species that the string of characters enclosed in quotes is to be used in the row title for the row or rows listed in the ROWS statement. Each quoted string appears on a separate line of the heading.
indents n spaces from the margin for the rows in the ROWS statement.
DOL
overlines the rows listed in the ROWS statement with double lines. Overlines are printed on the line before any row titles or data for the row.
DUL
underlines the rows listed in the ROWS statement with double lines. Underlines are printed on the line after the data for the row. A row can have both an underline and an overline option.
NOPRINT
suppresses printing of the rows listed in the ROWS statement. This option enables you to create rows to be used for intermediate calculations without having those rows printed.
NOZERO
suppresses the printing of a row when all the values are 0 or missing.
OL
overlines the rows listed in the ROWS statement with a single line. Overlines are printed on the line before any row titles or data for the row.
OVERPRINT
overprints titles, values, overlining, and underlining associated with rows listed in the ROWS statement. The OVERPRINT option can be abbreviated OVP. This option is valid only when the system option OVP is in effect. See SAS Language: Reference for more information about the OVP option.
_PAGE_
prints a blank line after the data lines for these rows.
UL
underlines the rows listed in the ROWS statement with a single line. Underlines are printed on the line after the data for the row. A row can have both an underline and an overline option.
species a format for printing the values of the rows listed in the ROWS statement. The FORMAT= option can be abbreviated as F=.
LJC
CELL Statement
CELL cell_names / FORMAT= format ;
The CELL statement species the format for printing a particular cell in the COMPUTAB data table. Cell variable names are compound SAS names of the form name1.name2, where name1 is the name of a row variable and name2 is the name of a column variable. Formats specied with the FORMAT= option in CELL statements override formats specied in ROWS and COLUMNS statements.
INIT Statement
INIT anchor-name [locator-name] values [locator-name values] ;
The INIT statement initializes values in the COMPUTAB data table at the beginning of each execution of the procedure and at the beginning of each BY group if a BY statement is present. The INIT statement in the COMPUTAB procedure is similar in function to the RETAIN statement in the DATA step, which initializes values in the program data vector. The INIT statement can be used at any point after the variable to which it refers has been dened in COLUMNS or ROWS statements. Each INIT statement initializes one row or column. Any number of INIT statements can be used. The rst term after the keyword INIT, anchor-name, anchors initialization to a row or column. If anchor-name is a row name, then all locator-name values in the statement are columns of that row. If anchor-name is a column name, then all locator-name values in the statement are rows of that column. The following terms appear in the INIT statement: anchor-name locator-name names the row or column in which values are to be initialized. This term is required. identies the starting column in the row (or starting row in the column) into which values are to be placed. For example, in a table with a row SALES and a column for each month of the year, the following statement initializes values for columns JAN, FEB, and JUN:
init sales jan 500 feb 600 jun 800;
If you do not specify locator-name values, the rst value is placed into the rst row or column, the second value into the second row or column, and so on. For example, the following statement assigns 500 to column JAN, 600 to FEB, and 450 to MAR:
init sales 500 600 450;
+n
species the number of columns in a row (or rows in a column) that are to be skipped when initializing values. For example, the following statement assigns 500 to JAN and 900 to JUL:
init sales jan 500 +5 900;
n*value
assigns value to n columns in the row (or rows in the column). For example, both of the following statements assign 500 to columns JAN through JUN and 1000 to JUL through DEC:
init sales jan 6*500 jul 6*1000; init sales 6*500 6*1000;
Programming Statements
You can use most SAS programming statements the same way you use them in the DATA step. Also, all DATA step functions can be used in the COMPUTAB procedure. Lines written by the PUT statement are not integrated with the COMPUTAB report. PUT statement output is written to the SAS log. The automatic variable _N_ can be used; its value is the number of observations read or the number read in the current BY group, if a BY statement is used. FIRST.variable and LAST.variable references cannot be used. The following statements are also available in PROC COMPUTAB:
ABORT ARRAY ATTRIB assignment statement CALL DELETE DO iterative DO DO UNTIL DO WHILE END FOOTNOTE FORMAT GOTO IF-THEN/ELSE LABEL LINK PUT RETAIN SELECT STOP sum statement TITLE
The programming statements can be assigned labels ROWxxxxx: or COLxxxxx: to indicate the start of a row and column block, respectively. Statements in a row block create or change values in all the columns in the specied rows. Similarly, statements in a column block create or change values in all the rows in the specied columns. There is an implied RETURN statement before each new row or column block. Thus, the ow of execution does not leave the current row (column) block before the block repeats for all columns (rows.) Row and column variables and nonretained variables are initialized prior to each execution of the block. The next COLxxxxx: label, ROWxxxxx: label, or the end of the PROC COMPUTAB step signals the end of a row (column) block. Column blocks and row blocks can be mixed in any order. In some cases, performing calculations in different orders can lead to different results.
See the sections Program Flow Example on page 448, Order of Calculations on page 451, and Controlling Execution within Row and Column Blocks on page 453 for more information.
BY Statement
BY variables ;
A BY statement can be used with PROC COMPUTAB to obtain separate reports for observations in groups dened by the BY variables. At the beginning of each BY group, before PROC COMPUTAB reads any observations, all table values are set to 0 unless the INITMISS option or an INIT statement is specied.
SUMBY Statement
SUMBY variables ;
The SUMBY statement produces consolidation tables for variables whose names are in the SUMBY list. Only one SUMBY statement can be used. To use a SUMBY statement, you must use a BY statement. The SUMBY and BY variables must be in the same relative order in both statements. For example:
by a b c; sumby a b;
This SUMBY statement produces tables that consolidate over values of C within levels of B and over values of B within levels of A. Suppose A has values 1, 2; B has values 1, 2; and C has values 1, 2, 3. Table 9.2 indicates the consolidation tables produced by the SUMBY statement.
Table 9.2
SUMBY Consolidations A=1, B=1 A=1, B=2 A=1 A=2, B=1 A=2, B=2 A=2
Consolidated BY Groups C=1 C=2 C=3 C=1 C=2 C=3 B=1, C=1 B=1, C=2 B=1, C=3 B=2, C=1 B=2, C=2 B=2, C=3 C=1 C=2 C=3 C=1 C=2 C=3 B=1, C=1 B=1, C=2 B=1, C=3 B=2, C=1 B=2, C=2 B=2, C=3
Two consolidation tables for B are produced for each value of A. The rst table consolidates the three tables produced for the values of C while B is 1; the second table consolidates the three tables produced for C while B is 2. Tables are similarly produced for values of A. Nested consolidation tables are produced for B (as described previously) for each value of A. Thus, this SUMBY statement produces a total of six consolidation tables in addition to the tables produced for each BY group. To produce a table that consolidates the entire data set (the equivalent of using PROC COMPUTAB with neither BY nor SUMBY statements), use the special name _TOTAL_ as the rst entry in the SUMBY variable list. For example,
sumby _total_ a b;
PROC COMPUTAB then produces consolidation tables for SUMBY variables as well as a consolidation table for all observations. To produce only consolidation tables, use the SUMONLY option in the PROC COMPUTAB statement.
NOTRANS Option
The NOTRANS option in the PROC COMPUTAB statement prevents the transposition of the input data set. NOTRANS affects the input block, the precedence of row and column options, and the structure of the output data set if the OUT= option is specied. When the input data set is transposed, input variables are among the rows of the COMPUTAB report, and observations compose columns. The reverse is true if the data set is not transposed; therefore, the input block must select rows to receive data values, and input variables are among the columns. Variables from the input data set dominate the format specication and data type. When the input data set is transposed, input variables are among the rows of the report, and row options take precedence over column options. When the input data set is not transposed, input variables are among the columns, and column options take precedence over row options. Variables for the output data set are taken from the dimension (row or column) that contains variables from the input data set. When the input data set is transposed, this dimension is the row dimension; otherwise, the output variables come from the column dimension.
Table 9.3 shows the CDT before any observation is read in. All the columns and rows are dened with the values initialized to 0.
Table 9.3
C88 0 0 0 0
C89 0 0 0 0
C90 0 0 0 0
TOTAL 0 0 0 0
When the rst input is read in (year=1988, sales=83, and cgs=52), the input block puts the values for SALES and CGS in the C88 column since year=1988. Also the value for the gross prot for that year (GPROFIT) is calculated as indicated in the following statements:
gprofit = sales-cgs; c88 = year = 1988; c89 = year = 1989; c90 = year = 1990;
Table 9.4 shows the CDT after the rst observation is input.
Table 9.4
C88 83 52 31 0
C89 0 0 0 0
C90 0 0 0 0
TOTAL 0 0 0 0
Similarly, the second observation (year=1989, sales=106, cgs=85) is put in the second column, and the GPROFIT is calculated to be 21. The third observation (year=1990, sales=120, cgs=114) is put in the third column, and the GPROFIT is calculated to be 6. Table 9.5 shows the CDT after all observations are input.
Table 9.5
C88 83 52 31 0
C89 106 85 21 0
TOTAL 0 0 0 0
After the input block is executed for each observation in the input data set, the rst row or column block is processed. In this case, the column block is
The column block executes for each row, calculating the TOTAL column for each row. Table 9.6 shows the CDT after the column block has executed for the rst row (total=83 + 106 + 120). The total sales for the three years is 309.
Table 9.6
C88 83 52 31 0
C89 106 85 21 0
TOTAL 309 0 0 0
Table 9.7 shows the CDT after the column block has executed for all rows and the values for total cost of goods sold and total gross prot have been calculated.
Table 9.7
C88 83 52 31 0
C89 106 85 21 0
After the column block has been executed for all rows, the next block is processed. The row block is
row: pctmarg = gprofit / cgs * 100;
The row block executes for each column, calculating the PCTMARG for each year and the total (TOTAL column) for three years. Table 9.8 shows the CDT after the row block has executed for all columns.
Table 9.8
C88 83 52 31 59.62
Order of Calculations
The COMPUTAB procedure provides alternative programming methods for performing most calculations. New column and row values are formed by adding values from the input data set, directly or with modication, into existing columns or rows. New columns can be formed in the input block or in column blocks. New rows can be formed in the input block or in row blocks. This example illustrates the different ways to collect totals. Table 9.9 is the total sales report for two products, SALES1 and SALES2, during the years 19881990. The values for SALES1 and SALES2 in columns C88, C89, and C90 come from the input data set.
Table 9.9
C88 15 30 45
C89 45 40 85
C90 80 50 130
The new column SALESTOT, which is the total sales for each product over three years, can be computed in several different ways: in the input block by selecting SALESTOT for each observation:
salestot = 1;
in a column block:
coltot: salestot = c88 + c89 + c90;
In a similar fashion, the new row YRTOT, which is the total sales for each year, can be formed as follows: in the input block:
yrtot = sales1 + sales2;
in a row block:
rowtot: yrtot = sales1 + sales2;
Performing some calculations in PROC COMPUTAB in different orders can yield different results, because many operations are not commutative. Be sure to perform calculations in the proper sequence. It might take several column and row blocks to produce the desired report values. Notice that in the previous example, the grand total for all rows and columns is 260 and is the same whether it is calculated from row subtotals or column subtotals. It makes no difference in this case whether you compute the row block or the column block rst. However, consider the following example where a new column and a new row are formed:
Table 9.10
STORE1 12 11 23
STORE2 13 15 28
STORE3 27 14 41
MAX 27 15 ?
The new column MAX contains the maximum value in each row, and the new row TOTAL contains the column totals. MAX is calculated in a column block:
col: max = max(store1,store2,store3);
Notice that either of two values, 41 or 42, is possible for the element in column MAX and row TOTAL. If the row block is rst, the value is the maximum of the column totals (41). If the column block is rst, the value is the sum of the MAX values (42). Whether to compute a column block before a row block can be a critical decision.
Column Selection
The following discussion assumes that the NOTRANS option has not been specied. When NOTRANS is specied, this section applies to rows rather than columns. If a COLUMNS statement appears in PROC COMPUTAB, a target column must be selected for the incoming observation. If there is no COLUMNS statement, a new column is added for each observation. When a COLUMNS statement is present and the selection criteria fail to designate a column, the current observation is ignored. Faulty column selection can result in columns or entire tables of 0s (or missing values if the INITMISS option is specied). During execution of the input block, when an observation is read, its values are copied into row variables in the program data vector (PDV).
To select columns, use either the column variable names themselves or the special variable _COL_. Use the column names by setting a column variable equal to some nonzero value. The example in the section Getting Started: COMPUTAB Procedure on page 430 uses the logical expression COMPDIV= value, and the result is assigned to the corresponding column variable.
a = compdiv = A; b = compdiv = B; c = compdiv = C;
IF statements can also be used to select columns. The following statements are equivalent to the preceding example:
if compdiv = A then a = 1; else if compdiv = B then b = 1; else if compdiv = C then c = 1;
At the end of the input block for each observation, PROC COMPUTAB multiplies numeric input values by any nonzero selector values and adds the result to selected columns. Character values simply overwrite the contents already in the table. If more than one column is selected, the values are added to each of the selected columns. Use the _COL_ variable to select a column by assigning the column number to it. The COMPUTAB procedure automatically initializes column variables and sets the _COL_ variable to 0 at the start of each execution of the input block. At the end of the input block for each observation, PROC COMPUTAB examines the value of _COL_. If the value is nonzero and within range, the row variable values are added to the CDT cells of the _COL_th column, for example,
data rept; input div sales cgs; datalines; 2 106 85 3 120 114 1 83 52 ; proc computab data=rept; row div sales cgs; columns div1 div2 div3; _col_ = div; run;
The code in this example places the rst observation (DIV=2) in column 2 (DIV2), the second observation (DIV=3) in column 3 (DIV3), and the third observation (DIV=1) in column 1 (DIV1).
columns of the table for a specied row unless restricted in some way. Likewise, a column block operates on all rows for a specied column. Use column names or _COL_ in a row block to execute programming statements conditionally; use row names or _ROW_ in a column block. For example, consider a simple column block that consists of only one statement:
col: total = qtr1 + qtr2 + qtr3 + qtr4;
This column block assigns a value to each row in the TOTAL column. As each row participates in the execution of a column block, the following changes occur: Its row variable in the program data vector is set to 1. The value of _ROW_ is the number of the participating row. The value from each column of the row is copied from the COMPUTAB data table to the program data vector. To avoid calculating TOTAL on particular rows, use row names or _ROW_. For example,
col: if sales|cost then total = qtr1 + qtr2 + qtr3 + qtr4;
or
col: if _row_ < 3 then total = qtr1 + qtr2 + qtr3 + qtr4;
Row and column blocks can appear in any order, and rows and columns can be selected in each block.
Program Flow
This section describes in detail the different steps in PROC COMPUTAB execution.
Step 1: Dene Report Organization and Set Up the COMPUTAB Data Table
Before the COMPUTAB procedure reads in data or executes programming statements, the columns list from the COLUMNS statements and the rows list from the ROWS statements are used to set up a matrix of all columns and rows in the report. This matrix is called the COMPUTAB data table (CDT). When you dene columns and rows of the CDT, the COMPUTAB procedure also sets up corresponding variables in working storage called the program data vector (PDV) for programming statements. Data values reside in the CDT but are copied into the program data vector as they are needed for calculations.
The input block is executed once for each observation in the input data set. If there is no input data set, the input block is not executed. The program logic of the input block is as follows: 1. Determine which variables, row or column, are selector variables and which are data variables. Selector variables determine which rows or columns receive values at the end of the block. Data variables contain the values that the selected rows or columns receive. By default, column variables are selector variables and row variables are data variables. If the input data set is not transposed (the NOTRANS option is specied), the roles are reversed. 2. Initialize nonretained program variables (including selector variables) to 0 (or missing if the INITMISS option is specied). Selector variables are temporarily associated with a numeric data item supplied by the procedure. Using these variables to control row and column selection does not affect any other data values. 3. Transfer data from an observation in the data set to data variables in the PDV. 4. Execute the programming statements in the input block by using values from the PDV and storing results in the PDV. 5. Transfer data values from the PDV into the appropriate columns of the CDT. If a selector variable for a row or column has a nonmissing and nonzero value, multiply each PDV value for variables used in the report by the selector variable and add the results to the selected row or column of the CDT.
Step 3: Calculate Final Values by Using Column Blocks and Row Blocks
Column Blocks
A column block is executed once for each row of the CDT. The program logic of a column block is as follows: 1. Indicate the current row by setting the corresponding row variable in the PDV to 1 and the other row variables to missing. Assign the current row number to the special variable _ROW_. 2. Move values from the current row of the CDT to the respective column variables in the PDV. 3. Execute programming statements in the column block by using the column values in the PDV. Here new columns can be calculated and old ones adjusted. 4. Move the values back from the PDV to the current row of the CDT.
Row Blocks
A row block is executed once for each column of the CDT. The program logic of a row block is as follows: 1. Indicate the current column by setting the corresponding column variable in the PDV to 1 and the other column variables to missing. Assign the current column number to the special variable _COL_. 2. Move values from the current column of the CDT to the respective row variables in the PDV. 3. Execute programming statements in the row block by using the row values in the PDV. Here new rows can be calculated and old ones adjusted. 4. Move the values back from the PDV to the current column of the CDT. See the section Controlling Execution within Row and Column Blocks on page 453. Any number of column blocks and row blocks can be used. Each can include any number of programming statements. The values of row variables and column variables are determined by the order in which different row-block and column-block programming statements are processed. These values can be modied throughout the COMPUTAB procedure, and nal values are printed in the report.
where row-index and column-index can be numbers, character literals, numeric variables, character variables, or expressions that produce a number or a name. If an index is numeric, it must be within range; if it is character, it must name a row or column. References to TABLE elements can appear on either side of an equal sign in an assignment statement and can be used in a SAS expression.
Reserved Words
Certain words are reserved for special use by the COMPUTAB procedure, and using these words as variable names can lead to syntax errors or warnings. They are:
COLUMN COLUMNS COL COLS _COL_ ROW ROWS _ROW_ INIT _N_ TABLE
Missing Values
Missing values for variables in programming statements are treated in the same way that missing values are treated in the DATA step; that is, missing values used in expressions propagate missing values to the result. See SAS Language: Reference for more information about missing values. Missing values in the input data are treated as follows in the COMPUTAB report table. At the end of the input block, either one or more rows or one or more columns can have been selected to receive values from the program data vector (PDV). Numeric data values from variables in the PDV are added into selected report table rows or columns. If a PDV value is missing, the values already in the selected rows or columns for that variable are unchanged by the current observation. Other values from the current observation are added to table values as usual.
The BY variables contain values for the current BY group. For observations in the output data set from consolidation tables, the consolidated BY variables have missing values. The special variable _TYPE_ is a numeric variable that can have one of three values: 1, 2, or 3. _TYPE_= 1 indicates observations from the normal report table produced for each BY group; _TYPE_= 2 indicates observations from the _TOTAL_ consolidation table; _TYPE_= 3 indicates observations from other consolidation tables. _TYPE_= 2 and _TYPE_= 3 observations have one or more BY variables with missing values. The special variable _NAME_ is a character variable of length 8 that contains the row or column name associated with the observation from the report table. If the input data set is transposed, _NAME_ contains column names; otherwise, _NAME_ contains row names. If the input data set is transposed, the remaining variables in the output data set are row variables from the report table. They are column variables if the input data set is not transposed.
date9. la atl ch ny; 120 160 200 240 280 320 360 400 440 480 520 130 170 210 250 290 330 370 410 450 490 530
The following PROC COMPUTAB statements select columns by setting _COL_ to an appropriate value. The PCT1, PCT2, PCT3, and PCT4 columns represent the percentage contributed by each city to the total for the quarter. These statements produce Output 9.1.1.
proc computab data=bookings cspace=1 cwidth=6; columns columns columns rows la qtr1 pct1 qtr1-qtr4 pct1-pct4 atl ch ny qtr2 pct2 qtr3 pct3 qtr4 pct4; / format=6.; / format=6.2; total;
/* column selection */ _col_ = qtr( reptdate ) * 2 - 1; /* copy qtr column values temporarily into pct columns */ colcopy: pct1 = qtr1; pct2 = qtr2; pct3 = qtr3; pct4 = qtr4; /* calculate total row for all columns */ /* calculate percentages for all rows in pct columns only rowcalc: total = la + atl + ch + ny; if mod( _col_, 2 ) = 0 then do; la = la / total * 100; atl = atl / total * 100; ch = ch / total * 100; ny = ny / total * 100; total = 100; end; run;
*/
420 22.58 450 24.19 480 25.81 510 27.42 1860 100.00
780 23.64 810 24.55 840 25.45 870 26.36 3300 100.00
1140 24.05 1170 24.68 1200 25.32 1230 25.95 4740 100.00
1500 24.27 1530 24.76 1560 25.24 1590 25.73 6180 100.00
Using the same input data, the next set of statements shows the usefulness of arrays in allowing PROC COMPUTAB to work in two directions at once. Arrays in larger programs can both reduce the amount of program source code and simplify otherwise complex methods of referring to rows and columns. The same report as in Output 9.1.1 is produced.
proc computab data=bookings cspace=1 cwidth=6; columns columns columns rows la qtr1 pct1 qtr1-qtr4 pct1-pct4 atl ch ny qtr2 pct2 qtr3 pct3 qtr4 pct4; / format=6.; / format=6.2; total;
array pct[4] pct1-pct4; array qt[4] qtr1-qtr4; array rowlist[5] la atl ch ny total; /* column selection */ _col_ = qtr(reptdate) * 2 - 1; /* copy qtr column values temporarily into pct columns */ colcopy: do i = 1 to 4; pct[i] = qt[i]; end; /* calculate total row for all columns */ /* calculate percentages for all rows in pct columns only */ rowcalc: total = la + atl + ch + ny; if mod(_col_,2) = 0 then do i = 1 to 5; rowlist[i] = rowlist[i] / total * 100; end; run;
datalines; BUDGET JAN1989 BUDGET FEB1989 BUDGET MAR1989 ACTUAL JAN1989 ACTUAL FEB1989 ;
title Computab Report without Any Specifications; proc computab data=incomrep; run;
To exclude the budgeted values from your report, select columns for ACTUAL observations only. To remove unwanted variables, specify the variables you want in a ROWS statement.
title Column Selection by Month; proc computab data=incomrep; rows sales--other; columns jana feba mara; mnth = month(date); if type = ACTUAL; jana = mnth = 1; feba = mnth = 2; mara = mnth = 3; run;
To complete the report, compute new rows from existing rows. This is done in a row block (although it can also be done in the input block). Add a new column (QTR1) that accumulates all the actual data. The NOZERO option suppresses the zero column for March. The output produced by these statements is shown in Output 9.2.3.
proc computab data=incomrep; /* add a new column to be selected */ /* qtr1 column will be selected several times */ columns actual1-actual3 qtr1 / nozero; array collist[3] actual1-actual3; rows sales retdis netsales tcos grosspft selling randd general admin deprec operexp operinc other taxblinc taxes netincom; if type=ACTUAL; i = month(date); if i <= 3 then qtr1 = 1; collist[i]=1; rowcalc: if sales netsales grosspft operexp operinc taxblinc netincom run;
= = = = = = =
. then return; sales - retdis; netsales - tcos; selling + randd + general + admin + deprec; grosspft - operexp; operinc + other; taxblinc - taxes;
Now that you have all the numbers calculated, add specications to improve the reports appearance. Specify titles, row and column labels, and formats. The report produced by these statements is shown in Output 9.2.4.
/* now title title2 title3 title4 get the report to look the way you want it */ Pro Forma Income Statement; XYZ Computer Services, Inc.; Period to Date Actual; Amounts in Thousands;
proc computab data=incomrep; columns actual1-actual3 qtr1 / nozero f=comma7. +3 ; array collist[3] actual1-actual3; columns actual1 / Jan; columns actual2 / Feb; columns actual3 / Mar; columns qtr1 / Total Qtr 1; rows sales / Gross Sales ; rows retdis / Less Returns & Discounts; rows netsales / Net Sales +3 ol; rows tcos / Total Cost of Sales; rows grosspft / Gross Profit; rows selling / Operating Expenses: Selling;
rows rows rows rows rows rows rows rows rows rows
randd general admin deprec operexp operinc other taxblinc taxes netincom
/ / / / / / / / / /
R & D; +3; Administrative; Depreciation Operating Income; Other Income/-Expense Taxable Income; Income Taxes Net Income
if type = ACTUAL; i = month( date ); collist[i] = 1; colcalc: qtr1 = actual1 + actual2 + actual3; rowcalc: if sales = . then return; netsales = sales - retdis; grosspft = netsales - tcos; operexp = selling + randd + general + admin + deprec; operinc = grosspft - operexp; taxblinc = operinc + other; netincom = taxblinc - taxes; run;
Output 9.2.4 Specifying Titles, Row and Column Labels, and Formats
Pro Forma Income Statement XYZ Computer Services, Inc. Period to Date Actual Amounts in Thousands
Jan Gross Sales Less Returns & Discounts Net Sales Total Cost of Sales Gross Profit Operating Expenses: Selling R & D GENERAL Administrative Depreciation 4,900 505 --------4,395 2,100 2,295
430 130 410 200 14 --------1,184 1,111 -8 --------1,103 500 --------603 =========
510 110 390 230 15 --------1,255 965 2 --------967 490 --------477 =========
940 240 800 430 29 --------2,439 2,076 -6 --------2,070 990 --------1,080 =========
Operating Income Other Income/-Expense Taxable Income Income Taxes Net Income
data incomrep; length type $ 8; input type :$8. date :monyy7. sales retdis tcos selling randd general admin deprec other taxes; format date monyy7.; datalines; BUDGET JAN1989 4600 300 2200 480 110 500 210 BUDGET FEB1989 4700 330 2300 500 110 500 200 BUDGET MAR1989 4800 360 2600 500 120 600 250 ACTUAL JAN1989 4900 505 2100 430 130 410 200 ACTUAL FEB1989 5100 480 2400 510 110 390 230 ; title title2 title3 title4 Pro Forma Income Statement; XYZ Computer Services, Inc.; Budget Analysis; Amounts in Thousands;
options linesize=96; proc computab data=incomrep; columns cmbud cmact cmpct ytdbud ytdact ytdpct / zero= ; columns cmbud--cmpct / mtitle=- Current Month: February -; columns ytdbud--ytdpct / mtitle=- Year To Date -; columns cmbud ytdbud / Budget f=comma6.; columns cmact ytdact / Actual f=comma6.; columns cmpct ytdpct / % f=7.2; columns cmbud--ytdpct / -; columns ytdbud / _titles_; retain curmo 2; /* current month: February */ rows sales / Gross Sales; rows retdis / Less Returns & Discounts; rows netsales / Net Sales +3 ol; rows tcos / Total Cost of Sales; rows grosspft / Gross Profit +3; rows selling / Operating Expenses: Selling; rows randd / R & D; rows general / +3; rows admin / Administrative; rows deprec / Depreciation ul; rows operexp / ; rows operinc / Operating Income ol; rows other / Other Income/-Expense ul; rows taxblinc / Taxable Income; rows taxes / Income Taxes ul; rows netincom / Net Income dul;
cmbud = type = BUDGET & month(date) = curmo; cmact = type = ACTUAL & month(date) = curmo; ytdbud = type = BUDGET & month(date) <= curmo; ytdact = type = ACTUAL & month(date) <= curmo; rowcalc: if cmpct netsales grosspft operexp operinc taxblinc netincom
| = = = = = =
ytdpct then return; sales - retdis; netsales - tcos; selling + randd + general + admin + deprec; grosspft - operexp; operinc + other; taxblinc - taxes;
colpct: if cmbud & cmact then cmpct = 100 * cmact / cmbud; if ytdbud & ytdact then ytdpct = 100 * ytdact / ytdbud; run;
500 110 500 200 14 --------1,324 --------746 --------746 480 --------266 =========
510 110 390 230 15 --------1,255 --------965 2 --------967 490 --------477 =========
102.00 100.00 78.00 115.00 107.14 --------94.79 --------129.36 --------129.62 102.08 --------179.32 =========
Operating Income Other Income/-Expense Taxable Income Income Taxes Net Income
980 220 1,000 410 28 --------2,638 --------1,532 -8 --------1,524 990 --------534 =========
940 240 800 430 29 --------2,439 --------2,076 -6 --------2,070 990 --------1,080 =========
95.92 109.09 80.00 104.88 103.57 --------92.46 --------135.51 75.00 --------135.83 100.00 --------202.25 =========
region month sold revenue recd cost; 2465 2550 5525 4900 1666
proc format; value divfmt 1=Equipment 2=Publishing; value regfmt 1=North Central 2=Northeast 3=South 4=West; run; proc sort data=product; by div region pcode; run; title1 title2 title3 title4 XYZ Development Corporation ; Corporate Headquarters: New York, NY ; Profit Summary ; ;
options linesize=96; proc computab data=product sumonly; by div region pcode; sumby _total_ div region; format div divfmt.; format region regfmt.; label div = DIVISION; /* specify order of columns and column titles */ columns jan feb mar qtr1 / mtitle=- first quarter - nozero; columns apr may jun qtr2 / mtitle=- second quarter - nozero;
columns jul aug sep qtr3 / mtitle=- third quarter - nozero; columns oct nov dec qtr4 / mtitle=- fourth quarter - nozero; column jan / January =; column feb / February =; column mar / March =; column qtr1 / Quarter Summary =; column column column column column column column column column column column column apr may jun qtr2 jul aug sep qtr3 oct nov dec qtr4 / / / / / / / / / / / / April = _page_; May =; June =; Quarter Summary =; July = _page_; August =; September =; Quarter Summary =; October = _page_; November =; December =; Quarter Summary =; rows and row titles */ Number Sold f=8.; Sales Revenue; Number Received f=8.; Cost of Items Received; Profit Within Period ol; Profit Margin dul;
/* specify order of row sold / row revenue / row recd / row cost / row profit / row pctmarg /
/* select column for appropriate month */ _col_ = month + ceil( month / 3 ) - 1; /* calculate quarterly summary columns */ colcalc: qtr1 = jan + feb + mar; qtr2 = apr + may + jun; qtr3 = jul + aug + sep; qtr4 = oct + nov + dec; /* calculate profit rows */ rowcalc: profit = revenue - cost; if cost > 0 then pctmarg = profit / cost * 100; run;
--------------------SUMMARY TABLE:
DIVISION=Equipment region=North Central-------------------------------- first quarter -------------Quarter Summary ========= 540 62940.00 682
January ========= Number Sold Sales Revenue Number Received Cost of Items Received 198 22090.00 255
24368.00 ---------
20104.00 ---------
19405.00 ---------
63877.00 ---------
----------------------SUMMARY TABLE:
DIVISION=Equipment region=Northeast----------------------
January ========= Number Sold Sales Revenue Number Received Cost of Items Received 82 9860.00 162
16374.00 ---------
6325.00 ---------
12333.00 ---------
35032.00 ---------
-------------------------------SUMMARY TABLE:
DIVISION=Equipment------------------------------
January ========= Number Sold Sales Revenue Number Received Cost of Items Received 280 31950.00 417
40742.00 ---------
26429.00 ---------
31738.00 ---------
98909.00 ---------
Output 9.4.3 shows the consolidation report of prot summary over both divisions and regions.
-------------------------------------SUMMARY TABLE:
TOTALS------------------------------------
January ========= Number Sold Sales Revenue Number Received Cost of Items Received 590 41790.00 656
46360.00 ---------
35359.00 ---------
40124.00 ---------
121843.00 ---------
region month sold revenue recd cost; 2465 2550 5525 4900 1666
/* create data set, profit */ proc computab data=sorted notrans out=profit noprint; by div region; sumby div; /* specify order of row jan feb mar row apr may jun row jul aug sep row oct nov dec rows and row titles */ qtr1; qtr2; qtr3; qtr4;
/* specify order of columns and column titles */ columns sold revenue recd cost profit pctmarg; /* select row for appropriate month */ _row_ = month + ceil( month / 3 ) - 1; /* calculate quarterly summary rows */ rowcalc: qtr1 = jan + feb + mar; qtr2 = apr + may + jun; qtr3 = jul + aug + sep; qtr4 = oct + nov + dec; /* calculate profit columns */ colcalc: profit = revenue - cost; if cost > 0 then pctmarg = profit / cost * 100; run; /* make a partial listing of the output data set */ options linesize=96; proc print data=profit(obs=10) noobs; run;
Because the NOTRANS option is specied, column names become variables in the data set. REGION has missing values in the output data set for observations associated with consolidation tables. The output data set PROFIT, in conjunction with the option NOPRINT, illustrates how you can use the computational features of PROC COMPUTAB for creating additional rows and columns as in a spreadsheet without producing a report. Output 9.5.1 shows a partial listing of the output data set PROFIT.
div 1 1 1 1 1 1 1 1 1 1
region 1 1 1 1 1 1 1 1 1 1
_TYPE_ 1 1 1 1 1 1 1 1 1 1
_NAME_ JAN FEB MAR QTR1 APR MAY JUN QTR2 JUL AUG
sold 198 223 119 540 82 180 183 445 194 153
revenue 22090 26830 14020 62940 9860 21330 21060 52250 23210 17890
cost 24368 20104 19405 63877 16374 6325 12333 35032 10310 16704
PROFIT -2278 6726 -5385 -937 -6514 15005 8727 17218 12900 1186
PCTMARG -9.348 33.456 -27.751 -1.467 -39.783 237.233 70.761 49.149 125.121 7.100
The following statements illustrate how PROC FORECAST makes a total market forecast for the next four quarters:
/* forecast the total number of units to be */ /* sold in the next four quarters */ proc forecast out=outcome trend=2 interval=qtr lead=4; id date; var units; run;
The macros WHATIF and SHOW build a report table and provide the exibility of examining alternate what-if situations. The row and column calculations of PROC COMPUTAB compute the income statement. With macros stored in a macro library, the only statements required with PROC COMPUTAB are macro invocations and TITLE statements.
/* set up rows and columns of report and initialize */ /* market share and program constants */ %macro whatif(mktshr=,price=,ucost=,taxrate=,numshar=,overhead=); columns mar / columns jun / columns sep / columns dec / columns total rows mktshr / rows tunits / rows units / rows sales / rows cost / rows ovhd / rows gprof / rows tax / rows pat / rows earn / March; June; September; December; / Calculated Total; Market Share f=5.2; Market Forecast; Items Sold; Sales; Cost of Goods; Overhead; Gross Profit; Tax; Profit After Tax; Earnings per Share;
rows mktshr--earn / skip; rows sales--earn / f=dollar12.2; rows tunits units / f=comma12.2; /* initialize market share values */ init mktshr &mktshr; /* define constants */ retain price &price ucost &ucost taxrate &taxrate numshar &numshar; /* retain overhead and sales from previous quarter */ retain prevovhd &overhead prevsale; %mend whatif; /* perform calculations and print the specified rows */ %macro show(rows); /* initialize list of row names */ %let row1 = mktshr; %let row2 = tunits; %let row3 = units; %let row4 = sales; %let row5 = cost; %let row6 = ovhd; %let row7 = gprof; %let row8 = tax; %let row9 = pat; %let row10 = earn; /* find parameter row names in list and eliminate */ /* them from the list of noprint rows */ %let n = 1;
%let word = %scan(&rows,&n); %do %while(&word NE ); %let i = 1; %let row11 = &word; %do %while(&&row&i NE &word); %let i = %eval(&i+1); %end; %if &i<11 %then %let row&i = ; %let n = %eval(&n+1); %let word = %scan(&rows,&n); %end; rows &row1 &row2 &row3 &row4 &row5 &row6 &row7 &row8 &row9 &row10 dummy / noprint; /* select column using lead values from proc forecast */ mar = _lead_ = 1; jun = _lead_ = 2; sep = _lead_ = 3; dec = _lead_ = 4; rowreln:; /* inter-relationships */ share = round( mktshr, 0.01 ); tunits = units; units = share * tunits; sales = units * price; cost = units * ucost; /* calculate overhead */ if mar then prevsale = sales; if sales > prevsale then ovhd = prevovhd + .05 * ( sales - prevsale ); else ovhd = prevovhd; prevovhd = ovhd; prevsale = sales; gprof = sales - cost - ovhd; tax = gprof * taxrate; pat = gprof - tax; earn = pat / numshar; coltot:; if mktshr then total = ( mar + jun + sep + dec ) / 4; else total = mar + jun + sep + dec; %mend show; run;
The following PROC COMPUTAB statements use the PROC FORECAST output data set with invocations of the macros dened previously to perform a what-if analysis of the predicted income statement. The report is shown in Output 9.6.1.
title1 Fleet Footwear, Inc.;
title2 Marketing Analysis Income Statement; title3 Based on Forecasted Unit Sales; title4 All Values Shown; options linesize=96; proc computab data=outcome cwidth=12; %whatif(mktshr=.02 .07 .15 .25,price=38.00, ucost=20.00,taxrate=.48,numshar=15000,overhead=5000); %show(mktshr tunits units sales cost ovhd gprof tax pat earn); run;
Market Share Market Forecast Items Sold Sales Cost of Goods Overhead Gross Profit Tax Profit After Tax Earnings per Share
March 0.02 23,663.94 473.28 $17,984.60 $9,465.58 $5,000.00 $3,519.02 $1,689.13 $1,829.89 $0.12
June 0.07 24,169.61 1,691.87 $64,291.15 $33,837.45 $7,315.33 $23,138.38 $11,106.42 $12,031.96 $0.80
September 0.15 24,675.27 3,701.29 $140,649.03 $74,025.80 $11,133.22 $55,490.00 $26,635.20 $28,854.80 $1.92
December 0.25 25,180.93 6,295.23 $239,218.83 $125,904.65 $16,061.71 $97,252.47 $46,681.19 $50,571.28 $3.37
The following statements produce a similar report for different values of market share and unit costs. The report in Output 9.6.2 displays the values for the market share, market forecast, sales, after-tax prot, and earnings per share.
title3 Revised; title4 Selected Values Shown; options linesize=96; proc computab data=outcome cwidth=12;
%whatif(mktshr=.01 .06 .12 .20,price=38.00, ucost=23.00,taxrate=.48,numshar=15000,overhead=5000); %show(mktshr tunits sales pat earn); run;
Output 9.6.2 Report That Uses Macro Invocations for Selected Values
Fleet Footwear, Inc. Marketing Analysis Income Statement Revised Selected Values Shown Calculated Total 0.10 97,689.75 $367,993.28 $56,502.84 $3.77
Market Share Market Forecast Sales Profit After Tax Earnings per Share
col col col row row row row row row row row row row row row rows
qtr2 / Two; qtr3 / Three; qtr4 / Four; begcash / Beginning Cash; netinc / Income Net income; depr / Depreciation; borrow; subtot1 / Subtotal; invest / Expenditures Investment; tax / Taxes; div / Dividend; adv / Advertising; subtot2 / Subtotal; cashflow/ skip; irret / Internal Rate of Return zero= ; depr borrow subtot1 tax div adv subtot2 / +3;
retain cashin -5; _col_ = qtr( date ); rowblock: subtot1 = netinc + depr + borrow; subtot2 = tax + div + adv; begcash = cashin; cashflow = begcash + subtot1 - subtot2; irret = cashflow; cashin = cashflow; colblock: if begcash then cashin = qtr1; if irret then do; temp = irr( 4, cashin, qtr1, qtr2, qtr3, qtr4 ); qtr1 = temp; qtr2 = 0; qtr3 = 0; qtr4 = 0; end; run;
Output 9.7.1 Report That Uses a RETAIN Statement and the IRR Financial Function
Blue Sky Endeavors Financial Summary (Dollar Figures in Thousands) Quarter One -5.0 65.0 42.0 32.0 139.0 126.0 43.0 51.0 41.0 135.0 -1.0 Quarter Two -1.0 68.0 47.0 32.0 147.0 144.0 45.0 54.0 46.0 145.0 1.0 Quarter Three 1.0 70.0 49.0 30.0 149.0 148.0 46.0 55.0 47.0 148.0 2.0 Quarter Four 2.0 73.0 49.0 30.0 152.0 148.0 48.0 55.0 47.0 150.0 4.0
Beginning Cash Income Net income Depreciation BORROW Subtotal Expenditures Investment Taxes Dividend Advertising Subtotal CASHFLOW Internal Rate of Return
20.9
482
Chapter 10
the mean of the counts is high, in which case the Gaussian approximation and linear regression may be satisfactory.
The response variable y is numeric and has nonnegative integer values. To allow for variance greater than the mean, specify the dist=negbin option to t the negative binomial model instead of the Poisson. The following example illustrates the use of PROC COUNTREG. The data are taken from Long (1997). This study examines how factors such as gender (fem), marital status (mar), number of young children (kid5), prestige of the graduate program (phd), and number of articles published by a scientists mentor (ment) affect the number of articles (art) published by the scientist. The rst 10 observations are shown in Figure 10.1.
Figure 10.1 Article Count Data
Obs 1 2 3 4 5 6 7 8 9 10 art 3 0 4 1 1 1 0 0 3 3 fem 0 0 0 0 0 0 0 0 0 0 mar 1 0 0 1 1 1 1 1 1 1 kid5 2 0 0 1 0 1 1 0 2 1 phd 1.38000 4.29000 3.85000 3.59000 1.81000 3.59000 2.12000 4.29000 2.58000 1.80000 ment 8.0000 7.0000 47.0000 19.0000 0.0000 6.0000 10.0000 2.0000 2.0000 4.0000
The Model Fit Summary, shown in Figure 10.2, lists several details about the model. By default, the COUNTREG procedure uses the Newton-Raphson optimization technique. The maximum log-
likelihood value is shown, as well as two information measures, Akaikes information criterion (AIC) and Schwarzs Bayesian information criterion (SBC), which can be used to compare competing Poisson models. Smaller values of these criteria indicate better models.
Figure 10.2 Estimation Summary Table for a Poisson Regression
The COUNTREG Procedure Model Fit Summary Dependent Variable Number of Observations Data Set Model Log Likelihood Maximum Absolute Gradient Number of Iterations Optimization Method AIC SBC art 915 WORK.LONG97DATA Poisson -1651 0.00181 13 Quasi-Newton 3314 3343
The parameter estimates of the model and their standard errors are shown in Figure 10.3. All covariates are signicant predictors of the number of articles, except for the prestige of the program (phd), which has a p-value of 0.6271.
Figure 10.3 Parameter Estimates of Poisson Regression
Parameter Estimates Standard Error 0.102982 0.054614 0.061375 0.040127 0.026397 0.002006 Approx Pr > |t| 0.0031 <.0001 0.0114 <.0001 0.6271 <.0001
DF 1 1 1 1 1 1
The following statements t the negative binomial model. While the Poisson model requires that the conditional mean and conditional variance be equal, the negative binomial model allows for overdispersion; that is, the conditional variance may exceed the conditional mean.
/*-- Negative Binomial Regression --*/ proc countreg data=long97data; model art = fem mar kid5 phd ment / dist=negbin(p=2) method=quanew; run;
The t summary is shown in Figure 10.4, and parameter estimates are listed in Figure 10.5.
DF 1 1 1 1 1 1 1
The parameter estimate for _Alpha of 0:4416 is an estimate of the dispersion parameter in the negative binomial distribution. A t test for the hypothesis H0 W D 0 is provided. It is highly signicant, indicating overdispersion (p < 0:0001). The null hypothesis H0 W D 0 can be also tested against the alternative > 0 by using the likelihood ratio test, as described by Cameron and Trivedi (1998, pp. 45, 7778). The likelihood ratio test statistic is equal to 2.LP LNB / D 2. 1651 C 1561/ D 180, which is highly signicant, providing strong evidence of overdispersion.
PROC COUNTREG options ; BOUNDS bound1 [ , bound2 . . . ] ; BY variables ; INIT initvalue1 [ , initvalue2 . . . ] ; MODEL dependent variable = regressors / options ; OUTPUT options ; RESTRICT restriction1 [, restriction2 . . . ] ; ZEROMODEL dependent variable zero-inated regressors / options ;
There can only be one MODEL statement. The ZEROMODEL statement, if used, must appear after the MODEL statement.
Functional Summary
The statements and options used with the COUNTREG procedure are summarized in the following table.
Table 10.1 COUNTREG Functional Summary
Description Data Set Options specify the input data set write parameter estimates to an output data set write estimates of x0 and z0 to an output data set i i Declaring the Role of Variables specify BY-group processing Printing Control Options print the correlation matrix of the estimates print the covariance matrix of the estimates print a summary iteration listing suppress the normal printed output request all printing options Options to Control the Optimization Process specify maximum number of iterations allowed select the iterative minimization method to use set boundary restrictions on parameters set initial values for parameters set linear restrictions on parameters specify other optimization options
BY
MAXITER= METHOD=
Description
Statement
Option
Model Estimation Options specify the type of model specify the type of model specify the type of covariance matrix suppress the intercept parameter specify the offset variable specify the zero-inated offset variable specify the zero-inated link function Output Control Options include covariances in the OUTEST= data set output probability of response variable taking the current value output expected value of response variable output estimates of XBeta D x0 i output estimates of ZGamma D z0 i output probability of response variable taking a zero value as a result of the zero-generating process
species the input SAS data set. If the DATA= option is not specied, PROC COUNTREG uses the most recently created SAS data set.
writes the covariance matrix for the parameter estimates to the OUTEST= data set. This
Printing Options
NOPRINT
The COVEST= option species the type of covariance matrix of the parameter estimates. The quasi-maximum likelihood estimates are computed with COVEST=QML. The default is COVEST=HESSIAN. The supported covariance types are as follows: OP HESSIAN QML species the covariance from the outer product matrix. species the covariance from the Hessian matrix. species the covariance from the outer product and Hessian matrices.
species the iterative minimization method to use. The default is METHOD=NRA. QN NRA TR species the quasi-Newton method. species the Newton-Raphson method. species the trust region method.
BOUNDS Statement
BOUNDS bound1 [, bound2 . . . ] ;
The BOUNDS statement imposes simple boundary constraints on the parameter estimates. BOUNDS statement constraints refer to the parameters estimated by the COUNTREG procedure. You can specify any number of BOUNDS statements. Each bound is composed of parameter names, constants, and inequality operators:
item operator item [ operator item [ operator item . . . ] ]
BY Statement ! 491
Each item is a constant, a parameter name, or a list of parameter names. Each operator is <, >, <=, or >=. Parameter names are as shown in the ESTIMATE column of the Parameter Estimates table. You can use both the BOUNDS statement and the RESTRICT statement to impose boundary constraints; however, the BOUNDS statement provides a simpler syntax for specifying these kinds of constraints. See the section RESTRICT Statement on page 493 as well. The following BOUNDS statement constrains the estimates of the parameter for z to be negative, the parameters for x1 through x10 to be between zero and one, and the parameter for x1 in the zeroination model to be less than one. The following example illustrates the use of parameter lists to specify boundary constraints:
bounds z < 0, 0 < x1-x10 < 1, Inf_x1 < 1;
BY Statement
BY variables ;
A BY statement can be used with PROC COUNTREG to obtain separate analyses on observations in groups dened by the BY variables. When a BY statement appears, the input data set should be sorted in order of the BY variables.
INIT Statement
INIT initvalue1 [ , initvalue2 . . . ] ;
The INIT statement is used to set initial values for parameters in the optimization. Each initvalue is written as a parameter or parameter list, followed by an optional equal sign (=), followed by a number:
parameter [=] number
Parameter names are as shown in the ESTIMATE column of the Parameter Estimates table.
MODEL Statement
MODEL dependent = regressors / options ;
The MODEL statement species the dependent variable and independent regressor variables for the regression model. The dependent count variable should take on only nonnegative integer values in
the input data set. PROC COUNTREG rounds any positive noninteger count values to the nearest integer. PROC COUNTREG discards any observations with a negative count. Only one MODEL statement can be specied. The following options can be used in the MODEL statement after a slash (/).
DIST=value, D=value
species a type of model to be analyzed. If you specify this option in both the MODEL statement and the PROC COUNTREG statement, then only the value in the MODEL statement is used. The supported model types as follows: POISSON | P NEGBIN(P=1) Poisson regression model negative binomial regression model with a linear variance function
NEGBIN(P=2) | NEGBIN negative binomial regression model with a quadratic variance function ZIPOISSON | ZIP ZINEGBIN | ZINB
NOINT
species a variable in the input data set to be used as an offset variable. The offset variable appears as a covariate in the model with its parameter restricted to 1. The offset variable cannot be the response variable, the zero-ination offset variable (if any), or one of the explanatory variables. The Model Fit Summary gives the name of the data set variable used as the offset variable; it is labeled as Offset.
Printing Options
CORRB
prints the correlation matrix of the parameter estimates. It can also be specied in the PROC COUNTREG statement.
COVB
prints the covariance matrix of the parameter estimates. It can also be specied in the PROC COUNTREG statement.
ITPRINT
prints the objective function and parameter estimates at each iteration. The objective function is the negative log-likelihood function. It can also be specied in the PROC COUNTREG statement.
PRINTALL
requests all printing options. It can also be specied in the PROC COUNTREG statement.
OUTPUT Statement
OUTPUT < OUT=SAS-data-set > < output-options > ;
The OUTPUT statement creates a new SAS data set containing all the variables in the input data set and, optionally, the estimates of x0 , the expected value of the response variable, and the probability i that the response variable will take on the current value. Furthermore, if a zero-inated model was t, you can request that the output data set contain the estimates of z0 , and the probability that the i response is zero as a result of the zero-generating process. These statistics can be computed for all observations in which the regressors are not missing, even if the response is missing. By adding observations with missing response values to the input data set, you can compute these statistics for new observations or for settings of the regressors not present in the data without affecting the model t. You can only specify one OUTPUT statement. OUTPUT statement options as follows:
OUT=SAS-data-set
names the variable containing the predicted value of the response variable.
PROB=name
names the variable containing the probability of the response variable taking the current value, Pr(Y D yi ).
ZGAMMA=name
names the variable containing the value that of 'i , the probability that the response variable will take on the value of zero as a result of the zero-generating process. It is written to the output le only if the model is zero-inated.
RESTRICT Statement
RESTRICT restriction1 [, restriction2 . . . ] ;
The RESTRICT statement is used to impose linear restrictions on the parameter estimates. You can specify any number of RESTRICT statements. Each restriction is written as an expression, followed by an equality operator (=) or an inequality operator (<, >, <=, >=), followed by a second expression:
The operator can be =, <, >, <=, or >=. Restriction expressions can be composed of parameter names, constants, and the operators times ( ), plus (C), and minus ( ). Parameter names are as shown in the ESTIMATE column of the Parameter Estimates table. The restriction expressions must be a linear function of the variables. Lagrange multipliers are reported in the Parameter Estimates table for all the active linear constraints. They are identied with the names Restrict1, Restrict2, and so on. The probabilities of these Lagrange multipliers are computed using a beta distribution (LaMotte 1994). Nonactive, or nonbinding, restrictions have no effect on the estimation results and are not noted in the output. The following RESTRICT statement constrains the negative binomial dispersion parameter, , to 1, which restricts the conditional variance to be C 2 :
restrict _Alpha = 1;
ZEROMODEL Statement
ZEROMODEL dependent variable zero-inated regressors / options ;
The ZEROMODEL statement is required if either ZIP or ZINB is specied in the DIST= option in the MODEL statement. If ZIP or ZINB is specied, then the ZEROMODEL statement must follow immediately after the MODEL statement. The dependent variable in the ZEROMODEL statement must be the same as the dependent variable in the MODEL statement. The zero-inated (ZI) regressors appear in the equation that determines the probability ('i ) of a zero count. Each of these q variables has a parameter to be estimated in the regression. For example, let z0 be the i th observations 1 .q C 1/ vector of values of the q ZI explanatory variables (w0 i is set to 1 for the intercept term). Then 'i will be a function of z0 , where is the .q C 1/ 1 i vector of parameters to be estimated. (The zero-inated intercept is 0 ; the coefcients for the q zero-inated covariates are 1 ; : : : ; q .) If this option is omitted, then only the intercept term 0 is estimated. The Parameter Estimates table in the displayed output gives the estimates for the ZI intercept and ZI explanatory variables; they are labeled with the prex Inf_. For example, the ZI intercept is labeled Inf_intercept. If you specify Age (a variable in your data set) as a ZI explanatory variable, then the Parameter Estimates table labels the corresponding parameter estimate Inf_Age. The following options can be specied in the ZEROMODEL statement following a slash (/):
LINK=value
species the distribution function used to compute probability of zeros. The supported distribution functions are as follows: LOGISTIC NORMAL species logistic distribution. species standard normal distribution.
species a variable in the input data set to be used as a zero-inated (ZI) offset variable. The ZI offset variable is included as a term, with coefcient restricted to 1, in the equation that determines the probability ('i ) of a zero count. The ZI offset variable cannot be the response variable, the offset variable (if any), or one of the explanatory variables. The name of the data set variable used as the ZI offset variable is displayed in the Model Fit Summary output, where it is labeled as Inf_offset.
Missing Values
Any observation in the input data set with a missing value for one or more of the regressors is ignored by PROC COUNTREG and not used in the model t. PROC COUNTREG rounds any positive noninteger count values to the nearest integer. PROC COUNTREG ignores any observations with a negative count. If there are observations in the input data set with missing response values but with nonmissing regressors, PROC COUNTREG can compute several statistics and store them in an output data set by using the OUTPUT statement. For example, you can request that the output data set contain the estimates of x0 , the expected value of the response variable, and the probability of the response i variable taking on the current value. Furthermore, if a zero-inated model was t, you can request that the output data set contain the estimates of z0 , and the probability that the response is zero as i a result of the zero-generating process. Note that the presence of such observations (with missing response values) does not affect the model t.
Poisson Regression
The most widely used model for count data analysis is Poisson regression. This assumes that yi , given the vector of covariates xi , is independently Poisson distributed with P .Yi D yi jxi / D e
i
yi i
yi
yi D 0; 1; 2; : : :
and the mean parameter that is, the mean number of events per period is given by
i
D exp.x0 / i
where is a .k C 1/ 1 parameter vector. (The intercept is 0 ; the coefcients for the k regressors are 1 ; : : : ; k .) Taking the exponential of x0 ensures that the mean parameter i is nonnegative. i
D exp.x0 / i
The name log-linear model is also used for the Poisson regression model since the logarithm of the conditional mean is linear in the parameters: lnE.yi jxi / D ln.
i/
D x0 i
Note that the conditional variance of the count random variable is equal to the conditional mean in the Poisson regression model: V .yi jxi / D E.yi jxi / D
i
The equality of the conditional mean and variance of yi is known as equidispersion. The marginal effect of a regressor is given by @E.yi jxi / D exp.x0 /j D E.yi jxi /j i @xj i Thus, a one-unit change in the j th regressor leads to a proportional change in the conditional mean E.yi jxi / of j . The standard estimator for the Poisson model is the maximum likelihood estimator (MLE). Since the observations are independent, the log-likelihood function is written as
N X LD . i D1 i
C yi ln
N X 0 ln yi / D . e xi C yi x0 i i D1
ln yi /
e xi /xi
@2 L D @@ 0
N X i D1
e xi xi x0 i
The Poisson model has been criticized for its restrictive property that the conditional variance equals the conditional mean. Real-life data are often characterized by overdispersion that is, the variance exceeds the mean. Allowing for overdispersion can improve model predictions since the Poisson restriction of equal mean and variance results in the underprediction of zeros when overdispersion exists. The most commonly used model that accounts for overdispersion is the negative binomial model.
D e xi C
where the unobserved heterogeneity term i D e i is independent of the vector of regressors xi . Then the distribution of yi conditional on xi and i is Poisson with conditional mean and conditional variance i i : f .yi jxi ; i / D exp.
i i /. i i / yi
yi
Let g. i / be the probability density function of i . Then, the distribution f .yi jxi / (no longer conditional on i ) is obtained by integrating f .yi jxi ; i / with respect to i : Z 1 f .yi jxi / D f .yi jxi ; i /g. i /d i
0
An analytical solution to this integral exists when i is assumed to follow a gamma distribution. This solution is the negative binomial distribution. When the model contains a constant term, it is necessary to assume that E.e i / D E. i / D 1, in order to identify the mean of the distribution. Thus, it is assumed that i follows a gamma(; ) distribution with E. i / D 1 and V . i / D 1= : 1 exp. i / . / i R1 where .x/ D 0 z x 1 exp. z/dz is the gamma function and is a positive parameter. Then, the density of yi given xi is derived as g. i / D Z f .yi jxi / D D D D
0 1
i Cyi 1 d i i
i i yi . /
i C / i
0 yi .yi i
C / yi
i
yi . /. C i /Cyi .yi C / i yi . / C i C
1
yi ;
i
yi D 0; 1; 2; : : :
Thus, the negative binomial distribution is derived as a gamma mixture of Poisson random variables. It has conditional mean E.yi jxi / D
i
D e xi
i 1
The conditional variance of the negative binomial distribution exceeds the conditional mean. Overdispersion results from neglected unobserved heterogeneity. The negative binomial model with
variance function V .yi jxi / D i C 2 , which is quadratic in the mean, is referred to as the NEGi BIN2 model (Cameron and Trivedi 1986). To estimate this model, specify DIST=NEGBIN(p=2) in the MODEL statement. The Poisson distribution is a special case of the negative binomial distribution where D 0. A test of the Poisson distribution can be carried out by testing the hypothesis 1 that D D 0. A Wald test of this hypothesis is provided (it is the reported t statistic for the i estimated in the negative binomial model). The log-likelihood function of the negative binomial regression model (NEGBIN2) is given by
N X i D1
L D
(y 1 i X
j D0
ln.j C
1
ln.yi / )
.yi C
where use of the following fact is made: .y C a/= .a/ D if y is an integer. The gradient is X yi @L D @ 1C
i D1 N i i y 1 Y j D0
.j C a/
xi
and @L D @ 8 N X<
i D1
yi 1 X j D0
1 .j C
1/
ln.1 C
i/
yi i .1 C i / ;
9 =
Cameron and Trivedi (1986) consider a general class of negative binomial models with mean i p and variance function i C i . The NEGBIN2 model, with p D 2, is the standard formulation of the negative binomial model. Models with other values of p, 1 < p < 1, have the same density f .yi jxi / except that 1 is replaced everywhere by 1 2 p . The negative binomial model NEGBIN1, which sets p D 1, has variance function V .yi jxi / D i C i , which is linear in the mean. To estimate this model, specify DIST=NEGBIN(p=1) in the MODEL statement. The log-likelihood function of the NEGBIN1 regression model is given by
N X i D1
L D
(y 1 i X
j D0
ln j C
exp.x0 / i )
ln.yi /
yi C
1
i/
A xi
ln.1 C /
i xi
9 = ;
and 8 0 yi 1 N X 1 X< @L @ D : @ .j C
i D1 j D0
1
i i/
ln.1 C /
.yi C 1 1C
i/
9 yi = C ;
Therefore, the probability of fYi D yi g can be described as P .yi D 0jxi / D 'i C .1 P .yi jxi / D .1 'i /g.0/ yi > 0
'i /g.yi /;
where g.yi / follows either the Poisson or the negative binomial distribution. When the probability 'i depends on the characteristics of observation i , 'i is written as a function of z0 , where z0 is the 1 .q C 1/ vector of zero-inated covariates and is the .q C 1/ 1 vector i i of zero-inated coefcients to be estimated. (The zero-inated intercept is 0 ; the coefcients for the q zero-inated covariates are 1 ; : : : ; q .) The function F relating the product z0 (which is a i scalar) to the probability 'i is called the zero-inated link function, 'i D Fi D F .z0 / i In the COUNTREG procedure, the zero-inated covariates are indicated in the ZEROMODEL statement. Furthermore, the zero-inated link function F can be specied as either the logistic function, F .z0 / D .z0 / D i i exp.z0 / i 1 C exp.z0 / i
or the standard normal cumulative distribution function (also called the probit function), Z z0 i 1 0 0 F .zi / D .zi / D p exp. u2 =2/du 2 0 The zero-inated link function is indicated in the ZEROMODEL statement, using the LINK= option. The default ZI link function is the logistic function.
exp.
yi i/ i
yi
yi > 0
The conditional expectation and conditional variance of yi are given by E.yi jxi ; zi / D
i .1
Fi /
i Fi /
Note that the ZIP model (as well as the ZINB model) exhibits overdispersion since V .yi jxi ; zi / > E.yi jxi ; zi /. In general, the log-likelihood function of the ZIP model is LD
N X i D1
ln P .yi jxi ; zi /
Once a specic link function (either logistic or standard normal) for the probability 'i is chosen, it is possible to write the exact expressions for the log-likelihood function and the gradient.
# ln.k/
X
fi Wyi >0g N X i D1
ln 1 C exp.z0 / i
exp.z0 i
N X exp.z0 / i zi 1 C exp.z0 / i i D1
X
fi Wyi D0g
X
fiWyi >0g
yi
exp.x0 / xi i
ln .z0 / C 1 i ( .z0 / i
) ln.k/
X
fi Wyi >0g
ln 1
zi
X
fi Wyi >0g
@L @
D C
X
fi Wyi D0g
.z0 / C 1 i yi
xi
X
fi Wyi >0g
yi
i
In this case, the conditional expectation and conditional variance of yi are E.yi jxi ; zi / D
i .1
Fi /
i .Fi
C /
As with the ZIP model, the ZINB model exhibits overdispersion because the conditional variance exceeds the conditional mean.
h ln exp.z0 / C .1 C exp.x0 // i i
yi 1 X 1
C C
ln.j C
/
1
fi Wyi >0g j D0
X
fi Wyi >0g N X i D1
ln.yi /
.yi C
ln 1 C exp.z0 / i
# zi
N X exp.z0 / i zi 1 C exp.z0 / i i D1
@L @
" D C X
fi Wyi D0g
1
1
# xi
exp.z0 / C .1 C exp.x0 // i i
fi Wyi >0g
X yi exp.x0 / i xi 1 C exp.x0 / i
@L D @
X
fi Wyi D0g
exp.x0 / i
8 X <
fi Wyi >0g
yi 1 X j D0
1 .j C
1/
ln.1 C exp.x0 // C i
yi .1 C
exp.x0 / = i exp.x0 // ; i
n ln .z0 / C 1 i ln 1
yi 1 X
.z0 / .1 C exp.x0 // i i
X
fi Wyi >0g
.z0 / i
1
X X
fi Wyi >0g
ln.j C
fi Wyi >0g j D0
X X
/ ln.1 C exp.x0 // i
fi Wyi >0g
C C
fi Wyi >0g
X
fi Wyi >0g
The gradient for this model is given by h i 2 1 '.z0 / 1 .1 C exp.x0 // X i i @L 4 D 0 0 @ .zi / C 1 .zi / .1 C exp.x0 // i
fi Wyi D0g
5 zi
X
fi Wyi >0g
'.z0 / i zi 1 .z0 / i
@L D @
X
fi Wyi D0g
.1C/=
1
.z0 / C 1 i
.z0 / .1 C exp.x0 // i i
xi
fi Wyi >0g
X yi exp.x0 / i xi 1 C exp.x0 / i
@L D @
X
fi Wyi D0g
.z0 / i
exp.x0 / i
.z0 / .1 C exp.x0 // i i
8 X <
fi Wyi >0g
Computational Resources
The time and memory required by PROC COUNTREG are proportional to the number of parameters in the model and the number of observations in the data set being analyzed. Less time and memory will be required for smaller models and fewer observations. Also affecting these resources are the method chosen to calculate the variance-covariance matrix and the optimization method. All optimization methods available through the METHOD= option have similar memory use requirements. The processing time might differ for each method depending on the number of iterations and functional calls needed. The data set is read into memory to save processing time. If there is not enough memory available to hold the data, the COUNTREG procedure stores the data in a utility le on disk and rereads the data as needed from this le. When this occurs, the execution time of the procedure increases substantially. The gradient and the variance-covariance matrix must be held in memory. If the model has p parameters including the intercept, then at least 8 .p C p .p C 1/=2/ bytes will be needed. Time is also a function of the number of iterations needed to converge to a solution for the model parameters. The number of iterations needed cannot be known in advance. The MAXITER= option can be used to limit the number of iterations that PROC COUNTREG will do. The convergence criteria can be altered by nonlinear optimization options available in the PROC COUNTREG statement. For a list of all the nonlinear optimization options, see Chapter 6, Nonlinear Optimization Methods.
Displayed Output
PROC COUNTREG produces the following displayed output.
value of the objective function (negative one times the log-likelihood value) at the current solution change in the objective function from previous iteration value of the maximum absolute gradient element step size (for Newton-Raphson and quasi-Newton methods) slope of the current search direction (for Newton-Raphson and quasi-Newton methods) lambda (for trust region method) radius value at current iteration (for trust region method)
Parameter Estimates
The Parameter Estimates table in the displayed output gives the estimates for the ZI intercept and ZI explanatory variables; they are labeled with the prex Inf_. For example, the ZI intercept is labeled Inf_intercept. If you specify Age (a variable in your data set) as a ZI explanatory variable, then the Parameter Estimates table labels the corresponding parameter estimate Inf_Age. If you do not list any ZI explanatory variables (for the ZI option VAR=), then only the intercept term is estimated. _Alpha is the negative binomial dispersion parameter. The t statistic given for _Alpha is a test of overdispersion.
Description
Option default default default COVB CORRB ITPRINT ITPRINT ITPRINT ITPRINT ITPRINT ITPRINT ITPRINT
ODS Tables Created by the MODEL Statement FitSummary Summary of Nonlinear Estimation ConvergenceStatus Convergence Status ParameterEstimates Parameter Estimates CovB Covariance of Parameter Estimates CorrB Correlation of Parameter Estimates InputOptions Input Options IterStart Optimization Start IterHist Iteration History IterStop Optimization Results ParameterEstimatesResults Parameter Estimates ParameterEstimatesStart Parameter Estimates ProblemDescription Problem Description
Poisson Models
These statements t a Poisson model to the data by using the covariates SEX, ILLNESS, INCOME, and HSCORE:
/*-- Poisson Model --*/ proc countreg data=docvisit;
In this example, the DIST= option in the MODEL statement species the POISSON distribution. In addition, the PRINTALL option displays the correlation and covariance matrices for the parameters, log-likelihood values, and convergence information in addition to the parameter estimates. The parameter estimates for this model are shown in Output 10.1.2.
Output 10.1.2 Parameter Estimates of Poisson Model
The COUNTREG Procedure Parameter Estimates Standard Error 0.074545 0.054362 0.017080 0.077829 0.009089 Approx Pr > |t| <.0001 <.0001 <.0001 0.0019 <.0001
DF 1 1 1 1 1
Suppose that you suspect that the population of individuals can be viewed as two distinct groups: a low-risk group, comprising individuals who never go to the doctor, and a high-risk group, comprising individuals who do go to the doctor. You might suspect that the data have this structure both because the sample variance of DOCTORCO (0.64) exceeds its sample mean (0.30), which suggests overdispersion, and because a large fraction of the DOCTORCO observations (80%) have the value zero. Estimating a zero-inated model is one way to deal with overdispersion that results from excess zeros. Suppose also that you suspect that the covariate AGE has an impact on whether an individual belongs to the low-risk group. For example, younger individuals might have illnesses of much lower severity when they do get sick and be less likely to visit a doctor, all else being equal. The following statements estimate a zero-inated Poisson regression with AGE as a covariate in the zerogeneration process:
/*-- Zero-Inflated Poisson Model --*/ proc countreg data=docvisit; model doctorco=sex illness income hscore / dist=zip; zeromodel doctorco ~ age; run;
In this case, the ZEROMODEL statement following the MODEL statement species that both an intercept and the variable AGE be used to estimate the likelihood of zero doctor visits. Output 10.1.3 shows the resulting parameter estimates.
Example 10.2: ZIP and ZINB Models for Data Exhibiting Extra Zeros ! 511
DF 1 1 1 1 1 1 1
The estimates of the zero-inated intercept (Inf_Intercept) and the zero-inated regression coefcient for AGE (Inf_age) are approximately 0.99 and 2.09, respectively. Therefore, you can estimate the probabilities for individuals of ages 20, 50, and 70 as follows: e .0:99 2:09 :20/ 1 C e .0:99 2:09 :20/ e .0:99 2:09 :50/ 50 years: 1 C e .0:99 2:09 :50/ e .0:99 2:09 :70/ 70 years: 1 C e .0:99 2:09 :70/
20 years:
That is, the estimated probability of belonging to the low-risk group is about 0.64 for a 20-yearold individual, 0.49 for a 50-year-old individual, and only 0.38 for a 70-year-old individual. This supports the suspicion that older individuals are more likely to have a positive number of doctor visits. Alternative models to account for the overdispersion are the negative binomial and the zero-inated negative binomial models, which can be t using the DIST=NEGBIN and DIST=ZINB option, respectively.
Example 10.2: ZIP and ZINB Models for Data Exhibiting Extra Zeros
In the study by Long (1997) of the number of published articles by scientists (see the section Getting Started: COUNTREG Procedure on page 485), the observed proportion of scientists publishing no articles is 0.3005. The following statements use PROC FREQ to compute the proportion of scientists publishing each observed number of articles. Output 10.2.1 shows the results.
PROC COUNTREG is then used to t Poisson and negative binomial models to the data. For each model, the PROBCOUNTS macro computes the probability that the number of published articles is m, where m is a value in a list of nonnegative integers specied in the COUNTS= option. The computations require the parameter estimates of the tted model. These are saved using the ODS OUTPUT statement as shown and passed to the PROBCOUNTS macro by using the INMODEL= option. Variables containing the probabilities are created with names beginning with the PREFIX= string followed by the COUNTS= values and are saved in the OUT= data set. For the Poisson model, the variables poi0, poi1, : : : , poi10 are created and saved in the data set predpoi, which also contains all of the variables in the DATA= data set. The PROBCOUNTS macro is available from the Samples section at http://support.sas.com. The following statements compute the estimates for Poisson and negative binomial models.
/*-- Poisson Model --*/ proc countreg data=long97data; model art=fem mar kid5 phd ment / dist=poisson; ods output ParameterEstimates=pe; run; %include probcounts; %probcounts(data=long97data, inmodel=pe, counts=0 to 10, prefix=poi, out=predpoi) /*-- Negative Binomial Model --*/
Example 10.2: ZIP and ZINB Models for Data Exhibiting Extra Zeros ! 513
proc countreg data=long97data; model art=fem mar kid5 phd ment / dist=negbin(p=2); ods output ParameterEstimates=pe; run; %probcounts(data=predpoi, inmodel=pe, counts=0 to 10, prefix=nb, out=prednb)
Parameter estimates for these two models are shown in the section Getting Started: COUNTREG Procedure on page 485. For each model, the predicted proportion of zero articles can be calculated as the average predicted probability of zero articles across all scientists as shown in the macro probcounts in the following program. Under the Poisson model, the predicted proportion of zero articles is 0.2092, which considerably underestimates the observed proportion. The negative binomial more closely estimates the proportion of zeros (0.3036). Also, the test of the dispersion parameter, _Alpha, in the negative binomial model indicates signicant overdispersion (p < 0:0001). As a result, the negative binomial model is preferred to the Poisson model. Another way to account for the large number of zeros in this data set is to t a zero-inated Poisson (ZIP) or a zero-inated negative binomial (ZINB) model. In the following statements, DIST=ZIP requests the ZIP model. In the ZEROMODEL statement, you can specify the predictors, z, for the process that generated the additional zeros. The ZEROMODEL statement also species the model for the probability '. By default, a logistic model is used for '. The default can be changed using the LINK= option. In this particular ZIP model, all variables used to model the article counts are also used to model '.
proc countreg data=long97data; model art = fem mar kid5 phd ment / dist=zip; zeromodel art ~ fem mar kid5 phd ment; ods output ParameterEstimates=pe; run; %probcounts(data=prednb, inmodel=pe, counts=0 to 10, prefix=zip, out=predzip)
The parameters of the ZIP model are displayed in Output 10.2.2. The rst set of parameters gives the estimates of in the model for the Poisson process mean. Parameters with the prex Inf_ are the estimates of in the logistic model for '.
Parameter Intercept fem mar kid5 phd ment Inf_Intercept Inf_fem Inf_mar Inf_kid5 Inf_phd Inf_ment
DF 1 1 1 1 1 1 1 1 1 1 1 1
Estimate 0.640838 -0.209145 0.103751 -0.143320 -0.006166 0.018098 -0.577060 0.109747 -0.354013 0.217101 0.001272 -0.134114
t Value 5.28 -3.30 1.46 -3.02 -0.20 7.89 -1.13 0.39 -1.11 1.10 0.01 -2.96
The proportion of zeros predicted by the ZIP model is 0.2986, which is much closer to the observed proportion than the Poisson model. But Output 10.2.4 shows that both models deviate from the observed proportions at one, two, and three articles. The ZINB model is specied by the DIST=ZINB option. All variables are again used to model both the number of articles and '. The METHOD=QN option species that the quasi-Newton method be used to t the model rather than the default Newton-Raphson method. These options are implemented in the following program.
Example 10.2: ZIP and ZINB Models for Data Exhibiting Extra Zeros ! 515
proc countreg data=long97data; model art=fem mar kid5 phd ment / dist=zinb method=qn; zeromodel art ~ fem mar kid5 phd ment; ods output ParameterEstimates=pe; run; %probcounts(data=predzip, inmodel=pe, counts=0 to 10, prefix=zinb, out=predzinb)
The estimated parameters of the ZINB model are shown in Output 10.2.3. The test for overdispersion again indicates a preference for the negative binomial version of the zero-inated model (p < 0:0001). The ZINB model also does a good job of estimating the proportion of zeros (0.3119), and it follows the observed proportions well, though possibly not as well as the negative binomial model.
Output 10.2.3 ZINB Model Estimation
The COUNTREG Procedure Model Fit Summary Dependent Variable Number of Observations Data Set Model ZI Link Function Log Likelihood Maximum Absolute Gradient Number of Iterations Optimization Method AIC SBC Algorithm converged. Parameter Estimates Standard Error 0.143596 0.075592 0.084452 0.054206 0.036270 0.003493 1.322807 0.848911 0.938661 0.442780 0.308005 0.316223 0.051029 Approx Pr > |t| 0.0037 0.0097 0.2479 0.0051 0.9846 <.0001 0.8848 0.4538 0.1102 0.1558 0.9025 0.0053 <.0001 art 915 WORK.LONG97DATA ZINB Logistic -1550 0.00263 81 Quasi-Newton 3126 3189
Parameter Intercept fem mar kid5 phd ment Inf_Intercept Inf_fem Inf_mar Inf_kid5 Inf_phd Inf_ment _Alpha
DF 1 1 1 1 1 1 1 1 1 1 1 1 1
Estimate 0.416747 -0.195507 0.097582 -0.151732 -0.000700 0.024786 -0.191684 0.635928 -1.499470 0.628427 -0.037715 -0.882291 0.376681
t Value 2.90 -2.59 1.16 -2.80 -0.02 7.10 -0.14 0.75 -1.60 1.42 -0.12 -2.79 7.38
The following statements compute the average predicted count probability across all scientists for each count 0, 1, : : :, 10. The averages for each model, along with the observed proportions, are then arranged for plotting by PROC SGPLOT.
proc summary data=predzinb; var poi0-poi10 nb0-nb10 zip0-zip10 zinb0-zinb10; output out=mnpoi mean(poi0-poi10) =mn0-mn10; output out=mnnb mean(nb0-nb10) =mn0-mn10; output out=mnzip mean(zip0-zip10) =mn0-mn10; output out=mnzinb mean(zinb0-zinb10)=mn0-mn10; run; data means; set mnpoi mnnb mnzip mnzinb; drop _type_ _freq_; run; proc transpose data=means out=tmeans; run; data allpred; merge obs(where=(art<=10)) tmeans; obs=percent/100; run; proc sgplot; yaxis label=Probability; xaxis label=Number of Articles; series y=obs x=art / name=obs legendlabel=Observed lineattrs=(color=black thickness=4px); series y=col1 x=art / name=poi legendlabel=Poisson lineattrs=(color=blue); series y=col2 x=art/ name=nb legendlabel=Negative Binomial lineattrs=(color=red); series y=col3 x=art/ name=zip legendlabel=ZIP lineattrs=(color=blue pattern=2); series y=col4 x=art/ name=zinb legendlabel=ZINB lineattrs=(color=red pattern=2); discretelegend poi zip nb zinb obs / title=Models: location=inside position=ne across=2 down=3; run;
For each of the four tted models, Output 10.2.4 shows the average predicted count probability for each article count across all scientists. The Poisson model clearly underestimates the proportion of zero articles published, while the other three models are quite accurate at zero. All of the models do well at the larger numbers of articles.
References ! 517
References
Abramowitz, M. and Stegun, A. (1970), Handbook of Mathematical Functions, New York: Dover Press. Amemiya, T. (1985), Advanced Econometrics, Cambridge: Harvard University Press. Cameron, A. C. and Trivedi, P. K. (1986), Econometric Models Based on Count Data: Comparisons and Applications of Some Estimators and Some Tests, Journal of Applied Econometrics, 1, 2953. Cameron, A. C. and Trivedi, P. K. (1998), Regression Analysis of Count Data, Cambridge: Cambridge University Press. Godfrey, L. G. (1988), Misspecication Tests in Econometrics, Cambridge: Cambridge University Press. Greene, W. H. (1994), Accounting for Excess Zeros and Sample Selection in Poisson and Negative
Binomial Regression Models, Working Paper No. 94-10, New York: Stern School of Business, Department of Economics, New York University. Greene, W. H. (2000), Econometric Analysis, Upper Saddle River, NJ: Prentice Hall. Hausman, J. A., Hall, B. H., and Griliches, Z. (1984), Econometric Models for Count Data with an Application to the Patents-R&D Relationship, Econometrica, 52, 909938. King, G. (1989a), A Seemingly Unrelated Poisson Regression Model, Sociological Methods and Research, 17, 235255. King, G. (1989b), Unifying Political Methodology: The Likelihood Theory and Statistical Inference, Cambridge: Cambridge University Press. Lambert, D. (1992), Zero-Inated Poisson Regression with an Application to Defects in Manufacturing, Technometrics, 34, 114. LaMotte, L. R. (1994), A Note on the Role of Independence in t Statistics Constructed from Linear Statistics in Regression Models, The American Statistician, 48, 238240. Long, J. S. (1997), Regression Models for Categorical and Limited Dependent Variables, Thousand Oaks, CA: Sage Publications. Winkelmann, R. (2000), Econometric Analysis of Count Data, Berlin: Springer-Verlag.
Chapter 11
Example 11.1: BEA National Income and Product Accounts . . . . . . . . Example 11.2: BLS Consumer Price Index Surveys . . . . . . . . . . . . . Example 11.3: BLS State and Area Employment, Hours, and Earnings Surveys Example 11.4: DRI/McGraw-Hill Format CITIBASE Files . . . . . . . . . Example 11.5: DRI Data Delivery Service Database . . . . . . . . . . . . . Example 11.6: PC Format CITIBASE Database . . . . . . . . . . . . . . . Example 11.7: Quarterly COMPUSTAT Data Files . . . . . . . . . . . . . Example 11.8: Annual COMPUSTAT Data Files, V9.2 New Filetype CSAUC3 Example 11.9: CRSP Daily NYSE/AMEX Combined Stocks . . . . . . . . Data Elements Reference: DATASOURCE Procedure . . . . . . . . . . . . . . . . BEA Data Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . BLS Data Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Global Insight DRI Data Files . . . . . . . . . . . . . . . . . . . . . . . . . COMPUSTAT Data Files . . . . . . . . . . . . . . . . . . . . . . . . . . . CRSP Stock Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . FAME Information Services Databases . . . . . . . . . . . . . . . . . . . . Haver Analytics Data Files . . . . . . . . . . . . . . . . . . . . . . . . . . . IMF Data Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . OECD Data Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
554 557 562 564 571 573 575 578 581 586 590 591 593 595 600 605 607 608 610 612
rst three attributes in this list are used to enhance the output, the length attribute is used to control the memory and disk-space usage of the DATASOURCE procedure. Data les currently supported by the DATASOURCE procedure include the following: U.S. Bureau of Economic Analysis data les: National Income and Product Accounts National Income and Product Accounts PC format S-pages U.S. Bureau of Labor Statistics data les: Consumer Price Index Surveys Producer Price Index Survey National Employment, Hours, and Earnings Survey State and Area Employment, Hours, and Earnings Survey Standard & Poors Compustat Services Financial Database Files: COMPUSTAT Annual COMPUSTAT 48 Quarter COMPUSTAT Full Coverage Annual COMPUSTAT Full Coverage 48 Quarter Center for Research in Security Prices (CRSP) data les: Daily Binary Format Files Monthly Binary Format Files Daily Character Format Files Monthly Character Format Files Global Insight, formerly DRI/McGraw-Hill data les: Basic Economics Data (formerly CITIBASE) DRI Data Delivery Service les CITIBASE Data Files DRI Data Delivery Service Time Series PC Format CITIBASE Databases FAME Information Services Databases Haver Analytics data les United States Economic Indicators
Specialized Databases Financial Indicators Industry Industrial Countries Emerging Markets International Organizations Forecasts and As Reported Data United States Regional International Monetary Funds Economic Information System data les: International Financial Statistics Direction of Trade Statistics Balance of Payment Statistics Government Finance Statistics Organization for Economic Cooperation and Development: Annual National Accounts Quarterly National Accounts Main Economic Indicators
Table 11.1
Time Series Variables EXRJAN EXRSW 143.290 1.50290 143.320 1.49400 135.400 1.38250 128.240 1.33040 127.690 1.34660 129.170 1.39160
Here, the FILETYPE= option indicates that you want to read DRIs Basic Economics data le, the INFILE= option species the leref CITIFILE of the external le you want to read, and the OUT= option names the SAS data set to contain the time series data.
When the INTERVAL= option is not given, the default frequency dened for the FILETYPE= type le is used. For example, the statements in the previous section extract yearly series since INTERVAL=YEAR is the default frequency for DRIs Basic Economic Data les. To extract data for several frequencies, you need to execute the DATASOURCE procedure once for each frequency.
The KEEP statement also allows input names to be quoted strings. If the name of a series in the input le contains blanks or special characters that are not valid SAS name syntax, put the series name in quotes to select it. Another way to allow the use of special characters in your SAS variable names is to use the SAS options statement to designate VALIDVARNAME=ANY. This option will allow PROC DATASOURCE to include special characters in your SAS variable names. The following is an example of extracting series from a FAME database by using the DATASOURCE procedure.
proc datasource filetype=fame dbname=fame_nys /disk1/prc/prc interval=weekday out=outds outcont=attrds; range 1jan90d to 1feb90d; keep cci.close {ibm.high,ibm.low,ibm.close} mave(ibm.close,30)
Selecting Time Series Variables The KEEP and DROP Statements ! 525
The resulting output data set OUTDS contains the following series: DATE, CCI_CLOS, IBM_HIGH, IBM_LOW, IBM_CLOS, IBM30DAY, GM_VOLUM, F_VOLUME, C_VOLUME, CCI_IBM. Obviously, to be able to use KEEP and DROP statements, you need to know the name of time series variables available in the data le. The OUTCONT= option gives you this information. More specically, the OUTCONT= option creates a data set containing descriptive information on the same frequency time series. This descriptive information includes series names, a ag indicating if the series is selected for output, series variable types, lengths, position of series in the OUT= data set, labels, format names, format lengths, format decimals, and a set of FILETYPE= specic descriptor variables. For example, the following statements list some of the monthly series available in the CITIFILE and are shown in Figure 11.1.
/*-- Selecting Time Series Variables -- The KEEP and DROP Statements --*/ filename citifile "citiaf.dat" RECFM=F LRECL=80; proc datasource filetype=dribasic infile=citifile interval=month outcont=vars; drop e: ; run; title1 Some Time Series Variables Available in CITIFILE; proc print data=vars; run;
Cross-section one is identied by (COUNTRY=112 CSC=F PARTNER= VERSION=Z). Cross-section two is identied by (COUNTRY=146 CSC=F PARTNER= VERSION=Z). Cross-section three is identied by (COUNTRY=158 CSC=F PARTNER= VERSION=Z).
Table 11.2
COUNTRY 112 112 112 112 112 112 146 146 146 146 146 146 158 158 158 158 158 158
VERSION Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z
Time ID Variable DATE SEP1987 OCT1987 NOV1987 DEC1987 JAN1988 FEB1988 SEP1987 OCT1987 NOV1987 DEC1987 JAN1988 FEB1988 SEP1987 OCT1987 NOV1987 DEC1987 JAN1988 FEB1988
Time Series Variables EFFEXR EXRINDEX 9326 12685 9393 12813 9626 13694 9675 14099 9581 13910 9493 13549 12046 16192 12067 16266 12558 17596 12759 18301 12642 18082 12409 17470 13841 16558 13754 16499 14222 17505 14768 18423 14933 18565 14915 18331
Note that the data sets in Table 11.1 and Table 11.2 use two different ways of representing time series data for three different countries: the United Kingdom (COUNTRY=112), Switzerland (COUNTRY=146), and Japan (COUNTRY=158). The rst representation (Table 11.1) incorporates each countrys name into the series names, while the second representation (Table 11.2) represents countries as different cross sections by using the BY variable named COUNTRY. See Time Series and SAS Data Sets in Chapter 3, Working with Time Series Data.
O b s 1 2 3 4 5
C S C F F F F F
P A R T N E R
V E R S I O N Z Z Z Z Z
N S E R I E S 6 6 6 6 6
N S E L E C T 3 3 3 3 3
The OUTBY= data set reports the total number of series, NSERIES, dened in each cross section, NSELECT of which represent the selected variables. If you want to see the descriptive information on each of these NSELECT variables for each cross section, specify the OUTALL= option. For example, the following statements print descriptive information on all monthly series dened for all cross sections (COUNTRY=111, COUNTRY=112, COUNTRY=146, COUNTRY=158, and COUNTRY=186) which are shown in Figure 11.4.
filename datafile "imfifs1.dat" RECFM=F LRECL=88; title3 Time Series Defined in Cross Section; proc datasource filetype=imfifsp outselect=on ebcdic interval=month outall=ifsall; run; title4 Cross Sections Available in OUTALL=IFSALL Data Set; proc print data=ifsall; run;
O b s 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
C O U N T R Y 111 111 111 112 112 112 146 146 146 158 158 158 186 186 186
C S C F F F F F F F F F F F F F F F
P A R T N E R
V E R S I O N Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z
N A M E F___AA F___AC F___AE F___AA F___AC F___AE F___AA F___AC F___AE F___AA F___AC F___AE F___AA F___AC F___AE E N D _ D A T E
K E P T 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
T Y P E 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
L E N G T H 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 C N T Y N A M E
V A R N U M . . . . . . . . . . . . . . .
B L K N U M 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 MARKET MARKET MARKET MARKET MARKET MARKET MARKET MARKET MARKET MARKET MARKET MARKET MARKET MARKET MARKET RATE RATE RATE RATE RATE RATE RATE RATE RATE RATE RATE RATE RATE RATE RATE
L A B E L CONVERSION CONVERSION CONVERSION CONVERSION CONVERSION CONVERSION CONVERSION CONVERSION CONVERSION CONVERSION CONVERSION CONVERSION CONVERSION CONVERSION CONVERSION D A T A T Y P E S S S S S S S S S S S S S S S FACTOR FACTOR FACTOR FACTOR FACTOR FACTOR FACTOR FACTOR FACTOR FACTOR FACTOR FACTOR FACTOR FACTOR FACTOR
F O R M A T
F O R M A T L 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
O b s 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
F O R M A T D 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
S T _ D A T E JAN1957 JAN1957 JAN1957 JAN1957 JAN1957 JAN1957 JAN1957 JAN1957 JAN1957 JAN1957 JAN1957 JAN1957 JAN1957 JAN1957 JAN1957
N T I M E 357 357 357 357 357 357 357 357 357 357 357 357 357 357 357
N O B S 357 357 357 357 357 357 357 357 357 357 357 357 357 357 357
S U B J E C T
S C D A T A
D U _ C O D E E F A E F A E F A E F A E F A
D U _ N A M E U U U U U U U U U U U U U U U U U U U U U
N D E C 5 5 5 6 5 6 4 6 4 3 6 3 3 5 3
B A S E Y E A R
S O U R C E
SEP1986 SEP1986 SEP1986 SEP1986 SEP1986 SEP1986 SEP1986 SEP1986 SEP1986 SEP1986 SEP1986 SEP1986 SEP1986 SEP1986 SEP1986
UNITED STATES UNITED STATES UNITED STATES UNITED KINGDOM UNITED KINGDOM UNITED KINGDOM SWITZERLAND SWITZERLAND SWITZERLAND JAPAN JAPAN JAPAN TURKEY TURKEY TURKEY
The OUTCONT= data set contains one observation for each time series variable with the descriptive information summarized over BY groups. When the data le contains no cross sections, the OUTCONT= and OUTALL= data sets are equivalent, except that the OUTALL= data set also reports time ranges of available data. The OUTBY= data set in this case contains a single observation reporting the number of series and time ranges for the whole data le.
proc datasource filetype=imfifsp infile=ifsfile interval=month out=market outcont=mrktvars; where country in (112,146,158) and partner= ; keep f___aa f___ac; range from 01jun85d to 01feb86d; rename f___aa=alphmkt f___ac=charmkt; label f___aa=F___AA: Market Rate Conversion Factor Used in Alpha Test f___ac=F___AC: Market Rate Conversion Used in Charlie Test; run; title1 Printout of OUTCONT= Showing New NAMEs and LABELs; proc print data=mrktvars ; var name label length; run; title1 Contents of OUT= Showing New NAMEs and LABELs; proc contents data=market; run;
The RENAME statement allows input names to be quoted strings. If the name of a series in the input le contains blanks or special characters that are not in valid SAS name syntax, use the SAS option VALIDVARNAME=ANY or put the series name in quotes to rename it. See the FAME example using rename in the Selecting Time Series Variables The KEEP and DROP Statements on page 524 section.
Figure 11.5 Renaming and Labeling Variables
Contents of OUT= Showing New NAMEs and LABELs S E L E C T E D
O b s
N A M E
K E P T
T Y P E
L E N G T H
V A R N U M
L A B E L
1 alphmkt 1 1 1 5 6 F___AA: Market Rate Conversion Factor Used in Alpha Test 2 charmkt 1 1 1 5 7 F___AC: Market Rate Conversion Used in Charlie Test
F O R O M b A s T 1 2
F O R M A T L 0 0
F O R M A T D 0 0
7 charmkt
3 COUNTRY CODE 1 CONTROL SOURCE CODE 4 MONYY7. Date of Observation 3 PARTNER COUNTRY CODE 1 VERSION CODE 5 F___AA: Market Rate Conversion Factor Used in Alpha Test 5 F___AC: Market Rate Conversion Used in Charlie Test
Notice that even though you changed the names of F___AA and F___AC to alphmkt and charmkt, respectively, you still use their old names in the KEEP and LABEL statements because renaming takes place at the output stage.
f___aa=F___AA: Market Rate Conversion Factor Used in Alpha Test f___ac=F___AC: Market Rate Conversion Used in Charlie Test; length _numeric_ 4; length f___aa 6; run; title1 Printout of OUTCONT= Showing New NAMEs and LABELs; proc print data=mrktvars ; var name label length; run; title1 Contents of OUT= Showing New NAMEs and LABELs; proc contents data=market; run;
label
O b s
N A M E
K E P T
T Y P E
L E N G T H
V A R N U M
L A B E L
1 alphmkt 1 1 1 6 6 F___AA: Market Rate Conversion Factor Used in Alpha Test 2 charmkt 1 1 1 4 7 F___AC: Market Rate Conversion Used in Charlie Test
F O R O M b A s T 1 2
F O R M A T L 0 0
F O R M A T D 0 0
7 charmkt
3 COUNTRY CODE 1 CONTROL SOURCE CODE 4 MONYY7. Date of Observation 3 PARTNER COUNTRY CODE 1 VERSION CODE 6 F___AA: Market Rate Conversion Factor Used in Alpha Test 4 F___AC: Market Rate Conversion Used in Charlie Test
The default lengths of the character variables are set to the minimum number of characters that can hold the longest possible value.
The PROC DATASOURCE statement is required. All the rest of the statements are optional. The DATASOURCE procedure uses two kinds of statements, subsetting statements and attribute statements. Subsetting statements provide selection of time series data over selected time periods and cross sections from the input data le. Attribute statements control the attributes of the variables in the output SAS data set. The subsetting statements are the KEEP, DROP, KEEPEVENT, and DROPEVENT statements (which select output variables); the RANGE statement (which selects time ranges); and the WHERE
statement (which selects cross sections). The attribute statements are the ATTRIBUTE, FORMAT, LABEL, LENGTH, and RENAME statements. The statements and options used by PROC DATASOURCE are summarized in Table 11.3.
Table 11.3 Summary of Syntax
Option
Description
Input Data File Options FILETYPE= type of input data le to read INFILE= leref(s) of the input data LRECL= lrecl(s) of the input data RECFM= recfm(s) of the input data ASCII character set of the incoming data EBCDIC character set of the incoming data Output Data Set Options OUT= write the extracted time series data OUTALL= information on time series and cross sections OUTBY= information on only cross sections OUTCONT= information on only time series variables OUTEVENT= write event-oriented data OUTSELECT= control reporting of all or only selected series and cross sections INDEX create single indexes from BY variables for the OUT= data set ALIGN= control the alignment of SAS date values Subsetting Option and Statements INTERVAL= select periodicity of series to extract KEEP time series to include in the OUT= data set DROP time series to exclude from the OUT= data set KEEPEVENT events to include in the OUTEVENT= data set DROPEVENT events to exclude from the OUTEVENT= data set WHERE select cross sections for output RANGE time range of observations to be output Assigning Attributes Options and Statements FORMAT assign formats to variables in the output data sets ATTRIBUTE FORMAT= assign formats to variables in the output data sets LABEL assign labels to variables in the output data sets ATTRIBUTE LABEL= assign labels to variables in the output data sets LENGTH control the lengths of variables in the output data sets ATTRIBUTE LENGTH= control the lengths of variables in the output data sets RENAME assign new names to variables in the output data sets
controls the alignment of SAS dates used to identify output observations. The ALIGN= option allows the following values: BEGINNING | BEG | B, MIDDLE | MID | M, and ENDING | END | E. BEGINNING is the default.
ASCII
species the incoming data is ASCII. This option is used when the native character set of your host machine is EBCDIC.
DBNAME= database name
species the FAME database to access. Only use this option with the letype=FAME option. The character string you specify in the DBNAME= option is passed through to FAME. Specify the value of this option as you would in accessing the database from within FAME software.
EBCDIC
species the incoming data is ebcdic. This option is needed when the native character set of your host machine is ASCII.
FAMEPRINT
prints the FAME command le generated by PROC DATASOURCE and the log le produced by the FAME component of the interface system. Only use this option with the letype=FAME option.
FILETYPE= entry DBTYPE= dbtype
species the kind of input data le to process. See Data Elements Reference: DATASOURCE Procedure on page 586 for a list of supported le types. The FILETYPE= option is required.
INDEX
creates a set of single indexes from BY variables for the OUT= data set. Under some circumstances, creating indexes for a SAS data set may increase the efciency in locating observations when BY or WHERE statements are used in subsequent steps. Refer to SAS Language Reference: Concepts for more information on SAS indexes. The INDEX option is ignored when no OUT= data set is created or when the data le does not contain any BY variables. The INDEX= data set option can be used to override the index variable denitions.
INFILE= leref INFILE= (leref1 leref2 . . . lerefn)
species the leref assigned to the input data le. The default value is DATAFILE. The leref used in the INFILE= option (or if no INFILE= option is specied, the leref DATAFILE) must be associated with the physical data le in a FILENAME statement. (On some operating systems, the leref assignment can be made with the systems control language, and a FILENAME statement may not be needed. Refer to SAS Language Reference: Dictionary for more details on the FILENAME statement. Physical data les can reside on DVD, CD-ROM, or other media.
For some le types, the data are distributed over several les. In this case, the INFILE= option is required, and it lists in parentheses the lerefs for each of the les making up the database. The order in which these FILEREFS are listed is important and must conform to the specics of each le type as explained in Data Elements Reference: DATASOURCE Procedure on page 586.
LRECL= lrecl LRECL= (lrecl1 lrecl2 . . . lrecln)
The logical record length in bytes of the inle. Only use this if you need to override the default LRECL of the le. For some le types, the data are distributed over several les. In this case, the LRECL= option lists in parentheses the LRECLs for each of the les making up the database. The order in which these LRECLs are listed is important and must conform to the specics of each le type as explained in Data Elements Reference: DATASOURCE Procedure on page 586.
RECFM= recfm RECFM= (recfm1 recfm2 . . . recfmn)
The record format of the inle. Only use this if you need to override the default record format of the le. For some le types, the data are distributed over several les. In this case, the RECFM= option lists in parentheses the RECFMs for each of the les making up the database. The order in which these RECFMs are listed is important and must conform to the specics of each le type as explained in Data Elements Reference: DATASOURCE Procedure on page 586. The possible values of RECFM are F or FIXED for xed length records N or BIN for binary records D or VAR for varying length records U or DEF for host default record format DOM_V or DOMAIN_VAR or BIN_V or BIN_VAR for UNIX binary record format
INTERVAL= interval FREQUENCY= interval TYPE= interval
species the periodicity of series selected for output to the OUT= data set. The OUT= data set created by PROC DATASOURCE can contain only time series with the same periodicity. Some data les contain time series with different periodicities; for example, a le can contain both monthly series and quarterly series. Use the INTERVAL= option to indicate which periodicity you want. If you want to extract series with different periodicities, use different PROC DATASOURCE invocations with the desired INTERVAL= options. Common values for INTERVAL= are YEAR, QUARTER, MONTH, WEEK, and DAY. The values allowed, as well as the default value of the INTERVAL= option, depend on the le type. See Data Elements Reference: DATASOURCE Procedure on page 586 for the INTERVAL= values appropriate to the data le type you are reading.
OUT= SAS-data-set
names the output data set for the time series extracted from the data le. If none of the output
data set options are specied, including the OUT= data set itself, an OUT= data set is created and named according to the DATAn convention. However, when you create any of the other output data sets, such as OUTCONT=, OUTBY=, OUTALL=, or OUTEVENT=, you must explicitly specify the OUT= data set; otherwise, it will not be created. See OUT= Data Set on page 548 for further details.
OUTALL= SAS-data-set
writes information on the contents of the input data le to an output data set. The OUTALL= data set includes descriptive information, time ranges, and observation counts for all the time series within each BY group. By default, no OUTALL= data set is created. The OUTALL= data set contains the Cartesian product of the information output by the OUTCONT= and OUTBY= options. In data les for which there are no cross sections, the OUTALL= and OUTCONT= data sets are almost equivalent, except that OUTALL= data set also reports time ranges and observation counts of series. See OUTALL= Data Set on page 552 for further details.
OUTBY= SAS-data-set
writes information on the BY variables to an output data set. The OUTBY= data set contains the list of cross sections in the database delimited by the unique set of values that the BY variables assume. Unless the OUTSELECT=OFF option is present, only the selected BY groups are written to the OUTBY= data set. If you omit the OUTBY= option, no OUTBY= data set is created. See OUTBY= Data Set on page 551 for further details.
OUTCONT= SAS-data-set
writes information on the contents of the input data le to an output data set. By default, the OUTCONT= data set includes descriptive information on all of the unique series of the selected periodicity in the data le. When the OUTSELECT=OFF option is omitted, the OUTCONT= data set includes observations only for the series selected for output to the OUT= data set. By default, no OUTCONT= data set is created. See OUTCONT= Data Set on page 550 for further details.
OUTEVENT= SAS-data-set
names the output data set to output event-oriented time series data. This option can only be used when CRSP stock les are being processed. For all other le types, it will be ignored. See OUTEVENT= Data Set on page 553 for further details.
OUTSELECT= ON | OFF
determines whether to output all observations (OUTSELECT=OFF) or only those corresponding to the selected time series and selected BY groups (OUTSELECT=ON) to OUTCONT=, OUTBY=, and OUTALL= data sets. The default is OUTSELECT=ON. The OUTSELECT= option is only relevant when any one of the auxiliary data sets is specied. The option writes observations to OUTCONT=, OUTBY=, and OUTALL= data sets for only the selected time series and selected BY groups if it is set ON. The OUTSELECT= option is only relevant when any one of the OUTCONT=, OUTBY=, and OUTALL= options is specied. The default is OUTSELECT=ON.
KEEP Statement
KEEP variable-list ;
The KEEP statement species which variables in the data le are to be included in the OUT= data set. Only the time series and event variables can be specied in a KEEP statement. All the BY variables and the time ID variable DATE are always included in the OUT= data set; they cannot be referenced in a KEEP statement. If they are referenced, a warning message is given and the reference is ignored. The variable list can contain variable names or name range specications. See Variable Lists on page 547 for details. There is a default KEEP list for each le type. Usually, descriptor type variables, like footnotes, are not included in the default KEEP list. If you give a KEEP statement, the default list becomes undened. Only one KEEP or one DROP statement can be used. KEEP and DROP are mutually exclusive. You can also use the KEEP= data set option to control which variables to include in the OUT= data set. However, the KEEP statement differs from the KEEP= data set option in several respects: The KEEP statement selection is applied before variables are read from the data le, while the KEEP= data set option selection is applied after variables are read and as they are written to the OUT= data set. Therefore, using the KEEP statement instead of the KEEP= data set option is much more efcient. If the KEEP statement causes no series variables to be selected, then no observations are output to the OUT= data set. The KEEP statement variable specications are applied to each cross section independently. This behavior may produce variables different from those produced by the KEEP= data set option when order-range variable list specications are used.
DROP Statement
DROP variable-list ;
The DROP statement species that some variables be excluded from the OUT= data set. Only the time series and event variables can be specied in a DROP statement. None of the BY variables or the time ID variable DATE can be excluded from the OUT= data set. If they are referenced in a DROP statement, a warning message is given and the reference is ignored. Use the WHERE statement for selection based on BY variables, and use the RANGE statement for date selections. The variable list can contain variable names or name range specications. See Variable Lists on page 547 for details. Only one DROP or one KEEP statement can be used. KEEP and DROP are mutually exclusive.
There is a default DROP or KEEP list for each le type. Usually, descriptor type variables, like footnotes, are not included in the default KEEP list. If you specify a DROP statement, the default list becomes undened. You can also use the DROP= data set option to control which variables to exclude from the OUT= data set. However, the DROP statement differs from the DROP= data set option in several aspects: The DROP statement selection is applied before variables are read from the data le, while the DROP= data set option selection is applied after variables are read and as they are written to the OUT= data set. Therefore, using the DROP statement instead of the DROP= data set option is much more efcient. If the DROP statement causes all series variables to be excluded, then no observations are output to the OUT= data set. The DROP statement variable specications are applied to each cross section independently. This behavior may produce variables different from those produced by the DROP= data set option when order-range variable list specications are used.
KEEPEVENT Statement
KEEPEVENT variable-list ;
The KEEPEVENT statement species which event variables in the data le are to be included in the OUTEVENT= data set. As a result, the KEEPEVENT statement is valid only for data les containing event-oriented time series data. All the BY variables, the time ID variable DATE, and the event-grouping variable EVENT are always included in the OUTEVENT= data set. These variables cannot be referenced in the KEEPEVENT statement. If any of these variables are referenced, a warning message is given and the reference is ignored. The variable list can contain variable names or name range specications. See Variable Lists on page 547 for details. Only one KEEPEVENT or one DROPEVENT statement can be used. DROPEVENT are mutually exclusive. KEEPEVENT and
You can also use the KEEP= data set option to control which event variables to include in the OUTEVENT= data set. However, the KEEPEVENT statement differs from the KEEP= data set option in several respects: The KEEPEVENT statement selection is applied before variables are read from the data le, while the KEEP= data set option selection is applied after variables are read and as they are written to the OUTEVENT= data set. Therefore, using the KEEPEVENT statement instead of the KEEP= data set option is much more efcient. If the KEEPEVENT statement causes no event variables to be selected, then no observations are output to the OUTEVENT= data set.
DROPEVENT Statement
DROPEVENT variable-list ;
The DROPEVENT statement species that some event variables be excluded from the OUTEVENT= data set. As a result, the DROPEVENT statement is valid only for data les containing event-oriented time series data. All the BY variables, the time ID variable DATE, and the event-grouping variable EVENT are always included in the OUTEVENT= data set. These variables cannot be referenced in the DROPEVENT statement. If any of these variables are referenced, a warning message is given and the reference is ignored. The variable list can contain variable names or name range specications. See Variable Lists on page 547 for details. Only one DROPEVENT or one KEEPEVENT statement can be used. DROPEVENT and KEEPEVENT are mutually exclusive. You can also use the DROP= data set option to control which event variables to exclude from the OUTEVENT= data set. However, the DROPEVENT statement differs from the DROP= data set option in several respects: The DROPEVENT statement selection is applied before variables are read from the data le, while the DROP= data set option selection is applied after variables are read and as they are written to the OUTEVENT= data set. Therefore, using the DROPEVENT statement instead of the DROP= data set option is much more efcient. If the DROPEVENT statement causes all series variables to be excluded, then no observations are output to the OUTEVENT= data set.
WHERE Statement
WHERE where-expression ;
The WHERE statement species conditions that BY variables must satisfy in order for a cross section to be included in the OUT= and OUTEVENT= data sets. By default, all BY groups are selected. The where-expression must refer only to BY variables dened for the le type you are reading. The Data Elements Reference: DATASOURCE Procedure on page 586 lists the names of the BY variables for each le type. For example, DOTS (Direction of Trade Statistics) les, distributed by the International Monetary Fund, have four BY variables: COUNTRY, CSC, PARTNER, and VERSION. Both COUNTRY and PARTNER are three-digit country codes. To select the direction of trade statistics of the United States (COUNTRY=111) with Turkey (COUNTRY=186), Japan (COUNTRY=158), and the oil exporting countries group (COUNTRY=985), you should specify
You can use any SAS language operators and special WHERE expression operators in the WHERE statement condition. Refer to SAS Language Reference: Concepts for a more detailed discussion of WHERE expressions. If you want to see the names of the BY variables and the values they assume for each cross section, you can rst run PROC DATASOURCE with only the OUTBY= option. The information contained in the OUTBY= data set will aid you in selecting the appropriate BY groups for subsequent PROC DATASOURCE steps.
RANGE Statement
RANGE FROM from TO to ;
The RANGE statement selects the time range of observations written to the OUT= and OUTEVENT= data sets. The from and to values can be SAS date, time, or datetime constants, or they can be specied as year or year : period, where year is a two-digit or four-digit year, and period (when specied) is a period within the year corresponding to the INTERVAL= option. (For example, if INTERVAL=QTR, then period refers to quarters.) When period is omitted, the beginning of the year is assumed for the from value, and the end of the year is assumed for the to value. If a two-digit year is specied, PROC DATASOURCE uses the current value of the YEARCUTOFF option to determine the century of your data. Warnings are issued in the SAS log whenever DATASOURCE needs to determine the century from a two-digit year specication. The default YEARCUTOFF value is 1920. To use a different YEARCUTOFF value, specify
options yearcutoff=yyyy;
where YYYY is the YEARCUTOFF value you want to use. See SAS Language Reference: Dictionary for a more detailed discussion of the YEARCUTOFF option. Both the FROM and TO specications are optional, and both the FROM and TO keywords are optional. If the FROM limit is omitted, the output observations start with the minimum date for which data are available for any selected series. Similarly, if the TO limit is omitted, the output observations end with the maximum date for which data are available. The following are some examples of RANGE statements:
range range range range range range range from 1980 to 1990; 1980 - 1990; from 1980; 1980; to 1990; to 1990:2; from 31aug89d to 28feb1990d;
The RANGE statement applies to each BY group independently. If all the selected series contain no data in the specied range for a given BY group, then there will be no observations for that BY group in the OUT= and OUTEVENT= data sets. If you want to know the time ranges for which periodic time series data are available, you can rst run PROC DATASOURCE with the OUTBY= or OUTALL= option. The OUTBY= data set reports the union of the time ranges over all the series within each BY group, while the OUTALL= data set gives time ranges for each series separately in each BY group.
ATTRIBUTE Statement
ATTRIBUTE variable-list attribute-list . . . ;
The ATTRIBUTE statement assigns formats, labels, and lengths to variables in the output data sets. The variable-list can contain variable names and variable name range specications. See Variable Lists on page 547 for details. The attributes specied in the following attribute list apply to all variables in the variable list. An attribute-list consists of one or more of the following options:
FORMAT= format
associates a format with variables in variable-list. The format can be either a standard SAS format or a format dened with the FORMAT procedure. The default formats for variables depend on the le type.
LABEL= "label"
assigns a label to the variables in the variable list. The default labels for variables depend on the le type. Labels can be up to 256 bytes in length.
LENGTH= length
species the number of bytes used to store the values of variables in the variable list. The default lengths for numeric variables depend on the le type. Usually default lengths are set to 5 bytes. The length specication also controls the amount of memory that PROC DATASOURCE uses to hold variable values while processing the input data le. Thus, specifying a LENGTH= value smaller than the default will reduce both the disk space taken up by the output data sets and the amount of memory used by the PROC DATASOURCE step, at the cost of precision of output data values.
FORMAT Statement
FORMAT variable-list format . . . ;
The FORMAT statement assigns formats to variables in output data sets. The variable-list can contain variable names and variable name range specications. See Variable Lists on page 547 for details. The format specied applies to all variables in the variable list. A single FORMAT statement can assign the same format to several variables or different formats to different variables. The FORMAT statement can use standard SAS formats or formats dened using the FORMAT procedure. Any later format specication for a variable, using either the FORMAT statement or the FORMAT= option in the ATTRIBUTE statement, always overrides the previous one.
LABEL Statement
LABEL variable = "label" . . . ;
The LABEL statement assigns SAS variable labels to variables in the output data sets. You can give labels for any number of variables in a single LABEL statement. The default labels for variables depend on the le type. Extra-long labels ( > 256 bytes ) reside in the OUTCONT data set as the DESCRIPT variable. Any later label specication for a variable, using either the LABEL statement or the LABEL= option in the ATTRIBUTE statement, always overrides the previous one.
LENGTH Statement
LENGTH variable-list length . . . ;
The LENGTH statement, like the LENGTH= option in the ATTRIBUTE statement, species the number of bytes used to store values of variables in output data sets. The default lengths for numeric variables depend on the le type. Usually default lengths are set to 5 bytes. The default lengths of character variables are dened as the minimum number of characters that can hold the longest possible value. For some le types, the LENGTH statement also controls the amount of memory used to store values of numeric variables while processing the input data le. Thus, specifying LENGTH values smaller than the default will reduce both the disk space taken up by the output data sets and the amount of memory used by the PROC DATASOURCE step, at the cost of precision of output data values. Any later length specication for a variable, using either the LENGTH statement or the LENGTH= option in the ATTRIBUTE statement, always overrides the previous one.
RENAME Statement
RENAME old-name = new-name . . . ;
The RENAME statement is used to change the names of variables in the output data sets. Any number of variables can be renamed in a single RENAME statement. The most recent RENAME specication overrides any previous ones for a given variable. The new-name is limited to 32 characters. Renaming of variables is done at the output stage. Therefore, you need to use the old variable names in all other PROC DATASOURCE statements. For example, the series variable names DATA1DATA350 used with annual COMPUSTAT les are not very descriptive, so you may choose to rename them to reect the nancial aspect they represent. You may rename DATA51 as INVESTTAX with the RENAME statement
rename data51=investtax;
since it contains investment tax credit data. However, in all other DATASOURCE statements, you must use the old name, DATA51.
Variable Lists
Variable lists used in PROC DATASOURCE statements can consist of any combination of variable names and name range specications. Items in variable lists can have the following forms: a name, such as PZU. an alphabetic range name1-name2. For example, A-DZZZZZZZ species all variables with names starting with A, B, C, or D. a prex range prex :. For example, IP: selects all variables with names starting with the letters IP. an order range name1name2. For example, GLR72GLRD72 species all variables in the input data le between GLR72 and GRLD72 inclusive. a numeric order range name1-NUMERIC-name2. For example, GLR72-NUMERICGLRD72 species all numeric variables between GLR72 and GRLD72 inclusive. a character order range name1-CHARACTER-name2. For example, GLR72-CHARACTERGLRD72 species all character variables between GLR72 and GRLD72 inclusive.
one of the keywords _NUMERIC_, _CHARACTER_, or _ALL_. The keyword _NUMERIC_ species all numeric variables, _CHARACTER_ species all character variables, and _ALL_ species all variables. To determine the order of series in a data le, run PROC DATASOURCE with the OUTCONT= option, and print the output data set. Note that order and alphabetic range specications are inclusive, meaning that the beginning and ending names of the range are also included in the variable list. For order ranges, the names used to dene the range must actually name variables in the input data le. For alphabetic ranges, however, the names used to dene the range need not be present in the data le. Note that variable specications are applied to each cross section independently. This may cause the order-range variable list specication to behave differently than its DATA step and data set option counterparts. This is because PROC DATASOURCE knows which variables are dened for which cross sections, while the DATA step applies order range specication to the whole collection of time series variables. If the ending variable name in an order range specication is not in the current cross section, all variables starting from the beginning variable to the last variable dened in that cross section get selected. If the rst variable is not in the current cross section, then order range specication has no effect for that cross section. The variable names used in variable list specications can refer either to series names appearing in the input data le or to the SAS names assigned to series data elds internally if the series names are not recorded to the INFILE= le. When the latter is the case, internally dened variable names are listed in Data Elements Reference: DATASOURCE Procedure on page 586 later in this chapter. The following are examples of the use of variable lists:
keep ip: pw112-pw117 pzu; drop data1-data99 data151-data350; length data1-numeric-aftnt350 ucode 4;
The rst statement keeps all the variables starting with IP:, all the variables between PW112 and PW117 including PW112 and PW117 themselves, and a single variable PZU. The second statement drops all the variables that fall alphabetically between DATA1 and DATA99, and between DATA151 and DATA350. Finally, the third statement assigns a length of 4 bytes to all the numeric variables dened between DATA1 and AFTNT350, and UCODE. Variable lists can not exceed 200 characters in length.
WHERE statement to process the OUT= data set by cross sections. The order in which BY variables are dened in the OUT= data set corresponds to the order in which the data le is sorted. DATE, a SAS date-, time-, or datetime-valued variable that reports the time period of each observation. The values of the DATE variable may span different time ranges for different BY groups. The format of the DATE variable depends on the INTERVAL= option. the periodic time series variables, which are included in the OUT= data set only if they have data in at least one selected BY group and they are not discarded by a KEEP or DROP statement the event variables, which are included in the OUT= data set if they are not discarded by a KEEP or DROP statement. By default, these variables are not output to OUT= data set. The values of BY variables remain constant in each cross section. Observations within each BY group correspond to the sampling of the series variables at the time periods indicated by the DATE variable. You can create a set of single indexes for the OUT= data set by using the INDEX option, provided there are BY variables. Under some circumstances, this may increase the efciency of subsequent PROC and DATA steps that use BY and WHERE statements. However, there is a cost associated with creation and maintenance of indexes. The SAS Language Reference: Concepts lists the conditions under which the benets of indexes outweigh the cost. With data les containing cross sections, there can be various degrees of overlap among the series variables. One extreme is when all the series variables contain data for all the cross sections. In this case, the output data set is very compact. In the other extreme case, however, the set of time series variables are unique for each cross section, making the output data set very sparse, as depicted in Table 11.4.
Table 11.4
The OUT= Data Set Containing Unique Series for Each BY Group
DATA is here
The data in Table 11.4 can be represented more compactly if cross-sectional information is incorporated into series variable names.
By default, observations for only the selected BY groups (where BYSELECT=1) are output to the OUTBY= data set, and the date and time range variables are computed over only the selected time series variables. If the OUTSELECT=OFF option is specied, the OUTBY= data set contains an observation for each BY group, and the date and time range variables are computed over all the time series variables. For le types that have no BY variables, the OUTBY= data set contains one observation giving ST_DATE, END_DATE, NTIME, NOBS, NINRANGE, NSERIES, and NSELECT for all the series in the le. If you do not know the BY variable names or their possible values, you can do an initial run of PROC DATASOURCE with the OUTBY= option. The information contained in the OUTBY= data set can help you design your WHERE expression and RANGE statement for the subsequent executions of PROC DATASOURCE to obtain different subsets of the same data le.
title3 Annual; proc datasource filetype=beanipa infile=ascifile interval=year outselect=off outkey=byfor4 out=foreign4; range from 1984 to 1989; keep __00100 __00800; label __00100=Balance of Payment Accounts; label __00800=Exports of Goods and Services; rename __00100=BPAs __00800=exports; run; proc contents data=foreign4; run; proc print data=foreign4; run;
The results are shown in Output 11.1.1, Output 11.1.2, and Output 11.1.3.
This example illustrates the following features: You need to know the series variables names used by a particular vendor in order to construct the KEEP statement. You need to know the BY-variable names and their values for the required cross sections. You can use RENAME and LABEL statements to associate more meaningful names and labels with your selected series variables.
options yearcutoff = 1900; filename datafile blscpi1.data recfm=v lrecl=152; proc datasource filetype=blscpi interval=mon outselect=off outby=cpikey(where=( upcase(areaname) in (NORTHEAST,NORTH CENTRAL,SOUTH,WEST)) ) outcont=cpicont(where= ( index( upcase(label), MEDICAL CARE )) ); where survey=CU; run; title1 OUTBY= Data Set, By AREANAME Selection; proc print data=cpikey; run; title1 OUTCONT= Data Set, By LABEL Selection; proc print data=cpicont; run;
The OUTBY= data set in Output 11.2.1 lists all cross sections available for the four geographical regions: Northeast (AREA=0100), North Central (AREA=0200), Southern (AREA=0300), and Western (AREA=0400). The OUTCONT= data set in Output 11.2.2 gives the variable names for medical care related series.
Output 11.2.1 Partial Listings of the OUTBY= Data Set
OUTCONT= Data Set, By LABEL Selection Obs 1 2 3 4 5 Obs 1 2 3 4 5 SURVEY CU CU CW CW CW NTIME 152 . 152 . . SEASON U U U U U NOBS 152 0 0 0 0 AREA 0200 0100 0400 0100 0200 NSERIES 2 0 1 0 0 BASPTYPE S S S S S BASEPER 1982-84=100 1982-84=100 1982-84=100 1982-84=100 1982-84=100 NSELECT 2 0 0 0 0 BYSELECT 1 1 0 0 0 ST_DATE DEC1977 . DEC1977 . . END_DATE JUL1990 . JUL1990 . .
SURTITLE ALL URBAN CONSUM ALL URBAN CONSUM URBAN WAGE EARN URBAN WAGE EARN URBAN WAGE EARN
O b s 1 2 3
T Y P E 1 1 1
L E N G T H 5 5 5
V A R N U M . . .
L A B E L SERVICES LESS MEDICAL CARE MEDICAL CARE SERVICES ALL ITEMS LESS MEDICAL CARE
F O R M A T
F O R M A T L 0 0 0
F O R M A T D 0 0 0
The following statements make use of this information to extract the data for A512 and descriptive information on cross sections containing A512. Output 11.2.3 and Output 11.2.4 show these results.
options yearcutoff = 1900; filename datafile blscpi1.data recfm=v lrecl=152; proc format; value $areafmt 0100 0200 0300 0400 run;
= = = =
proc datasource filetype=blscpi interval=month out=medical outall=medinfo; where survey=CU and area in ( 0100,0200,0300,0400 ); keep date a512; range from 1988:9; format area $areafmt.; rename a512=medcare; run; title1 Information on Medical Care Service, OUTALL= Data Set; proc print data=medinfo; run; title1 Medical Care Service By Region, OUT= Data Set; title2 Range from September, 1988; proc print data=medical; run;
O b s 1
S U R V E Y CU
S E A S O N U
B A S E P E R 1982-84=100
N A M E medcare
K E P T 1 E N D _ D A T E JUL1990
T Y P E 1
O b s 1
L E N G T H 5
V A R N U M 7
B L K N U M 50 N I N R A N G E 23
F O R M A T
F O R M A T L 0 A R E A N A M E
F O R M A T D 0
S T _ D A T E DEC1977
N T I M E 152
O b s 1
N O B S 152
S _ C O D E CUUR0200SA512
U N I T S
N D E C 1
NORTH CENTRAL
The OUTALL= data set in Output 11.2.3 indicates that data values are stored with one decimal place (see the NDEC variable). Therefore, they need to be rescaled, as follows:
data medical; set medical; medcare = medcare * 0.1; run;
This example illustrates the following features: Descriptive information needed to write KEEP and WHERE statements can be obtained with an initial run of the DATASOURCE procedure. The OUTCONT= and OUTALL= data sets contain information on how data values are stored, such as the precision, the units, and so on. The OUTCONT= and OUTALL= data sets report the new series names assigned by the RENAME statement, not the old names (see the NAME variable in Output 11.2.3). You can use PROC FORMAT to dene formats for series or BY variables to enhance your output. Note that PROC DATASOURCE associates a permanent format, $AREAFMT., with the BY variable AREA. As a result, the formatted values are displayed in the printout of the OUTALL=MEDINFO data set (see Output 11.2.3).
Example 11.3: BLS State and Area Employment, Hours, and Earnings Surveys
This example illustrates how to extract specic series from a State and Area Employment, Hours, and Earnings Survey. The series to be extracted is total employment in real estate and construction industries with respect to states from March 1989 to March 1990. The State and Area, Employment, Hours and Earnings survey designates the totals for statewide gures by AREA=0000. The data type code for total employment is reported to be 1. Therefore, the series name for this variable is SA1, since series names are constructed by adding an SA prex to the data type codes given by BLS. Output 11.3.1 and Output 11.3.2 show statewide gures for total employment (SA1) in many industries from March 1989 through March 1990.
filename ascifile "blseesa.dat" RECFM=F LRECL=152; proc datasource filetype=blseesa infile=ascifile outall=totkey out=totemp; keep sa1; range from 1989:3 to 1990:3; rename sa1=totemp; run; title1 Information on Total Employment, OUTALL= Data Set; proc print data=totkey; run; title1 Total Employment, OUT= Data Set; proc print data=totemp; run;
Example 11.3: BLS State and Area Employment, Hours, and Earnings Surveys ! 563
Output 11.3.1 Printout of the OUTALL= Data Set for All BY Groups
Total Employment, OUT= Data Set D I V I S I O N 7 4 4 2 7 6 I N D U S T R Y 0000 2039 2300 0000 6102 5600 S T A T E A B B AR CA CA CA DE DC S E L E C T E D 1 1 1 1 1 1 E N D _ D A T E JUN1990 JUN1990 JUN1990 DEC1987 DEC1987 JUN1990
O b s
S T A T E
A R E A
D E T A I L 1 6 2 1 6 2
K E P T 1 1 1 1 1 1
T Y P E 1 1 1 1 1 1
L E N G T H 5 5 5 5 5 5
V A R N U M
B L K N U M
F O R M A T
F O R M A T L 0 0 0 0 0 0
F O R M A T D 0 0 0 0 0 0
O b s 1 2 3 4 5 6
N I N R A N G E 13 13 13 0 0 13
FINANCE, INSURANCE, AND REAL ESTATE CANNED, CURED, AND FROZEN FOODS APPAREL AND OTHER TEXTILE PRODUCTS CONSTRUCTION NONDEPOS. INSTNS. & SEC. & COM. BRKRS. APPAREL AND ACCESSORY STORES
O b s 1 2 3 4 5 6
S E A S O N U U U U U U
U N I T S
N D E C 1 1 1 1 1 1
filename datafile "blseesa.dat" RECFM=F LRECL=152; proc datasource filetype=blseesa outall=totkey out=totemp;
where industry=0000; keep sa1; range from 1989:3 to 1990:3; rename sa1=totemp; run; title1 Total Employment for Real Estate and Construction, OUT= Data Set; proc print data=totemp; run;
Note the following for this example: When the INFILE= option is omitted, the leref assigned to the BLSEESA le is the default value DATAFILE. The FROM and TO values in the RANGE statement correspond to monthly data points since the INTERVAL= option defaults to MONTH for the BLSEESA letype.
filename datafile "citidem.dat" RECFM=D LRECL=80; proc datasource filetype=citibase interval=week outall=citiall outby=citikey; run; title1 Summary Information on Weekly Data for CITIDEMO File; proc print data=citikey; run; title1 Weekly Series Available in CITIDEMO File; proc print data=citiall( drop=label ); run;
Obs LABEL 1 2 3 4 5 6 7 8 9 10 STOCK MKT INDEX:NY DOW JONES COMPOSITE, (WSJ) STOCK MKT INDEX:NYSE COMPOSITE, (WSJ) STOCK MKT INDEX:WILSHIRE 500, (WSJ) FOREIGN EXCH RATE WSJ:CANADA,CANADIAN $/U.S. $,NSA FOREIGN EXCH RATE WSJ:U.K.,CENTS/POUND(90 DAY FORWARD),NSA STOCK MKT INDEX:U.K. - ALL SHARES STOCK MKT INDEX:JAPAN - NIKKEI-DOW INT.RATE:5-DAY COMM.PAPER, SHORT TERM YIELD INT.RATE:1MO CERTIFICATES OF DEPOSIT, SHORT TERM YIELD (FBR H.15) INT.RATE:3MO T-BILL, DISCOUNT YIELD (FRB H.15) ST_DATE 02DEC2019 02DEC2019 02DEC2019 29NOV2019 29NOV2019 29NOV2019 29NOV2019 02DEC2019 02DEC2019 02DEC2019 END_DATE NTIME NOBS CODE 09FEB2023 09FEB2023 09FEB2023 09FEB2023 09FEB2023 09FEB2023 09FEB2023 22JAN2021 03FEB2023 03FEB2023 834 834 834 835 835 835 835 300 830 830 834 834 834 835 835 835 835 300 830 830 DSIUSNYDJCM DSIUSNYSECM DSIUSWIL DFXWCAN DFXWUK90 DSIUKAS DSIJPND DCP05 DCD1M DTBD3M
ATTRIBUT NDEC 1 1 1 1 1 1 1 2 1 1 2 2 2 4 2 2 2 2 2 2
Note the following from Output 11.4.2: The OUTALL= data set reports the time ranges of variables. There are six observations in the OUTALL= data set, the same number as reported by NSERIES and NSELECT variables in the OUTBY= data set. The VARNUM variable contains all MISSING values, since no OUT= data set is created. Output 11.4.3 and Output 11.4.4 demonstrate how the OUTSELECT= option affects the contents of the OUTBY= and OUTALL= data sets when a KEEP statement is present. First, set the OUTSELECT= option to OFF.
filename citidemo "citidem.dat" RECFM=D LRECL=80; proc datasource filetype=citibase infile=citidemo interval=week outall=alloff outby=keyoff outselect=off; keep WSP:; run; title1 Summary Information on Weekly Data for CITIDEMO File; proc print data=keyoff; run; title1 Weekly Series Available in CITIDEMO File; proc print data=alloff( keep=name kept selected st_date end_date ntime nobs ); run;
Obs LABEL 1 2 3 4 5 6 7 8 9 10 STOCK MKT INDEX:NY DOW JONES COMPOSITE, (WSJ) STOCK MKT INDEX:NYSE COMPOSITE, (WSJ) STOCK MKT INDEX:WILSHIRE 500, (WSJ) FOREIGN EXCH RATE WSJ:CANADA,CANADIAN $/U.S. $,NSA FOREIGN EXCH RATE WSJ:U.K.,CENTS/POUND(90 DAY FORWARD),NSA STOCK MKT INDEX:U.K. - ALL SHARES STOCK MKT INDEX:JAPAN - NIKKEI-DOW INT.RATE:5-DAY COMM.PAPER, SHORT TERM YIELD INT.RATE:1MO CERTIFICATES OF DEPOSIT, SHORT TERM YIELD (FBR H.15) INT.RATE:3MO T-BILL, DISCOUNT YIELD (FRB H.15) ST_DATE 02DEC2019 02DEC2019 02DEC2019 29NOV2019 29NOV2019 29NOV2019 29NOV2019 02DEC2019 02DEC2019 02DEC2019 END_DATE NTIME NOBS CODE 09FEB2023 09FEB2023 09FEB2023 09FEB2023 09FEB2023 09FEB2023 09FEB2023 22JAN2021 03FEB2023 03FEB2023 834 834 834 835 835 835 835 300 830 830 834 834 834 835 835 835 835 300 830 830 DSIUSNYDJCM DSIUSNYSECM DSIUSWIL DFXWCAN DFXWUK90 DSIUKAS DSIJPND DCP05 DCD1M DTBD3M
ATTRIBUT NDEC 1 1 1 1 1 1 1 2 1 1 2 2 2 4 2 2 2 2 2 2
Setting the OUTSELECT= option ON gives results shown in Output 11.4.5 and Output 11.4.6.
filename citidemo "citidem.dat" RECFM=D LRECL=80; proc datasource filetype=citibase infile=citidemo interval=week outall=allon outby=keyon outselect=on; keep WSP:; run; title1 Summary Information on Weekly Data for CITIDEMO File; proc print data=keyon; run;
title1 Daily Series Available in CITIDEMO File; proc print data=allon( keep=name kept selected st_date end_date ntime nobs ); run;
Obs LABEL 1 2 3 STOCK MKT INDEX:NYSE COMPOSITE, (WSJ) INT.RATE:5-DAY COMM.PAPER, SHORT TERM YIELD INT.RATE:1MO CERTIFICATES OF DEPOSIT, SHORT TERM YIELD (FBR H.15) ST_DATE END_DATE NTIME NOBS CODE 834 300 830 834 DSIUSNYSECM 300 DCP05 830 DCD1M
ATTRIBUT NDEC 1 2 1 2 2 2
Comparison of Output 11.4.4 and Output 11.4.6 reveals the following: The OUTALL= data set contains 10 (NSERIES) observations when OUTSELECT=OFF, and three (NSELECT) observations when OUTSELECT=ON. The observations in OUTALL=ALLON are those for which SELECTED=1 in OUTALL=ALLOFF. The time ranges in the OUTBY= data set are computed over all the variables (selected or not) for OUTSELECT=OFF, but only computed over the selected variables for OUTSELECT=ON. This corresponds to computing time ranges over all the series reported in the OUTALL= data set. The variable NTIME is the number of time periods between ST_DATE and END_DATE, while NOBS is the number of observations the OUT= data set is to contain. Thus, NTIME is
different depending on whether the OUTSELECT= option is set to ON or OFF, while NOBS stays the same. The KEEP statement in the last two examples illustrates the use of an additional variable, KEPT, in the OUTALL= data sets of Output 11.4.4 and Output 11.4.6. KEPT, which reports the outcome of the KEEP statement, is only added to the OUTALL= data set when there is KEEP statement. Adding the RANGE statement to the last example generates the data sets in Output 11.4.7 and Output 11.4.8:
filename citidemo "citidem.dat" RECFM=D LRECL=80; proc datasource filetype=citibase infile=citidemo interval=week outby=keyrange out=citiout outselect=on; keep WSP:; range from 01dec1990d; run; title1 Summary Information on Weekly Data for CITIDEMO File; proc print data=keyrange; run; title1 Weekly Data in CITIDEMO File; proc print data=citiout; run;
The OUTBY= data set in this last example contains an additional variable NINRANGE. This variable is added since there is a RANGE statement. Its value, 15, is the number of observations in the OUT= data set. In this case, NOBS gives the number of observations the OUT= data set would contain if there were not a RANGE statement.
/*--------------------------------------------------------* process daily series * * --------------------------------------------------------*/ * title3 Reading DAILY Federal Reserve Series with fxrates_.dds; proc datasource filetype=dridds infile=datafile interval=day out=fixr outcont=fixrcnt outall=fixrall; keep rx: ; range from 01jan97d to 31dec99d; format disttek distekfm.; format convtek convtkfm.; run; title1 CONTENTS of FXRATES_.DDS File, KEEP RX:; proc print data=fixrcnt; run; title1 Daily Series Available in FXRATES_.DDS File, KEEP RX:; proc print data=fixr; run;
Obs FORMATD 1 2 3 0 0 0
Obs 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
DATE 01JAN1997 02JAN1997 03JAN1997 04JAN1997 05JAN1997 06JAN1997 07JAN1997 08JAN1997 09JAN1997 10JAN1997 11JAN1997 12JAN1997 13JAN1997 14JAN1997 15JAN1997 16JAN1997 17JAN1997 18JAN1997 19JAN1997 20JAN1997 21JAN1997 22JAN1997 23JAN1997 24JAN1997 25JAN1997 26JAN1997 27JAN1997 28JAN1997 29JAN1997 30JAN1997 31JAN1997
filename indfile "baseind.dat" RECFM=F LRECL=84; filename dbfile "basedb.dat" RECFM=F LRECL=4; proc datasource filetype=citidisk infile=( keyfile indfile dbfile ) out=popest outall=popinfo; run; proc print data=popinfo; run; proc print data=popest; run;
Obs FORMATL FORMATD ST_DATE END_DATE NTIME NOBS DISKNUM ATTRIBUT NDEC AGGREGAT 1 2 3 4 5 0 0 0 0 0 0 0 0 0 0 1980 1980 1980 1980 1980 1989 1989 1989 1989 1989 10 10 10 10 10 10 10 10 10 10 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
This example demonstrates the following: The INFILE= options lists the lerefs of the key, index, and database les, in that order. The INTERVAL= option is omitted since the default interval for CITIDISK type les is YEAR.
/*- add company name to the out= data set */ data stocks; merge stocks company( keep=dnum cnum cic coname ); by dnum cnum cic; run; title1 Common Stocks for Last Quarter of 1990; proc print data=stocks ; run;
ftcomstk
Note that quarterly Compustat data are also available in Universal Character format. If you have this type of le instead of IBM 360/370 General format, use the FILETYPE=CS48QUC option instead.
Example 11.8: Annual COMPUSTAT Data Files, V9.2 New Filetype CSAUC3
Annual COMPUSTAT data in Universal Character format is read for PRICES since the year 2002, so that the desired output show the PRICE (HIGH), PRICE (LOW), and PRICE (CLOSE) for each company.
filename datafile "csaucy3.dat" RECFM=F LRECL=13612; /*--------------------------------------------------------------* * create OUT=csauy3 data set with ASCII 2003 Industrial Data * * compare it with the OUT=csauc data set created by DATA STEP * *--------------------------------------------------------------*/ proc datasource filetype=csaucy3 ascii infile=datafile interval=year outselect=on outkey=y3key out=csauy3; keep data197-data199 label; range from 2002; run; proc sort data=csauy3 out=csauy3; by dnum cnum cic file zlist smbl xrel stk; run; title1 Price, High, Low and Close for Range from 2002; proc contents data=csauy3; run; proc print data=csauy3; run;
Output 11.8.1 shows information on the contents of the CSAUY3 data set while Output 11.8.2 shows a listing of the CSAUY3 data set.
Example 11.8: Annual COMPUSTAT Data Files, V9.2 New Filetype CSAUC3 ! 579
YEAR4.
Fiscal Year - High ($&c,NA) Fiscal Year - Low ($&c,NA) Close - Fiscal Year-End ($&c,NA) Observation
Obs CPSPIN 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 1 1 1 1
CSSPIN 10 10 10 10
CSSPII
1 1
10 10
1 1
1 1
10 10
1 1
Note that annual COMPUSTAT data are available in either IBM 360/370 General format or Universal Character format. The rst example expects an IBM 360/370 General format le since the FILETYPE= is set to CSAIBM, while the second example uses a Universal Character format le (FILETYPE=CSAUC).
The data set CALDATA is created by the following statements to contain the calendar/indices le:
proc datasource filetype=crspdci infile=calfile out=caldata; run;
Here the FILETYPE=CRSPDCI indicates that you are reading a character format (indicated by a C in the 6th position) daily (indicated by a D in the 5th position) calendar/indices le (indicated by an I in the 7th position). The annual data in security les can be obtained by the following statements:
proc datasource filetype=crspdca infile=( secfile1 secfile2 secfile3 ) out=annual; run;
Similarly, the data sets to contain the daily security data (the OUT= data set) and the event data (the OUTEVENT= data set) are obtained by the following statements:
proc datasource filetype=crspdcs infile=( calfile secfile1 secfile2 secfile3 ) out=periodic index outevent=events; run;
Note that the FILETYPE= has an S in the 7th position, since you are reading the security les. Also, the INFILE= option rst expects the leref of the calendar/indices le since the dating variable (CALDT) is contained in that le. Following the leref of calendar/indices le, you give the list of security les in the order in which you want to read them. When data span more than one physical volume, the lerefs of the security les residing on each volume must be given following the leref of the calendar/indices le. The DATASOURCE procedure reads each of these les in the order in which they are specied. Therefore, you can request that all three volumes be mounted to the same drive, if you choose to do so. This sample code illustrates the following points: The INDEX option in the second PROC DATASOURCE run creates an index le for the OUT=PERIODIC data set. This index le provides random access to the OUT= data set
and may increase the efciency of the subsequent PROC and DATA steps that use BY and WHERE statements. The index variables are CUSIP, CRSP permanent number (PERMNO), NASDAQ company number (COMPNO), NASDAQ issue number (ISSUNO), header exchange code (HEXCD), and header SIC code (HSICCD). Each one of these variables forms a different key which is a single index. If you want to form keys from a combination of variables (composite indexes) or use some other variables as indexes, you should use the INDEX= data set option for the OUT= data set. The OUTEVENT=EVENTS data set is sparse. In fact, for each EVENT type, a unique set of event variables are dened. For example, for EVENT=SHARES, only the variables SHROUT and SHRFLG are dened, and they have missing values for all other EVENT types. Pictorially, this structure is similar to the data set shown in Figure 11.4. Because of this sparse representation, you should create the OUTEVENT= data set only when you need a subset of securities and events. By default, the OUT= data set contains only the periodic data. However, you may also want to include the event-oriented data in the OUT= data set. This is accomplished by listing the event variables together with periodic variables in a KEEP statement. For example, if you want to extract the historical CUSIP (NCUSIP), number of shares outstanding (SHROUT), and dividend cash amount (DIVAMT) together with all the periodic series, use the following statements.
proc datasource filetype=crspdcs infile=( calfile secfile1 secfile2 secfile3 ) out=both outevent=events; where cusip=09523220; keep bidlo askhi prc vol ret sxret bxret ncusip shrout divamt; run;
The KEEP statement has no effect on the event variables output to the OUTEVENT= data set. If you want to extract only a subset of event variables, you need to use the KEEPEVENT statement. For example, the following sample code outputs only NCUSIP and SHROUT to the OUTEVENT= data set for CUSIP=09523220:
proc datasource filetype=crspdxc infile=( calfile secfile) outevent=subevts; where cusip=09523220; keepevent ncusip shrout; run;
Output 11.9.1, Output 11.9.2, Output 11.9.3, and Output 11.9.4 show how to read the CRSP Daily NYSE/AMEX Combined ASCII Character Files.
filename dxci "dxccal95.dat" RECFM=F LRECL=130; filename dxc "dxcsub95.dat" RECFM=F LRECL=400; /*--- create output data sets from character format DX files ---*/ /*- create securities output data sets using DATASOURCE -------*/ /*- statements -*/ proc datasource filetype=crspdcs ascii infile=( dxci dxc )
interval=day outcont=dxccont outkey=dxckey outall=dxcall out=dxc outevent=dxcevent outselect=off; range from 15aug95d to 28aug95d ; where cusip in (12709510,35614220); run; title3 DX Security File Outputs; title4 OUTKEY= Data Set; proc print data=dxckey; run; title4 OUTCONT= Data Set; proc print data=dxccont; run; title4 "Listing of OUT= Data Set for cusip in (12709510,35614220)"; proc print data=dxc; run; title4 "Listing of OUTEVENT= Data Set for cusip in (12709510,35614220)"; proc print data=dxcevent; run;
O b s 1 2 3 4 5 6 7
P E R M N O
C O M P N O
I S S U N O
H E X C D
H S I C C D
N T I M E
N O B S
N S E R I E S 35 35 35 35 35 35 35
N S E L E C T 7 7 7 7 7 7 7
10000 7952 9787 3 3990 10010 7967 9809 3 3840 10020 7972 9824 3 6710 10030 22160 0 1 3310 10040 7988 9846 3 6210 10050 13 11 3 3448 10060 8007 9876 3 1040
O b s 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35
N A M E BIDLO ASKHI PRC VOL RET SXRET BXRET NCUSIP TICKER COMNAM SHRCLS SHRCD EXCHCD SICCD DISTCD DIVAMT FACPR FACSHR DCLRDT RCRDDT PAYDT SHROUT SHRFLG DLSTCD NWPERM NEXTDT DLBID DLASK DLPRC DLVOL DLRET TRTSCD NMSIND MMCNT NSDINX
K E P T 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
T Y P E 1 1 1 1 1 1 1 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
L E N G T H 6 6 6 6 6 6 6 8 5 32 1 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6
V A R N U M 8 9 10 11 12 13 14 . . . . . . . . . . . . . . . . . . . . . . . . . . . .
L A B E L Bid or Low Ask or High Closing Price of Bid/Ask average Share Volume Holding Period Return Standard Deviation Excess Return Beta Excess Return Name CUSIP Exchange Ticker Symbol Company Name Share Class Share Code Exchange Code Standard Industrial Classification Code Distribution Code Dividend Cash Amount Factor to adjust price Factor to adjust shares outstanding Declaration date Record date Payment date Number of shares outstanding Share flag Delisting code New CRSP permanent number Date of next available information Delisting bid Delisting ask Delisting price Delisting volume Delisting return Traits code National Market System Indicator Market maker count NASD index
F O R M A T
F O R M A T L 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 7 7 7 0 0 0 0 7 0 0 0 0 0 0 0 0 0
F O R M A T D 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
DATE
Output 11.9.3 Listing of the OUT= Data Set with OUTSELECT=ON for CUSIPs 12709510 and 35614220
Price, High, Low and Close for Range from 2002 DX Security File Outputs Listing of OUTEVENT= Data Set for cusip in (12709510,35614220) Obs 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Obs 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 CUSIP 12709510 12709510 12709510 12709510 12709510 12709510 12709510 12709510 12709510 12709510 35614220 35614220 35614220 35614220 35614220 35614220 35614220 35614220 35614220 35614220 BIDLO 7.500 7.500 7.500 7.375 7.375 7.250 7.250 7.125 6.875 7.000 12.375 12.125 12.250 12.250 12.375 12.250 12.125 12.125 12.000 12.000 PERMNO 10010 10010 10010 10010 10010 10010 10010 10010 10010 10010 10060 10060 10060 10060 10060 10060 10060 10060 10060 10060 ASKHI 7.8750 7.8750 7.8750 7.5000 7.3750 7.3750 7.3750 7.5000 7.3750 7.1250 12.6875 12.3750 12.3125 12.6250 12.6250 12.3750 12.2500 12.3750 12.2500 12.0625 COMPNO 7967 7967 7967 7967 7967 7967 7967 7967 7967 7967 8007 8007 8007 8007 8007 8007 8007 8007 8007 8007 PRC 7.5625 7.5000 7.5000 7.3750 7.3750 7.2500 7.3125 7.1250 7.0000 7.0000 12.3750 12.2031 12.2500 12.3750 12.3750 12.2500 12.1250 12.3750 12.0000 12.0625 ISSUNO 9809 9809 9809 9809 9809 9809 9809 9809 9809 9809 9876 9876 9876 9876 9876 9876 9876 9876 9876 9876 VOL 29200 22365 33416 16666 9382 33674 22371 38621 29713 38798 39136 45916 43644 11027 7378 99655 95148 185572 9575 12854 HEXCD 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 RET -0.008197 -0.008264 0.000000 -0.016667 0.000000 -0.016949 0.008621 -0.025641 -0.017544 0.000000 0.000000 -0.013889 0.003841 0.010204 0.000000 -0.010101 -0.010204 0.020619 -0.030303 0.005208 HSICCD 3840 3840 3840 3840 3840 3840 3840 3840 3840 3840 1040 1040 1040 1040 1040 1040 1040 1040 1040 1040 SXRET . . . . . . . . . . . . . . . . . . . . DATE 15AUG1995 16AUG1995 17AUG1995 18AUG1995 21AUG1995 22AUG1995 23AUG1995 24AUG1995 25AUG1995 28AUG1995 15AUG1995 16AUG1995 17AUG1995 18AUG1995 21AUG1995 22AUG1995 23AUG1995 24AUG1995 25AUG1995 28AUG1995 BXRET . . . . . . . . . . . . . . . . . . . .
O b s
C U S I P
H E X C D
E V E N T
D A T E 28AUG1995 24AUG1995 N E X T D T
S H R C D
S I C C D
F A C P R
1 12709510 10010 7967 9809 3 3840 DELIST 2 12709510 10010 7967 9809 3 3840 NASDIN D C L R D T . . R C R D D T . . S H R O U T S H R F L G D L S T C D N W P E R M
. . . . . . . . . . . . . . T R T S C D N M S I N D N S D I N X
O b s 1 2
P A Y D T
D L B I D
D L A S K
D L P R C
D L V O L
D L R E T
M M C N T
. . . 203 23588 . . . . .
. . . 0 . 0.037500 . . . . . . . . . . 1 2 17 2
Note in Output 11.9.4 that there were no events in range for cusip 35614220. See Chapter 33, The SASECRSP Interface Engine, for more on CRSPAccess Data access.
FILETYPE= Description BEANIPA BEANIPAD BLSCPI BLSWPI BLSEENA BLSEESA National Income and Product Accounts National Income and Product Accounts PC Format Consumer Price Index Surveys Producer Price Index Survey National Employment, Hours, and Earnings Survey State and Area Employment,Hours,and Earnings Survey
Table 11.5
continued
FILETYPE= Description DRIBASIC CITIBASE DRIDDS CITIDISK CRY2DBS CRY2DBI CRY2DBA CRY2MBS CRY2MBI CRY2MBA CRY2DCS CRY2DCI CRY2DCA CRY2MCS CRY2MCI CRY2MCA CRY2DIS CRY2DII CRY2DIA CRY2MIS CRY2MII CRY2MIA CRY2MVS CRY2MVI CRY2MVA CRY2DVS CRY2DVI CRY2DVA CRSPDBS CRSPDBI CRSPDBA CRSPMBS CRSPMBI CRSPMBA CRSPDCS CRSPDCI CRSPDCA CRSPMCS CRSPMCI CRSPMCA CRSPDIS CRSPDII CRSPDIA CRSPMIS Basic Economic (formerly CITIBASE) Data Files CITIBASE Data Files DRI Data Delivery Service Time Series PC Format CITIBASE Databases Y2K Daily Binary Security File Format Y2K Daily Binary Calendar&Indices File Format Y2K Daily Binary File Annual Data Format Y2K Monthly Binary Security File Format Y2K Monthly Binary Calendar&Indices File Format Y2K Monthly Binary File Annual Data Format Y2K Daily Character Security File Format Y2K Daily Character Calendar&Indices File Format Y2K Daily Character File Annual Data Format Y2K Monthly Character Security File Format Y2K Monthly Character Calendar&Indices File Format Y2K Monthly Character File Annual Data Format Y2K Daily IBM Binary Security File Format Y2K Daily IBM Binary Calendar&Indices File Format Y2K Daily IBM Binary File Annual Data Format Y2K Monthly IBM Binary Security File Format Y2K Monthly IBM Binary Calendar&Indices File Format Y2K Monthly IBM Binary File Annual Data Format Y2K Monthly VAX Binary Security File Format Y2K Monthly VAX Binary Calendar&Indices File Format Y2K Monthly VAX Binary File Annual Data Format Y2K Daily VAX Binary Security File Format Y2K Daily VAX Binary Calendar&Indices File Format Y2K Daily VAX Binary File Annual Data Format CRSP Daily Binary Security File Format CRSP Daily Binary Calendar&Indices File Format CRSP Daily Binary File Annual Data Format CRSP Monthly Binary Security File Format CRSP Monthly Binary Calendar&Indices File Format CRSP Monthly Binary File Annual Data Format CRSP Daily Character Security File Format CRSP Daily Character Calendar&Indices File Format CRSP Daily Character File Annual Data Format CRSP Monthly Character Security File Format CRSP Monthly Character Calendar&Indices File Format CRSP Monthly Character File Annual Data Format CRSP Daily IBM Binary Security File Format CRSP Daily IBM Binary Calendar&Indices File Format CRSP Daily IBM Binary File Annual Data Format CRSP Monthly IBM Binary Security File Format
Table 11.5
continued
Supplier CRSP
FILETYPE= Description CRSPMII CRSPMIA CRSPMVS CRSPMVI CRSPMVA CRSPDVS CRSPDVI CRSPDVA CRSPMUS CRSPMUI CRSPMUA CRSPDUS CRSPDUI CRSPDUA CRSP Monthly IBM Binary Calendar&Indices File Format CRSP Monthly IBM Binary File Annual Data Format CRSP Monthly VAX Binary Security File Format CRSP Monthly VAX Binary Calendar&Indices File Format CRSP Monthly VAX Binary File Annual Data Format CRSP Daily VAX Binary Security File Format CRSP Daily VAX Binary Calendar&Indices File Format CRSP Daily VAX Binary File Annual Data Format CRSP Monthly UNIX Binary Security File Format or utility dump of CRSPAccess Monthly Security File Format CRSP Monthly UNIX Binary Calendar&Indices File Format or utility dump of CRSPAccess Monthly Cal&Indices Format CRSP Monthly UNIX Binary File Annual Data Format or utility dump of CRSPAccess Monthly Annual Data Format CRSP Daily UNIX Binary Security File Format or utility dump of CRSPAccess Daily Security Format CRSP Daily UNIX Binary Calendar&Indices File Format or utility dump of CRSPAccess Daily Calendar&Indices Format CRSP Daily UNIX Binary File Annual Data Format or utility dump of CRSPAccess Daily Annual Data Format CRSP Monthly Old Character Security File Format CRSP Monthly Old Character Calendar&Indices File Format CRSP Monthly Old Character File Annual Data Format CRSP Daily Old Character Security File Format CRSP Daily Old Character Calendar&Indices File Format CRSP Daily Old Character File Annual Data Format CRSP 1995 Monthly IBM Binary Security File Format CRSP 1995 Monthly IBM Binary Calendar&Indices File Format CRSP 1995 Monthly IBM Binary File Annual Data Format CRSP 1995 Daily IBM Binary Security File Format CRSP 1995 Daily IBM Binary Calendar&Indices File Format CRSP 1995 Daily IBM Binary File Annual Data Format CRSP 1995 Monthly VAX Binary Security File Format CRSP 1995 Monthly VAX Binary Calendar&Indices File Format CRSP 1995 Monthly VAX Binary File Annual Data Format CRSP 1995 Daily VAX Binary Security File Format CRSP 1995 Daily VAX Binary Calendar&Indices File Format CRSP 1995 Daily VAX Binary File Annual Data Format CRSP 1995 Monthly UNIX Binary Security File Format CRSP 1995 Monthly UNIX Binary Calendar&Indices File Format CRSP 1995 Monthly UNIX Binary File Annual Data Format CRSP 1995 Daily UNIX Binary Security File Format CRSP 1995 Daily UNIX Binary Calendar&Indices File Format CRSP 1995 Daily UNIX Binary File Annual Data Format CRSP 1995 Monthly VMS Binary Security File Format
CRSP
CRSPMOS CRSPMOI CRSPMOA CRSPDOS CRSPDOI CRSPDOA CR95MIS CR95MII CR95MIA CR95DIS CR95DII CR95DIA CR95MVS CR95MVI CR95MVA CR95DVS CR95DVI CR95DVA CR95MUS CR95MUI CR95MUA CR95DUS CR95DUI CR95DUA CR95MSS
Table 11.5
continued
Supplier
FILETYPE= Description CR95MSI CR95MSA CR95DSS CR95DSI CR95DSA CR95MAS CR95MAI CR95MAA CR95DAS CR95DAI CR95DAA FAME HAVER IMFIFSP IMFDOTSP IMFBOPSP IMFGFSP OECDANA OECDQNA OECDMEI CSAIBM CS48QIBM CSAUC CS48QUC CSAIY2 CSQIY2 CSAUCY2 CSQUCY2 CRSP 1995 Monthly VMS Binary Calendar&Indices File Format CRSP 1995 Monthly VMS Binary File Annual Data Format CRSP 1995 Daily VMS Binary Security File Format CRSP 1995 Daily VMS Binary Calendar&Indices File Format CRSP 1995 Daily VMS Binary File Annual Data Format CRSP 1995 Monthly ALPHA Binary Security File Format CRSP 1995 Monthly ALPHA Binary Calendar&Indices Format CRSP 1995 Monthly ALPHA Binary File Annual Data Format CRSP 1995 Daily ALPHA Binary Security File Format CRSP 1995 Daily ALPHA Binary Calendar&Indices File Format CRSP 1995 Daily ALPHA Binary File Annual Data Format FAME Information Services Databases Haver Analytics Data Files International Financial Statistics, Packed Format Direction of Trade Statistics, Packed Format Balance of Payment Statistics, Packed Format Government Finance Statistics, Packed Format OECD Annual National Accounts Format OECD Quarterly National Accounts Format OECD Main Economic Indicators Format COMPUSTAT Annual, IBM 360&370 Format COMPUSTAT 48 Quarter, IBM 360&370 Format COMPUSTAT Annual, Universal Character Format COMPUSTAT 48 Quarter, Universal Character Format Y2K COMPUSTAT Annual, IBM 360&370 Format Y2K COMPUSTAT 48 Quarter, IBM 360&370 Format Y2K COMPUSTAT Annual, Universal Character Format Y2K COMPUSTAT 48 Quarter, Universal Character Format
OECD
S&P
Data supplier abbreviations used in Table 11.5 are summarized in Table 11.6.
Table 11.6 Data Supplier Abbreviations
Abbreviation BEA BLS CRSP DRI FAME GLOBAL INSIGHT HAVER IMF OECD S&P
Supplier Bureau of Economic Analysis, U.S. Department of Commerce Bureau of Labor Statistics, U.S. Department of Labor Center for Research in Security Prices Global Insight (formerly DRI/McGraw-Hill) FAME Information Services, Inc. Global Insight, Inc. Haver Analytics Inc. International Monetary Fund Organization for Economic Cooperation and Development Standard & Poors Compustat Services Inc.
Series Variables
Metadata Metadata Labels Fields Database is stored in a single le. YEAR (default), QUARTER, MONTH PARTNO Part Number of Publication, Integer Portion of the Table Number, 19 (character) TABNUM Table Number Within Part, Decimal Portion of the Table Number, 124 (character) Series variable names are constructed by concatenating table number sufx, line and column numbers within each table. An underscore (_) prex is also added for readability.
Metadata Metadata Labels Fields Database is stored in a single le. YEAR (default), QUARTER, MONTH PARTNO Part Number of Publication, Integer Portion of the Table Number, 19 (character) TABNUM Table Number Within Part, Decimal Portion of the Table Number, 124 (character)
Table 11.8
Metadata Metadata Labels Fields Series variable names are constructed by concatenating table number sufx, line and column numbers within each table. An underscore (_) prex is also added for readability.
Metadata Metadata Labels Fields Database is stored in a single le. YEAR, SEMIYEAR1.6, MONTH (default) SURVEY Survey type: CU=All Urban Consumers, CW=Urban Wage Earners and Clerical Workers (character) SEASON Seasonality: S=Seasonally adjusted, U=Unadjusted (character) AREA Geographic Area (character) BASPTYPE Index Base Period Type, S=Standard, A=Alternate Reference (character) BASEPER Index Base Period (character) Series variable names are the same as consumer item codes listed in the Series Directory shipped with the data. A data value of 0 is interpreted as MISSING.
Metadata Metadata Labels Fields Database is stored in a single le. YEAR, MONTH (default)
Table 11.10
Metadata Labels
Seasonality: S=Seasonally adjusted, U=Unadjusted (character) MAJORCOM Major Commodity Group (character) BY SEASON MAJORCOM Series variable names are the same as commodity codes but prexed by an underscore (_). A data value of 0 is interpreted as MISSING.
Metadata Metadata Labels Fields Database is stored in a single le. YEAR, QUARTER, MONTH (default) SEASON Seasonality: S=Seasonally adjusted, U=Unadjusted (character) DIVISION Major Industrial Division (character) INDUSTRY Industry Code (character) BY SEASON DIVISION INDUSTRY Series variable names are the same as data type codes prexed by EE. EE01 Total Employment EE02 Employment of Women EE03 Employment of Production or Nonsupervisory Workers EE04 Average Weekly Earnings of Production Workers EE05 Average Weekly Hours of Production Workers EE06 Average Hourly Earnings of Production Workers EE07 Average Weekly Overtime Hours of Production Workers EE40 Index of Aggregate Weekly Hours EE41 Index of Aggregate Weekly Payrolls EE47 Hourly Earnings Index; 1977 Weights; Current Dollars EE48 Hourly Earnings Index; 1977 Weights; Base 1977 Dollars EE49 Average Hourly Earnings; Base 1977 Dollars EE50 Gross Average Weekly Earnings; Current Dollars EE51 Gross Average Weekly Earnings; Base 1977 Dollars EE52 Spendable Average Weekly Earnings; No Dependents; Current Dollars
Table 11.11
Metadata Labels
Missing Codes
Spendable Average Weekly Earnings; No Dependents; Base 1977 Dollars EE54 Spendable Average Weekly Earnings; 3 Dependents; Current Dollars EE55 Spendable Average Weekly Earnings; 3 Dependents; Base 1977 Dollars EE60 Average Hourly Earnings Excluding Overtime EE61 Index of Diffusion; 1-month Span; Base 1977 EE62 Index of Diffusion; 3-month Span; Base 1977 EE63 Index of Diffusion; 6-month Span; Base 1977 EE64 Index of Diffusion; 12-month Span; Base 1977 Series data values are set to MISSING when their status codes are 1.
Missing Codes
Metadata Metadata Labels Fields Database is stored in a single le. YEAR, MONTH (default) STATE State FIPS codes (numeric) AREA Area codes (character) DIVISION Major industrial division (character) INDUSTRY Industry code (character) DETAIL Private/Government detail BY STATE AREA DIVISION INDUSTRY DETAIL Series variable names are the same as data type codes prexed by SA. SA1 All employees SA2 Women workers SA3 Production workers SA4 Average weekly earnings SA5 Average weekly hours Series data values are set to MISSING when their status codes are 1.
series. Global Insight, formerly DRI/McGraw-Hill, distributes Basic Economic data les on various media. Old DRIDDS data les can be read by DATASOURCE using the DRIDDS letype. The following DRI le types are supported.
Metadata Metadata Labels Fields Database is stored in a single le. YEAR (default), QUARTER, MONTH, WEEK, WEEK1.1, WEEK1.2, WEEK1.3, WEEK1.4, WEEK1.5, WEEK1.6, WEEK1.7, WEEKDAY None Variable names are taken from the series descriptor records in the data le. Note that series codes can be 20 bytes. MISSING=( 1.000000E9=. NA-ND=. )
Note that when you specify the INTERVAL=WEEK option, all the weekly series will be aggregated, and the DATE variable in the OUT= data set will be set to the date of Sundays. The date of rst observation for each series is the Sunday marking the beginning of the week that contains the starting date of that variable.
Metadata Metadata Labels Fields Database is stored in a single le. YEAR (default), SEMIYEAR, QUARTER, MONTH, SEMIMONTH, TENDAY, WEEK, WEEK1.1, WEEK1.2, WEEK1.3, WEEK1.4, WEEK1.5, WEEK1.6, WEEK1.7, WEEKDAY, DAY None Variable names are taken from the series descriptor records in the data le. Note that series names can be 24 bytes. MISSING=( NA-ND=. )
Table 11.15
Missing Codes
Metadata Metadata Labels Fields Database is stored in a single le. YEAR (default), QUARTER, MONTH None Variable names are taken from the series descriptor records in the data le and are the same as the series codes reported in the CITIBASE Directory. 1.0E9=.
Metadata Metadata Labels Fields Database is stored in groups of three associated les having the same lename but different extensions: KEY, IND, or DB. The INFILE= option should contain three lerefs in the following order: INFILE=(keyle indle dble). YEAR (default), QUARTER, MONTH None Series variable names are the same as series codes reported in the CITIBASE Directory. 1.0E9=.
Series names are internally constructed from the data array names documented in the COMPUSTAT manual. Each column of data array is treated as a SAS variable. The names of these variables are generated by concatenating the corresponding column numbers to the array name. Missing values use four codes. Missing code .C represents a combined gure where the data item has been combined into another data item, .I reports an insignicant gure, .S represents a semiannual gure in the second and fourth quarters, .A represents an annual gure in the fourth quarter, and . indicates that the data item is not available. The missing codes .C and .I are not used for Aggregate or Prices, Dividends, and Earnings (PDE) les. The missing codes .S and .A are used only on the Industrial Quarterly File and not on the Aggregate Quarterly, Business Information, or PDE les.
FILETYPE=CSAIBMCOMPUSTAT Annual, IBM 360/370 Format FILETYPE=CSAIY2Four-Digit Year COMPUSTAT Annual, IBM 360/370 Format
Table 11.17 FILETYPE=CSAIBM,CSAIY2 COMPUSTAT Annual,IBM 360/370 Format
Metadata Metadata Labels Fields Database is stored in a single le. YEAR (default) DNUM Industry Classication Code (numeric) CNUM CUSIP Issuer Code (character) CIC CUSIP Issue Number and Check Digit (numeric) FILE File Identication Code (numeric) ZLIST Exchange Listing and S&P Index Code (numeric) CONAME Company Name (character) INAME Industry Name (character) SMBL Stock Ticker Symbol (character) XREL S&P Industry Index Relative Code (numeric) STK Stock Ownership Code (numeric) STATE Company Location Identication Code - State (numeric) COUNTY Company Location Identication Code - County (numeric) FINC Incorporation Code - Foreign (numeric) EIN Employer Identication Number (character) CPSPIN S&P Index Primary Marker (character) CSSPIN S&P Index Secondary Identier (character) CSSPII S&P Index Subset Identier (character) SDBT S&P Senior Debt Rating - Current (character) SDBTIM Footnote- S&P Senior Debt Rating- Current (character) SUBDBT S&P Subordinated Debt Rating - Current (character)
Table 11.17
Metadata Labels
S&P Commercial Paper Rating - Current (character) BY DNUM CNUM CIC DATA1-DATA350 FYR UCODE SOURCE AFTNT1-AFTNT70 DROP DATA322-DATA326 DATA338 DATA345-DATA347 DATA350 AFTNT52-AFTNT70; 0.0001=. 0.0004=.C 0.0008=.I 0.0002=.S 0.0003=.A
FILETYPE=CS48QIBMCOMPUSTAT 48-Quarter, IBM 360/370 Format FILETYPE=CSQIY2FOUR-DIGIT YEAR COMPUSTAT 48-Quarter, IBM 360/370 Format
Table 11.18 FILETYPE=CS48QIBM,CSQIY2 COMPUSTAT 48-Quarter, IBM 360/370 Format
Metadata Metadata Labels Fields Database is stored in a single le. QUARTER (default) DNUM Industry Classication Code (numeric) CNUM CUSIP Issuer Code (character) CIC CUSIP Issue Number and Check Digit (numeric) FILE File Identication Code (numeric) CONAME Company Name (character) INAME Industry Name (character) EIN Employer Identication Number (character) STK Stock Ownership Code (numeric) SMBL Stock Ticker Symbol (character) ZLIST Exchange Listing and S&P Index Code (numeric) XREL S&P Industry Index Relative Code (numeric) FIC Incorporation Code - Foreign (numeric) INCORP Incorporation Code - State (numeric) STATE Company Location Identication Code - State (numeric) COUNTY Company Location Identication Code - County (numeric) CANDX Canadian Index Code - Current (character) BY DNUM CNUM CIC; DATA1Data Array DATA232 QFTNT1Data Footnotes QFTNT60 FYR Fiscal Year-End Month of Data SPCSCYR SPCS Calendar Year
Table 11.18
Metadata Metadata Labels Fields SPCSCQTR SPCS Calendar Quarter UCODE Update Code SOURCE Source Document Code BONDRATE S&P Bond Rating DEBTCL S&P Class of Debt CPRATE S&P Commercial Paper Rating STOCK S&P Common Stock Ranking MIC S&P Major Index Code IIC S&P Industry Index Code REPORTDT Report Date of Quarterly Earnings FORMAT Flow of Funds Statement Format Code DEBTRT S&P Subordinated Debt Rating CANIC Canadian Index Code CS Comparability Status CSA Company Status Alert SENIOR S&P Senior Debt Rating DROP DATA122-DATA232 QFTNT24-QFTNT60; 0.0001=. 0.0004=.C 0.0008=.I 0.0002=.S 0.0003=.A
FILETYPE=CSAUCCOMPUSTAT Annual, Universal Character Format FILETYPE=CSAUCY2Four-Digit Year COMPUSTAT Annual, Universal Character Format
Table 11.19 FILETYPE=CSAUC,CSAUCY2 COMPUSTAT Annual, Universal Character Format
Metadata Metadata Labels Fields Database is stored in a single le. YEAR (default) DNUM Industry Classication Code (numeric) CNUM CUSIP Issuer Code (character) CIC CUSIP Issue Number and Check Digit (character) FILE File Identication Code (numeric) ZLIST Exchange Listing and S&P Index Code (numeric) CONAME Company Name (character) INAME Industry Name (character) SMBL Stock Ticker Symbol (character) XREL S&P Industry Index Relative Code (numeric) STK Stock Ownership Code (numeric) STATE Company Location Identication Code - State (numeric)
Table 11.19
Metadata Labels
Company Location Identication Code - County (numeric) FINC Incorporation Code - Foreign (numeric) EIN Employer Identication Number (character) CPSPIN S&P Index Primary Marker (character) CSSPIN S&P Index Secondary Identier (character) CSSPII S&P Index Subset Identier (character) SDBT S&P Senior Debt Rating - Current (character) SDBTIM Footnote- S&P Senior Debt Rating- Current (character) SUBDBT S&P Subordinated Debt Rating - Current (character) CPAPER S&P Commercial Paper Rating - Current (character) BY DNUM CNUM CIC DATA1-DATA350 FYR UCODE SOURCE AFTNT1-AFTNT70 DROP DATA322-DATA326 DATA338 DATA345-DATA347 DATA350 AFTNT52-AFTNT70; -0.001=. -0.004=.C -0.008=.I -0.002=.S -0.003=.A
FILETYPE=CS48QUCCOMPUSTAT 48 Quarter, Universal Character Format FILETYPE=CSQUCY2Four-Digit Year COMPUSTAT 48 Quarter, Universal Character Format
Table 11.20 FILETYPE=CS48QUC,CSQUCY2 COMPUSTAT 48 Quarter, Universal Character Format
Metadata Metadata Labels Fields Database is stored in a single le. QUARTER (default) DNUM Industry Classication Code (numeric) CNUM CUSIP Issuer Code (character) CIC CUSIP Issue Number and Check Digit (character) FILE File Identication Code (numeric) CONAME Company Name (character) INAME Industry Name (character) EIN Employer Identication Number (character) STK Stock Ownership Code (numeric) SMBL Stock Ticker Symbol (character) ZLIST Exchange Listing and S&P Index Code (numeric) XREL S&P Industry Index Relative Code (numeric)
Table 11.20
Metadata Labels
Incorporation Code - Foreign (numeric) Incorporation Code - State (numeric) Company Location Identication Code - State (numeric) COUNTY Company Location Identication Code - County (numeric) CANDXC Canadian Index Code - Current (numeric) BY DNUM CNUM CIC DATA1Data Array DATA232 QFTNT1Data Footnotes QFTNT60 FYR Fiscal Year-End Month of Data SPCSCYR SPCS Calendar Year SPCSCQTR SPCS Calendar Quarter UCODE Update Code SOURCE Source Document Code BONDRATE S&P Bond Rating DEBTCL S&P Class of Debt CPRATE S&P Commercial Paper Rating STOCK S&P Common Stock Ranking MIC S&P Major Index Code IIC S&P Industry Index Code REPORTDT Report Date of Quarterly Earnings FORMAT Flow of Funds Statement Format Code DEBTRT S&P Subordinated Debt Rating CANIC Canadian Index Code - Current CS Comparability Status CSA Company Status Alert SENIOR S&P Senior Debt Rating DROP DATA122-DATA232 QFTNT24-QFTNT60; -0.001=. -0.004=.C -0.008=.I -0.002=.S -0.003=.A
les contain annual data elds. CRSP data les are distributed in CRSPAccess format. See Chapter 33, The SASECRSP Interface Engine, for more about accessing your CRSPAccess database. You can convert your CRSPAccess data to binary format (SFA format) by using the CRSP-supplied utility (STK_DUMP_BIN). Use the DATASOURCE procedure for SFA format access and use SASECRSP Interface for CRSPAccess. CRSP stock data (in SFA format) are provided in two les, a main data le containing security information and a calendar/indices le containing a list of trading dates and market information associated with those trading dates. The le types for CRSP stock les are constructed by concatenating CRSP with a D or M to indicate the frequency of data, followed by B, C, or I to indicate le formats. B is for host binary, C is for character, and I is for IBM binary formats. The last character in the le type indicates if you are reading the Calendar/Indices le (I), or if you are extracting the security (S) or annual data (A). For example, the le type for the daily NYSE/AMEX combined data in IBM binary format is CRSPDIS. Its calendar/indices le can be read by CRSPDII, and its annual data can be extracted by CRSPDIA. Starting in 1995, binary data used split records (RICFAC=2), so the 1995 letypes (CR95*) should be used for 1995 and 1996 binary data. If you use utility routines supplied by CRSP to convert a character format le to a binary format le on a given host, then you need to use host binary le types (RIDFAC=1) to read those les in. Note that you cannot do the conversion on one host and transfer and read the le on another host. If you are using the CRSPAccess Database, you will need to use the utility routine (stk_dump_bin) supplied by CRSP to generate the UNIX binary format of the data. You can access the UNIX (or SUN) binary data by using PROC DATASOURCE with the CRSPDUS for daily or CRSPMUS for monthly stock data. For the four-digit year data, use the Y2K-compliant letypes for that data type. For CRSP le types, the INFILE= option must be of the form
INFILE=( calfile security1 < security2 ... > )
where calle is the leref assigned to the calendar/indices le, and security1 < security2 . . . > are the lerefs given to the security les, in the order in which they should be read.
Metadata Metadata Labels Fields Database is stored in a single le. DAY for products DA, DR, DX, EX, NX, and RA MONTH for products MA, MX, and MZ None VWRETD Value-Weighted Return (including all distributions)
Table 11.21
Metadata Labels
Default List
KEEP
Value-Weighted Return (excluding dividends) Equal-Weighted Return (including all distributions) EWRETX Equal-Weighted Return (excluding dividends) TOTVAL Total Market Value TOTCNT Total Market Count USDVAL Market Value of Securities Used USDCNT Count of Securities Used SPINDX Level of the Standard & Poors Composite Index SPRTRN Return on the Standard & Poors Composite Index NCINDX NASDAQ Composite Index NCRTRN NASDAQ Composite Return All variables will be kept.
Events
Metadata Metadata Fields Labels INFILE=( calle securty1 < securty2 . . . > ) DAY CUSIP CUSIP Identier (character) PERMNO CRSP Permanent Number (numeric) COMPNO NASDAQ Company Number (numeric) ISSUNO NASDAQ Issue Number (numeric) HEXCD Header Exchange Code (numeric) HSICCD Header SIC Code (numeric) BY CUSIP BIDLO Bid or Low ASKHI Ask or High PRC Closing Price of Bid/Ask Average VOL Share Volume RET Holding Period Return missing=( -66.0 = .p -77.0 = .t -88.0 = .r -99.0 = .b ) BXRET Beta Excess Return missing=( -44.0 = . ) SXRET Standard Deviation Excess Return missing=( -44.0 = . ) NAMES NCUSIP Name CUSIP TICKER Exchange Ticker Symbol COMNAM Company Name
Table 11.22
Metadata Fields
Default Lists
KEEP
Share Class Share Code Exchange Code Standard Industrial Classication Code DIST DISTCD Distribution Code DIVAMT Dividend Cash Amount FACPR Factor to Adjust Price FACSHR Factor to Adjust Shares Outstanding DCLRDT Declaration Date RCRDDT Record Date PAYDT Payment Date SHARES SHROUT Number of Shares Outstanding SHRFLG Share Flag DELIST DLSTCD Delisting Code NWPERM New CRSP Permanent Number NEXTDT Date of Next Available Information DLBID Delisting Bid DLASK Delisting Ask DLPRC Delisting Price DLVOL Delisting Volume missing=( -99 = . ) DLRET Delisting Return missing=( -55.0=.s -66.0=.t 88.0=.a -99.0=.p ); NASDIN TRTSCD Traits Code NMSIND National Market System Indicator MMCNT Market Maker Count NSDINX NASD Index All periodic series variables will be output to the OUT= data set and all event variables will be output to the OUTEVENT= data set.
Metadata Metadata Fields Labels INFILE=( calle security1 < security2 . . . > ) MONTH CUSIP CUSIP Identier (character)
Table 11.23
Metadata Fields PERMNO COMPNO ISSUNO HEXCD HSICCD BY CUSIP BIDLO ASKHI PRC VOL RET RETX PRC2
Metadata Labels CRSP Permanent Number (numeric) NASDAQ Company Number (numeric) NASDAQ Issue Number (numeric) Header Exchange Code (numeric) Header SIC Code (numeric) Bid or Low Ask or High Closing Price of Bid/Ask average Share Volume Holding Period Return missing=( -66.0 = .p -77.0 = .t -88.0 = .r -99.0 = .b ); Return Without Dividends missing=( -44.0 = . ) Secondary Price missing=( -44.0 = . ) NCUSIP Name CUSIP TICKER Exchange Ticker Symbol COMNAM Company Name SHRCLS Share Class SHRCD Share Code EXCHCD Exchange Code SICCD Standard Industrial Classication Code DISTCD Distribution Code DIVAMT Dividend Cash Amount FACPR Factor to Adjust Price FACSHR Factor to Adjust Shares Outstanding EXDT Ex-distribution Date RCRDDT Record Date PAYDT Payment Date SHROUT Number of Shares Outstanding SHRFLG Share Flag DLSTCD Delisting Code NWPERM New CRSP Permanent Number NEXTDT Date of Next Available Information DLBID Delisting Bid DLASK Delisting Ask DLPRC Delisting Price DLVOL Delisting Volume DLRET Delisting Return
Events
NAMES
DIST
SHARES DELIST
Table 11.23
Metadata Fields
Metadata Labels
Default Lists
KEEP
missing=( -55.0=.s -66.0=.t 88.0=.a -99.0=.p ); NASDIN TRTSCD Traits Code NMSIND National Market System Indicator MMCNT Market Maker Count NSDINX NASD Index All periodic series variables will be output to the OUT= data set and all event variables will be output to the OUTEVENT= data set.
Default Lists
KEEP
Metadata Metadata Labels Fields INFILE=( security1 < security2 . . . > ) YEAR CUSIP CUSIP Identier (character) PERMNO CRSP Permanent Number (numeric) COMPNO NASDAQ Company Number (numeric) ISSUNO NASDAQ Issue Number (numeric) HEXCD Header Exchange Code (numeric) HSICCD Header SIC Code (numeric) BY CUSIP CAPV Year End Capitalization SDEVV Annual Standard Deviation missing=( -99.0 = . ) BETAV Annual Beta missing=( -99.0 = . ) CAPN Year End Capitalization Portfolio Assignment SDEVN Standard Deviation Portfolio Assignment BETAN Beta Portfolio Assignment All variables will be kept.
for SAS 8.2 on Windows, Solaris2, AIX, and HP-UX hosts. SASEFAME for SAS 9 supports Windows, Solaris, AIX, and Linux. The DATASOURCE interface to FAME requires a component supplied by FAME Information Services, Inc. Once this FAME component is installed on your system, you can use the DATASOURCE procedure to extract data from your FAME databases by giving the following specications. Specify FILETYPE=FAME in the PROC DATASOURCE statement and give the FAME database name to access with a DBNAME=fame-database option. The character string you specify in the DBNAME= option is passed through to FAME; specify the value of this option as you would in accessing the database from within FAME software. Specify the output SAS data set to be created, the frequency of the series to be extracted, and other usual DATASOURCE procedure options as appropriate. Specify the time range to extract with a RANGE statement. The RANGE statement is required when extracting series from FAME databases. Name the FAME series to be extracted with a KEEP statement. The items in the KEEP statement are passed through to FAME software; therefore, you can use any valid FAME expression to specify the series to be extracted. Enclose in quotes any FAME series name or expression that is not a valid SAS name. Name the SAS variable names you want to use for the extracted series in a RENAME statement. Give the FAME series name or expression (in quotes if needed) followed by an equal sign and the SAS name. The RENAME statement is not required; however, if the FAME series name is not a valid SAS variable name, the DATASOURCE procedure will construct a SAS name by translating and truncating the FAME series name. This process might not produce the desired name for the variable in the output SAS data set, so a rename statement could be used to produce a more appropriate variable name. The VALIDVARNAME=ANY option in your SAS options statement can be used to allow special characters in the SAS variable name. For an alternative solution to PROC DATASOURCEs access to FAME, see The SASEFAME Interface Engine in Chapter 34, The SASEFAME Interface Engine.
Metadata Metadata Field Types Fields INTERVAL= YEAR YEAR.2 YEAR.3 YEAR.4 YEAR.5 YEAR.6 YEAR.7 YEAR.8 YEAR.9
Metadata Labels correspond to FAMEs ANNUAL(DECEMBER) correspond to FAMEs ANNUAL(JANUARY) correspond to FAMEs ANNUAL(FEBRUARY) correspond to FAMEs ANNUAL(MARCH) correspond to FAMEs ANNUAL(APRIL) correspond to FAMEs ANNUAL(MAY) correspond to FAMEs ANNUAL(JUNE) correspond to FAMEs ANNUAL(JULY) correspond to FAMEs ANNUAL(AUGUST)
Table 11.25
Metadata Metadata Labels Fields YEAR.10 correspond to FAMEs ANNUAL(SEPTEMBER) YEAR.11 correspond to FAMEs ANNUAL(OCTOBER) YEAR.12 correspond to FAMEs ANNUAL(NOVEMBER) SEMIYEAR correspond to FAMEs SEMIYEAR QUARTER correspond to FAMEs QUARTER MONTH correspond to FAMEs MONTH SEMIMONTHcorrespond to FAMEs SEMIMONTH TENDAY correspond to FAMEs TENDAY WEEK corresponds to FAMEs WEEKLY(SATURDAY) WEEK.2 corresponds to FAMEs WEEKLY(SUNDAY) WEEK.3 corresponds to FAMEs WEEKLY(MONDAY) WEEK.4 corresponds to FAMEs WEEKLY(TUESDAY) WEEK.5 corresponds to FAMEs WEEKLY(WEDNESDAY) WEEK.6 corresponds to FAMEs WEEKLY(THURSDAY) WEEK.7 corresponds to FAMEs WEEKLY(FRIDAY) WEEK2 corresponds to FAMEs BIWEEKLY(ASATURDAY) WEEK2.2 correspond to FAMEs BIWEEKLY(ASUNDAY) WEEK2.3 correspond to FAMEs BIWEEKLY(AMONDAY) WEEK2.4 correspond to FAMEs BIWEEKLY(ATUESDAY) WEEK2.5 correspond to FAMEs BIWEEKLY(AWEDNESDAY) WEEK2.6 correspond to FAMEs BIWEEKLY(ATHURSDAY) WEEK2.7 correspond to FAMEs BIWEEKLY(AFRIDAY) WEEK2.8 correspond to FAMEs BIWEEKLY(BSATURDAY) WEEK2.9 correspond to FAMEs BIWEEKLY(BSUNDAY) WEEK2.10 correspond to FAMEs BIWEEKLY(BMONDAY) WEEK2.11 correspond to FAMEs BIWEEKLY(BTUESDAY) WEEK2.12 correspond to FAMEs BIWEEKLY(BWEDNESDAY) WEEK2.13 correspond to FAMEs BIWEEKLY(BTHURSDAY) WEEK2.14 correspond to FAMEs BIWEEKLY(BFRIDAY) WEEKDAY correspond to FAMEs WEEKDAY DAY correspond to FAMEs DAY None Variable names are constructed from the FAME series codes. Note that series names are limited to 32 bytes.
your data. The format of Haver Analytics data les is similar to the CITIBASE/DRIBASIC formats.
Missing Codes
Metadata Metadata Labels Fields Database is stored in a single le. YEAR (default), QUARTER, MONTH Variable names are taken from the series descriptor records in the data le. NOTE: HAVER letype reports the UPDATE and SOURCE in the OUTCONT= data set, while HAVERO does not. 1.0E9=.
Metadata Metadata Labels Fields Database is stored in a single le. YEAR (default), QUARTER, MONTH COUNTRY Country Code (character, three digits) CSC Control Source Code (character) PARTNER Partner Country Code (character, three digits) VERSION Version Code (character) BY COUNTRY CSC PARTNER VERSION Series variable names are the same as series codes reported in IMF Documentation prexed by F for data and F_F for footnote indicators.
Table 11.27
Metadata Metadata Labels Fields By default all the footnote indicators will be dropped.
Default List
KEEP
Metadata Metadata Labels Fields Database is stored in a single le. YEAR (default), QUARTER, MONTH COUNTRY Country Code (character, three digits) CSC Control Source Code (character) PARTNER Partner Country Code (character, three digits) VERSION Version Code (character) BY COUNTRY CSC PARTNER VERSION Series variable names are the same as series codes reported in IMF Documentation prexed by D for data and F_D for footnote indicators. By default all the footnote indicators will be dropped.
Sorting Order
Metadata Metadata Labels Fields Database is stored in a single le. YEAR (default), QUARTER, MONTH COUNTRY Country Code (character, three digits) CSC Control Source Code (character) PARTNER Partner Country Code (character, three digits) VERSION Version Code (character) BY COUNTRY CSC PARTNER VERSION
Table 11.29
Default List
KEEP
Metadata Metadata Labels Fields Series variable names are the same as series codes reported in IMF Documentation prexed by B for data and F_B for footnote indicators. By default all the footnote indicators will be dropped.
Default List
KEEP
Metadata Metadata Labels Fields Database is stored in a single le. YEAR (default), QUARTER, MONTH COUNTRY Country Code (character, three digits) CSC Control Source Code (character) PARTNER Partner Country Code (character, three digits) VERSION Version Code (character) BY COUNTRY CSC PARTNER VERSION Series variable names are the same as series codes reported in IMF Documentation prexed by G for data and F_G for footnote indicators. By default all the footnote indicators will be dropped.
Table 11.31
Series Renamed
Missing Codes
Metadata Metadata Labels Fields Database is stored on a single le. YEAR (default), SEMIYR1.6, QUARTER, MONTH, WEEK, WEEKDAY PREFIX Table number prex (character) CNTRYZ Country Code (character) Series variable names are the same as the mnemonic name of the element given on the element E record. They are taken from the 12 byte time series T record time series indicative. OLDNAME NEWNAME p0discgdpe p0digdpe doll2gdpe dol2gdpe doll3gdpe dol3gdpe doll1gdpe dol1gdpe ppp1gdpd pp1gdpd ppp1gdpd1 pp1gdpd1 p0itxgdpc p0itgdpc p0itxgdps p0itgdps p0subgdpc p0sugdpc p0subgdps p0sugdps p0cfcgdpc p0cfgdpc p0cfgddps p0cfgdps p0discgdpc p0dicgdc p0discgdps p0dicgds A data value of * is interpreted as MISSING.
Metadata Metadata Labels Fields Database is stored on a single le. QUARTER(default),YEAR COUNTRY Country Code (character) SEASON Seasonality S=seasonally adjusted 0=raw data, not seasonally adjusted PRICETAG Prices C=data at current prices
Table 11.32
Metadata Fields
Metadata Labels
Series Variables
R,L,M=data at constant prices P,K,J,V=implicit price index or volume index Subject code used to distinguish series within countries. Series variables are prexed by _ for data, C for control codes, and D for relative date. By default all the control codes and relative dates will be dropped. A data value of + or - is interpreted as MISSING.
Metadata Metadata Labels Fields Database is stored on a single le. YEAR(default),QUARTER,MONTH COUNTRY Country Code (character) CURRENCY Unit of expression of the series. ADJUST Adjustment 0,H,S,A,L=no adjustment 1,I=calendar or working day adjusted 2,B,J,M=seasonally adjusted by National Authorities 3,K,D=seasonally adjusted by OECD Series variables are prexed by _ for data, C for control codes, and D for relative date in weeks since last updated. By default, all the control codes and relative dates will be dropped. A data value of + or - is interpreted as MISSING.
References
Bureau of Economic Analysis (1986), The National Income and Product Accounts of the United States, 1929-82, U.S. Dept of Commerce, Washington, DC. Bureau of Economic Analysis (1987), Index of Items Appearing in the National Income and Product Accounts Tables, U.S. Dept of Commerce, Washington, DC.
References ! 613
Bureau of Economic Analysis (1991), Survey of Current Business, U.S. Dept of Commerce, Washington, DC. Center for Research in Security Prices (2006), CRSP Data Description Guide, Chicago, IL. Center for Research in Security Prices (2006), CRSP Fortran-77 to Fortran-95 Migration Guide, Chicago, IL. Center for Research in Security Prices (2006), CRSP Programmers Guide, Chicago, IL. Center for Research in Security Prices (2006), CRSP Utilities Guide, Chicago, IL. Center for Research in Security Prices (2000), CRSP SFA Guide, Chicago, IL. Citibank (1990), CITIBASE Directory, New York, NY. Citibank (1991), CITIBASE-Weekly, New York, NY. Citibank (1991), CITIBASE-Daily, New York, NY. DRI/McGraw-Hill (1997), DataLink, Lexington, MA. DRI/McGraw-Hill Data Search and Retrieval for Windows (1996), DRIPRO Users Guide, Lexington, MA. FAME Information Services (1995), Users Guide to FAME, Ann Arbor, Michigan International Monetary Fund (1984), IMF Documentation on Computer Subscription, Washington, DC. Organization For Economic Cooperation and Development (1992) Annual National Accounts: Volume I. Main Aggregates Content Documentation, Paris, France. Organization For Economic Cooperation and Development (1992) Annual National Accounts: Volume II. Detailed Tables Technical Documentation, Paris, France. Organization For Economic Cooperation and Development (1992) Main Economic Indicators Database Note, Paris, France. Organization For Economic Cooperation and Development (1992) Main Economic Indicators Inventory, Paris, France. Organization For Economic Cooperation and Development (1992) Main Economic Indicators OECD Statistics Document, Paris, France. Organization For Economic Cooperation and Development (1992) OECD Statistical Information Research and Inquiry System Documentation, Paris, France. Organization For Economic Cooperation and Development (1992) Quarterly National Accounts Inventory of Series Codes, Paris, France. Organization For Economic Cooperation and Development (1992) Quarterly National Accounts
Technical Documentation, Paris, France. Standard & Poors Compustat Services Inc. (1991), COMPUSTAT II Documentation, Englewood, CO. Standard & Poors Compustat Services Inc. (2003), COMPUSTAT Technical Guide, Englewood, CO.
Chapter 12
Example 12.1: Nonnormal Error Estimation . . . . . . . . Example 12.2: Unreplicated Factorial Experiments . . . . Example 12.3: Censored Data Models in PROC ENTROPY Example 12.4: Use of the PDATA= Option . . . . . . . . . Example 12.5: Illustration of ODS Graphics . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
Generalized maximum entropy (GME) is a means of selecting among probability distributions to choose the distribution that maximizes uncertainty or uniformity remaining in the distribution, subject to information already known about the distribution. Information takes the form of data or moment constraints in the estimation procedure. PROC ENTROPY creates a GME distribution for each parameter in the linear model, based upon support points supplied by the user. The mean of each distribution is used as the estimate of the parameter. Estimates tend to be biased, as they are a type of shrinkage estimate, but typically portray smaller variances than ordinary least squares (OLS) counterparts, making them more desirable from a mean squared error viewpoint (see Figure 12.1).
Figure 12.1 Distribution of Maximum Entropy Estimates versus OLS
Maximum entropy techniques are most widely used in the econometric and time series elds. Some important uses of maximum entropy include the following: size distribution of rms stationary Markov Process social accounting matrix (SAM) consumer brand preference exchange rate regimes
This data set contains outliers, and the condition number of the matrix of regressors, X, is large, which indicates collinearity amoung the regressors. Since the maximum entropy estimates are both robust with respect to the outliers and also less sensitive to a high condition number of the X matrix, maximum entropy estimation is a good choice for this problem.
To t a simple linear model to this data by using PROC ENTROPY, use the following statements:
proc entropy data=coleman; model test_score = teach_sal prcnt_prof socio_stat teach_score mom_ed; run;
This requests the estimation of a linear model for TEST_SCORE with the following form: t est_score D i nt ercept C a Cd teach_sal C b prcnt_prof C c socio_stat
t each_score C e
mom_ed C I
This estimation produces the Model Summary table in Figure 12.2, which shows the equation variables used in the estimation.
Figure 12.2 Model Summary Table
Test Scores compiled by Coleman et al. (1966) The ENTROPY Procedure Variables(Supports(Weights)) Equations(Supports(Weights)) teach_sal prcnt_prof socio_stat teach_score mom_ed Intercept test_score
Since support points and prior weights are not specied in this example, they are not shown in the Model Summary table. The next four pieces of information displayed in Figure 12.3 are: the Data Set Options, the Minimization Summary, the Final Information Measures, and the Observations Processed.
Figure 12.3 Estimation Summary Tables
Test Scores compiled by Coleman et al. (1966) The ENTROPY Procedure GME-NM Estimation Summary Data Set Options DATA= WORK.COLEMAN
Minimization Summary Parameters Estimated Covariance Estimator Entropy Type Entropy Form Numerical Optimizer 6 GME-NM Shannon Dual Quasi Newton
The item labeled Objective Function Value is the value of the entropy estimation criterion for this estimation problem. This measure is analogous to the log-likelihood value in a maximum likelihood estimation. The Parameter Information Index and the Error Information Index are normalized entropy values that measure the proximity of the solution to the prior or target distributions. The next table displayed is the ANOVA table, shown in Figure 12.4. This is in the same form as the ANOVA table for the MODEL procedure, since this is also a multivariate procedure.
Figure 12.4 Summary of Residual Errors
GME-NM Summary of Residual Errors DF Model 6 DF Error 14
Equation test_score
SSE 175.8
MSE 8.7881
R-Square 0.7266
The last table displayed is the Parameter Estimates table, shown in Figure 12.5. The difference between this parameter estimates table and the parameter estimates table produced by other regression procedures is that the standard error and the probabilities are labeled as approximate.
Figure 12.5 Parameter Estimates
GME-NM Variable Estimates Approx Std Err 0.00551 0.00323 0.0308 0.0180 0.0921 0.3958 Approx Pr > |t| <.0001 <.0001 <.0001 <.0001 <.0001 <.0001
The parameter estimates produced by the REG procedure for this same model are shown in Figure 12.6. Note that the parameters and standard errors from PROC REG are much different than estimates produced by PROC ENTROPY.
symbol v=dot h=1 c=green; proc reg data=coleman; model test_score = teach_sal prcnt_prof socio_stat teach_score mom_ed; plot rstudent.*obs. / vref= -1.714 1.714 cvref=blue lvref=1 HREF=0 to 30 by 5 cHREF=red cframe=ligr; run;
DF 1 1 1 1 1 1
This data set contains two outliers, observations 3 and 18. These can be seen in a plot of the residuals shown in Figure 12.7
The presence of outliers suggests that a robust estimator such as M -estimator in the ROBUSTREG procedure should be used. The following statements use the ROBUSTREG procedure to estimate the model.
proc robustreg data=coleman; model test_score = teach_sal prcnt_prof socio_stat teach_score mom_ed; run;
Note that TEACH_SAL(VAR1) and MOM_ED(VAR5) change greatly when the robust estimation is used. Unfortunately, these two coefcients are negative, which implies that the test scores increase with decreasing teacher salaries and decreasing levels of the mothers education. Since ROBUSTREG is robust to outliers, they are not causing the counterintuitive parameter estimates. The condition number of the regressor matrix X also plays a important role in parameter estimation. The condition number of the matrix can be obtained by specifying the COLLIN option in the PROC ENTROPY statement.
proc entropy data=coleman collin; model test_score = teach_sal prcnt_prof socio_stat teach_score mom_ed; run;
Number 1 2 3 4 5 6
Collinearity Diagnostics Condition Number 1.0000 2.3040 8.6833 17.6191 60.4112 84.8501 -Proportion of Variationmom_ed Intercept 0.0001 0.0000 0.0000 0.0083 0.3309 0.6607 0.0000 0.0001 0.0003 0.0099 0.0282 0.9614
Number 1 2 3 4 5 6
The condition number of the X matrix is reported to be 84:85. This means that the condition number of X0 X is 84:852 D 7199:5, which is very large. Ridge regression can be used to offset some of the problems associated with ill-conditioned X matrices. Using the formula for the ridge value as
R
kS 2 O O 0
0:9
O where and S 2 are the least squares estimators of and 2 and k D 6. A ridge regression of the test score model was performed by using the data set with the outliers removed. The following PROC REG code performs the ridge regression:
data coleman; set coleman; if _n_ = 3 or _n_ = 18 then delete; run; proc reg data=coleman ridge=0.9 outest=t noprint; model test_score = teach_sal prcnt_prof socio_stat teach_score mom_ed; run;
Obs 1 2
_PCOMIT_ . .
Obs 1 2
0.085118 0.041889
1.18400 0.60041
Note that the ridge regression estimates are much closer to the estimates produced by the ENTROPY procedure that uses the original data set. Ridge regressions are not robust to outliers as maximum entropy estimates are. This might explain why the estimates still differ for TEACH_SAL.
If you have only a small amount of data, the estimates are very sensitive to your selection of support points and weights. For larger data sets, incorrect priors are discounted if they are not supported by the data. Consider the data set generated by the following SAS statements:
data prior; do by = 1 to 100; do t = 1 to 10; y = 2*t + 5 * rannor(456); output; end; end; run;
The PRIOR data set contains 100 samples of 10 observations each from the population y D 2 tC
N.0; 5/
The 100 estimates are summarized by using the following SAS statements:
proc univariate data=parm1; var t; run;
The summary statistics from PROC UNIVARIATE are shown in Output 12.11. The true value of the coefcient T is 2.0, demonstrating that maximum entropy estimates tend to be biased.
Figure 12.11 No Prior Information Monte Carlo Summary
Test Scores compiled by Coleman et al. (1966) The UNIVARIATE Procedure Variable: t Basic Statistical Measures Location Mean Median Mode 1.674802 1.708554 . Variability Std Deviation Variance Range Interquartile Range 0.32418 0.10509 1.80200 0.34135
Now assume that you have prior information about the slope and the intercept for this model. You
are reasonably condent that the slope is 2 and you are less condent that intercept is zero. To specify prior information about the parameters, use the PRIORS statement. There are two parts to the prior information specied in the PRIORS statement. The rst part is the support points for a parameter. The support points specify the domain of the parameter. For example, the following statement sets the support points 1000 and 1000 for the parameter associated with variable T:
priors t -1000 1000;
This means that the coefcient lies in the interval 1000; 1000. If the estimated value of the coefcient is actually outside of this interval, the estimation will not converge. In the previous PRIORS statement, no weights were specied for the support points, so uniform weights are assumed. This implies that the coefcient has a uniform probability of being in the interval 1000; 1000. The second part of the prior information is the weights on the support points. For example, the following statements sets the support points 10, 15, 20, and 25 with weights 1, 5, 5, and 1 respectively for the coefcient of T:
priors t 10(1) 15(5) 20(5) 25(1);
This creates the prior distribution on the coefcient shown in Figure 12.12. The weights are automatically normalized so that they sum to one.
For the PRIOR data set created previously, the expected value of the coefcient of T is 2. The following SAS statements reestimate the parameters with a prior weight specied for each one.
proc entropy data=prior outest=parm2 noprint; priors t 0(1) 2(3) 4(1) intercept -100(.5) -10(1.5) 0(2) 10(1.5) 100(0.5); model y = t; by by; run;
The priors on the coefcient of T express a condent view of the value of the coefcient. The priors on INTERCEPT express a more diffuse view on the value of the intercept. The following PROC UNIVARIATE statement computes summary statistics from the estimations:
proc univariate data=parm2; var t; run;
The summary statistics for the distribution of the estimates of T are shown in Figure 12.13.
The prior information improves the estimation of the coefcient of T dramatically. The downside of specifying priors comes when they are incorrect. For example, say the priors for this model were specied as
priors t -2(1) 0(3) 2(1);
to indicate a prior centered on zero instead of two. The resulting summary statistics shown in Figure 12.14 indicate how the estimation is biased away from the solution.
Figure 12.14 Incorrect Prior Information Monte Carlo Summary
Prior Distribution of Parameter T The UNIVARIATE Procedure Variable: t Basic Statistical Measures Location Mean Median Mode 0.061016 0.060133 . Variability Std Deviation Variance Range Interquartile Range 0.00942 0.0000888 0.05650 0.01158
The more data available for estimation, the less sensitive the parameters are to the priors. If the number of observations in each sample is 50 instead of 10, then the summary statistics shown in Figure 12.15 are produced. The prior information is not supported by the data, so it is discounted.
Variable x1 x2 x3 x4 x5 x6 Restrict0
Label
x1 + x2 + x3 + x4 + x5 + x6 = 1
Note how the probabilities are skewed to the higher values because of the high average roll provided in the input data.
You can model the probability that a customer moves from one company to another using a rstorder Markov model. Mathematically the model is: y t D P yt
1
where yt is a vector of k market shares at time t and P is a k k matrix of unknown transition probabilities. The value pij represents the probability that a customer who is currently using company j at time t 1 moves to company i at time t . The diagonal elements then represent the probability that a customer stays with the current company. The columns in P sum to one. Given market share information over time, you can estimate the transition probabilities P . In order to estimate P using traditional methods, you need at least k observations. If you have fewer than k transitions, you can use the ENTROPY procedure to estimate the probabilities. Suppose you are studying the market share for four companies. If you want to estimate the transition probabilities for these four companies, you need a time series with four observations of the shares. Assume the current transition probability matrix is as follows: 2 3 0:7 0:4 0:0 0:1 6 0:1 0:5 0:4 0:0 7 6 7 4 0:0 0:1 0:6 0:0 5 0:2 0:0 0:0 0:9 The following SAS DATA step statements generate a series of market shares from this probability matrix. A transition is represented as the current period shares, y, and the previous period shares, x.
data m; /* Known Transition matrix */ array p[4,4] (0.7 .4 .0 .1 0.1 .5 .4 .0 0.0 .1 .6 .0 0.2 .0 .0 .9 ) ; /* Initial Market shares */ array y[4] y1-y4 ( .4 .3 .2 .1 ); array x[4] x1-x4; drop p1-p16 i; do i = 1 to 3; x[1] = y[1]; x[2] = y[2]; x[3] = y[3]; x[4] = y[4]; y[1] = p[1,1] * x1 + p[1,2] * x2 + y[2] = p[2,1] * x1 + p[2,2] * x2 + y[3] = p[3,1] * x1 + p[3,2] * x2 + y[4] = p[4,1] * x1 + p[4,2] * x2 + output; end; run;
* * * *
x3 x3 x3 x3
+ + + +
* * * *
The following SAS statements estimate the transition matrix by using only the rst transition.
proc entropy markov pure data=m(obs=1); model y1-y4 = x1-x4; run;
The MARKOV option implies NOINT for each model, that the sum of the parameters in each column is one, and chooses support points of 0 and 1. This model can be expressed equivalently as
proc entropy pure priors y1.x1 0 priors y2.x1 0 priors y3.x1 0 priors y4.x1 0 model model model model y1 y2 y3 y4 = = = = data=m(obs=1) ; 1 y1.x2 0 1 y1.x3 1 y2.x2 0 1 y2.x3 1 y3.x2 0 1 y3.x3 1 y4.x2 0 1 y4.x3 / / / / noint; noint; noint; noint; + + + + y3.x1 y3.x2 y3.x3 y3.x4 + + + + y4.x1 y4.x2 y4.x3 y4.x4 = = = = 1; 1; 1; 1; 0 0 0 0 1 1 1 1 y1.x4 y2.x4 y3.x4 y4.x4 0 0 0 0 1; 1; 1; 1;
Variable y1.x1 y1.x2 y1.x3 y1.x4 y2.x1 y2.x2 y2.x3 y2.x4 y3.x1 y3.x2 y3.x3 y3.x4 y4.x1 y4.x2 y4.x3 y4.x4
Estimate 0.463407 0.41055 0.356272 0.302163 0.272755 0.271459 0.267252 0.260084 0.119926 0.148481 0.180224 0.214394 0.143903 0.169504 0.196252 0.223364
Note that P varies greatly from the true solution. If two transitions are used instead (OBS=2), the resulting transition matrix is shown in Figure 12.19.
proc entropy markov pure data=m(obs=2); model y1-y4 = x1-x4; run;
Variable y1.x1 y1.x2 y1.x3 y1.x4 y2.x1 y2.x2 y2.x3 y2.x4 y3.x1 y3.x2 y3.x3 y3.x4 y4.x1 y4.x2 y4.x3 y4.x4
Estimate 0.721012 0.355703 0.026095 0.096654 0.083987 0.53886 0.373668 0.000133 0.000062 0.099848 0.600104 7.871E-8 0.194938 0.00559 0.000133 0.903214
This transition matrix is much closer to the actual transition matrix. If, in addition to the transitions, you had other information about the transition matrix, such as your own companys transition values, that information can be added as restrictions to the parameter estimates. For noisy data, the PURE option should be dropped. Note that this example has six zero probabilities in the transition matrix; the accurate estimation of transition matrices with fewer zero probabilities generally requires more transition observations.
The dependent variable in this data, job, takes on values 0 through 4. Support points are used only for the error terms; so error supports are specied on the MODEL statement.
proc entropy data=kpdata gmed tech=nra; model job = x1 x2 x3 x4 / noint esupports=( -.1 -0.0666 -0.0333 0 0.0333 0.0666 .1 ); run;
Variable x1_1 x2_1 x3_1 x4_1 x1_2 x2_2 x3_2 x4_2 x1_3 x2_3 x3_3 x4_3 x1_4 x2_4 x3_4 x4_4
Estimate 1.885831 -0.00308 -0.17823 1.03477 0.187073 0.019307 0.003994 0.269913 -4.47704 0.025457 0.235925 1.25614 -9.56275 0.026694 0.650215 1.444471
t Value 1.40 -0.20 -2.04 1.49 0.15 1.34 0.05 0.47 -2.76 1.55 2.43 1.51 -6.14 1.73 6.99 2.09
Note there are ve estimates of the parameters produced for each regressor, one for each choice. The rst choice is restricted to zero for normalization purposes. PROC ENTROPY drops the zeroed regressors. PROC ENTROPY also generates tables of marginal effects for each regressor. The following statements generate the marginal effects table for the previous analysis at the means of the variables.
proc entropy data=kpdata gmed tech=nra; model job = x1 x2 x3 x4 / noint esupports=( -.1 -0.0666 -0.0333 0 0.0333 0.0666 .1 ) marginals; run;
Variable x1_0 x2_0 x3_0 x4_0 x1_1 x2_1 x3_1 x4_1 x1_2 x2_2 x3_2 x4_2 x1_3 x2_3 x3_3 x4_3 x1_4 x2_4 x3_4 x4_4
Mean 1 20.50148 13.09496 0.916914 1 20.50148 13.09496 0.916914 1 20.50148 13.09496 0.916914 1 20.50148 13.09496 0.916914 1 20.50148 13.09496 0.916914
The marginals are derivatives of the probabilities with respect to each variable and so summarize how a small change in each variable affects the overall probability. PROC ENTROPY also enables the user to specify where the derivative is evaluated, as shown below:
proc entropy data=kpdata gmed tech=nra; model job = x1 x2 x3 x4 / noint esupports=( -.1 -0.0666 -0.0333 0 0.0333 0.0666 .1 ) marginals=( x2=.4 x3=10 x4=0); run;
Variable x1_0 x2_0 x3_0 x4_0 x1_1 x2_1 x3_1 x4_1 x1_2 x2_2 x3_2 x4_2 x1_3 x2_3 x3_3 x4_3 x1_4 x2_4 x3_4 x4_4
Marginal Effect 0.331414 -0.00186 -0.02079 -0.09815 0.853981 -0.00343 -0.0644 0.034867 0.853874 0.000952 -0.04913 -0.16118 -0.25376 0.001479 0.008944 0.064326 -1.78551 0.00286 0.125377 0.160136
Mean 1 20.50148 13.09496 0.916914 1 20.50148 13.09496 0.916914 1 20.50148 13.09496 0.916914 1 20.50148 13.09496 0.916914 1 20.50148 13.09496 0.916914
Marginal Effect at User Supplied Values -0.10594 -0.00202 0.010599 -0.13625 0.472546 -0.0032 -0.04401 0.172753 -0.06716 0.004337 0.014829 -0.07592 -0.16652 0.000628 0.009385 0.026551 -0.13292 0.000261 0.009199 0.012866
In this example, you evaluate the derivative when x1=1, x2=0.4, x3=10, and x4=0. If the user neglects a variable, PROC ENTROPY uses its mean value.
PROC ENTROPY options ; BOUNDS bound1 < , bound2, . . . > ; BY variable < variable . . . > ; ID variable < variable . . . > ; MODEL variable = variable < variable > . . . < / options > ; PRIORS variable < support points > variable < value > . . . ; RESTRICT restriction1 < , restriction2 . . . > ; TEST < name > test1 < , test2 . . . > < / options > ; WEIGHT variable ;
Functional Summary
The statements and options in the ENTROPY procedure are summarized in the following table.
Description
Statement
Option
Data Set Options specify the input data set for the variables specify the input data set for support points and priors specify the output data set for residual, predicted, and actual values specify the output data set for the support points and priors write the covariance matrix of the estimates to OUTEST= data set write the parameter estimates to a data set write the Lagrange multiplier estimates to a data set write the covariance matrix of the equation errors to a data set write the S matrix used in the objective function denition to a data set read the covariance matrix of the equation errors Printing Options request that the procedure produce graphics via the Output Delivery System print collinearity diagnostics suppress the normal printed output Options to Control Iteration Output print a summary iteration listing
ENTROPY ENTROPY ENTROPY ENTROPY ENTROPY ENTROPY ENTROPY ENTROPY ENTROPY ENTROPY
DATA= PDATA= OUT= OUTP= OUTCOV OUTEST= OUTL= OUTS= OUTSUSED= SDATA=
ENTROPY
ITPRINT
Description
Statement
Option
Options to Control the Minimization Process specify the convergence criteria specify the maximum number of iterations allowed specify the maximum number of subiterations allowed select the iterative minimization method to use Statements That Declare Variables specify BY-group processing specify a weight variable specify identifying variables General PROC ENTROPY Statement Options specify seemingly unrelated regression specify iterated seemingly unrelated regression specify data-constrained generalized maximum entropy specify normed moment generalized maximum entropy specify the denominator for computing variances and covariances General TEST Statement Options specify that a Wald test be computed specify that a Lagrange multiplier test be computed specify that a likelihood ratio test be computed request all three types of tests
BY WEIGHT ID
WALD LM LR ALL
General Options
COLLIN
species the method for producing the covariance matrix of parameters for output and for standard error calculations. GMENM and GME are aliases and are the default.
GME | GCE
requests generalized maximum entropy or generalized cross entropy. This is the default estimation method.
GMENM | GCENM
requests normed moment maximum entropy or the normed moment cross entropy.
GMED
species the denominator to be used in computing variances and covariances. VARDEF=N species that the number of nonmissing observations be used. VARDEF=WGT species that the sum of the weights be used. VARDEF=DF species that the number of nonmissing observations minus the model degrees of freedom (number of parameters) be used. VARDEF=WDF species that the sum of the weights minus the model degrees of freedom be used. The default is VARDEF=DF.
species the input data set. Values for the variables in the model are read from this data set.
PDATA=SAS-data-set
names the SAS data set that contains the data about priors and supports.
OUT=SAS-data-set
names the SAS data set to contain the residuals from each estimation.
OUTCOV COVOUT
writes the covariance matrix of the estimates to the OUTEST= data set in addition to the parameter estimates. The OUTCOV option is applicable only if the OUTEST= option is also specied.
OUTEST=SAS-data-set
names the SAS data set to contain the parameter estimates and optionally the covariance of the estimates.
OUTL=SAS-data-set
names the SAS data set to contain the estimated Lagrange multipliers for the models.
OUTP=SAS-data-set
names the SAS data set to contain the support points and estimated probabilities.
OUTS=SAS-data-set
names the SAS data set to contain the estimated covariance matrix of the equation errors. This is the covariance of the residuals computed from the parameter estimates.
OUTSUSED=SAS-data-set
names the SAS data set to contain the S matrix used in the objective function denition. The OUTSUSED= data set is the same as the OUTS= data set for the methods that iterate the S matrix.
SDATA=SAS-data-set
species a data set that provides the covariance matrix of the equation errors. The matrix read from the SDATA= data set is used for the equation error covariance matrix (S matrix) in the estimation. The SDATA= matrix is used to provide only the initial estimate of S for the methods that iterate the S matrix.
Printing Options
ITPRINT
prints the parameter estimates, objective function value, and convergence criteria at each iteration.
NOPRINT
suppresses the normal printed output but does not suppress error listings. Using any other print option turns the NOPRINT option off.
PLOTS=global-plot-options | plot-request
requests that the ENTROPY procedure produce statistical graphics via the Output Delivery System, provided that the ODS GRAPHICS statement has been specied. For general information about ODS Graphics, see Chapter 21, Statistical Graphics Using ODS (SAS/STAT Users Guide). The global-plot-options apply to all relevant plots generated by the ENTROPY procedure. The global-plot-options supported by the ENTROPY procedure are as follows:
ONLY
suppresses the default plots. Only the plots specically requested are produced.
UNPACKPANEL breaks a graphic that is otherwise paneled into individual component plots. The specic plot-request values supported by the ENTROPY procedure are as follows: ALL requests that all plots appropriate for the particular analysis be produced. ALL is equivalent to specifying FITPLOT, COOKSD, QQ, RESIDUALHISTOGRAM, and STUDENTRESIDUAL. plots the predicted and actual values. produces the Cooks D plot. produces a Q-Q plot of residuals. plots the histogram of residuals. plots the studentized residuals.
FITPLOT COOKSD QQ
When ODS graphics are enabled, the default behavior is to plot all plots appropriate for the particular analysis (ALL) in a panel.
species the convergence criteria for S-iterated methods. The convergence measure computed during model estimation must be less than value before convergence is assumed. The default value is CONVERGE=0.001.
DUAL | PRIMAL
species whether the optimization problem is solved using the dual or primal form. The dual form is the default.
MAXITER=n
species the maximum number of subiterations allowed for an iteration. The MAXSUBITER= option limits the number of step halvings. The default is MAXSUBITER=30.
METHOD=TR | NEWRAP | NRR | QN | CONGR | NSIMP | DBLDOG | LEVMAR TECHNIQUE=TR | NEWRAP | NRR | QN | CONGR | NSIMP | DBLDOG | LEVMAR TECH=TR | NEWRAP | NRR | QN | CONGR | NSIMP | DBLDOG | LEVMAR
species the iterative minimization method to use. METHOD=TR species the trust region method, METHOD=NEWRAP species the Newton-Raphson method, METHOD=NRR species the Newton-Raphson ridge method, and METHOD=QN species the quasi-Newton method. See Chapter 6, Nonlinear Optimization Methods, for more details about optimization methods. The default is METHOD=QN for the dual form and METHOD=NEWRAP for the primal form.
BOUNDS Statement
BOUNDS bound1 < , bound2 . . . > ;
The BOUNDS statement imposes simple boundary constraints on the parameter estimates. BOUNDS statement constraints refer to the parameters estimated by the ENTROPY procedure. You can specify any number of BOUNDS statements. Each boundary constraint is composed of variables, constants, and inequality operators in the following form:
item operator item <,operator item <,operator item ...> >
Each item is a constant, the name of a regressor variable, or a list of regressor names. Each operator is <, >, <=, or >=. You can use either the BOUNDS statement or the RESTRICT statement to impose boundary constraints; the BOUNDS statement provides a simpler syntax for specifying inequality constraints. See section RESTRICT Statement on page 648 for more information about the computational details of estimation with inequality restrictions. Lagrange multipliers are reported for all the active boundary constraints. In the printed output and in the OUTEST= data set, the Lagrange multiplier estimates are identied with the names BOUND1, BOUND2, and so forth. The probability of the Lagrange multipliers are computed using a beta distribution (LaMotte 1994). Nonactive or nonbinding bounds have no effect on the estimation results and are not noted in the output. To give the constraints more descriptive names, use the RESTRICT statement instead of the BOUNDS statement. The following BOUNDS statement constrains the estimates of the coefcients of WAGE and TARGET and the 10 coefcients of x1 through x10 to be between zero and one. This example illustrates the use of parameter lists to specify boundary constraints.
bounds 0 < wage target x1-x10 < 1;
The following is an example of the use of the BOUNDS statement to impose boundary constraints on the variables X1, X2, and X3:
The parameter estimates from this run are shown in Figure 12.23.
Figure 12.23 Output from Bounded Estimation
Prior Distribution of Parameter T The ENTROPY Procedure Variables(Supports(Weights)) Equations(Supports(Weights)) x1 x2 x3 Intercept y
Prior Distribution of Parameter T The ENTROPY Procedure GME-NM Estimation Summary Data Set Options DATA= WORK.ZERO
Minimization Summary Parameters Estimated Covariance Estimator Entropy Type Entropy Form Numerical Optimizer 4 GME-NM Shannon Dual Newton-Raphson
Final Information Measures Objective Function Value Signal Entropy Noise Entropy Normed Entropy (Signal) Normed Entropy (Noise) Parameter Information Index Error Information Index Observations Processed Read Used 20 20 6.292861 6.375715 -0.08285 0.990364 1.004172 0.009636 -0.00417
Equation y
SSE 1665620
MSE 83281.0
R-Square -0.0013
GME-NM Variable Estimates Approx Std Err 0 0 0 3.406E-6 9130.3 0 0 Approx Pr > |t| . . . <.0001 0.9999 . .
Variable x1 x2 x3 Intercept
Label
BY Statement
BY variables ;
A BY statement is used to obtain separate estimates for observations in groups dened by the BY variables. To save parameter estimates for each BY group, use the OUTEST= option.
ID Statement
ID variables ;
The ID statement species variables to identify observations in error messages or other listings and in the OUT= data set. The ID variables are normally SAS date or datetime variables. If more than one ID variable is used, the rst variable is used to identify the observations and the remaining variables are added to the OUT= data set.
MODEL Statement
MODEL dependent = regressors < / options > ;
The MODEL statement species the dependent variable and independent regressor variables for the regression model. If no independent variables are specied in the MODEL statement, only the mean
(intercept) is estimated. To model a system of equations, specify more than one MODEL statement. The following options can be used in the MODEL statement after a slash (/).
ESUPPORTS=( support (prior) . . . )
species the support points and prior weights on the residuals for the specied equation. The default is the following ve support values: 10 value; value; 0; value; 10 value
for GME, where y is the dependent variable, and value D .max.y/ y/ N multiplier nobs max.X/ 0:1
for generalized maximum entropynormed moments (GME-NM), where X is the information matrix, and nobs is the number of observations. The multiplier depends on the MULTIPLIER= option. The MULTIPLIER= option defaults to 2 for unrestricted models and to 4 for restricted models. The prior probabilities default to the following: 0:0005; 0:333; 0:333; 0:333; 0:0005 The support points and prior weights are selected so that hypothesis tests can be performed without adding signicant bias to the estimation. These prior probability values are ad hoc.
NOINT
requests that the marginal effects of each variable be calculated for GME-D. Specifying the MARGINALS option with an optional list of values calculates the marginals at that vector of values. For example, if x1x4 are explanatory variables, then including MARGINALS = ( x1 = 2, x2 = 4, x3 = 1, x4 = 5) calculates the marginal effects at that vector. A skipped variable implies that its mean value is to be used.
CENSORED ( ( UB | LB) = (variable | value ), ESUPPORTS =( support (prior) . . . ) )
species that the dependent variable be observed with censoring and species the censoring thresholds and the supports of the censored observations.
CATEGORY= variable
species the variable that keeps track of the categories the dependent variable is in when there is range censoring. When the actual value is observed, this variable should be set to MISSING.
RANGE ( ID = (QS | INT) L = ( NUMBER ) R = ( NUMBER ) , ESUPPORTS=( support < (prior) > . . . ) )
species that the dependent variable be range bound. The RANGE option denes the range and the key ( RANGE ) that is used to identify the observation as being range bound. The RANGE = value should be some value in the CATEGORY= variable. The L and R dene, respectively, the left endpoint of the range and the right endpoint of the range. ESUPPORTS sets the error supports on the variable.
PRIORS Statement
PRIORS variable < support points < (priors) > > variable < support points < (priors) > > . . . ;
The PRIORS statement species the support points and prior weights for the coefcients on the variables. Support points for coefcients default to ve points, determined as follows: 2 value; value; 0; value; 2 value
where the mean and the stderr are obtained from OLS and the multiplier depends on the MULTIPLIER= option. The MULTIPLIER= option defaults to 2 for unrestricted models and to 4 for restricted models. The prior probabilities for each support point default to the uniform distribution. The number of support points must be at least two. If priors are specied, they must be positive and there must be the same number of priors as there are support points. Priors and support points can also be specied through the PDATA= data set.
RESTRICT Statement
RESTRICT restriction1 < , restriction2 . . . > ;
The RESTRICT statement is used to impose linear restrictions on the parameter estimates. You can specify any number of RESTRICT statements. Each restriction is written as an optional name, followed by an expression, followed by an equality operator (=) or an inequality operator (<, >, <=, >=), followed by a second expression:
<name > expression operator expression
The optional name is a string used to identify the restriction in the printed output and in the OUTEST= data set. The operator can be =, <, >, <= , or >=. The operator and second expression are optional, as in the TEST statement, where they default to D 0.
Restriction expressions can be composed of variable names, multiplication ( ), and addition (C) operators, and constants. Variable names in restriction expressions must be among the variables whose coefcients are estimated by the model. The restriction expressions must be a linear function of the variables. The following is an example of the use of the RESTRICT statement:
proc entropy data=one; restrict y1.x1*2 <= x2 + y2.x1; model y1 = x1 x2; model y2 = x1 x3; run;
This example illustrates the use of compound names, y1.x1, to specify coefcients of specic equations.
TEST Statement
TEST < name > test1 < , test2 . . . > < ,/ options > ;
The TEST statement performs tests of linear hypotheses on the model parameters. The TEST statement applies only to parameters estimated in the model. You can specify any number of TEST statements. Each test is written as an expression optionally followed by an equal sign (=) and a second expression:
expression <= expression>
Test expressions can be composed of variable names, multiplication ( ), addition (C), and subtraction ( ) operators, and constants. Variables named in test expressions must be among the variables estimated by the model. If you specify only one expression in a TEST statement, that expression is tested against zero. For example, the following two TEST statements are equivalent:
test a + b; test a + b = 0;
When you specify multiple tests on the same TEST statement, a joint test is performed. For example, the following TEST statement tests the joint hypothesis that both of the coefcients on a and b are equal to zero:
test a, b;
To perform separate tests rather than a joint test, use separate TEST statements. For example, the following TEST statements test the two separate hypotheses that a is equal to zero and that b is equal to zero:
test a; test b;
species the name of an output SAS data set that contains the test results. The format of the OUT= data set produced by the TEST statement is similar to that of the OUTEST= data set.
WEIGHT Statement
WEIGHT variable ;
The WEIGHT statement species a variable to supply weighting values to use for each observation in estimating parameters. If the weight of an observation is nonpositive, that observation is not used for the estimation. Variable must be a numeric variable in the input data set. The regressors and the dependent variables are multiplied by the square root of the weight variable to form the weighted X matrix and the weighted dependent variable. The same weight is used for all MODEL statements.
maximize
pi ln.pi /
subject to
pi D 1
where pi is the probability associated with the ith support point. Properties that characterize the entropy measure are set forth by Kapur and Kesavan (1992). The objective is to maximize the entropy of the distribution with respect to the probabilities pi and subject to constraints that reect any other known information about the distribution (Jaynes 1957). This measure, in the absence of additional information, reaches a maximum when the probabilities are uniform. A distribution other than the uniform distribution arises from information already known.
With these reparameterizations, the GME estimation problem is maximize H.p; w/ D subject to p 0 ln.p/ w 0 ln.w/
where y denotes the column vector of length T of the dependent variable; X denotes the .T K/ matrix of observations of the independent variables; p denotes the LK column vector of weights associated with the points in Z; w denotes the LT column vector of weights associated with the points in V; 1K , 1L , and 1T are K-, L-, and T-dimensional column vectors, respectively, of ones; and IK and IT are .K K/ and .T T/ dimensional identity matrices. These equations can be rewritten using set notation as follows:
maximize H.p; w/ D
L X lD1 L X lD1
L K X X lD1 kD1
pkl ln.pkl /
L T XX lD1 t D1
wt l ln.wt l /
"
subject to yt D
K X kD1
# . Xk t Zkl pkl / C Vt l wt l
L X lD1
pkl D 1 and
wt l D 1
The subscript l denotes the support point (l=1, 2, ..., L), k denotes the parameter (k=1, 2, ..., K), and t denotes the observation (t=1, 2, ..., T). The GME objective is strictly concave; therefore, a unique solution exists. The optimal estimated probabilities, p and w, and the prior supports, Z and V, can be used to form the point estimates of the unknown parameters, , and the unknown errors, e.
pi ln. pi = qi / pi D 1;
subject to
where qi is the probability associated with the ith point in the distribution from which the discrepancy is measured. The qi (in conjunction with the support) are often referred to as the prior distribution. The measure is nonnegative and is equal to zero when pi equals qi . The properties of the cross entropy measure are examined by Kapur and Kesavan (1992). The principle of minimum cross entropy (Kullback 1959; Good 1963) states that one should choose probabilities that are as close as possible to the prior probabilities. That is, out of all probability distributions that satisfy a given set of constraints which reect known information about the distribution, choose the distribution that is closest (as measured by p.ln.p/ ln.q//) to the prior distribution. When the prior distribution is uniform, maximum entropy and minimum cross entropy produce the same results (Kapur and Kesavan 1992), where the higher values for entropy correspond exactly with the lower values for cross entropy. If the prior distributions are nonuniform, the problem can be stated as a generalized cross entropy (GCE) formulation. The cross entropy terminology species weights, qi and ui , for the points Z and V, respectively. Given informative prior distributions on Z and V, the GCE problem is minimize I.p; q; w; u/ D p 0 ln.p=q/ C w 0 ln.w=u/ subject to y D XZp C V w 1K D .IK 10 / p L 1T D .IT 10 / w L where y denotes the T column vector of observations of the dependent variables; X denotes the .T K/ matrix of observations of the independent variables; q and p denote LK column vectors of prior and posterior weights, respectively, associated with the points in Z; u and w denote the LT column vectors of prior and posterior weights, respectively, associated with the points in V; 1K , 1L , and 1T are K-, L-, and T-dimensional column vectors, respectively, of ones; and IK and IT are (K K ) and (T T ) dimensional identity matrices. The optimization problem can be rewritten using set notation as follows minimize I.p; q; w; u/ D
L X lD1 L X lD1 L K X X lD1 kD1
L T XX lD1 t D1
wt l ln.wt l =ut l /
"
subject to yt D
K X kD1
pkl D 1 and
wt l D 1
The subscript l denotes the support point (l=1, 2, ..., L), k denotes the parameter (k=1, 2, ..., K), and t denotes the observation (t=1, 2, ..., T). The objective function is strictly convex; therefore, there is a unique global minimum for the problem (Golan, Judge, and Miller 1996). The optimal estimated weights, p and w, and the prior supports, Z and V, can be used to form the point estimates of the unknown parameters, , and the unknown errors, e, by using
z11 6 0 6 D Zp D 6 4 0 0
0 0 0
zL1 0 0 z12 0 0 0 0
0 0 0
0 zL2 0 0
0 0 :: : 0
0 0 0 z1K
0 0 0
0 0 0 zLK
3 7 7 7 5
3 p11 6 : 7 6 : 7 6 : 7 6 pL1 7 7 6 6 p12 7 7 6 6 : 7 6 : 7 6 : 7 6 pL2 7 7 6 6 : 7 6 : 7 6 : 7 6 p1K 7 7 6 6 : 7 : 5 4 : pLK 2 2 3 w11 6 : 7 6 : 7 6 : 7 6 wL1 7 6 7 6 w12 7 6 7 6 : 7 : 7 6 : 6 7 6 wL2 7 7 6 6 : 7 6 : 7 : 7 6 6 w1T 7 6 7 6 : 7 4 : 5 : wLT
v11 6 0 6 e D Vw D 6 4 0 0
0 0 0
vL1 0 0 v12 0 0 0 0
0 0 0
0 vL2 0 0
0 0 :: : 0
0 0 0 v1T
0 0 0
0 0 0 vLT
3 7 7 7 5
Computational Details
This constrained estimation problem can be solved either directly (primal) or by using the dual form. Either way, it is prudent to factor out one probability for each parameter and each observation as the sum of the other probabilities. This factoring reduces the computational complexity signicantly. If the primal formalization is used and two support points are used for the parameters and the errors, the resulting GME problem is O..nparms C nobs/3 /. For the dual form, the problem is O..nobs/3 /. Therefore for large data sets, GME-NM should be used instead of GME.
p 0 ln.p/
w 0 ln.w/
X 0y D X 0X Z p C X 0V w 1K D .IK 10 / p L 1T D .IT 10 / w L
There is also the cross entropy version of GME-NM, which has the same form as GCE but with the normed constraints.
p 0 ln.p/
w 0 ln.w/
where
D Zp
2 6 6 6 6 6 6 Z D 6 6 6 6 6 4
1 z11
1 zL1
0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
0 :: : 0 0 0 0 0
0 0 K z11 0 0 0 0
0 0 0 0 0 0
0 0
K zL1
0 0 0 :: : 0 0 0
0 0 0 0 1 z1M 0 0
0 0 0 0 0 0
0 0 0 0 1 zLM 0 0
0 0 0 0 0 :: : 0
0 0 0 0 0 0 K z1M
0 0 0 0 0 0
0 0 0 0 0 0 K zLM
3 7 7 7 7 7 7 7 7 7 7 7 5
0 0 0 0
p D
1 p11
1 pL1
K p11
K pL1
1 p1M
1 pLM
K p1M
K pLM
e D Vw
2 6 6 6 6 6 6 V D 6 6 6 6 6 4
1 v11
L v11
0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
0 :: : 0 0 0 0 0
0 0
1 v1T
0 0 0 0 0 0
0 0
L v1T
0 0 0 :: : 0 0 0
0 0 0 0
1 vM1
0 0 0 0 0 0
0 0 0 0
L vM1
0 0 0 0 0 :: : 0
0 0 0 0 0 0
1 vM T
0 0 0 0 0 0
0 0 0 0 0 0
L vM T
3 7 7 7 7 7 7 7 7 7 7 7 5
0 0 0 0
0 0 0 0
0 0
0 0
w D
1 w11
L w11
1 w1T
L w1T
1 wM1
L wM1
1 wM T
L wM T
y denotes the MT column vector of observations of the dependent variables; X denotes the (MT x KM ) matrix of observations for the independent variables; p denotes the LKM column vector of weights associated with the points in Z; w denotes the LMT column vector of weights associated with the points in V; 1L , 1KM , and 1M T are L-, KM-, and MT-dimensional column vectors, respectively, of ones; and IKM and IMT are (KM x KM) and (MT x MT) dimensional identity matrices. The subscript l denotes the support point .l D 1; 2; : : : ; L/, k denotes the parameter .k D 1; 2; : : : ; K/, m denotes the equation .m D 1; 2; : : : ; M /, and t denotes the observation .t D 1; 2; : : : ; T /. Using this notation, the maximum entropy problem that is analogous to the OLS problem used as the initial step of the traditional SUR approach is
p 0 ln.p/ w 0 ln.w/ p X Z p/ D V w
1KM D .IKM 10 / p L 1M T D .IM T 10 / w L The results are GME-SUR estimates with independent errors, the analog of OLS. The covariance O matrix is computed based on the residual of the equations, V w D e. An L0 L factorization of the O is used to compute the square root of the matrix. After solving this problem, these entropy-based estimates are analogous to the Aitken two-step O estimator. For iterative GME-SUR, the covariance matrix of the errors is recomputed, and a new is computed and factored. As in traditional ITSUR, this process repeats until the covariance matrix and the parameter estimates converge. The estimation of the parameters for the normed-moment version of SUR (GME-SUR-NM) uses an identical process. The constraints for GME-SUR-NM is dened as: X 0 y D X 0 .S
1
I/X Z p C X 0 .S
I/V w
The estimation of the parameters for GME-SUR-NM uses an identical process as outlined previously for GME-SUR.
D pij C
ij
The standard maximum likelihood approach for multinomial logit is equivalent to the maximum entropy solution for discrete choice models. The generalized maximum entropy approach avoids an assumption of the form of the link function G./. The generalized maximum entropy for discrete choice models (GME-D) is written in primal form as maximize H.p; w/ D p 0 ln.p/ w 0 ln.w/
Golan, Judge, and Miller (1996) have shown that the dual unconstrained formulation of the GMED can be viewed as a general class of logit models. Additionally, as the sample size increases, the solution of the dual problem approaches the maximum likelihood solution. Because of these characteristics, only the dual approach is available for the GME-D estimation method. The parameters j are the Lagrange multipliers of the constraints. The covariance matrix of the parameter estimates is computed as the inverse of the Hessian of the dual form of the objective function.
You have: 2 u u 3 3 2 u 3 2 u V w X y ub 5 D 4 Xa 5 Zp C 4 Va wa 5 4 ya yl lb Vl wl Xl The primal problem then becomes maximize H.p; w/ D subject to yu p 0 ln.p/ w 0 ln.w/
ya D Xa Va p C Va wa Xu Vu p C Vu wu 1K D .IK 10 / p L 1T D .IT 10 / w L yl Xl Vl p C Vl wl
PROC ENTROPY requires that the number of supports be identical for all three regions. Alternatively, you can think of cases where the dependent variable is observed continuously for most of its range. However, the variables range is reported for some observations. Such data is often found in highly disaggregated state level employment measures.
l1 y r1 : : : lk y rk ot herwi se
Just as in the censored case, each range yields two inequality constraints for each observation in that range.
Information Measures
PROC ENTROPY returns several measures of t. First, the value of the objective function is returned. Next, the signal entropy is provided followed by the noise entropy. The sum of the noise and signal entropies should equal the value of the objective function. The next two metrics that follow are the normed entropies of both the signal and the noise. Normalized entropy (NE) measures the relative informational content of both the signal and noise components through p and w, respectively (Golan, Judge, and Miller 1996). Let S denote the normalized entropy of the signal, X, dened as: S.p/ D Q p 0 ln.p/ Q Q 0 ln.q/ q
where S.p/ 0; 1. In the case of GME, where uniform priors are assumed, S can be written as: Q p 0 ln.p/ Q Q S.p/ D P Q ln.Mi / i where Mi is the number of support points for parameter i . A value of 0 for S implies that there is no uncertainty regarding the parameters; hence, it is a degenerate situation. However, a value of 1 implies that the posterior distributions equal the priors, which indicates total uncertainty if the priors are uniform. Because NE is relative, it can be used for comparing various situations. Consider adding a data point to the model. If ST C1 D ST , then there is no additional information contained within that data constraint. However, if ST C1 < ST , then the data point gives a more informed set of parameter estimates. NE can be used for determining the importance of particular variables with regard to the reduction of the uncertainty they bring to the model. Each of the k parameters that is estimated has an associated NE dened as S.pk / D Q pk ln.pk / Q0 Q ln.qk /
where pk is the vector of supports for parameter k and M is the corresponding number of support Q points. Since a value of 1 implies no relative information for that particular sample, Golan, Judge, Q and Miller (1996) suggest an exclusion criteria of S.pk / > 0:99 as an acceptable means of selecting noninformative variables. See Golan, Judge, and Miller (1996) for some simulation results. The nal set of measures of t are the parameter information index and error information index. These measures can be best summarized as 1 the appropriate normed entropy.
where
N X O2 ./ D 1 O N i D1 2 i
and
is the Lagrange multiplier associated with the i th row of the V w constraint matrix. Also, 0 N J X X 2 O2 ./ D 6 1 O @ vij wij 4 N
i D1 j D1
J X
1 vij wij /2 A
32 7 5
j D1
DC
X 0 X z
2 PL 6 z D 4
0 :: : 0 PL
2 lD1 zKl pKl
and
2 PJ 6 v D 4
2 j D1 v1j wj l
0 :: : 0 PJ
2 j D1 vKl wKl
7 5 0 PJ . j D1 vKl wKl /2
Statistical Tests
Since the GME estimates have been shown to be asymptotically normally distributed, the classical Wald, Lagrange mulitiplier, and likelihood ratio statistics can be used for testing linear restrictions on the parameters.
Wald Tests
Let H0 W L D m, where L is a set of linearly independent combinations of the elements of . Then under the null hypothesis, the Wald test statistic, TW D .L has a central
2
O O m/0 L.Var.//L0
.L
m/
O has the limiting distribution of the Wald statistic when testing the same hypothesis. Note that F ./ Q and F ./ are the maximum values of the entropy objective function over the full and restricted parameter spaces, respectively.
Q G./
where G is the gradient of F , which is being evaluated at the optimum point for the restricted parameters. This test statistic shares the same limiting distribution as the Wald and pseudo-likelihood ratio tests.
Missing Values
If an observation in the input data set contains a missing value for any of the regressors or dependent values, that observation is dropped from the analysis.
BY variables (if any) _TYPE_, a character variable of length 8 that identies the estimation method: GME or GMENM. variable, a character variable of length 32 that indicates the name of the regressor. The regressor name and the equation name identify a unique coefcient. _OBS_, a numeric variable that is either missing when the probabilities are for coefcients or the observation number when the probabilities are for the residual terms. The _OBS_ and the equation name identify which residual the probability is associated with. equation, a character variable of length 32 that indicates the name of the dependent variable NSupport, a numeric variable that indicates the number of support points for each basis support, a numeric variable that is the support value the probability is associated with prior, a numeric variable that is the prior probability associated with the probability Prb, a numeric variable that is the estimated probability
Table 12.2
ODS Table Name ConvCrit ConvergenceStatus DatasetOptions MinSummary ObsUsed ParameterEstimates ResidSummary TestResults
Description Convergence criteria for estimation Convergence status Data sets used Number of parameters, estimation kind Observations read, used, and missing Parameter estimates Summary of the SSE, MSE for the equations Test statement table
Option default default default default default default default TEST statement
ODS Graphics
This section describes the use of ODS for creating graphics with the ENTROPY procedure.
Plot Description Includes all the plots listed below Predicted versus actual plot Cooks D plot Q-Q plot of residuals Studentized residual plot Histogram of the residuals
For the rst simulation, is distributed normally, then a chi-squared distribution with six degrees of freedom is assumed for the second simulation, and nally is assumed to have a Cauchy distribution in the third simulation. In each of the three simulations, 100 samples of 10 observations each were simulated. The data for the model with the Cauchy error distribution is generated using the following DATA step code:
data one; call streaminit(156789); do by = 1 to 100; do x2 = 1 to 10; x1 = 10 * ranuni( 512); y = x1 + 2*x2 + rand(cauchy); output; end; end; run;
The statements for the other distributions are identical except for the argument to the RAND() function. The parameters to the model were estimated by using maximum entropy with the following programming statements:
proc entropy data=one gme outest=parm1; model y = x1 x2; by by; run;
The estimation by using moment-constrained maximum entropy was performed by changing the GME option to GMENM. For comparison, the same model was estimated by using OLS with the following PROC REG statements:
The 100 estimations of the coefcient on variable x1 are then summarized for each of the three error distributions by using PROC UNIVARIATE, as follows:
proc univariate data=parm1; var x1; run;
The following table summarizes the results from the estimations. The true value for the coefcient on x1 is 1.0. Estimation Method GME GME-NM OLS Mean 0.418 0.878 0.973 Normal Std Deviation 0.117 0.116 0.142 Chi-Squared Mean Std Deviation 0.626 .330 0.948 0.427 1.023 0.467 Mean 0.818 3.03 5.54 Cauchy Std Deviation 3.36 13.62 26.83
For normally distributed or nearly normally distributed data, moment-constrained maximum entropy is a good choice. For distributions not well described by a normal distribution, dataconstrained maximum entropy is a good choice.
data rate; do a=-1,1; do b=-1,1; do c=-1,1; do d=-1,1; input y @@; ab=a*b; ac=a*c; ad=a*d; bc=b*c; bd=b*d; cd=c*d; abc=a*b*c; abd=a*b*d; acd=a*c*d; bcd=b*c*d; abcd=a*b*c*d; output; end; end; end; end; datalines; 45 71 48 65 68 60 80 65 43 100 45 104 75 86 70 96 ; run;
Analyze the data by using PROC REG, then output the resulting estimates.
proc reg data=rate outest=regout; model y=a b c d ab ac ad bc bd cd abc abd acd bcd abcd; run; proc transpose data=regout out=ploteff name=effect prefix=est; var a b c d ab ac ad bc bd cd abc abd acd bcd abcd; run;
Now the normal scores for the estimates can be computed with the rank procedure as follows:
proc rank data=ploteff normal=blom out=qqplot; var est1; ranks normalq; run;
To create the probability plot, simply plot the estimates versus their normal scores by using PROC SGPLOT as follows:
title "Unreplicated Factorial Experiments"; proc sgplot data=qqplot; scatter x=est1 y=normalq / markerchar=effect markercharattrs=(size=10pt); xaxis label="Estimate"; yaxis label="Normal Quantile"; run;
The plot shown in Output 12.2.1 displays evidence that the a, b, d, ad, and bd estimates do not t into the purely random normal model, which suggests that they may have some signicant effect on the response variable. To verify this, t a reduced model that contains only these effects.
proc reg data=rate; model y=a b d ad bd; run;
The estimates for the reduced model are shown in Output 12.2.2.
Variable Intercept a b d ad bd
DF 1 1 1 1 1 1
These results support the probability plot methodology. PROC ENTROPY can directly estimate the full model without having to rely upon the probability plot for insight into which effects can be signicant. To illustrate this, PROC ENTROPY is run by using default parameter and error supports in the following statements:
proc entropy data=rate; model y=a b c d ab ac ad bc bd cd abc abd acd bcd abcd; run;
The resulting GME estimates are shown in Output 12.2.3. Note that the parameter estimates associated with the a, b, d, ad, and bd effects are all signicant.
Estimate 5.688414 2.988032 0.234331 9.627308 -0.01386 -0.00054 6.833076 0.113908 -7.68105 0.00002 -0.14876 -0.0399 0.466938 0.059581 0.024785 69.87294
t Value 7.19 5.47 1.70 9.86 -0.51 -0.16 7.92 1.21 -8.48 0.05 -1.37 -0.77 2.38 0.91 0.64 61.28
The following DATA step generates censored data in which any negative values of the dependent variable, y, are set to a lower bound of 0.
data cens; do t = 1 to 100; x1 = 5 * ranuni(456); x2 = 10 * ranuni( 457); y = 4.5*x1 + 2*x2 + 15 * rannor(458); if( y<0 ) then y = 0; output; end; run;
To illustrate the effect of the censored option in PROC ENTROPY, the model is initially estimated without accounting for censoring in the following statements:
title "Censored Data Estimation"; proc entropy data = cens gme primal; priors intercept -32 32 x1 -15 15 x2 -15 15; model y = x1 x2 / esupports = (-25 1 25); run;
Variable x1 x2 intercept
The previous model is reestimated by using the CENSORED option in the following statements:
proc entropy data = cens gme primal; priors intercept -32 32 x1 -15 15 x2 -15 15; model y = x1 x2 / esupports = (-25 1 25) censored(lb = 0, esupports=(-15 1 15) ); run;
Variable x1 x2 intercept
The second set of entropy estimates are much closer to the true parameter estimates of 4.5 and 2. Since another alternative available for tting a model of censored data is a tobit model, PROC QLIM is used in the following statements to t a tobit model to the data:
proc qlim data=cens; model y = x1 x2; endogenous y ~ censored(lb=0); run;
DF 1 1 1 1
For this data and code, PROC ENTROPY produces estimates that are closer to the true parameter values than those computed by PROC QLIM.
title "Using a PDATA= data set"; data a; retain seed1 55371 seed2 97335; array x[4]; do t = 1 to 100; ys = -5; do k = 1 to 4; x[k] = rannor( seed1 + k ) ys = ys + x[k] * k; end; ys = ys + rannor( seed2 ); output; end; run;
Next you t this data with some arbitrary parameter support points and priors by using the following PROC ENTROPY statements:
proc entropy data = a gme primal; priors x1 -10(2) 30(1) x2 -20(3) 30(2) x3 -15(4) 30(4) x4 -25(3) 30(2) intercept -13(4) 30(2) ; model ys = x1 x2 x3 x4 / esupports=(-25 0 25); run;
Variable x1 x2 x3 x4 intercept
You can estimate the same model by rst creating a PDATA= data set, which includes the same information as the PRIORS statement in the preceding PROC ENTROPY step.
A data set that denes the supports and priors for the model parameters is shown in the following statements:
data test; length Variable $ 12 Equation $ 12; input Variable $ Equation $ Nsupport Support Prior ; datalines; Intercept . 2 -13 0.66667 Intercept . 2 30 0.33333 x1 . 2 -10 0.66667 x1 . 2 30 0.33333 x2 . 2 -20 0.60000 x2 . 2 30 0.40000 x3 . 2 -15 0.50000 x3 . 2 30 0.50000 x4 . 2 -25 0.60000 x4 . 2 30 0.40000 ;
The following statements reestimate the model by using these support points.
proc entropy data=a gme primal pdata=test; model ys = x1 x2 x3 x4 / esupports=(-25 0 25); run;
Variable x1 x2 x3 x4 Intercept
These results are identical to the ones produced by the previous PROC ENTROPY step.
displays are requested by specifying the ODS GRAPHICS statement. For information about the graphics available in the ENTROPY procedure, see the section ODS Graphics on page 666. The following statements show how to generate ODS graphics plots with the ENTROPY procedure. The plots are displayed in Output 12.5.1.
proc entropy data=coleman; model test_score = teach_sal prcnt_prof socio_stat teach_score mom_ed; run;
References
Coleman, J. S., Campbell, E. Q., Hobson, C. J., McPartland, J., Mood, A. M., Weinfeld, F. D., and York, R. L. (1966), Equality of Educational Opportunity, Washington, DC: U.S. Government Printing Ofce. Deaton, A. and Muellbauer, J. (1980), An Almost Ideal Demand System, The American Economic Review, 70, 312326. Golan, A., Judge, G., and Miller, D. (1996), Maximum Entropy Econometrics: Robust Estimation with Limited Data, Chichester, England: John Wiley & Sons. Golan, A., Judge, G., and Perloff, J. (1996), A Generalized Maximum Entropy Approach to Recovering Information from Multinomial Response Data, Journal of the American Statistical Association, 91, 841853. Golan, A., Judge, G., and Perloff, J. (1997), Estimation and Inference with Censored and Ordered Multinomial Response Data, Journal of Econometrics, 79, 2351. Golan, A., Judge, G., and Perloff, J. (2002), Comparison of Maximum Entropy and Higher-Order Entropy Estimators, Journal of Econometrics, 107, 195211. Good, I. J. (1963), Maximum Entropy for Hypothesis Formulation, Especially for Multidimensional Contingency Tables, Annals of Mathematical Statistics, 34, 911934. Harmon, A. M., Preckel, P., and Eales, J. (1998), Maximum Entropy-Based Seemingly Unrelated Regression, Masters thesis, Purdue University. Jaynes, E. T. (1957), Information of Theory and Statistical Mechanics, Physics Review, 106, 620 630. Jaynes, E. T. (1963), Information Theory and Statistical Mechanics, in K. W. Ford, ed., Brandeis Lectures in Theoretical Physics, volume 3, Statistical Physics, 181218, New York, Amsterdam: W. A. Benjamin Inc. Kapur, J. N. and Kesavan, H. K. (1992), Entropy Optimization Principles with Applications, Boston: Academic Press. Kullback, J. (1959), Information Theory and Statistics, New York: John Wiley & Sons. Kullback, J. and Leibler, R. A. (1951), On Information and Sufciency, Annals of Mathematical Statistics. LaMotte, L. R. (1994), A Note on the Role of Independence in t Statistics Constructed from Linear Statistics in Regression Models, The American Statistician, 48, 238240. Miller, D., Eales, J., and Preckel, P. (2003), Quasi-Maximum Likelihood Estimation with Bounded Symmetric Errors, in Advances in Econometrics, volume 17, 133148, Elsevier. Mittelhammer, R. C. and Cardell, S. (2000), The Data-Constrained GME Estimator of the GLM: Asymptotic Theory and Inference, Working paper of the Department of Statistics, Washington State University, Pullman.
References ! 679
Mittelhammer, R. C., Judge, G. G., and Miller, D. J. (2000), Econometric Foundations, Cambridge: Cambridge University Press. Myers, R. H. and Montgomery, D. C. (1995), Response Surface Methodology: Process and Product Optimization Using Designed Experiments, New York: John Wiley & Sons. Shannon, C. E. (1948), A Mathematical Theory of Communication, Bell System Technical Journal, 27, 379423 and 623656.
680
Chapter 13
Additionally, transformed versions of these models are provided: log square root logistic Box-Cox
Graphics are available with the ESM procedure. For more information, see the section ODS Graphics on page 705. The exponential smoothing models supported in PROC ESM differ from those supported in PROC FORECAST since all parameters associated with the forecasting model are optimized by PROC ESM based on the data. The ESM procedure writes the time series extrapolated by the forecasts, the series summary statistics, the forecasts and condence limits, the parameter estimates, and the t statistics to output data sets. The ESM procedure optionally produces printed output for these results by using the Output Delivery System (ODS). The ESM procedure can forecast both time series data, whose observations are equally spaced by a specic time interval (for example, monthly, weekly), or transactional data, whose observations are not spaced with respect to any particular time interval. Internet, inventory, sales, and similar data are typical examples of transactional data. For transactional data, the data is accumulated based on a specied time interval to form a time series prior to modeling and forecasting.
System (ODS). The following examples are more fully illustrated in Example 13.2: Forecasting of Transactional Data on page 709. Given an input data set that contains numerous time series variables recorded at a specic frequency, the ESM procedure can forecast the series as follows:
proc esm data=<input-data-set> out=<output-data-set>; id <time-ID-variable> interval=<frequency>; forecast <time-series-variables>; run;
For example, suppose that the input data set SALES contains sales data recorded monthly, the variable that represents time is DATE, and the forecasts are to be recorded in the output data set NEXTYEAR. The ESM procedure could be used as follows:
proc esm data=sales out=nextyear; id date interval=month; forecast _numeric_; run;
The preceding statements generate forecasts for every numeric variable in the input data set SALES for the next twelve months and store these forecasts in the output data set NEXTYEAR. Other output data sets can be specied to store the parameter estimates, forecasts, statistics of t, and summary data. By default, PROC ESM generates no printed output. If you want to print the forecasts by using the Output Delivery System (ODS), then you need to add the PRINT=FORECASTS option to the PROC ESM statement, as shown in the following example:
proc esm data=sales out=nextyear print=forecasts; id date interval=month; forecast _numeric_; run;
Other PRINT= options can be specied to print the parameter estimates, statistics of t, and summary data. The ESM procedure can forecast both time series data, whose observations are equally spaced by a specic time interval (for example, monthly, weekly), or transactional data, whose observations are not spaced with respect to any particular time interval. Given an input data set that contains transactional variables not recorded at any specic frequency, the ESM procedure accumulates the data to a specic time interval and forecasts the accumulated series as follows:
proc esm data=<input-data-set> out=<output-data-set>; id <time-ID-variable> interval=<frequency> accumulate=<accumulation>; forecast <time-series-variables> / model=<esm>; run;
For example, suppose that the input data set WEBSITES contains three variables (BOATS, CARS, PLANES) that are Internet data recorded on no particular time interval, and the variable that represents time is TIME, which records the time of the Web hit. The forecasts for the total daily values are to be recorded in the output data set NEXTWEEK. The ESM procedure could be used as follows:
proc esm data=websites out=nextweek lead=7; id time interval=dtday accumulate=total; forecast boats cars planes; run;
The preceding statements accumulate the data into a daily time series, generate forecasts for the BOATS, CARS, and PLANES variables in the input data set (WEBSITES) for the next seven days, and store the forecasts in the output data set (NEXTWEEK). Because the MODEL= option is not specied in the FORECAST statement, a simple exponential smoothing model is t to each series.
Functional Summary
The statements and options that control the ESM procedure are summarized in the following table.
Table 13.1 Syntax Summary
Description Statements specify data sets and options specify BY-group processing specify variables to forecast specify the time ID variable Data Set Options specify the input data set specify to output forecasts only specify the output data set specify parameter output data set
Statement
Option
Description specify forecast output data set specify the forecast procedure information output data set specify statistics output data set specify summary output data set replace actual values held back replace missing values use forecast value to append Accumulation and Seasonality Options specify accumulation frequency specify length of seasonal cycle specify interval alignment specify that time ID variable values are not sorted specify starting time ID value specify ending time ID value specify accumulation statistic specify missing value interpretation specify zero value interpretation Forecasting Horizon, Holdback Options specify data to hold back specify forecast horizon or lead specify horizon to start summation Forecasting Model Options specify condence limit width specify forecast model specify median forecats specify backcast initialization specify model transformation Printing and Plotting Control Options specify time ID format specify graphical output specify printed output specify detailed printed output Miscellaneous Options specify that analysis variables are processed in sorted order limit error and warning messages
Statement PROC ESM PROC ESM PROC ESM PROC ESM FORECAST FORECAST FORECAST
SORTNAMES MAXERROR=
species the number of observations before the end of the data where the multistep forecasts are to begin. The default is BACK=0.
DATA=SAS-data-set
names the SAS data set that contains the input data for the procedure to forecast. If the DATA= option is not specied, the most recently created SAS data set is used.
LEAD=n
species the number of periods ahead to forecast (forecast lead or horizon). The default is LEAD=12. The LEAD= value is relative to the BACK= option specication and to the last observation in the input data set or the accumulated series, and not to the last nonmissing observation of a particular series. Thus, if a series has missing values at the end, the actual number of forecasts computed for that series is greater than the LEAD= value.
MAXERROR=number
limits the number of warning and error messages produced during the execution of the procedure to the specied value. The default is MAXERRORS=50. This option is particularly useful in BY-group processing where it can be used to suppress the recurring messages.
NOOUTALL
species that only forecasts are written to the OUT= and OUTFOR= data sets. The NOOUTALL option includes only the nal forecast observations in the output data sets; it does not include the one-step forecasts for the data before the forecast period. The OUT= and OUTFOR= data set will only contain the forecast results starting at the next period following the last observation and ending with the forecast horizon specied by the LEAD= option.
OUT=SAS-data-set
names the output data set to contain the forecasts of the variables specied in the subsequent FORECAST statements. If an ID variable is specied, it is also included in the OUT= data set. The values are accumulated based on the ACCUMULATE= option, and forecasts are appended to these values based on the FORECAST statement USE= option. The OUT= data set is particularly useful in extending the independent variables. The OUT= data set can be used as the input data set in a subsequent PROC step to forecast a dependent series by using a regression modeling procedure. If the OUT= option is not specied, a default output data set is created by using the DATAn convention. If you do not want the OUT= data set created, use OUT=_NULL_.
OUTEST=SAS-data-set
names the output data set to contain the model parameter estimates and the associated test statistics and probability values. The OUTEST= data set is useful for evaluating the signicance of the model parameters and understanding the model dynamics.
OUTFOR=SAS-data-set
names the output data set to contain the forecast time series components (actual, predicted, lower condence limit, upper condence limit, prediction error, prediction standard error). The OUTFOR= data set is useful for displaying the forecasts in tabular or graphical form.
OUTPROCINFO=SAS-data-set
names the output data set to contain information in the SAS log, specically the number of notes, errors, and warnings and the number of series processed, forecasts requested, and forecasts failed.
OUTSTAT=SAS-data-set
names the output data set to contain the statistics of t (or goodness-of-t statistics). The OUTSTAT= data set is useful for evaluating how well the model ts the series.
OUTSUM=SAS-data-set
names the output data set to contain the summary statistics and the forecast summation. The summary statistics are based on the accumulated time series when the ACCUMULATE= or SETMISSING= options are specied. The forecast summations are based on the LEAD=, STARTSUM=, and USE= options. The OUTSUM= data set is useful when forecasting large numbers of series and a summary of the results are needed.
PLOT=option | ( options )
species the graphical output desired. By default, the ESM procedure produces no graphical output. The following plotting options are available: ERRORS ACF PACF IACF WN MODELS FORECASTS plots prediction error time series graphics. plots prediction error autocorrelation function graphics. plots prediction error partial autocorrelation function graphics. plots prediction error inverse autocorrelation function graphics. plots white noise graphics. plots model graphics. plots forecast graphics.
MODELFORECASTSONLY plots forecast graphics with condence limits in the data range. FORECASTSONLY LEVELS SEASONS TRENDS ALL plots the forecast in the forecast horizon only. plots smoothed level component graphics. plots smoothed seasonal component graphics. plots smoothed trend (slope) component graphics. is the same as specifying all of the above PLOT= options.
For example, PLOT=FORECASTS plots the forecasts for each series. The PLOT= option produces printed output for these results by using the Output Delivery System (ODS).
PRINT=option | ( options )
species the printed output desired. By default, the ESM procedure produces no printed output. The following printing options are available: ESTIMATES FORECASTS PERFORMANCE prints the results of parameter estimation. prints the forecasts. prints the performance statistics for each forecast. prints the performance summary for each BY group. prints the performance summary for all of the BY groups.
prints the statistics of t. prints the backcast, initial, and nal states. prints the summary statistics for the accumulated time series. Same as PRINT=(ESTIMATES FORECASTS STATISTICS SUMMARY).
For example, PRINT=FORECASTS prints the forecasts, PRINT=(ESTIMATES FORECASTS) prints the parameter estimates and the forecasts, and PRINT=ALL prints all of the output.
PRINTDETAILS
species that output requested with the PRINT= option be printed in greater detail.
SEASONALITY=number
species the length of the seasonal cycle. For example, SEASONALITY=3 means that every group of three observations forms a seasonal cycle. The SEASONALITY= option is applicable only for seasonal forecasting models. By default, the length of the seasonal cycle is one (no seasonality) or the length implied by the INTERVAL= option specied in the ID statement. For example, INTERVAL=MONTH implies that the length of the seasonal cycle is twelve.
SORTNAMES
species that the variables specied in the FORECAST statements are processed in sorted order.
STARTSUM=n
species the starting forecast lead (or horizon) for which to begin summation of the forecasts specied by the LEAD= option. The STARTSUM= value must be less than the LEAD= value. The default is STARTSUM=1; that is, the sum from the one-step ahead forecast (which is the rst forecast in the forecast horizon) to the multistep forecast specied by the LEAD= option. The prediction standard errors of the summation of forecasts take into account the correlation between the multistep forecasts. The section Forecast Summation on page 698 describes the STARTSUM= option in more detail.
BY Statement ! 689
BY Statement
BY variables ;
A BY statement can be used with PROC ESM to obtain separate dummy variable denitions for groups of observations dened by the BY variables. When a BY statement appears, the procedure expects the input data set to be sorted in order of the BY variables. If your input data set is not sorted in ascending order, use one of the following alternatives: Sort the data by using the SORT procedure with a similar BY statement. Specify the option NOTSORTED or DESCENDING in the BY statement for the ESM procedure. The NOTSORTED option does not mean that the data are unsorted but rather that the data are arranged in groups (according to values of the BY variables) and that these groups are not necessarily in alphabetical or increasing numeric order. Create an index on the BY variables by using the DATASETS procedure. For more information about the BY statement, see SAS Language Reference: Concepts. For more information about the DATASETS procedure, see the discussion in the Base SAS Procedures Guide.
FORECAST Statement
FORECAST variable-list / options ;
The FORECAST statement lists the numeric variables in the DATA= data set whose accumulated values represent time series to be modeled and forecast. The options specify which forecast model is to be used. A data set variable can be specied in only one FORECAST statement. Any number of FORECAST statements can be used. The following options can be used with the FORECAST statement.
ACCUMULATE=option
species how the data set observations are accumulated within each time period for the variables listed in the FORECAST statement. If the ACCUMULATE= option is not specied in the FORECAST statement, accumulation is determined by the ACCUMULATE= option of the ID statement. Use the ACCUMULATE= option with multiple FORECAST statements when you want different accumulation specications for different variables. See the ID statement ACCUMULATE= option for more details.
ALPHA=number
species the signicance level to use in computing the condence limits of the forecast. The ALPHA= value must be between 0 and 1. The default is ALPHA=0.05, which produces 95% condence intervals.
MEDIAN
species that the median forecast values are to be estimated. Forecasts can be based on the mean or median. By default, the mean value is provided. If no transformation is applied to the time series by using the TRANSFORM= option, the mean and median forecast values are identical.
MODEL=model-name
species the forecasting model to be used to forecast the time series. The default is MODEL=SIMPLE, which performs simple exponential smoothing. The following forecasting models are provided: NONE SIMPLE DOUBLE LINEAR DAMPTREND no forecast simple (single) exponential smoothing. This is the default. double (Brown) exponential smoothing linear (Holt) exponential smoothing damped trend exponential smoothing additive seasonal exponential smoothing multiplicative seasonal exponential smoothing
When the option MODEL=NONE is specied, the time series is appended with missing values in the OUT= data set. This option is useful when the results stored in the OUT= data set are used in a subsequent analysis where forecasts of the independent variables are needed to forecast the dependent variable.
NBACKCAST=n
species the number of observations used to initialize the backcast states. The default is the entire series.
REPLACEBACK
species that actual values excluded by the BACK= option are replaced with one-step-ahead forecasts in the OUT= data set.
REPLACEMISSING
species that embedded missing values are replaced with one-step-ahead forecasts in the OUT= data set.
SETMISSING=option | number
species how missing values (either input or accumulated) are assigned in the accumulated time series for variables listed in the FORECAST statement. If the SETMISSING= option is not specied in the FORECAST statement, missing values are set based on the SETMISSING= option of the ID statement. See the ID statement SETMISSING= option for more details.
ID Statement ! 691
TRANSFORM=option
species the time series transformation to be applied to the input or accumulated time series. The following transformations are provided: NONE LOG SQRT LOGISTIC BOXCOX(n) no transformation. This is the default. logarithmic transformation square-root transformation logistic transformation Box-Cox transformation with parameter number where number is between 5 and 5
When the TRANSFORM= option is specied, the time series must be strictly positive. After the time series is transformed, the model parameters are estimated by using the transformed series. The forecasts of the transformed series are then computed, and nally the transformed series forecasts are inverse transformed. The inverse transform produces either mean or median forecasts depending on whether the MEDIAN option is specied. The sections Transformations on page 697 and Inverse Transformations on page 698 describe this in more detail.
USE=option
species which forecast values are appended to the actual values in the OUT= and OUTSUM= data sets. The following USE= options are provided: PREDICT LOWER UPPER The predicted values are appended to the actual values. This option is the default. The lower condence limit values are appended to the actual values. The upper condence limit values are appended to the actual values.
Thus, the USE= option enables the OUT= and OUTSUM= data sets to be used for worst-case, best-case, average-case, and median-case decisions.
ZEROMISS=option
species how beginning or ending zero values (either input or accumulated) are interpreted in the accumulated time series for variables listed in the FORECAST statement. If the ZEROMISS= option is not specied in the FORECAST statement, beginning or ending zero values are set to missing values based on the ZEROMISS= option of the ID statement. See the ID statement ZEROMISS= option for more details.
ID Statement
ID variable INTERVAL= interval < options > ;
The ID statement names a numeric variable that identies observations in the input and output data sets. The ID variables values are assumed to be SAS date or datetime values. In addition, the
ID statement species the (desired) frequency associated with the time series. The ID statement options also specify how the observations are accumulated and how the time ID values are aligned to form the time series to be forecast. The information specied affects all variables specied in subsequent FORECAST statements. If the ID statement is specied, the INTERVAL= option must be specied. If an ID statement is not specied, the observation number, with respect to the BY group, is used as the time ID. The following options can be used with the ID statement.
ACCUMULATE=option
species how the data set observations are accumulated within each time period. The frequency (width of each time interval) is specied by the INTERVAL= option. The ID variable contains the time ID values. Each time ID variable value corresponds to a specic time period. The accumulated values form the time series, which is used in subsequent model tting and forecasting. The ACCUMULATE= option is particularly useful when there are gaps in the input data or when there are multiple input observations that coincide with a particular time period (for example, transactional data). The EXPAND procedure offers additional frequency conversions and transformations that can also be useful in creating a time series. The following options determine how the observations are accumulated within each time period based on the ID variable and the frequency specied by the INTERVAL= option: NONE TOTAL AVERAGE | AVG MINIMUM | MIN MEDIAN | MED MAXIMUM | MAX N NMISS NOBS FIRST LAST STDDEV | STD CSS No accumulation occurs; the ID variable values must be equally spaced with respect to the frequency. This is the default option. Observations are accumulated based on the total sum of their values. Observations are accumulated based on the average of their values. Observations are accumulated based on the minimum of their values. Observations are accumulated based on the median of their values. Observations are accumulated based on the maximum of their values. Observations are accumulated based on the number of nonmissing observations. Observations are accumulated based on the number of missing observations. Observations are accumulated based on the number of observations. Observations are accumulated based on the rst of their values. Observations are accumulated based on the last of their values. Observations are accumulated based on the standard deviation of their values. Observations are accumulated based on the corrected sum of squares of their values.
ID Statement ! 693
USS
Observations are accumulated based on the uncorrected sum of squares of their values.
If the ACCUMULATE= option is specied, the SETMISSING= option is useful for specifying how accumulated missing values are treated. If missing values should be interpreted as zero, then SETMISSING=0 should be used. The section Accumulation on page 695 describes accumulation in greater detail.
ALIGN=option
controls the alignment of SAS dates used to identify output observations. The ALIGN= option accepts the following values: BEGINNING | BEG | B, MIDDLE | MID | M, and ENDING | END | E. BEGINNING is the default.
END=date | datetime
species a SAS date or datetime literal value that represents the end of the data. If the last time ID variable value is less than the END= value, the series is extended with missing values. If the last time ID variable value is greater than the END= value, the series is truncated. For example, END=1jan2008D species that data for time periods after the rst of January 2008 not be used. The option END=&sysdateD uses the automatic macro variable SYSDATE to extend or truncate the series to the current date. This option and the START= option can be used to ensure that data associated with each BY group contains the same number of observations.
FORMAT=format
species the SAS format for the time ID values. If the FORMAT= option is not specied, the default format is implied from the INTERVAL= option.
INTERVAL=interval
species the frequency of the input time series or for the time series to be accumulated from the input data. For example, if the input data set consists of quarterly observations, then INTERVAL=QTR should be used. If the SEASONALITY= option is not specied, the length of the seasonal cycle is implied by the INTERVAL= option. For example, INTERVAL=QTR implies a seasonal cycle of length 4. If the ACCUMULATE= option is also specied, the INTERVAL= option determines the time periods for the accumulation of observations. The basic intervals are YEAR, SEMIYEAR, QTR, MONTH, SEMIMONTH, TENDAY, WEEK, WEEKDAY, DAY, HOUR, MINUTE, SECOND. See Chapter 4, Date Intervals, Formats, and Functions, for more information about the intervals that can be specied.
NOTSORTED
species that the time ID values are not in sorted order. The ESM procedure sorts the data with respect to the time ID prior to analysis.
SETMISSING=option | number
species how missing values (either input or accumulated) are assigned in the accumulated time series. If a number is specied, missing values are set to that number. If a missing value on the input data set indicates an unknown value, the SETMISSING= option should not be used. If a missing value indicates no value, SETMISSING=0 should be used. You
typically use SETMISSING=0 for transactional data, because no recorded data usually implies no activity. The following options can also be used to determine how missing values are assigned: MISSING AVERAGE | AVG MINIMUM | MIN MEDIAN | MED MAXIMUM | MAX FIRST LAST PREVIOUS | PREV Missing values are set to missing. This is the default option. Missing values are set to the accumulated average value. Missing values are set to the accumulated minimum value. Missing values are set to the accumulated median value. Missing values are set to the accumulated maximum value. Missing values are set to the accumulated rst nonmissing value. Missing values are set to the accumulated last nonmissing value. Missing values are set to the previous accumulated nonmissing value. Missing values at the beginning of the accumulated series remain missing. Missing values are set to the next accumulated nonmissing value. Missing values at the end of the accumulated series remain missing.
NEXT
If SETMISSING=MISSING is specied, the missing observations are replaced with predicted values computed from the exponential smoothing model.
START=date | datetime
species a SAS date or datetime literal value that represents the beginning of the data. If the rst time ID variable value is greater than the START= value, the series is prexed with missing values. If the rst time ID variable value is less than the START= value, the series is truncated. This option and the END= option can be used to ensure that data associated with each BY group contains the same number of observations.
ZEROMISS=option
species how beginning and/or ending zero values (either input or accumulated) are interpreted in the accumulated time series. The following values can be specied for the ZEROMISS= option: NONE LEFT RIGHT BOTH Beginning and/or ending zeros are unchanged. This is the default. Beginning zeros are set to missing. Ending zeros are set to missing. Both beginning and ending zeros are set to missing.
If the accumulated series is all missing and/or zero the series is not changed.
Step 1 2 3 4 5 6 7
Operation accumulation missing value interpretation transformations parameter estimation forecasting inverse transformation summation of forecasts
Option ACCUMULATE= SETMISSING= TRANSFORM= MODEL= MODEL=, LEAD= TRANSFORM, MEDIAN LEAD=, STARTSUM=
Statement ID ID, FORECAST FORECAST FORECAST FORECAST, PROC ESM FORECAST PROC ESM
Each of the steps shown in Table 13.2 is described in the following sections.
Accumulation
If the ACCUMULATE= option is specied in the ID statement, data set observations are accumulated within each time period. The frequency (width of each time interval) is specied by the INTERVAL= option, and the ID variable contains the time ID values. Each time ID value corresponds to a specic time period. Accumulation is particularly useful when the input data set contains transactional data, whose observations are not spaced with respect to any particular time interval. The accumulated values form the time series that is used in subsequent analyses by the ESM procedure. For example, suppose a data set contains the following observations:
19MAR1999 19MAR1999 11MAY1999 12MAY1999 23MAY1999 10 30 50 20 20
If the INTERVAL=MONTH option is specied on the ID statement, all of the preceding observations fall within three time periods: March 1999, April 1999, and May 1999. The observations are accumulated within each time period as follows. If the ACCUMULATE=NONE option is specied, an error is generated because the ID variable values are not equally spaced with respect to the specied frequency (MONTH).
As can be seen from the preceding examples, even though the data set observations contained no missing values, the accumulated time series can have missing values.
Transformations
If the TRANSFORM= option is specied in the FORECAST statement, the time series is transformed prior to model parameter estimation and forecasting. Only strictly positive series can be transformed. An error is generated when the TRANSFORM= option is used with a nonpositive series. (See Chapter 44, Forecasting Process Details, for more details about forecasting transformed time series.)
Parameter Estimation
All the parameters (smoothing weights) associated with the exponential smoothing model used to forecast the time series (as specied by the MODEL= option) are optimized based on the data, with the default parameter restrictions imposed. If the TRANSFORM= option is specied, the transformed time series data are used to estimate the model parameters. The techniques used in the ESM procedure are identical to those used for exponential smoothing models in the Time Series Forecasting System of SAS/ETS software. See Chapter 36, Overview of the Time Series Forecasting System, for more information.
N OTE : Even if all of the observed data are nonmissing, the ACCUMULATE= option can create missing values in the accumulated series (when the data contain no observations for some of the time periods specied by the INTERVAL= option).
Forecasting
Once the model parameters are estimated, one-step-ahead forecasts are generated for the full range of the accumulated and optionally transformed time series data, and multistep forecasts are generated from the end of the time series to the future time period specied by the LEAD= option. If there are missing values at the end of the time series, the forecast horizon will be greater than that specied by the LEAD= option.
Inverse Transformations
If the TRANSFORM= option is specied in the FORECAST statement, the forecasts of the transformed time series are inverse transformed. By default, forecasts of the mean (expected value) are generated. If the MEDIAN option is specied, median forecasts are generated. (See Chapter 44, Forecasting Process Details, for more details about forecasting transformed time series.)
Statistics of Fit
The statistics of t are computed by comparing the time series data (after accumulation and missing value recoding, if specied) with the generated forecasts. If the TRANSFORM= option is specied, the statistics of t are based on the inverse transformed forecasts. (See Chapter 44, Forecasting Process Details, for more details about statistics of t for forecasting models.)
Forecast Summation
The multistep forecasts generated by the preceding steps can optionally be summed from the STARTSUM= value to the LEAD= value. For example, if the options STARTSUM=4 and LEAD=6 are specied on the PROC ESM statement, the four-step through six-step ahead forecasts are summed. The forecasts are simply summed; however, the prediction error variance of this sum is computed by taking into account the correlation between the individual predictions. (These variance-related computations are performed only when no transformation is specied; that is, when TRANSFORM=NONE.) The upper and lower condence limits for the sum of the predictions is then computed based on the prediction error variance of the sum.
The forecast summation is particularly useful when it is desirable to model in one frequency but the forecast of interest is another frequency. For example, if a time series has a monthly frequency (INTERVAL=MONTH) and you want a forecast for the third and fourth future months, a forecast summation for the third and fourth month can be obtained by specifying STARTSUM=3 and LEAD=4.
If the parameter estimation step fails for a particular variable, no observations are output to the OUTEST= data set for that variable.
If the forecasting step fails for a particular variable, no observations are recorded in the OUTFOR= data set for that variable. If the TRANSFORM= option is specied, the values in the preceding variables are the inverse transform forecasts. If the MEDIAN option is specied, the median forecasts are stored; otherwise, the mean forecasts are stored.
DFE N NOBS NMISSA NMISSP NPARMS TSS SST SSE MSE UMSE RMSE URMSE MAPE MAE MASE RSQUARE ADJRSQ AADJRSQ RWRSQ AIC AICC SBC APC MAXERR
MINERR MINPE MAXPE ME MPE MDAPE GMAPE MINPPE MAXPPE MSPPE MAPPE MDAPPE GMAPPE MINSPE MAXSPE MSPE SMAPE MDASPE GMASPE MINRE MAXRE MRE MRAE MDRAE GMRAE MINAPES MAXAPES MAPES MDAPES GMAPES
minimum error minimum percent error maximum percent error mean error mean percent error median absolute percent error geometric mean absolute percent error minimum predictive percent error maximum predictive percent error mean predictive percent error symmetric mean absolute predictive percent error median absolute predictive percent error geometric mean absolute predictive percent error minimum symmetric percent error maximum symmetric percent error mean symmetric percent error symmetric mean absolute percent error median absolute symmetric percent error geometric mean absolute symmetric percent error minimum relative error maximum relative error mean relative error mean relative absolute error median relative absolute error geometric mean relative absolute error minimum absolute error percent of standard deviation maximum absolute error percent of standard deviation mean absolute error percent of standard deviation median absolute error percent of standard deviation geometric mean absolute error percent of standard deviation
If the statistics of t cannot be computed for a particular variable, no observations are recorded in the OUTSTAT= data set for that variable. If the TRANSFORM= option is specied, the values in the preceding variables are computed based on the inverse transform forecasts. If the MEDIAN option is specied, the median forecasts are the basis; otherwise, the mean forecasts are the basis. See Chapter 44, Forecasting Process Details, for more information about the calculation of forecasting statistics of t.
The following variables related to forecast summation are based on the LEAD= and STARTSUM= options: PREDICT STD LOWER UPPER forecast summation predicted values forecast summation prediction standard errors forecast summation lower condence limits forecast summation upper condence limits
Variance-related computations are computed only when no transformation is specied (TRANSFORM=NONE). The following variables related to multistep forecast are based on the LEAD= and USE= options: _LEADn_ multistep forecast (n ranges from one to the value of the LEAD= option). If USE=LOWER, this variable contains the lower condence limits; if USE=UPPER, this variable contains the upper condence limits; otherwise, this variable contains the predicted values.
If the forecast step fails for a particular variable, the variables that are related to forecasting are set to missing for that variable. The OUTSUM= data set contains both a summary of the (accumulated) time series and optionally its forecasts for all series.
Printed Output
The ESM procedure optionally produces printed output by using the Output Delivery System (ODS). By default, the procedure produces no printed output. All output is controlled by the PRINT= and PRINTDETAILS options in the PROC ESM statement. In general, if a forecasting step that is related to printed output fails, the values of this step are not printed and appropriate error or warning messages are recorded in the log. The printed output is similar to the output data sets. The printed output produced by the PRINT= option values is described as follows: SUMMARY ESTIMATES FORECASTS PERFORMANCE PERFORMANCESUMMARY PERFORMANCEOVERALL STATES STATISTICS prints the summary statistics and forecast summaries similar to the OUTSUM= data set. prints the parameter estimates similar to the OUTEST= data set. prints the forecasts similar to the OUTFOR= data set. prints the performance statistics. prints the performance summary for each BY group. prints the performance summary for all BY groups. prints the backcast, initial, and nal smoothed states. prints the statistics of t similar to the OUTSTAT= data set.
The PRINTDETAILS option is the opposite of the NOOUTALL option. Specically, if PRINT=FORECASTS and the PRINTDETAILS options are specied in the PROC ESM statement, the one-step-ahead forecasts through the range of the data are printed in addition to the information related to a specic forecasting model, such as the smoothing states. If the PRINTDETAILS option is not specied, only the multistep forecasts are printed.
Table 13.3
ODS Table Name DescStats ForecastSummary ForecastSummation ParameterEstimates Forecasts Performance PerformanceSummary PerformanceOverall SmoothedStates FitStatistics PerformanceStatistics
Description descriptive statistics forecast summary forecast summation parameter estimates forecasts performance statistics performance summary performance overall smoothed states evaluation statistics of t performance (out-of-sample) statistics of t
PRINT= Option SUMMARY SUMMARY SUMMARY ESTIMATES FORECASTS PERFORMANCE PERFORMANCESUMMARY PERFORMANCEOVERALL STATES STATISTICS STATISTICS
The ODS table ForecastSummary is related to all time series within a BY group. The other tables are related to a single series within a BY group.
ODS Graphics
This section describes the use of ODS for creating graphics with the ESM procedure. To request these graphs, you must specify the statement ods graphics on in your SAS program before the PROC ESM statement. In addition, you must specify the PLOT= option in the PROC ESM statement.
Table 13.4
ODS Graph Name ErrorACFNORMPlot ErrorACFPlot ErrorHistogram ErrorIACFNORMPlot ErrorIACFPlot ErrorPACFNORMPlot ErrorPACFPlot ErrorPlot ErrorWhiteNoiseLogProbPlot ErrorWhiteNoiseProbPlot ForecastsOnlyPlot ForecastsPlot LevelStatePlot ModelForecastsPlot ModelPlot SeasonStatePlot TrendStatePlot
Plot Description standardized autocorrelation of prediction errors autocorrelation of prediction errors prediction error histogram standardized inverse autocorrelation of prediction errors inverse autocorrelation of prediction errors standardized partial autocorrelation of prediction errors partial autocorrelation of prediction errors plot of prediction errors white noise log probability plot of prediction errors white noise probability plot of prediction errors forecasts only plot forecasts plot smoothed level state plot model and forecasts plot model plot smoothed season state plot smoothed trend state plot
PLOT= Option ACF ACF ERRORS IACF IACF PACF PACF ERRORS WN WN FORECASTSONLY FORECASTS LEVELS MODELFORECASTS MODELS SEASONS TRENDS
data sales; format date date9.; input date : date9. shoes socks laces dresses coats shirts ties belts hats blouses; datalines; ... more lines ...
The following ESM procedure statements forecast each of the monthly time series:
proc esm data=sales out=nextyear; id date interval=month; forecast _numeric_; run;
The preceding statements generate forecasts for every numeric variable in the input data set SALES for the next twelve months and store these forecasts in the output data set NEXTYEAR. The following statements plot the forecasts:
title1 "Shoe Department Sales"; proc sgplot data=nextyear; series x=date y=shoes / markers markerattrs=(symbol=circlefilled color=red) lineattrs=(color=red); series x=date y=socks / markers markerattrs=(symbol=asterisk color=blue) lineattrs=(color=blue); series x=date y=laces / markers markerattrs=(symbol=circle color=styg) lineattrs=(color=styg); refline 01JAN1999d / axis=x; xaxis values=(01JAN1994d to 01DEC2000d by year); yaxis values=(2000 to 10000 by 1000) minor label=Websites; run;
The plots are shown in Output 13.1.1. The historical data is shown to the left of the reference line and the forecasts for the next twelve monthly periods is shown to the right.
The default simple exponential smoothing model is used because the MODEL= option is omitted on the FORECAST statement. Note that for simple exponential smoothing the forecasts are constant. The following ESM procedure statements are identical to the preceding statements except that the PRINT=FORECASTS option is specied:
proc esm data=sales out=nextyear print=forecasts; id date interval=month; forecast _numeric_; run;
In addition to forecasting each of the monthly time series, the preceding statements print the forecasts by using the Output Delivery System (ODS); the forecasts are partially shown in Output 13.1.2. This output shows the predictions, prediction standard errors, and the upper and lower condence limits for the next twelve monthly periods.
Obs 62 63 64 65 66 67 68 69 70 71 72 73
Time FEB1999 MAR1999 APR1999 MAY1999 JUN1999 JUL1999 AUG1999 SEP1999 OCT1999 NOV1999 DEC1999 JAN2000
Forecasts 6009.1986 6009.1986 6009.1986 6009.1986 6009.1986 6009.1986 6009.1986 6009.1986 6009.1986 6009.1986 6009.1986 6009.1986
95% Confidence Limits 3913.2016 3900.6996 3888.2713 3875.9154 3863.6306 3851.4158 3839.2698 3827.1914 3815.1794 3803.2329 3791.3507 3779.5318 8105.1956 8117.6976 8130.1259 8142.4818 8154.7666 8166.9814 8179.1274 8191.2058 8203.2178 8215.1643 8227.0465 8238.8654
The preceding statements accumulate the data into a daily time series, generate forecasts for the BOATS, CARS, and PLANES variables in the input data set WEBSITES for the next week, and the forecasts are stored in the OUT= data set NEXTWEEK. The following statements plot the forecasts related to the Internet data:
title1 "Website Data"; proc sgplot data=nextweek; series x=time y=boats / markers markerattrs=(symbol=circlefilled color=red) lineattrs=(color=red); series x=time y=cars / markers markerattrs=(symbol=asterisk color=blue) lineattrs=(color=blue); series x=time y=planes / markers markerattrs=(symbol=circle color=styg) lineattrs=(color=styg); refline 11APR2000:00:00:00dt / axis=x; xaxis values=(13MAR2000:00:00:00dt to 18APR2000:00:00:00dt by dtweek); yaxis label=Websites minor; run;
The plots are shown in Output 13.2.1. The historical data is shown to the left of the reference line and the forecasts for the next seven days are shown to the right.
Output 13.2.1 Internet Data Forecast Plots
The following AUTOREG procedure statements are used to forecast the ENGINES variable by regressing on the independent variables (BOATS, CARS, and PLANES).
proc autoreg data= nextweek; model engines = boats cars planes / noprint; output out=enginehits p=predicted; run;
The NEXTWEEK data set created by PROC ESM is used as an input data set to PROC AUTOREG. The output data set from PROC AUTOREG contains the forecast of the variable ENGINES based on
the regression model with the variables BOATS, CARS, and PLANES as regressors. See Chapter 8, The AUTOREG Procedure, for details about autoregression models. The following statements plot the forecasts related to the ENGINES variable:
title1 "Website Data"; proc sgplot data=enginehits; series x=time y=boats / markers markerattrs=(symbol=circlefilled color=red) lineattrs=(color=red); series x=time y=cars / markers markerattrs=(symbol=asterisk color=blue) lineattrs=(color=blue); series x=time y=planes / markers markerattrs=(symbol=circle color=styg) lineattrs=(color=styg); scatter x=time y=predicted / markerattrs=(symbol=plus color=black); refline 11APR2000:00:00:00dt / axis=x; xaxis values=(13MAR2000:00:00:00dt to 18APR2000:00:00:00dt by dtweek); yaxis label=Websites minor; run;
The plots are shown in Output 13.4.1. The historical data is shown to the left of the reference line and the forecasts for the next seven daily periods are shown to the right.
proc esm data=sashelp.air out=_null_ lead=20 back=20 print=all plots=all; id date interval=month; forecast air / model=addwinters transform=log; run;
718
Chapter 14
The results of the EXPAND procedure are stored in a SAS data set. No printed output is produced.
Note that interpolating values of a time series does not add any real information to the data as the interpolation process is not the same process that generated the other (nonmissing) values in the series. While time series interpolation can sometimes be useful, great care is needed in analyzing time series containing interpolated values.
This example assumes that the variables X, Y, and Z represent point-in-time values observed at the beginning of each month, and that the desired results are point-in-time values observed at the beginning of each year. (The default value of the OBSERVED= option is OBSERVED=(BEGINNING,BEGINNING).) The variables A, B, and C are assumed to represent monthly totals, and that the desired results are annual totals; therefor the option OBSERVED=TOTAL is specied. See the section Specifying Observation Characteristics on page 724 for more information on the OBSERVED= option. Note that the AGGREGATE method can be used only if the input intervals are nested within the output intervals, as when converting from daily to monthly or from monthly to yearly frequency.
See Chapter 3, Working with Time Series Data, for further discussion of time series periodicity, time series dating, and time series interpolation. See the section Specifying Observation Characteristics on page 724 for more information on the OBSERVED= option.
This example assumes that the variables X, Y, and Z represent point-in-time values observed at the beginning of each year. (The default value of the OBSERVED= option is OBSERVED=BEGINNING.) The variables A, B, and C are assumed to represent annual totals. To interpolate missing values in variables observed at specic points in time, omit both the FROM= and TO= options and use the ID statement to supply time values for the observations. The observations do not need to be periodic or form regular time series, but the data set must be sorted by the ID variable. For example, the following statements interpolate any missing values in the numeric variables in the data set A.
proc expand data=a out=b; id date; run;
If the observations are equally spaced in time, and all the series are observed as beginning-ofperiod values, only the input and output data sets need to be specied. For example, the following statements interpolate any missing values in the numeric variables in the data set A using a cubic spline function, assuming that the observations are at equally spaced points in time.
proc expand data=a out=b; run;
Refer to the section Missing Values on page 748 for further information.
proc expand data=annual out=monthly from=year to=month; id date; convert x y z / method=join; run;
estimate the contribution of each month to the annual totals in A, B, and C, and interpolate rst-ofmonth estimates of X, Y, and Z.
proc expand data=annual out=monthly from=year to=month; id date; convert x y z; convert a b c / observed=total; run;
The EXPAND procedure supports ve different observation characteristics. The OBSERVED= value options for these ve observation characteristics are: BEGINNING MIDDLE END TOTAL AVERAGE beginning-of-period values period midpoint values end-of-period values period totals period averages
The interpolation of each series is adjusted appropriately for its observation characteristics. When OBSERVED=TOTAL or AVERAGE is specied, the interpolating curve is t to the data values so that the area under the curve within each input interval equals the value of the series. For OBSERVED=MIDDLE or END, the curve is t through the data points, with the time position of each data value placed at the specied offset from the start of the interval. See the section OBSERVED= Option on page 737 for details.
For example, suppose you are converting quarterly data to monthly and you want both rst-ofmonth and midmonth estimates for a beginning-of-period variable X. The following statements perform this task:
proc expand data=a out=b from=qtr to=month; id date; convert x=x_begin / observed=beginning; convert x=x_mid / observed=(beginning,middle); run;
Transforming Series
The interpolation methods used by PROC EXPAND assume that there are no restrictions on the range of values that series can have. This assumption can sometimes cause problems if the series must be within a certain range. For example, suppose you are converting monthly sales gures to weekly estimates. Sales estimates should never be less than zero, but since the spline curve ignores this restriction some interpolated values may be negative. One way to deal with this problem is to transform the input series before tting the interpolating spline and then reverse transform the output series. You can apply various transformations to the input series using the TRANSFORMIN= option on the CONVERT statement. (The TRANSFORMIN= option can be abbreviated as TRANSFORM= or TIN=.) You can apply transformations to the output series using the TRANSFORMOUT= option. (The TRANSFORMOUT= option can be abbreviated as TOUT=.) For example, you might use a logarithmic transformation of the input sales series and exponentiate the interpolated output series. The following statements t a spline curve to the log of SALES and then exponentiate the output series.
proc expand data=a out=b from=month to=week; id date; convert sales / observed=total transformin=(log) transformout=(exp); run;
Note that the transformations specied by the TRANSFORMIN= option are applied before the data are interpolated; the cubic spline curve or other interpolation method is tted to transformed input data. The transformations specied by the TRANSFORMOUT= option are applied to interpolated values computed from the curves t to the transformed input data. As another example, suppose you are interpolating missing values in a series of market share estimates. Market shares must be between 0% and 100%, but applying a spline interpolation to the raw series can produce estimates outside of this range. The following statements use the logistic transformation to transform proportions in the range 0 to 1 to values in the range 1 to C1. The TIN= option rst divides the market shares by 100 to rescale percent values to proportions and then applies the LOGIT function. The TOUT= option applies the inverse logistic function ILOGIT to the interpolated values to convert back to proportions and then multiplies by 100 to rescale back to percentages.
proc expand data=a out=b; id date; convert mshare / tin=( / 100 logit ) tout=( ilogit * 100 ); run;
When more than one transformation is specied in the TRANSFORMIN= or TRANSFORMOUT= option, the transformations are applied in the order in which they are listed. Thus in the above example the complete input transformation is logit(mshare/100) (and not logit(mshare)/100) because the division operation is listed rst in the TIN= option. You can also use the TRANSFORM= (or TRANSFORMOUT=) option as a convenient way to do calculations normally performed with the SAS DATA step. For example, the following statements add the lead of X to the data set A. The METHOD=NONE option is used to suppress interpolation.
proc expand data=a method=none; id date; convert x=xlead / transform=(lead); run;
Any number of operations can be listed in the TRANSFORMIN= and TRANSFORMOUT= options. See Table 14.2 for a list of the operations supported.
Functional Summary
The statements and options controlling the EXPAND procedure are summarized in the following table. Description Statements specify options specify BY-group processing specify conversion options specify the ID variable Data Set Options specify the input data set extrapolate values before or after input series specify the output data set write interpolating functions to a data set Input and Output Frequencies control the alignment of SAS Date values specify frequency conversion factor specify input frequency specify output frequency Interpolation Control Options specify interpolation method for all series specify interpolation method for series specify observation characteristics for series specify observation characteristics for series specify transformations of the input series specify transformations of the output series Graphical Output Control Options specify graphical output Statement Option
PROC EXPAND
PLOTS=
The following options can be used with the PROC EXPAND statement:
names the input data set. If the DATA= option is omitted, the most recently created SAS data set is used.
OUT= SAS-data-set
names the output data set containing the resulting time series. If OUT= is not specied, the data set is named using the DATAn convention. See the section OUT= Data Set on page 756 for details.
OUTEST= SAS-data-set
names an output data set containing the coefcients of the spline curves t to the input series. If the OUTEST= option is not specied, the spline coefcients are not output. See the section OUTEST= Data Set on page 756 for details.
controls the alignment of SAS dates used to identify output observations. The ALIGN= option allows the following values: BEGINNING | BEG | B, MIDDLE | MID | M, and ENDING | END | E. BEGINNING is the default.
FACTOR= n FACTOR=( n : m )
species the number of output observations to be created from the input observations. FACTOR=n species that n output observations are to be produced for each input observation. FACTOR=( n : m ) species that n output observations are to be produced for each group of m input observations. FACTOR=n is the same as FACTOR=(n : 1). In the FACTOR=() option, a comma can be used instead of a colon or the delimiter can be omitted. Thus FACTOR=( n, m ) or FACTOR=( n m ) is the same as FACTOR=( n : m ). The FACTOR= option cannot be used if the TO= option is used. The default value is FACTOR=(1:1). For more information, see the section Frequency Conversion on page 733.
FROM= interval
species the time interval between observations in the input data set. Examples of FROM= values are YEAR, QTR, MONTH, DAY, and HOUR. See Chapter 4, Date Intervals, Formats, and Functions, for a complete description and examples of interval specications.
TO= interval
species the time interval between observations in the output data set. By default, the TO= interval is generated from the combination of the FROM= and the FACTOR= values or is set to be the same as the FROM= value if FACTOR= is not specied. See Chapter 4, Date Intervals, Formats, and Functions, for a description of interval specications.
species that missing values at the beginning or end of input series be replaced with values produced by a linear extrapolation of the interpolating curve t to the input series. See the section Extrapolation on page 736 later in this chapter for details. By default, PROC EXPAND avoids extrapolating values beyond the rst or last input value for a series and only interpolates values within the range of the nonmissing input values. Note that the extrapolated values are often not very accurate and for the SPLINE method the EXTRAPOLATE option results may be very unreasonable. The EXTRAPOLATE option is rarely used.
METHOD= option METHOD=SPLINE( constraint < , constraint > )
species the method used to convert the data series. The methods supported are SPLINE, JOIN, STEP, AGGREGATE, and NONE. The METHOD= option specied on the PROC EXPAND statement can be overridden for particular series by the METHOD= option on the CONVERT statement. The default is METHOD=SPLINE. The constraint specications for METHOD=SPLINE can have the values NOTAKNOT (the default), NATURAL, SLOPE=value, and/or CURVATURE=value. See the section Conversion Methods on page 739 for more information about these methods.
OBSERVED= value OBSERVED=( from-value , to-value )
indicates the observation characteristics of the input time series and of the output series. Specifying the OBSERVED= option on the PROC EXPAND statement sets the default OBSERVED= value for subsequent CONVERT statements. See the sections CONVERT Statement on page 732 and OBSERVED= Option on page 737 later in this chapter for details. The default is OBSERVED=BEGINNING.
species the graphical output desired. If the PLOTS= option is used, the specied graphical output is produced for each output variable specied by a CONVERT statement. By default, the EXPAND procedure produces no graphical output. The following PLOTS= options are available. Required options are listed in parentheses in the plot descriptions when necessary. INPUT TRANSFORMIN CROSSINPUT plots the input series plots the transformed input series (TRANSFORMIN= option) plots both the input series and the transformed input series on one plot with two Y axis. The input and transformed series are shown on separate scales. (TRANSFORMIN= option) plots both the input series and the transformed input series on one plot with one Y axis. The input and transformed series are shown on the same scale. (TRANSFORMIN= option)
JOINTINPUT
BY Statement ! 731
CONVERTED
plots the converted series, after input transformations and interpolation but before any TRANSFORMOUT= transformations are applied (METHOD= option) plots the transformed output series (TRANSFORMOUT= option) plots both the converted series and the transformed output series on one plot with two Y axis. The converted and transformed output series are shown on separate scales. (TRANSFORMOUT= option) plots both the converted series and the transformed output series on one plot with one Y axis. The converted and transformed output series are shown on the same scale. (TRANSFORMOUT= option) plots the series stored in the OUT= data set (combination of TRANSFORMIN=, METHOD=, and TRANFORMOUT= options) produces all plots except the Joint and Cross plots. (Same as PLOTS=(INPUT TRANFORMIN CONVERTED TRANSFORMOUT).)
TRANSFORMOUT CROSSOUTPUT
JOINTOUTPUT
OUTPUT
ALL
The PLOTS= option produces results associated with each CONVERT statement output variable and the options listed in the PLOTS= specication. See the section PLOTS= Option Details on page 758 for more information.
BY Statement
BY variables ;
A BY statement can be used with PROC EXPAND to obtain separate analyses on observations in groups dened by the BY variables. The input data set must be sorted by the BY variables and be sorted by the ID variable within each BY group. Use a BY statement when you want to interpolate or convert time series within levels of a crosssectional variable. For example, suppose you have a data set STATE containing annual estimates of average disposable personal income per capita (DPI) by state and you want quarterly estimates by state. These statements convert the DPI series within each state:
proc sort data=state; by state date; run; proc expand data=state out=stateqtr from=year to=qtr; convert dpi; by state; id date; run;
CONVERT Statement
CONVERT variable = newname . . . < / options > ;
The CONVERT statement lists the variables to be processed. Only numeric variables can be processed. For each of the variables listed, a new variable name can be specied after an equal sign to name the variable in the output data set that contains the converted values. If a name for the output series is not given, the variable in the output data set has the same name as the input variable. Variable lists may be used only when no name is given for the output series. For example, variable lists can be specied as follows:
convert y1-y25 / observed=(beginning,end); convert x--a / observed=average; convert x-numeric-a / observed=average;
Any number of CONVERT statements can be used. If no CONVERT statement is used, all the numeric variables in the input data set except those appearing in the BY and ID statements are processed. The following options can be used with the CONVERT statement.
METHOD= option METHOD=SPLINE( constraint < , constraint > )
species the method used to convert the data series. (The method specied by the METHOD= option is applied to the input data series after applying any transformations specied by the TRANSFORMIN= option.) The methods supported are SPLINE, JOIN, STEP, AGGREGATE, and NONE. The METHOD= option specied on the PROC EXPAND statement can be overridden for particular series by the METHOD= option on the CONVERT statement. The default is METHOD=SPLINE. The constraint specications for METHOD=SPLINE can have the values NOTAKNOT (the default), NATURAL, SLOPE=value, and/or CURVATURE=value. See the section Conversion Methods on page 739 section for more information about these methods.
OBSERVED= value OBSERVED=( from-value , to-value )
indicates the observation characteristics of the input time series and of the output series. The values supported are TOTAL, AVERAGE, BEGINNING, MIDDLE, and END. In addition, DERIVATIVE can be specied as the to-value when the SPLINE method is used. When only one values is specied, that value species both the from-value and the tovalue. (That is, OBSERVED=value is equivalent to OBSERVED=(value,value).) If the OBSERVED= option is omitted form both the PROC EXPAND and the CONVERT statements, the default is OBSERVED=(BEGINNING,BEGINNING). See the section OBSERVED= Option on page 737 later in this chapter for details.
ID Statement ! 733
TRANSFORMIN=( operation . . . )
species a list of transformations to be applied to the input series before the interpolating function is t. The operations are applied in the order listed. See the section Transformation Operations on page 742 later in this chapter for the operations that can be specied. The TRANSFORMIN= option can be abbreviated as TRANSIN=, TIN=, or TRANSFORM=.
TRANSFORMOUT=( operation . . . )
species a list of transformations to be applied to the output series. The operations are applied in the order listed. See the section Transformation Operations on page 742 later in this chapter for the operations that can be specied. The TRANSFORMOUT= option can be abbreviated as TRANSOUT=, or TOUT=.
ID Statement
ID variable ;
The ID statement names a numeric variable that identies observations in the input and output data sets. The ID variables values are assumed to be SAS date or datetime values. The input data must form time series. This means that the observations in the input data set must be sorted by the ID variable (within the BY variables, if any). Moreover, there should be no duplicate observations, and no two observations should have ID values within the same time interval as dened by the FROM= option. If the ID statement is omitted, SAS date or datetime values are generated to label the input observations. These ID values are generated by assuming that the input data set starts at a SAS date value of 0, that is, 1 January 1960. This default starting date is then incremented for each observation by the FROM= interval (using the same logic as DATA step INTNX function). If the FROM= option is not specied, the ID values are generated as the observation count minus 1. When the ID statement is not used, an ID variable is added to the output data set named either DATE or DATETIME, depending on the value specied in the TO= option. If neither the TO= option nor the FROM= option is given, the ID variable in the output data set is named TIME.
Frequency Conversion
Frequency conversion is controlled by the FROM=, TO=, and FACTOR= options. The possible combinations of these options are explained in the following: None Used If FROM=, TO=, and FACTOR= are not specied, no frequency conversion is done. The data are
processed to interpolate any missing values and perform any specied transformations. Each input observation produces one output observation. FACTOR=(n:m) FACTOR=(n :m ) species that n output observations are produced for each group of m input observations. The fraction m /n is reduced rst: thus FACTOR=(10:6) is equivalent to FACTOR=(5:3). Note that if m /n =1, the result is the same as the case given previously under None Used. FROM=interval The FROM= option used alone establishes the frequency and interval widths of the input observations. Missing values are interpolated, and any specied transformations are performed, but no frequency conversion is done. TO=interval When the TO= option is used without the FROM= option, output observations with the TO= frequency are generated over the range of input ID values. The rst output observation is for the TO= interval containing the ID value of the rst input observation; the last output observation is for the TO= interval containing the ID value of the last input observation. The input observations are not assumed to form regular time series and may represent aperiodic points in time. An ID variable is required to give the date or datetime of the input observations. FROM=interval TO=interval When both the FROM= and TO= options are used, the input observations have the frequency given by the FROM= interval, and the output observations have the frequency given by the TO= interval. FROM=interval FACTOR=(n:m) When both the FROM= and FACTOR= options are used, a TO= interval is inferred from the combination of the FROM=interval and the FACTOR=(n:m ) values specied. For example, FROM=YEAR FACTOR=4 is the same as FROM=YEAR TO=QTR. Also, FROM=YEAR FACTOR=(3:2) is the same as FROM=YEAR used with TO=MONTH8. Once the implied TO= interval is determined, this combination operates the same as if FROM= and TO= had been specied. If no valid TO= interval can be constructed from the combination of the FROM= and FACTOR= options, an error is produced. TO=interval FACTOR=(n:m) The combination of the TO= option and the FACTOR= option is not allowed and produces an error. ALIGN= option Controls the alignment of SAS dates used to identify output observations. The ALIGN= option allows the following values: BEGINNING | BEG | B, MIDDLE | MID | M, and ENDING | END | E. BEGINNING is the default.
When values are missing or the FROM= interval is not nested within the TO= interval (as when aggregating from weekly to monthly), the results depend on an interpolation. The METHOD=AGGREGATE option always produces exact results, never an interpolation. However, this method can only be used if the FROM= interval is nested within the TO= interval.
Identifying Observations
The variable specied in the ID statement is used to identify the observations. Usually, SAS date or datetime values are used for this variable. PROC EXPAND uses the ID variable to do the following: identify the time interval of the input values validate the input data set observations compute the ID values for the observations in the output data set
that the ID values are nonmissing and in ascending order. An error message is produced and the observation is ignored when an invalid ID value is found in the input data set.
Extrapolation
The spline functions t by the EXPAND procedure are very good at approximating continuous curves within the time range of the input data but poor at extrapolating beyond the range of the data. The accuracy of the results produced by PROC EXPAND may be somewhat less at the ends of the output series than at time periods for which there are several input values at both earlier and later times. The curves t by PROC EXPAND should not be used for forecasting. PROC EXPAND normally avoids extrapolation of values beyond the time range of the nonmissing input data for a series, unless the EXTRAPOLATE option is used. However, if the start or end of
the input series does not correspond to the start or end of an output interval, some output values may depend in part on an extrapolation. For example, if FROM=YEAR, TO=WEEK, and OBSERVED=BEGINNING are specied, then the rst observation output for a series is for the week of 1 January of the rst nonmissing input year. If 1 January of that year is not a Sunday, the beginning of this week falls before the date of the rst input value, and therefore a beginning-of-period output value for this week is extrapolated. This extrapolation is made only to the extent needed to complete the terminal output intervals that overlap the endpoints of the input series and is limited to no more than the width of one FROM= interval or one TO= interval, whichever is less. This restriction of the extrapolation to complete terminal output intervals is applied to each series separately, and it takes into account the OBSERVED= option for the input and output series. When the EXTRAPOLATE option is used, the normal restriction on extrapolation is overridden. Output values are computed for the full time range covered by the input data set. For the SPLINE method, extrapolation is performed by a linear projection of the trend of the cubic spline curve t to the input data, not by extrapolation of the rst and last cubic segments. The EXTRAPLOTE option should be used with caution.
OBSERVED= Option
The values of the CONVERT statement OBSERVED= option are as follows: BEGINNING MIDDLE ENDING TOTAL AVERAGE DERIVATIVE indicates that the data are beginning-of-period values. OBSERVED=BEGINNING is the default. indicates that the data are period midpoint values. indicates that the data represent end-of-period values. indicates that the data values represent period totals for the time interval corresponding to the observation. indicates that the data values represent period averages. requests that the output series be the derivatives of the cubic spline curve t to the input data by the SPLINE method.
If only one value is specied in the OBSERVED= option, that value applies to both the input and the output series. For example, OBSERVED=TOTAL is the same as OBSERVED=(TOTAL,TOTAL), which indicates that the input values represent totals over the time intervals corresponding to the input observations, and the converted output values also represent period totals. The value DERIVATIVE can be used only as the second OBSERVED= option value, and it can be used only when METHOD=SPLINE is specied or is the default method. Since the TOTAL, AVERAGE, MIDDLE, and END cases require that the width of each input interval be known, both the FROM= option and an ID statement are normally required if one of these observation characteristics is specied for any series. However, if the FROM= option is not
specied, each input interval is assumed to extend from the ID value for the observation to the ID value of the next observation, and the width of the interval for the last observation is assumed to be the same as the width for the next to last observation.
Conversion Methods
The SPLINE Method
The SPLINE method ts a cubic spline curve to the input values. A cubic spline is a segmented function consisting of third-degree (cubic) polynomial functions joined together so that the whole curve and its rst and second derivatives are continuous. For point-in-time input data, the spline curve is constrained to pass through the given data points. For interval total or average data, the denite integrals of the spline over the input intervals are constrained to equal the given interval totals. For boundary constraints, the not-a-knot condition is used by default. This means that the rst two spline pieces are constrained to be part of the same cubic curve, as are the last two pieces. Thus the spline used by PROC EXPAND by default is not the same as the commonly used natural spline, which uses zero second-derivative endpoint constraints. While DeBoor (1981) recommends the not-a-knot constraint for cubic spline interpolation, using this constraint can sometimes produce anomalous results at the ends of the interpolated series. PROC EXPAND provides options to specify other endpoint constraints for spline curves. To specify endpoint constraints, use the following form of the METHOD= option.
METHOD=SPLINE( constraint < , constraint > )
The rst constraint specication applies to the lower endpoint, and the second constraint specication applies to the upper endpoint. If only one constraint is specied, it applies to both the lower and upper endpoints. The constraint specications can have the following values: NOTAKNOT species the not-a-knot constraint. This is the default. NATURAL species the natural spline constraint. The second derivative of the spline curve is constrained to be zero at the endpoint. SLOPE= value species the rst derivative of the spline curve at the endpoint. The value specied can be any positive or negative number, but extreme values may produce unreasonable results.
CURVATURE= value species the second derivative of the spline curve at the endpoint. The value specied can be any positive or negative number, but extreme values may produce unreasonable results. Specifying CURVATURE=0 is equivalent to specifying the NATURAL option. For example, to specify natural spline interpolation, use the following option in the CONVERT or PROC EXPAND statement:
method=spline(natural)
For OBSERVED=BEGINNING, MIDDLE, and END series, the spline knots are placed at the beginning, middle, and end of each input interval, respectively. For total or averaged series, the spline knots are set at the start of the rst interval, at the end of the last interval, and at the interval midpoints, except that there are no knots for the rst two and last two midpoints. Once the cubic spline curve is t to the data, the spline is extended by adding linear segments at the beginning and end. These linear segments are used for extrapolating values beyond the range of the input data. For point-in-time output series, the spline function is evaluated at the appropriate points. For interval total or average output series, the spline function is integrated over the output intervals.
If the input data are totals or averages, the results are the sums or averages, respectively, of the input values for observations corresponding to the output observations. That is, if either TOTAL or AVERAGE is specied for the OBSERVED= option, the METHOD=AGGREGATE result is the sum or mean of the input values corresponding to the output observation. For example, suppose METHOD=AGGREGATE, FROM=MONTH, and TO=YEAR are specied. For OBSERVED=TOTAL series, the result for each output year is the sum of the input values over the months of that year. If any input value is missing, the corresponding sum or mean is also a missing value. If the input data are point-in-time values, the result value of each output observation equals the input value for a selected input observation determined by the OBSERVED= attribute. For example, suppose METHOD=AGGREGATE, FROM=MONTH, and TO=YEAR are specied. For OBSERVED=BEGINNING series, January observations are selected as the annual values. For OBSERVED=MIDDLE series, July observations are selected as the annual values. For OBSERVED=END series, December observations are selected as the annual values. If the selected value is missing, the output annual value is missing. The AGGREGATE method can be used only when the FROM= intervals are nested within the TO= intervals. For example, you can use METHOD=AGGREGATE when FROM=MONTH and TO=QTR because months are nested within quarters. You cannot use METHOD=AGGREGATE when FROM=WEEK and TO=QTR because weeks are not nested within quarters. In addition, the AGGREGATE method cannot convert between point-in-time data and interval total or average data. Conversions between TOTAL and AVERAGE data are allowed, but conversions between BEGINNING, MIDDLE, and END are not. Missing input values produce missing result values for METHOD=AGGREGATE. However, gaps in the sequence of input observations are not allowed. For example, if FROM=MONTH, you may have a missing value for a variable in an observation for a given February. But if an observation for January is followed by an observation for March, there is a gap in the data, and METHOD=AGGREGATE cannot be used. When the AGGREGATE method is used, there is no interpolating curve, and therefore the EXTRAPOLATE option is not allowed. Alternate methods for aggregating or accumulating time series data are supported by the TIMESERIES procedure. See Chapter 27, The TIMESERIES Procedure, for more information.
METHOD=NONE
The option METHOD=NONE species that no interpolation be performed. This option is normally used in conjunction with the TRANSFORMIN= or TRANSFORMOUT= option. When METHOD=NONE is specied, there is no difference between the TRANSFORMIN= and TRANSFORMOUT= options; if both are specied, the TRANSFORMIN= operations are performed rst, followed by the TRANSFORMOUT= operations. TRANSFORM= can be used as an abbreviation for TRANSFORMIN=. METHOD=NONE cannot be used when frequency conversion is specied.
Transformation Operations
The operations that can be used in the TRANSFORMIN= and TRANSFORMOUT= options are shown in Table 14.2. Operations are applied to each value of the series. Each value of the series is replaced by the result of the operation. In Table 14.2, xt or x represents the value of the series at a particular time period t before the transformation is applied, yt represents the value of the result series, and N represents the total number of observations. The notation [ n ] indicates that the argument n is optional; the default is 1. The notation window is used as the argument for the moving statistics operators, and it indicates that you can specify either an integer number of periods n or a list of n weights in parentheses. The notation sequence is used as the argument for the sequence operators, and it indicates that you must specify a sequence of numbers. The notation s indicates the length of seasonality, and it is a required argument.
Table 14.2 Transformation Operations
Syntax + number - number * number / number ABS ADJUST CD_I s CD_S s CD_SA s CD_TC s CDA_I s CDA_S s CDA_SA s CEIL CMOVAVE window CMOVCSS window CMOVGMEAN window CMOVMAX n CMOVMED n CMOVMIN n CMOVPROD window CMOVRANGE n CMOVRANK n CMOVSTD window CMOVSUM n CMOVTVALUE window CMOVUSS window CMOVVAR window
Result adds the specied number : x C number subtracts the specied number : x number multiplies by the specied number : x number divides by the specied number : x=number absolute value: jxj indicates that the following moving window summation or product operator should be adjusted for window width classical decomposition irregular component classical decomposition seasonal component classical decomposition seasonally adjusted series classical decomposition trend-cycle component classical decomposition (additive) irregular component classical decomposition (additive) seasonal component classical decomposition (additive) seasonally adjusted series smallest integer greater than or equal to x : ceil.x/ centered moving average centered moving corrected sum of squares centered moving geometric mean centered moving maximum centered moving median centered moving minimum centered moving product centered moving range centered moving rank centered moving standard deviation centered moving sum centered moving t -value centered moving uncorrected sum of squares centered moving variance
Table 14.2
continued
Syntax CUAVE [ n ] CUCSS [ n ] CUGMEAN [ n ] CUMAX [ n ] CUMED [ n ] CUMIN [ n ] CUPROD [ n ] CURANK [ n ] CURANGE [ n ] CUSTD [ n ] CUSUM [ n ] CUTVALUE [ n ] CUUSS [ n ] CUVAR [ n ] DIF [ n ] EWMA number
EXP FDIF d FLOOR FSUM d HP_T lambda HP_C lambda ILOGIT LAG [ n ] LEAD [ n ] LOG LOGIT MAX number MIN number > number >= number = number ^= number < number <= number MOVAVE n MOVAVE window
Result cumulative average cumulative corrected sum of squares cumulative geometric mean cumulative maximum cumulative median cumulative minimum cumulative product cumulative rank cumulative range cumulative standard deviation cumulative sum cumulative t -value cumulative uncorrected sum of squares cumulative variance span n difference: xt xt n exponentially weighted moving average of x with smoothing weight number, where 0 < number < 1: yt D number xt C .1 number/yt 1 . This operation is also called simple exponential smoothing. exponential function: exp.x/ fractional difference with difference order d where 0 < d < 0:5 largest integer less than or equal to x : oor.x/ fractional summation with summation order d where 0 < d < 0:5 Hodrick-Prescott Filter trend component where lambda is the nonnegative lter parameter Hodrick-Prescott Filter cycle component where lambda is the nonnegative lter parameter exp.x/ inverse logistic function: 1Cexp.x/ value of the series n periods earlier: xt n value of the series n periods later: xt Cn natural logarithm: log.x/ logistic function: log. 1 x x / maximum of x and number : max.x; number/ minimum of x and number : min.x; number/ missing value if x <D number, else x missing value if x < number, else x missing value if x number, else x missing value if x D number, else x missing value if x >D number, else x missing value if x > number, else x backward moving average of n neighboring values: 1 Pn 1 j D0 xt j n backward weighted moving average of neighboring values:
Table 14.2
continued
Syntax MOVCSS window MOVGMEAN window MOVMAX n MOVMED n MOVMIN n MOVPROD window MOVRANGE n MOVRANK n MOVSTD window MOVSUM n MOVTVALUE window MOVUSS window MOVVAR window MISSONLY <MEAN>
NEG NOMISS PCTDIF n PCTSUM n RATIO n RECIPROCAL REVERSE SCALE n1 n2 SEQADD sequence SEQDIV sequence SEQMINUS sequence SEQMULT sequence SET n1 n2 SETEMBEDDED n1 n2 SETLEFT n1 n2 SETMISS number SETRIGHT n1 n2 SIGN SQRT SQUARE SUM SUM n
Result P P . nD1 wj xt nCj /=. nD1 wj / j j backward moving corrected sum of squares backward moving geometric mean backward moving maximum backward moving median backward moving minimum backward moving product backward moving range backward moving rank backward moving standard deviation backward moving sum backward moving t-value backward moving uncorrected sum of squares backward moving variance indicates that the following moving time window statistic operator should replace only missing values with the moving statistic and should leave nonmissing values unchanged. If the option MEAN is specied, then missing values are replaced by the overall mean of the series. changes the sign: x indicates that the following moving time window statistic operator should not allow missing values percent difference of the current value and lag n percent summation of the current value and cumulative sum n-lag periods ratio of current value to lag n reciprocal: 1=x reverse the series: xN t scale series between n1 and n2 add sequence values to series divide series by sequence values subtract sequence values to series multiply series by sequence values set all values of n1 to n2 set embedded values of n1 to n2 set beginning values of n1 to n2 replaces missing values in the series with the number specied set ending values of n1 to n2 -1, 0, or 1 as x is < 0, equals 0, or > 0 respectively p square root: x square: x 2 P cumulative sum: t D1 xj j cumulative sum of multiples of n-period lags: xt C xt n C xt 2n C : : :
Table 14.2
continued
Result sets xt to missing a value if tn or t N n C 1 sets xt to missing a value if tn sets xt to missing a value if t N n C 1
C xt
C xt
C xt
4 /=5
C xt
C xt C xt C1 C xt C2 /=5
If the window with a centered moving time window operator is not an odd number, one more lagged value than lead value is included in the time window. For example, the result of the CMOVAVE 4 operator is yt D .xt C1 C xt C xt
1
C xt
2 /=4
You can compute a forward moving time window operation by combining a backward moving time window operator with the REVERSE operator. For example, the following statement computes a ve-period forward moving average of X.
In this example, the resulting transformation is yt D .xt C xt C1 C xt C2 C xt C3 C xt C4 /=5 Some of the moving time window operators enable you to specify a list of weight values to compute weighted statistics. These are CMOVAVE, CMOVCSS, CMOVGMEAN, CMOVPROD, CMOVSTD, CMOVTVALUE, CMOVUSS, CMOVVAR, MOVAVE, MOVCSS, MOVGMEAN, MOVPROD, MOVSTD, MOVTVALUE, MOVUSS, and MOVVAR. To specify a weighted moving time window operator, enter the weight values in parentheses after the operator name. The window width n is equal to the number of weights that you specify; do not specify n. For example, the following statement computes a weighted ve-period centered moving average of X.
convert x=y / transformout=( cmovave( .1 .2 .4 .2 .1 ) );
C :2xt
The weight values must be greater than zero. If the weights do not sum to 1, the weights specied are divided by their sum to produce the weights used to compute the statistic. A complete time window is not available at the beginning of the series. For the centered operators a complete window is also not available at the end of the series. The computation of the moving time window operators is adjusted for these boundary conditions as follows. For backward moving window operators, the width of the time window is shortened at the beginning of the series. For example, the results of the MOVSUM 3 operator are y1 y2 y3 y4 y5 D D D D D x1 x1 C x2 x1 C x2 C x3 x2 C x3 C x4 x3 C x4 C x5
For centered moving window operators, the width of the time window is shortened at the beginning and the end of the series due to unavailable observations. For example, the results of the
2 1
4 3 2
3 2 1
2 1
yN
For weighted moving time window operators, the weights for the unavailable or unused observations are ignored and the remaining weights renormalized to sum to 1.
You can specify a lag increment argument n for the cumulative statistics operators. In this case, the statistic is computed from the current and every nt h previous value. When n is specied these operators compute statistics of the values xt ; xt n ; xt 2n ; : : :; xt i n for t i n > 0. For example, the following statement computes yt as the cumulative sum of nonmissing xi values for odd i when t is odd and for even i when t is even.
convert x=y / transformout=( cusum 2 );
Missing Values
You can truncate the length of the result series by using the TRIM, TRIMLEFT, and TRIMRIGHT operators to set values to missing at the beginning or end of the series. You can use these functions to trim the results of moving time window operators so that the result series contains only values computed from a full width time window. For example, the following statements compute a centered ve-period moving average of X, and they set to missing values at the ends of the series that are averages of fewer than ve values.
convert x=y / transformout=( cmovave 5 trim 2 );
Normally, the moving time window and cumulative statistics operators ignore missing values and compute their results for the nonmissing values. When preceded by the NOMISS operator, these functions produce a missing result if any value within the time window is missing. The NOMISS operator does not perform any calculations, but serves to modify the operation of the moving time window operator that follows it. The NOMISS operator has no effect unless it is followed by a moving time window operator. For example, the following statement computes a ve-period moving average of the variable X but produces a missing value when any of the ve values are missing.
convert x=y / transformout=( nomiss movave 5 );
The following statement computes the cumulative sum of the variable X but produces a missing value for all periods after the rst missing X value.
convert x=y / transformout=( nomiss cusum );
Similar to the NOMISS operator, the MISSONLY operator does not perform any calculations (unless followed by the MEAN option), but it serves to modify the operation of the moving time window operator that follows it. When preceded by the MISSONLY operator, these moving time window operators replace any missing values with the moving statistic and leave nonmissing values unchanged. For example, the following statement replaces any missing values of the variable X with an exponentially weighted moving average of the past values of X and leaves nonmissing values unchanged. The missing values are interpolated using the specied exponentially weighted moving average. (This is also called simple exponential smoothing.)
convert x=y / transformout=( missonly ewma 0.3 );
The following statement replaces any missing values of the variable X with the overall mean of X.
convert x=y / transformout=( missonly mean );
You can use the SETMISS operator to replace missing values with a specied number. For example, the following statement replaces any missing values of the variable X with the number 8.77.
convert x=y / transformout=( setmiss 8.77 );
For multiplicative decomposition, all of the xt values should be positive. The CD_TC operator computes the trend-cycle component for both the multiplicative and additive models. When s is odd, this operator computes an s-period centered moving average as follows: T Ct D
bs=2c X kD bs=2c
xtCk =s
When s is even, the CD_TC s operator computes the average of two adjacent s-period centered moving averages as follows: T Ct D
bs=2c 1 X kD bs=2c
The CD_S and CDA_S operators compute the seasonal components for the multiplicative and additive models, respectively. First, the trend-cycle component is computed as shown previously. Second, the seasonal-irregular component is computed by SIt D xt =T Ct for the multiplicative model and by SIt D xt T Ct for the additive model. The seasonal component is obtained by averaging the seasonal-irregular component for each season. SkCjs D X
tDk mod s
SIt n=s
where 0j n=s and 1ks. The seasonal components are normalized to sum to one (multiplicative) or zero (additive). The CD_I and CDA_I operators compute the irregular component for the multiplicative and additive models respectively. First, the seasonal component is computed as shown previously. Next, the irregular component is determined from the seasonal-irregular and seasonal components as appropriate. It It D SIt =St D SIt St
The CD_SA and CDA_SA operators compute the seasonally adjusted time series for the multiplicative and additive models, respectively. After decomposition, the original time series can be seasonally adjusted as appropriate. xt Q xt Q D xt =St D T Ct It D xt St D T Ct C It
The following statements compute all the multiplicative classical decomposition components for the variable X for s=12.
convert convert convert convert x=tc x=s x=i x=sa / / / / transformout=( transformout=( transformout=( transformout=( cd_tc cd_s cd_i cd_sa 12 12 12 12 ); ); ); );
The following statements compute all the additive classical decomposition components for the variable X for s=4.
/ / / /
4 4 4 4
); ); ); );
The X12 and X11 procedures provides other methods for seasonal decomposition. See Chapter 32, The X12 Procedure, and Chapter 31, The X11 Procedure.
Fractional Operators
For fractional operators, the parameter, d, represents the order of fractional differencing. Fractional summation is the inverse operation of fractional differencing.
Examples of Usage
Suppose that X is a fractionally integrated time series variable of order d=0.25. Fractionally differencing X forms a time series variable Y which is not integrated.
convert x=y / transformout=(fdif 0.25);
Suppose that Z is a non-integrated time series variable. Fractionally summing Z forms a time series W which is fractionally integrated of order d D 0:25.
convert z=w / transformout=(fsum 0.25);
The trades of a particular security are recorded for each weekday in a variable named PRICE. Given the historical daily trades, the ranking of the price of this security for each trading day, considering its entire past history, can be computed as follows:
convert price=history / transformout=( curank );
The ranking of the price of this security for each trading day considering the previous weeks history can be computed as follows:
The ranking of the price of this security for each trading day considering the previous two weeks history can be computed as follows:
convert price=twoweek / transformout=( movrank 10 );
Examples of Usage
The interest rates for a savings account are recorded for each month in the data set variable RATES. The cumulative interest rate for each month considering the entire account past history can be computed as follows:
convert rates=history / transformout=( + 1 cuprod - 1);
The interest rate for each quarter considering the previous quarters history can be computed as follows:
convert rates=lastqtr / transformout=( + 1 movprod 3 - 1);
Sequence Operators
For the sequence operators, the sequence values are used to compute the transformed values from the original values in a sequential fashion. You can add to or subtract from the original series or you can multiply or divide by the sequence values. The rst sequence value is applied to the rst observation of the series, the second sequence value is applied to the second observation of the series, and so on until the end of the sequence is reached. At this point, the rst sequence value is applied to the next observation of the series and the second sequence value on the next observation and so on.
Let v1 ; : : : ; vm be the sequence values and let xt , t D 1; : : : N , be the original time series. The transformed series, yt , is computed as follows: y1 y2 ym ymC1 ymC2 y2m y2mC1 y2mC2 D D D D D D D D x1 op v1 x2 op v2 xm op vm xmC1 op v1 xmC2 op v2 x2m op vm x2mC1 op v1 x2mC2 op v2
where op D C; ; ; or =.
Examples of Usage
The multiplicative seasonal indices are 0.9, 1.2. 0.8, and 1.1 for the four quarters. Let SEASADJ be a quarterly time series variable that has been seasonally adjusted in a multiplicative fashion. To restore the seasonality to SEASADJ use the following transformation:
convert seasadj=seasonal / transformout=(seqmult (0.9 1.2 0.8 1.1));
The additive seasonal indices are 4.4, -1.1, -2.1, and -1.2 for the four quarters. Let SEASADJ be a quarterly time series variable that has been seasonally adjusted in additive fashion. To restore the seasonality to SEASADJ use the following transformation:
convert seasadj=seasonal / transformout=(seqadd (4.4 -1.1 -2.1 -1.2));
Set Operators
For the set operators, the rst parameter, n1 , represents the value to be replaced and the second parameter, n2 , represents the replacement value. The replacement can be localized to the beginning, middle, or end of the series.
Examples of Usage
Suppose that a store opened recently and that the sales history is stored in a database that does not recognize missing values. Even though demand may have existed prior to the stores opening, this
database assigns the value of zero. Modeling the sales history may be problematic because the sales history is mostly zero. To compensate for this deciency, the leading zero values should be set to missing with the remaining zero values unchanged (representing no demand).
convert sales=demand / transformout=(setleft 0 .);
Likewise, suppose a store is closed recently. The demand may still be present and hence a recorded value of zero does not accurately reect actual demand.
convert sales=demand / transformout=(setright 0 .);
Scale Operator
For the scale operator, the rst parameter, n1 , represents the value associated with the minimum value (xmi n ) and the second parameter, n2 , represents the value associated with the maximum value (xmax ) of the original series (xt ). The scale operator recales the original data to be between the parameters n1 and n2 as follows: yt D ..n2 n1 /=.xmax xmi n //.xt xmi n / C n1
Examples of Usage
Suppose that two new product sales histories are stored in variables X and Y and you wish to determine their adoption rates. In order to compare their adoption histories the variables must be scaled for comparison.
convert x=w / transformout=(scale 0 1); convert y=z / transformout=(scale 0 1);
Adjust Operator
For the moving summation and product window operators, the window widths at the beginning and end of the series are smaller than those in the middle of the series. Likewise, if there are embedded missing values, the window width is smaller than specied. When preceded by the ADJUST operator, the moving summation (MOVSUM CMOVSUM) and moving product operators (MOVPROD CMOVPROD) are adjusted by the window width. For example, suppose the variable X has 10 values and the moving summation operator of width 3 is applied to X to create the variable Y with window width adjustment and the variable Z without adjustment.
convert x=y / transformout=(adjust movsum 3); convert x=z / transformout=(movsum 3);
The above transformations result in the following relationship between Y and Z: y1 D 3z1 , y2 D 3 2 z2 , yt D zt for t > 2 because the rst two window widths are smaller than 3. For example, suppose the variable X has 10 values and the moving multiplicative operator of width 3 is applied to X to create the variable Y with window width adjustment and the variable Z without adjustment.
convert x=y / transformout=(adjust movprod 3); convert x=z / transformout=(movprod 3);
3 The above transformation result in the following: y1 D z1 , y2 D z2 , yt D zt for t > 2 because the rst two window widths are smaller than 3. 3=2
Percent Operators
The percentage operators compute the percent summation and the percent difference of the current value and the lag.n/. The percent summation operator (PCTSUM) computes yt D 100xt =cusum.xt n /. If any of the values of the preceding equation are missing or the cumulative summation is zero, the result is set to missing. The percent difference operator (PCTDIF) computes yt D 100.xt xt n /=xt n . If any of the values of the preceding equation are missing or the lag value is zero, the result is set to missing. For example, suppose variable X contains the series. The percent summation of lag 4 is applied to X to create the variable Y . The percent difference of lag 4 is applied to X to create the variable Z.
convert x=y / transformout=(pctsum 4); convert x=z / transformout=(pctdif 4);
Ratio Operators
The ratio operator computes the ratio of the current value and the lag.n/ value. The ratio operator (RATIO) computes yt D xt =xt n . If any of the values of the preceding equation are missing or the lag value is zero, the result is set to missing. For example, suppose variable X contains the series. The ratio of the current value and the lag 4 value of X assigned to the variable Y . The percent ratio of the current value and lag 4 value of X is assigned to the variable Z.
convert x=y / transformout=(ratio 4); convert x=z / transformout=(ratio 4 * 100);
the ID variable that contains the lower breakpoint (or knot) of the spline segment to which the coefcients apply. The ID variable has the same name as the variable used in the ID statement. If an ID statement is not used, but the FROM= option is used, then the name of the ID variable is DATE or DATETIME, depending on whether the FROM= option indicates SAS date or SAS datetime values. If neither an ID statement nor the FROM= option is used, the ID variable is named TIME. CONSTANT, the constant coefcient for the spline segment LINEAR, the linear coefcient for the spline segment QUAD, the quadratic coefcient for the spline segment CUBIC, the cubic coefcient for the spline segment For each BY group, the OUTEST= data set contains observations for each polynomial segment of the spline curve t to each input series. To obtain the observations dening the spline curve used for a series, select the observations where the value of VARNAME equals the name of the series. The observations for a series in the OUTEST= data set encode the spline function t to the series as follows. Let ai ; bi ; ci ; and di be the values of the variables CUBIC, QUAD, LINEAR, and CONSTANT, respectively, for the i th observation for the series. Let xi be the value of the ID variable for the i th observation for the series. Let n be the number of observations in the OUTEST= data set for the series. The value of the spline function evaluated at a point x is f .x/ D ai .x xi /3 C bi .x xi /2 C ci .x xi / C di
where the segment number i is selected as follows: 8 i xi x < xi C1 ; 1 i < n < i D 1 x < x1 : n x xn In other words, if x is between the rst and last ID values (x1 x < xn ), use the observation from the OUTEST= data set with the largest ID value less than or equal to x. If x is less than the rst ID value x1 , then i D 1. If x is greater than or equal to the last ID value (x xn ), then i D n. For METHOD=JOIN, the curve is a linear spline, and the values of CUBIC and QUAD are 0. For METHOD=STEP, the curve is a constant spline, and the values of CUBIC, QUAD, and LINEAR are 0. For METHOD=AGGREGATE, no coefcients are output.
ODS Graphics
This section describes the use of ODS for creating graphics with the EXPAND procedure. To request these graphs, you must specify the statement ods graphics on; in your SAS program prior to the PROC EXPAND step and specify the PLOTS= option in the PROC EXPAND statement.
ODS Graph Name ConvertedSeriesPlot CrossInputSeriesPlot CrossOutputSeriesPlot InputSeriesPlot JointInputSeriesPlot JointOutputSeriesPlot OutputSeriesPlot TransformedInputSeriesPlot TransformedOutputSeriesPlot
Plot Description Converted Series Plot Cross Input Series Plot Cross Output Series Plot Input Series Plot Joint Input Series Plot Joint Output Series Plot Output Series Plot Transformed Input Series Plot Transformed Output Series Plot
PLOTS= Options CONVERTED OUTPUT ALL CROSSINPUT CROSSOUTPUT INPUT JOINTINPUT ALL JOINTINPUT JOINTOUTPUT SERIES|OUTPUT TRANSFORMIN OUTPUT ALL TRANSFORMOUT OUTPUT ALL
The PLOTS= option values CROSSINPUT and CROSSOUTPUT produce graphs that overlay plots of two series using two Y axis and with each of the two plots show at a separate scale. The PLOTS= option values JOINTINPUT and JOINTOUTPUT produce graphs that overly plots of two series using a single Y axis and with both of the plots shown on the same scale. Note that because the joint graphics options (PLOTS=JOINTINPUT or PLOTS=JOINTOUTPUT) plot the (input or converted) series and the transformed series on the same scale, if the transformation changes the range of the series these plots may be hard to visualize. The PLOTS=CROSSINPUT option plots both the input series and the series after the input transformation (TRANSFORMIN= option) is applied. The left side vertical axis refers to the input series, while the right side vertical axis refers to the series after the transformation. If the TRANFORMIN= option is not specied in the CONVERT statement for an output variable, the cross input plot is not produced for that variable. The PLOTS=JOINTINPUT option jointly plots both the input series and the series after the input transformation (TRANSFORMIN= option) is applied. If the TRANFORMIN= option is not specied in the CONVERT statement for an output variable, the joint input plot is not produced for that variable. The PLOTS=CROSSOUTPUT option plots both the converted series and the converted series after the output transformation (TRANSFORMOUT= option) is applied. The left side vertical axis refers to the input series, while the right side vertical axis refers to the series after the transformation. If the TRANFORMOUT= option is not specied in the CONVERT statement for an output variable, the cross output plot is not produced for that variable. The PLOTS=JOINTOUTPUT jointly plots both the converted series and the converted series after the output transformation (TRANSFORMOUT= option) is applied. If the TRANFORMOUT= option is not specied in the CONVERT statement for an output variable, the joint output plot is not produced for that variable.
The PLOTS=ALL option is a convenient way to specify all the plots except the OUTPUT plots and the joint and cross plots. The option PLOTS=(ALL OUTPUT JOINTINPUT JOINTOUTPUT CROSSINPUT CROSSOUTPUT) request that all possible plots be produced.
The following statements extract and print the monthly data, shown in Output 14.1.2.
data monthly; set sashelp.citimon;
where date >= 1jan1990d & date < 1jan1992d ; keep date ip lhur; run; title "Monthly Data"; proc print data=monthly; run;
The following statements interpolate monthly estimates for the quarterly series and merge the interpolated series with the monthly data. The resulting combined data set is then printed, as shown in Output 14.1.3.
proc expand data=qtrly out=temp from=qtr to=month; convert gdp gd / observed=average; id date; run; data combined; merge monthly temp; by date; run;
To compute the monthly estimates, use PROC EXPAND with the TO=MONTH option and specify OBSERVED=(BEGINNING,AVERAGE). The following statements interpolate the monthly estimates.
proc expand data=samples out=monthly to=month plots=(input output); id date; convert defects / observed=(beginning,average); run;
The following PROC PRINT step prints the results, as shown in Output 14.3.2.
title "Estimated Monthly Average Defect Rates"; proc print data=monthly; run;
The following statements use PROC EXPAND to compute lags and leads and a 3-period moving average of the X series.
proc expand data=test out=out method=none; id date; convert x = x_lag2 / transformout=(lag 2); convert x = x_lag1 / transformout=(lag 1); convert x; convert x = x_lead1 / transformout=(lead 1); convert x = x_lead2 / transformout=(lead 2); convert x = x_movave / transformout=(movave 3); run; title "Transformed Series"; proc print data=out; run;
Because there are no missing values to interpolate and no frequency conversion, the METHOD=NONE option is used to prevent PROC EXPAND from performing unnecessary computations. Because no frequency conversion is done, all variables in the input data set are copied to the output data set. The CONVERT X; statement is included to control the position of X in the output data set. This statement can be omitted, in which case X is copied to the output data set following the new variables computed by PROC EXPAND. The results are shown in Output 14.4.1.
Output 14.4.1 Output Data Set with Transformed Variables
Transformed Series Obs 1 2 3 4 5 6 7 8 date 1989:3 1989:4 1990:1 1990:2 1990:3 1990:4 1991:1 1991:2 x_lag2 . . 5238 5289 5375 5443 5514 5527 x_lag1 . 5238 5289 5375 5443 5514 5527 5557 x 5238 5289 5375 5443 5514 5527 5557 5615 x_lead1 5289 5375 5443 5514 5527 5557 5615 . x_lead2 5375 5443 5514 5527 5557 5615 . . x_movave 5238.00 5263.50 5300.67 5369.00 5444.00 5494.67 5532.67 5566.33 year 1989 1989 1990 1990 1990 1990 1991 1991 qtr 3 4 1 2 3 4 1 2
References
DeBoor, Carl (1981), A Practical Guide to Splines, New York: Springer-Verlag. Hodrick, R. J., and Prescott, E. C. (1980). Post-war U.S. business cycles: An empirical investigation. Discussion paper 451, Carnegie-Mellon University. Levenbach, H. and Cleary, J.P. (1984), The Modern Forecaster, Belmont, CA: Lifetime Learning
References ! 771
Publications (a division of Wadsworth, Inc.), 129-133. Makridakis, S. and Wheelwright, S.C. (1978), Interactive Forecasting: Univariate and Multivariate Methods, Second Edition, San Francisco: Holden-Day, 198-201. Wheelwright, S.C. and Makridakis, S. (1973), Forecasting Methods for Management, Third Edition, New York: Wiley-Interscience, 123-133.
772
Chapter 15
The following statements forecast 10 observations for the variable SALES by using the default STEPAR method and write the results to the output data set PRED:
proc forecast data=past lead=10 out=pred; var sales; run;
The following statements use the PRINT procedure to print the data set PRED:
proc print data=pred; run;
The PROC PRINT listing of the forecast data set PRED is shown in Figure 15.2.
proc forecast data=past interval=month lead=10 out=pred outlimit; var sales; id date; run; proc print data=pred; run;
The three observations for each forecast period have different values of the variable _TYPE_. For the _TYPE_=FORECAST observation, the value of the variable SALES is the forecast value for the period indicated by the DATE value. For the _TYPE_=L95 observation, the value of the variable SALES is the lower limit of the 95% condence interval for the forecast. For the _TYPE_=U95 observation, the value of the variable SALES is the upper limit of the 95% condence interval. You can control the types of observations written to the OUT= data set with the PROC FORECAST statement options OUTLIMIT, OUTRESID, OUTACTUAL, OUT1STEP, OUTSTD, OUTFULL, and OUTALL. For example, the OUTFULL option outputs the condence limit values, the onestep-ahead predictions, and the actual data, in addition to the forecast values. See the sections Syntax: FORECAST Procedure on page 788 and OUTEST= Data Set on page 808 for more information.
Plotting Forecasts
The forecasts, condence limits, and actual values can be plotted on the same graph with the SGPLOT procedure. Use the appropriate output control options in the PROC FORECAST statement to include in the OUT= data set the series you want to plot. Use the _TYPE_ variable in the SGPLOT procedure GROUP option to separate the observations for the different plots. The OUTFULL option is used in the following statements. The resulting output data set contains the actual and predicted values, as well as the upper and lower 95% condence limits.
proc forecast data=past interval=month lead=10 out=pred outfull; id date; var sales; run; proc sgplot data=pred; series x=date y=sales / group=_type_ lineattrs=(pattern=1); xaxis values=(1jan90d to 1jan93d by qtr); refline 15jul91d / axis=x; run;
The _TYPE_ variable is used in the SGPLOT procedures PLOT statement to make separate plots over time for each type of value. A reference line marks the start of the forecast period. (See SAS/GRAPH Software: Reference for more information about using PROC SGPLOT.) The WHERE statement restricts the range of the actual data shown in the plot. In this example, the variable SALES has monthly data from July 1989 through July 1991, but only the data for 1990 and 1991 are shown in Figure 15.4.
Plotting Residuals
You can plot the residuals from the forecasting model by using PROC SGPLOT and a WHERE statement. 1. Use the OUTRESID option or the OUTALL option in the PROC FORECAST statement to include the residuals in the output data set. 2. Use a WHERE statement to specify the observation type of RESIDUAL in the PROC GPLOT code. The following statements add the OUTRESID option to the preceding example and plot the residuals:
proc forecast data=past interval=month lead=10 out=pred outfull outresid; id date; var sales;
run; proc sgplot data=pred; where _type_=RESIDUAL; needle x=date y=sales / markers; xaxis values=(1jan89d to 1oct91d by qtr); run;
The PRINT procedure prints the OUTEST= data set, as shown in Figure 15.6.
Figure 15.6 The OUTEST= Data Set for STEPAR Method
Obs 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 _TYPE_ N NRESID DF SIGMA CONSTANT LINEAR AR1 AR2 AR3 AR4 AR5 AR6 AR7 AR8 SST SSE MSE RMSE MAPE MPE MAE ME MAXE MINE MAXPE MINPE RSQUARE ADJRSQ RW_RSQ ARSQ APC AIC SBC CORR date JUL91 JUL91 JUL91 JUL91 JUL91 JUL91 JUL91 JUL91 JUL91 JUL91 JUL91 JUL91 JUL91 JUL91 JUL91 JUL91 JUL91 JUL91 JUL91 JUL91 JUL91 JUL91 JUL91 JUL91 JUL91 JUL91 JUL91 JUL91 JUL91 JUL91 JUL91 JUL91 JUL91 JUL91 sales 25 25 22 0.2001613 9.4348822 0.1242648 0.5206294 . . . . . . . 21.28342 0.8793714 0.0399714 0.1999286 1.2280089 -0.050139 0.1312115 -0.001811 0.3732328 -0.551605 3.2692294 -5.954022 0.9586828 0.9549267 0.2657801 0.9474145 0.044768 -77.68559 -74.02897 0.9791313
In the OUTEST= data set, the DATE variable contains the ID value of the last observation in the data set used to t the forecasting model. The variable SALES contains the statistic indicated
by the value of the _TYPE_ variable. The _TYPE_=N, NRESID, and DF observations contain, respectively, the number of observations read from the data set, the number of nonmissing residuals used to compute the goodness-of-t statistics, and the number of nonmissing observations minus the number of parameters used in the forecasting model. The observation that has _TYPE_=SIGMA contains the estimate of the standard deviation of the one-step prediction error computed from the residuals. The _TYPE_=CONSTANT and _TYPE_=LINEAR observations contain the coefcients of the time trend regression. The _TYPE_=AR1, AR2, . . . , AR8 observations contain the estimated autoregressive parameters. A missing autoregressive parameter indicates that the autoregressive term at that lag was not retained in the model by the stepwise model selection method. (See the section STEPAR Method on page 797 for more information.) The other observations in the OUTEST= data set contain various goodness-of-t statistics that measure how well the forecasting model used ts the given data. See the section OUTEST= Data Set on page 808 for details.
The PRINT procedure prints the OUTEST= data set for the EXPO method, as shown in Figure 15.7.
See the section Syntax: FORECAST Procedure on page 788 for other options that control the forecasting method. See the section Introduction to Forecasting Methods on page 783 and the section Forecasting Methods on page 796 for an explanation of the different forecasting methods.
Two approaches to time series modeling and forecasting are time trend models and time series methods.
Suppose that the series exhibits growth over time, as shown in Figure 15.9.
Figure 15.9 Time Series with Linear Trend
A linear model is appropriate for this data. For the linear model, assume the x t values are generated according to the equation xt D b0 C b1 t C
t
The linear model has two parameters. The predicted values for the future are the points on the estimated line. The extension of the polynomial model to three parameters is the quadratic (which forms a parabola). This allows for a constantly changing slope, where the x t values are generated according to the equation xt D b0 C b1 t C b2 t 2 C
t
PROC FORECAST can t three types of time trend models: constant, linear, and quadratic. For other kinds of trend models, other SAS procedures can be used. Exponential smoothing ts a time trend model by using a smoothing scheme in which the weights decline geometrically as you go backward in time. The forecasts from exponential smoothing are a time trend, but the trend is based mostly on the recent observations instead of on all the observations
equally. How well exponential smoothing works as a forecasting method depends on choosing a good smoothing weight for the series. To specify the exponential smoothing method, use the METHOD=EXPO option. Single exponential smoothing produces forecasts with a constant trend (that is, no trend). Double exponential smoothing produces forecasts with a linear trend, and triple exponential smoothing produces a quadratic trend. Use the TREND= option with the METHOD=EXPO option to select single, double, or triple exponential smoothing. The time trend model can be modied to account for regular seasonal uctuations of the series about the trend. To capture seasonality, the trend model includes a seasonal parameter for each season. Seasonal models can be additive or multiplicative. xt D b0 C b1 t C s.t / C xt D .b0 C b1 t /s.t / C
t t
.additive/ .multiplicative/
where s(t) is the seasonal parameter for the season that corresponds to time t. The Winters method is similar to exponential smoothing, but it includes seasonal factors. The Winters method can use either additive or multiplicative seasonal factors. Like exponential smoothing, good results with the Winters method depend on choosing good smoothing weights for the series to be forecast. To specify the multiplicative or additive versions of the Winters method, use the METHOD=WINTERS or METHOD=ADDWINTERS options, respectively. To specify seasonal factors to include in the model, use the SEASONS= option. Many observed time series do not behave like constant, linear, or quadratic time trends. However, you can partially compensate for the inadequacies of the trend models by tting time series models to the departures from the time trend, as described in the following sections.
C a2 xt
C : : : C ap xt
The coefcients ai are autoregressive parameters. One of the simplest cases of this model is the random walk, where the series dances around in purely random jumps. This is illustrated in Figure 15.10.
In this type of model, the best forecast of a future value is the present value. However, with other autoregressive models, the best forecast is a weighted sum of recent values. Pure autoregressive forecasts always damp down to a constant (assuming the process is stationary). Autoregressive time series models can also be used to predict seasonal uctuations.
is written as follows: xt D b0 C b1 t C b2 t 2 C ut ut D a1 ut
1
C a2 ut
C : : : C ap ut
The autoregressive parameters included in the model for each series are selected by a stepwise regression procedure, so that autoregressive parameters are included only at those lags at which they are statistically signicant. The stepwise autoregressive method is fully automatic. Unlike the exponential smoothing and Winters methods, it does not depend on choosing smoothing weights. However, the STEPAR method assumes that the long-term trend is stable; that is, the time trend regression is t to the whole series with equal weights for the observations. The stepwise autoregressive model is used when you specify the METHOD=STEPAR option or do not specify any METHOD= option. To select a constant, linear, or quadratic trend for the time-trend part of the model, use the TREND= option.
Functional Summary
Table 15.1 summarizes the statements and options that control the FORECAST procedure.
Table 15.1 FORECAST Functional Summary
Description Statements specify model and data set options specify BY-group processing identify observations specify the variables to forecast Input Data Set Options specify the input SAS data set specify frequency of the input time series specify increment between observations
Option
Description specify seasonality specify number of periods in a season treat zeros at beginning of series as missing Output Data Set Options specify the number of periods ahead to forecast name output data set to contain the forecasts write actual values to the OUT= data set write condence limits to the OUT= data set write residuals to the OUT= data set write standard errors of the forecasts to the OUT= data set write one-step-ahead predicted values to the OUT= data set write predicted, actual, and condence limit values to the OUT= data set write all available results to the OUT= data set specify signicance level for condence limits control the alignment of SAS date values Parameters and Statistics Output Data Set Options write parameter estimates and goodness-oft statistics to an output data set write additional statistics to OUTEST= data set write Theil statistics to OUTEST= data set write forecast accuracy statistics to OUTEST= data set Forecasting Method Options specify the forecasting method specify degree of the time trend model specify smoothing weights specify order of the autoregressive model specify signicance level for adding AR lags specify signicance level for keeping AR lags start forecasting before the end of data specify criterion for judging singularity limit number of error or warning messages
PROC FORECAST PROC FORECAST PROC FORECAST PROC FORECAST PROC FORECAST PROC FORECAST PROC FORECAST PROC FORECAST PROC FORECAST PROC FORECAST PROC FORECAST
LEAD= OUT= OUTACTUAL OUTLIMIT OUTRESID OUTSTD OUT1STEP OUTFULL OUTALL ALPHA= ALIGN=
PROC FORECAST PROC FORECAST PROC FORECAST PROC FORECAST PROC FORECAST PROC FORECAST PROC FORECAST PROC FORECAST PROC FORECAST
Description Initializing Smoothed Values specify number of beginning values to use in calculating starting values specify number of beginning values to use in calculating initial seasonal parameters specify starting values for constant term specify starting values for linear trend specify starting values for the quadratic trend
Statement
Option
PROC FORECAST PROC FORECAST PROC FORECAST PROC FORECAST PROC FORECAST
controls the alignment of SAS dates used to identify output observations. The ALIGN= option allows the following values: BEGINNING | BEG | B, MIDDLE | MID | M, and ENDING | END | E. BEGINNING is the default.
ALPHA=value
species the signicance level to use in computing the condence limits of the forecast. The value of the ALPHA= option must be between 0.01 and 0.99. You should use only two digits for the ALPHA= option because PROC FORECAST rounds the value to the nearest percent (ALPHA=0.101 is the same as ALPHA=0.10). The default is ALPHA=0.05, which produces 95% condence limits.
AR=n NLAGS=n
species the maximum order of the autoregressive model. The AR= option is valid only for METHOD=STEPAR. The default value of n depends on the INTERVAL= option and on the number of observations in the DATA= data set. See the section STEPAR Method on page 797 for details.
ASTART=value ASTART=( value . . . )
species starting values for the constant term for the exponential smoothing, Winters, and additive Winters methods. This option is ignored if METHOD=STEPAR. The values specied are associated with the variables in the VAR statement in the order in which the variables are listed. See the section Starting Values for EXPO, WINTERS, and ADDWINTERS Methods on page 804 for details.
BSTART=value
BSTART=( value . . . )
species starting values for the linear trend for the exponential smoothing, Winters, and additive Winters methods. The values specied are associated with the variables in the VAR statement in the order in which the variables are listed. This option is ignored if METHOD=STEPAR or TREND=1. See the section Starting Values for EXPO, WINTERS, and ADDWINTERS Methods on page 804 for details.
CSTART=value CSTART=( value . . . )
species starting values for the quadratic trend for the exponential smoothing, Winters, and additive Winters methods. The values specied are associated with the variables in the VAR statement in the order in which the variables are listed. This option is ignored if METHOD=STEPAR or TREND=1 or 2. See the section Starting Values for EXPO, WINTERS, and ADDWINTERS Methods on page 804 for details.
DATA=SAS-data-set
names the SAS data set that contains the input time series for the procedure to forecast. If the DATA= option is not specied, the most recently created SAS data set is used.
INTERVAL=interval
species the frequency of the input time series. For example, if the input data set consists of quarterly observations, then INTERVAL=QTR should be used. See Chapter 4, Date Intervals, Formats, and Functions, for more details about the intervals available.
INTPER=n
when the INTERVAL= option is not used, species an increment (other than 1) to use in generating the values of the ID variable for the forecast observations in the output data set.
LEAD=n
species the number of periods ahead to forecast. The default is LEAD=12. The LEAD= value is relative to the last observation in the input data set and not to the end of a particular series. Thus, if a series has missing values at the end, the actual number of forecasts computed for that series will be greater than the LEAD= value.
MAXERRORS=n
limits the number of warning and error messages produced during the execution of the procedure to the specied value. The default is MAXERRORS=50. This option is particularly useful in BY-group processing where it can be used to suppress the recurring messages.
METHOD=method-name
species the method to use to model the series and generate the forecasts. METHOD=STEPAR species the stepwise autoregressive method. METHOD=EXPO species the exponential smoothing method. METHOD=WINTERS species the Holt-Winters exponentially smoothed trend-seasonal method.
METHOD=ADDWINTERS species the additive seasonal factors variant of the Winters method. For more information, see the section Forecasting Methods on page 796. The default is METHOD=STEPAR.
NSTART=n NSTART=MAX
species the number of beginning values of the series to use in calculating starting values for the trend parameters in the exponential smoothing, Winters, and additive Winters methods. This option is ignored if METHOD=STEPAR. For METHOD=EXPO, n beginning values of the series are used in forming the exponentially smoothed values S1, S2, and S3, where n is the value of the NSTART= option. The parameters are initialized by tting a time trend regression to the rst n nonmissing values of the series. For METHOD=WINTERS or METHOD=ADDWINTERS, n beginning complete seasonal cycles are used to compute starting values for the trend parameters. For example, for monthly data the seasonal cycle is one year, and NSTART=2 species that the rst 24 observations at the beginning of each series are used for the time trend regression used to calculate starting values. When NSTART=MAX is specied, all the observations are used. The default for METHOD=EXPO is NSTART=8; the default for METHOD=WINTERS or METHOD=ADDWINTERS is NSTART=2. See the section Starting Values for EXPO, WINTERS, and ADDWINTERS Methods on page 804 for details.
NSSTART=n NSSTART=MAX
species the number of beginning values of the series to use in calculating starting values for seasonal parameters for METHOD=WINTERS or METHOD=ADDWINTERS. The seasonal parameters are initialized by averaging over the rst n values of the series for each season, where n is the value of the NSSTART= option. When NSSTART=MAX is specied, all the observations are used. If NSTART= is specied, but NSSTART= is not, NSSTART= defaults to the value specied for NSTART=. If neither NSTART= nor NSSTART= is specied, then the default is NSSTART=2. This option is ignored if METHOD=STEPAR or METHOD=EXPO. See the section Starting Values for EXPO, WINTERS, and ADDWINTERS Methods on page 804 for details.
OUT=SAS-data-set
names the output data set to contain the forecasts. If the OUT= option is not specied, the data set is named by using the DATAn convention. See the section OUTEST= Data Set on page 808 for details.
OUTACTUAL
OUTALL
provides all the output control options (OUTLIMIT, OUT1STEP, OUTACTUAL, OUTRESID, and OUTSTD).
OUTEST=SAS-data-set
names an output data set to contain the parameter estimates and goodness-of-t statistics. When the OUTEST= option is not specied, the parameters and goodness-of-t statistics are not stored. See the section OUTEST= Data Set on page 808 for details.
OUTESTALL
writes additional statistics to the OUTEST= data set. This option is the same as specifying both OUTESTTHEIL and OUTFITSTATS.
OUTESTTHEIL
writes various R-square-type forecast accuracy statistics to the OUTEST= data set.
OUTFULL
provides OUTACTUAL, OUT1STEP, and OUTLIMIT output control options in addition to the forecast values.
OUTLIMIT
writes the standard errors of the forecasts to the OUT= data set.
OUT1STEP
species the seasonality for seasonal models. The interval can be QTR, MONTH, DAY, or HOUR, or multiples of these (for example, QTR2, MONTH2, MONTH3, MONTH4, MONTH6, HOUR2, HOUR3, HOUR4, HOUR6, HOUR8, and HOUR12). Alternatively, seasonality can be specied by giving the length of the seasonal cycles. For example, SEASONS=3 means that every group of three observations forms a seasonal cycle. The SEASONS= option is valid only for METHOD=WINTERS or METHOD=ADDWINTERS. See the section Specifying Seasonality on page 805 for details.
SINGULAR=value
gives the criterion for judging singularity. The default depends on the precision of the computer that you run SAS programs on.
SINTPER=m SINTPER= ( m1 [ m2 [ m3 ] ] )
species the number of periods to combine in forming a season. For example, SEASONS=3 SINTPER=2 species that each group of two observations forms a season and that the seasonal cycle repeats every six observations. The SINTPER= option is valid only when the SEASONS= option is used. See the section Specifying Seasonality on page 805 for details.
SLENTRY=value
controls the signicance levels for entry of autoregressive parameters in the STEPAR method. The value of the SLENTRY= option must be between 0 and 1. The default is SLENTRY=0.2. See the section STEPAR Method on page 797 for details.
SLSTAY=value
controls the signicance levels for removal of autoregressive parameters in the STEPAR method. The value of the SLSTAY= option must be between 0 and 1. The default is SLSTAY=0.05. See the section STEPAR Method on page 797 for details.
START=n
uses the rst n observations to t the model and begins forecasting with the n +1 observation.
TREND=n
species the degree of the time trend model. The value of the TREND= option must be 1, 2, or 3. TREND=1 selects the constant trend model; TREND=2 selects the linear trend model; and TREND=3 selects the quadratic trend model. The default is TREND=2, except for METHOD=EXPO, for which the default is TREND=3.
WEIGHT=w WEIGHT= ( w1 [ w2 [ w3 ] ] )
species the smoothing weights for the EXPO, WINTERS, and ADDWINTERS methods. For the EXPO method, only one weight can be specied. For the WINTERS or ADDWINTERS method, w1 gives the weight for updating the constant component, w2 gives the weight for updating the linear and quadratic trend components, and w3 gives the weight for updating the seasonal component. The w2 and w3 values are optional. Each value in the WEIGHT= option must be between 0 and 1. For default values, see the section EXPO Method on page 798 and the section WINTERS Method on page 799.
ZEROMISS
treats zeros at the beginning of a series as missing values. For example, a product can be introduced at a date after the date of the rst observation in the data set, and the sales variable for the product can be recorded as zero for the observations prior to the introduction date. The ZEROMISS option says to treat these initial zeros as missing values.
BY Statement ! 795
BY Statement
BY variables ;
A BY statement can be used with PROC FORECAST to obtain separate analyses on observations in groups dened by the BY variables.
ID Statement
ID variables ;
The rst variable listed in the ID statement identies observations in the input and output data sets. Usually, the rst ID variable is a SAS date or datetime variable. Its values are interpreted and extrapolated according to the values of the INTERVAL= option. See the section Data Periodicity and Time Intervals on page 796 for details. If more than one ID variable is specied in the ID statement, only the rst is used to identify the observations; the rest are just copied to the OUT= data set and will have missing values for forecast observations.
VAR Statement
VAR variables ;
The VAR statement species the variables in the input data set that you want to forecast. If no VAR statement is specied, the procedure forecasts all numeric variables except the ID and BY variables.
Missing Values
The treatment of missing values varies by method. For METHOD=STEPAR, missing values are tolerated in the series; the autocorrelations are estimated from the available data and tapered, if necessary. For the EXPO, WINTERS, and ADDWINTERS methods, missing values after the start of the series are replaced with one-step-ahead predicted values, and the predicted values are applied to the smoothing equations. For the WINTERS method, negative or zero values are treated as missing.
Forecasting Methods
This section explains the forecasting methods used by PROC FORECAST.
STEPAR Method
In the STEPAR method, PROC FORECAST rst ts a time trend model to the series and takes the difference between each value and the estimated trend. (This process is called detrending.) Then, the remaining variation is t by using an autoregressive model. The STEPAR method ts the autoregressive process to the residuals of the trend model by using a backwards-stepping method to select parameters. Because the trend and autoregressive parameters are t in sequence rather than simultaneously, the parameter estimates are not optimal in a statistical sense. However, the estimates are usually close to optimal, and the method is computationally inexpensive.
The STEPAR method consists of the following computational steps: 1. Fit the trend model as specied by the TREND= option by using ordinary least-squares regression. This step detrends the data. The default trend model for the STEPAR method is TREND=2, a linear trend model. 2. Take the residuals from step 1 and compute the autocovariances to the number of lags specied by the NLAGS= option. 3. Regress the current values against the lags, using the autocovariances from step 2 in a YuleWalker framework. Do not bring in any autoregressive parameter that is not signicant at the level specied by the SLENTRY= option. (The default is SLENTRY=0.20.) Do not bring in any autoregressive parameter that results in a nonpositive-denite Toeplitz matrix. 4. Find the autoregressive parameter that is least signicant. If the signicance level is greater than the SLSTAY= value, remove the parameter from the model. (The default is SLSTAY=0.05.) Continue this process until only signicant autoregressive parameters remain. If the OUTEST= option is specied, write the estimates to the OUTEST= data set. 5. Generate the forecasts by using the estimated model and output to the OUT= data set. Form the condence limits by combining the trend variances with the autoregressive variances. Missing values are tolerated in the series; the autocorrelations are estimated from the available data and tapered if necessary. This method requires at least three passes through the data: two passes to t the model and a third pass to initialize the autoregressive process and write to the output data set.
If the NLAGS= option is not specied, the default value of the NLAGS= option is chosen based on the data frequency specied by the INTERVAL= option and on the number of observations in the input data set, if this can be determined in advance. (PROC FORECAST cannot determine the number of input observations before reading the data when a BY statement or a WHERE statement
is used or if the data are from a tape format SAS data set or external database. The NLAGS= value must be xed before the data are processed.) If the INTERVAL= option is specied, the default NLAGS= value includes lags for up to three years plus one, subject to the maximum of 13 lags or one-third of the number of observations in your data set, whichever is less. If the number of observations in the input data set cannot be determined, the maximum NLAGS= default value is 13. If the INTERVAL= option is not specied, the default is NLAGS=13 or one-third the number of input observations, whichever is less. If the Toeplitz matrix formed by the autocovariance matrix at a given step is not positive denite, the maximal number of autoregressive lags is reduced. For example, for INTERVAL=QTR, the default is NLAGS=13 (that is, 4 3 C 1) provided that there are at least 39 observations. The NLAGS= option default is always at least 3.
EXPO Method
Exponential smoothing is used when the METHOD=EXPO option is specied. The term exponential smoothing is derived from the computational scheme developed by Brown and others (Brown and Meyers 1961; Brown 1962). Estimates are computed with updating formulas that are developed across time series in a manner similar to smoothing. The EXPO method ts a trend model such that the most recent data are weighted more heavily than data in the early part of the series. The weight of an observation is a geometric (exponential) function of the number of periods that the observation extends into the past relative to the current period. The weight function is w D !.1 !/t
where is the observation number of the past observation, t is the current observation number, and ! is the weighting constant specied with the WEIGHT= option. You specify the model with the TREND= option as follows: TREND=1 species single exponential smoothing (a constant model) TREND=2 species double exponential smoothing (a linear trend model) TREND=3 species triple exponential smoothing (a quadratic trend model)
Updating Equations
The single exponential smoothing operation is expressed by the formula St D !xt C .1 !/St
1
where St is the smoothed value at the current period, t is the time index of the current period, and xt is the current actual value of the series. The smoothed value St is the forecast of xt C1 and is calculated as the smoothing constant ! times the value of the series, xt , in the current period plus (1 !) times the previous smoothed value St 1 , which is the forecast of xt computed at time t 1.
Double and triple exponential smoothing are derived by applying exponential smoothing to the smoothed series, obtaining smoothed values as follows: St
2 3
D !St C .1 D !St C .1
2
!/St
2 1 3 1
St
!/St
Missing values after the start of the series are replaced with one-step-ahead predicted values, and the predicted value is then applied to the smoothing equations. The polynomial time trend parameters CONSTANT, LINEAR, and QUAD in the OUTEST= data 2 3 set are computed from ST , ST , and ST , the nal smoothed values at observation T, the last observation used to t the model. In the OUTEST= data set, the values of ST , ST , and ST are identied by _TYPE_=S1, _TYPE_=S2, and _TYPE_=S3, respectively.
Smoothing Weights
2 3
Exponential smoothing forecasts are forecasts for an integrated moving-average process; however, the weighting parameter is specied by the user rather than estimated from the data. Experience has shown that good values for the WEIGHT= option are between 0.05 and 0.3. As a general rule, smaller smoothing weights are appropriate for series with a slowly changing trend, while larger weights are appropriate for volatile series with a rapidly changing trend. If unspecied, the weight defaults to .1 0:81=t rend /, where trend is the value of the TREND= option. This produces defaults of WEIGHT=0.2 for TREND=1, WEIGHT=0.10557 for TREND=2, and WEIGHT=0.07168 for TREND=3. The ESM procedure can be used to forecast time series by using exponential smoothing with smoothing weights that are optimized automatically. See Chapter 13, The ESM Procedure. The Time Series Forecasting System provides for exponential smoothing models and enables you to either specify or optimize the smoothing weights. See Chapter 37, Getting Started with Time Series Forecasting, for details.
Condence Limits
The condence limits for exponential smoothing forecasts are calculated as they would be for an exponentially weighted time trend regression, using the simplifying assumption of an innite number of observations. The variance estimate is computed by using the mean square of the unweighted one-step-ahead forecast residuals. More detailed descriptions of the forecast computations can be found in Montgomery and Johnson (1976) and Brown (1962).
WINTERS Method
The WINTERS method uses updating equations similar to exponential smoothing to t parameters for the model xt D .a C bt /s.t / C
t
where a and b are the trend parameters and the function s (t ) selects the seasonal parameter for the season that corresponds to time t. The WINTERS method assumes that the series values are positive. If negative or zero values are found in the series, a warning is printed and the values are treated as missing. The preceding standard WINTERS model uses a linear trend. However, PROC FORECAST can also t a version of the WINTERS method that uses a quadratic trend. When TREND=3 is specied for METHOD=WINTERS, PROC FORECAST ts the following model: xt D .a C bt C ct 2 /s.t / C
t
The quadratic trend version of the Winters method is often unstable, and its use is not recommended. When TREND=1 is specied, the following constant trend version is t: xt D as.t / C
t
The default for the WINTERS method is TREND=2, which produces the standard linear trend model.
Seasonal Factors
The notation s (t) represents the selection of the seasonal factor used for different time periods. For example, if INTERVAL=DAY and SEASONS=MONTH, there are 12 seasonal factors, one for each month in the year, and the time index t is measured in days. For any observation, t is determined by the ID variable and s (t) selects the seasonal factor for the month that t falls in. For example, if t is 9 February 1993 then s (t) is the seasonal parameter for February. When there are multiple seasons specied, s (t) is the product of the parameters for the seasons. For example, if SEASONS=(MONTH DAY), then s (t) is the product of the seasonal parameter for the month that corresponds to the period t and the seasonal parameter for the day of the week that corresponds to period t. When the SEASONS= option is not specied, the seasonal factors s (t) are not included in the model. See the section Specifying Seasonality on page 805 for more information about specifying multiple seasonal factors.
Updating Equations
This section shows the updating equations for the Winters method. In the following formula, xt is the actual value of the series at time t; at is the smoothed value of the series at time t; bt is the smoothed trend at time t; ct is the smoothed quadratic trend at time t; st 1 .t/ selects the old value of the seasonal factor that corresponds to time t before the seasonal factors are updated. The estimates of the constant, linear, and quadratic trend parameters are updated by using the following equations: For TREND=3, at D !1 xt C .1 st 1 .t / !1 /.at
1
C bt
C ct
1/
at bt
C ct
1/
1/
C .1 !2 /ct
!2 /.bt
1
C 2ct
1/
C .1
xt C .1 st 1 .t / at
1/
!1 /.at
C bt
1
1/
C .1
!2 /bt
xt C .1 st 1 .t /
!1 /at
In this updating system, the trend polynomial is always centered at the current period so that the intercept parameter of the trend polynomial for predicted values at times after t is always the updated intercept parameter at . The predicted value for periods ahead is xt C D .at C bt /st .t C / The seasonal parameters are updated when the season changes in the data, using the mean of the ratios of the actual to the predicted values for the season. For example, if SEASONS=MONTH and INTERVAL=DAY, then when the observation for the rst of February is encountered, the seasonal parameter for January is updated by using the formula st .t
t 1 1 X xi 1/ D !3 C .1 31 ai i Dt 31
!3 /st
1 .t
1/
where t is February 1 of the current year, st .t 1/ is the seasonal parameter for January updated with the data available at time t, and st 1 .t 1/ is the seasonal parameter for January of the previous year. When multiple seasons are used, st .t / is a product of seasonal factors. For example, if SEASONS=(MONTH DAY) then st .t / is the product of the seasonal factors for the month and for the m d day of the week: st .t / D st .t /st .t /.
m The factor st .t / is updated at the start of each month by using a modication of the preceding i formula that adjusts for the presence of the other seasonal by dividing the summands xi by the that a
xt C .1 m at st .t /
d !3 /st 1 .t/
d where st 1 .t / is the seasonal factor for the same day of the previous week.
Missing values after the start of the series are replaced with one-step-ahead predicted values, and the predicted value is substituted for x i and applied to the updating equations.
Normalization
The parameters are normalized so that the seasonal factors for each cycle have a mean of 1.0. This normalization is performed after each complete cycle and at the end of the data. Thus, if INTERVAL=MONTH and SEASONS=MONTH are specied and a series begins with a July value, then the seasonal factors for the series are normalized at each observation for July and at the last observation in the data set. The normalization is performed by dividing each of the seasonal parameters, and multiplying each of the trend parameters, by the mean of the unnormalized seasonal parameters.
Smoothing Weights
The weight for updating the seasonal factors, ! 3 , is given by the third value specied in the WEIGHT= option. If the WEIGHT= option is not used, then ! 3 defaults to 0.25; if the WEIGHT= option is used but does not specify a third value, then ! 3 defaults to ! 2 . The weight for updating the linear and quadratic trend parameters, ! 2 , is given by the second value specied in the WEIGHT= option; if the WEIGHT= option does not specify a second value, then ! 2 defaults to ! 1 . The updating weight for the constant parameter, ! 1 , is given by the rst value specied in the WEIGHT= option. As a general rule, smaller smoothing weights are appropriate for series with a slowly changing trend, while larger weights are appropriate for volatile series with a rapidly changing trend. If the WEIGHT= option is not used, then ! 1 defaults to (1 0:81=t re nd ), where trend is the value of the TREND= option. This produces defaults of WEIGHT=0.2 for TREND=1, WEIGHT=0.10557 for TREND=2, and WEIGHT=0.07168 for TREND=3. The ESM procedure and the Time Series Forecasting System provide for generating forecast models that use Winters Method and enable you to specify or optimize the weights. (See Chapter 13, The ESM Procedure, and Chapter 37, Getting Started with Time Series Forecasting, for details.)
Condence Limits
A method for calculating exact forecast condence limits for the WINTERS method is not available. Therefore, the approach taken in PROC FORECAST is to assume that the true seasonal factors have small variability about a set of xed seasonal factors and that the remaining variation of the series is small relative to the mean level of the series. The equations are written st .t / D I.t /.1 C t / xt D I.t /.1 C at D .1 C t / where is the mean level and I(t ) are the xed seasonal factors. Assuming that t and t are small, the forecast equations can be linearized and only rst-order terms in t and t kept. In terms of forecasts for t , this linearized system is equivalent to a seasonal ARIMA model. Condence limits for t are based on this ARIMA model and converted into condence limits for xt using st .t / as estimates of I(t ). The exponential smoothing condence limits are based on an approximation to a weighted regression model, whereas the preceding Winters condence limits are based on an approximation to an ARIMA model. You can use METHOD=WINTERS without the SEASONS= option to do exponential smoothing and get condence limits for the EXPO forecasts based on the ARIMA model
t/
approximation. These are generally more pessimistic than the weighted regression condence limits produced by METHOD=EXPO.
ADDWINTERS Method
The ADDWINTERS method is like the WINTERS method except that the seasonal parameters are added to the trend instead of multiplied with the trend. The default TREND=2 model is as follows: xt D a C bt C s.t / C
t
The WINTERS method for updating equation and condence limits calculations described in the preceding section are modied accordingly for the additive version.
Although the forecasts are the same, the condence limits are computed differently.
For the Winters method, many combinations of weight values might produce unstable noninvertible models, even though all three weights are between 0 and 1. When the model is noninvertible, the forecasts depend strongly on values in the distant past, and predictions are determined largely by the starting values. Unstable models usually produce poor forecasts. The Winters model can be unstable even if the weights are optimally chosen to minimize the in-sample MSE. See Archibald (1990) for a detailed discussion of the unstable region of the parameter space of the Winters model. Optimal weights and forecasts for exponential smoothing models can be computed by using the ESM and ARIMA procedures and by the Time Series Forecasting System.
variables in the VAR statement in the order in which the variables are listed (the rst value with the rst variable, the second value with the second variable, and so on). If there are fewer values than variables, default starting values are used for the later variables. If there are more values than variables, the extra values are ignored.
Specifying Seasonality
Seasonality of a time series is a regular uctuation about a trend. This is called seasonality because the time of year is the most common source of periodic variation. For example, sales of home heating oil are regularly greater in winter than during other times of the year. Seasonality can be caused by many things other than weather. In the United States, sales of nondurable goods are greater in December than in other months because of the Christmas shopping season. The term seasonality is also used for cyclical uctuation at periods other than a year. Often, certain days of the week cause regular uctuation in daily time series, such as increased spending on leisure activities during weekends. Three kinds of seasonality are supported in PROC FORECAST: time-of-year, day-of-week, and time-of-day. The seasonal part of the model is specied by using the SEASONS= option. The values for the SEASONS= option are listed in Table 15.2.
Table 15.2 The SEASONS= Option
Type of Seasonality time of year time of year day of week time of day
The three kinds of seasonality can be combined. For example, SEASONS=(MONTH DAY HOUR) species that 24 hour-of-day seasons are nested within 7 day-of-week seasons, which in turn are nested within 12 month-of-year seasons. The different kinds of intervals can be listed in the SEASONS= option in any order. Thus, SEASONS=(HOUR DAY MONTH) is the same as SEASONS=(MONTH DAY HOUR). Note that the Winters method smoothing equations might be less stable when multiple seasonal factors are used. Multiple period seasons can also be used. For example, SEASONS=QTR2 species two semiannual time-of-year seasons. The grouping of observations into multiple period seasons starts with the rst interval in the seasonal cycle. Thus, MONTH2 seasons are JanuaryFebruary, MarchApril, and so on. (There is no provision for shifting seasonal intervals; thus, there is no way to specify seasons DecemberJanuary, FebruaryMarch, AprilMay, and so on.) For multiple period seasons, the number of intervals combined to form the seasons must evenly divide and be less than the basic cycle length. For example, with SEASONS=MONTHn, the basic cycle length is 12, so MONTH2, MONTH3, MONTH4, and MONTH6 are valid SEASONS= values (because 2, 3, 4, and 6 evenly divide 12 and are less than 12), but MONTH5 and MONTH12 are not valid SEASONS= values.
The frequency of the seasons must not be greater than the frequency of the input data. For example, you cannot specify SEASONS=MONTH if INTERVAL=QTR or SEASONS=MONTH if INTERVAL=MONTH2. You also cannot specify two seasons of the same basic cycle. For example, SEASONS=(MONTH QTR) or SEASONS=(MONTH2 MONTH4) is not allowed. Alternatively, the seasonality can be specied by giving the number of seasons in the SEASONS= option. SEASONS=n species that there are n seasons, with observations 1, n C 1, 2n C 1, and so on in the rst season, observations 2, n C 2, 2n C 2, and so on in the second season, and so forth. The options SEASONS=n and SINTPER=m cause PROC FORECAST to group the input observations into n seasons, with m observations to a season, which repeat every nm observations. The options SEASONS=( n1 n2 ) and SINTPER=( m1 m2 ) produce n1 seasons with m1 observations to a season nested within n2 seasons with n1 m1 m2 observations to a season. If the SINTPER=m option is used with the SEASONS= option, the SEASONS= interval is multiplied by the SINTPER= value. For example, specifying both SEASONS=(QTR HOUR) and SINTPER=(2 3) is the same as specifying SEASONS=(QTR2 HOUR3) and also the same as specifying SEASONS=(HOUR3 QTR2).
Data Requirements
You should have ample data for the series that you forecast by using PROC FORECAST. However, the results might be poor unless you have a good deal more than the minimum amount of data the procedure allows. The minimum number of observations required for the different methods is as follows: If METHOD=STEPAR is used, the minimum number of nonmissing observations required for each series forecast is the TREND= option value plus the value of the NLAGS= option. For example, using NLAGS=13 and TREND=2, at least 15 nonmissing observations are needed. If METHOD=EXPO is used, the minimum is the TREND= option value. If METHOD=WINTERS or ADDWINTERS is used, the minimum number of observations is either the number of observations in a complete seasonal cycle or the TREND= option value, whichever is greater. (However, there should be data for several complete seasonal cycles, or the seasonal factor estimates might be poor.) For example, for the seasonal specications SEASONS=MONTH, SEASONS=(QTR DAY), or SEASONS=(MONTH DAY HOUR), the longest cycle length is one year, so at least one year of data is required. At least two years of data is recommended.
the BY variables _TYPE_, a character variable that identies the type of observation _LEAD_, a numeric variable that indicates the number of steps ahead in the forecast. The value of _LEAD_ is 0 for the one-step-ahead forecasts before the start of the forecast period. the ID statement variables the VAR statement variables, which contain the result values as indicated by the _TYPE_ variable value for the observation The FORECAST procedure processes each of the input variables listed in the VAR statement and writes several observations for each forecast period to the OUT= data set. The observations are identied by the value of the _TYPE_ variable. The options OUTACTUAL, OUTALL, OUTLIMIT, OUTRESID, OUT1STEP, OUTFULL, and OUTSTD control which types of observations are included in the OUT= data set. The values of the variable _TYPE_ are as follows: ACTUAL The VAR statement variables contain actual values from the input data set. The OUTACTUAL option writes the actual values. By default, only the observations for the forecast period are output. The VAR statement variables contain forecast values. The OUT1STEP option writes the one-step-ahead predicted values for the observations used to t the model. The VAR statement variables contain residuals. The residuals are computed by subtracting the forecast value from the actual value (residual D act ual f orecast ). The OUTRESID option writes observations for the residuals. The VAR statement variables contain lower nn % condence limits for the forecast values for the future observations specied by the LEAD= option. The value of nn depends on the ALPHA= option; with the default ALPHA=0.05, the _TYPE_ value is L95 for the lower condence limit observations. The OUTLIMIT option writes observations for the upper and lower condence limits. The VAR statement variables contain upper nn % condence limits for the forecast values for the future observations specied by the LEAD= option. The value of nn depends on the ALPHA= option; with the default ALPHA=0.05, the _TYPE_ value is U95 for the upper condence limit observations. The OUTLIMIT option writes observations for the upper and lower condence limits. The VAR statement variables contain standard errors of the forecast values. The OUTSTD option writes observations for the standard errors of the forecast.
FORECAST
RESIDUAL
Lnn
Unn
STD
If no output control options are specied, PROC FORECAST outputs only the forecast values for the forecast periods. The _TYPE_ variable can be used to subset the OUT= data set. For example, the following data step splits the OUT= data set into two data sets, one that contains the forecast series and the other that contains the residual series. For example
proc forecast out=out outresid ...; ... run; data fore resid; set out; if _TYPE_=FORECAST then output fore; if _TYPE_=RESIDUAL then output resid; run;
See Chapter 3, Working with Time Series Data, for more information about processing time series data sets in this format.
CONSTANT
LINEAR
N QUAD
The observation contains the number of nonmissing observations used to t the model for the series. The observation contains the estimate of the quadratic parameter for the time trend model for the series. This observation is output only if you specify TREND=3. The observation contains the estimate of the standard deviation of the error term for the series. The observations contain exponentially smoothed values at the last observation. _TYPE_=S1 is the nal smoothed value of the single exponential smooth. _TYPE_=S2 is the nal smoothed value of the double exponential smooth. _TYPE_=S3 is the nal smoothed value of the triple exponential smooth. These observations are output for METHOD=EXPO only. The observation contains estimates of the seasonal parameters. For example, if SEASONS=MONTH, the OUTEST= data set contains observations with _TYPE_=S_JAN, _TYPE_=S_FEB, _TYPE_=S_MAR, and so forth. For multiple-period seasons, the names of the rst and last interval of the season are concatenated to form the season name. Thus, for SEASONS=MONTH4, the OUTEST= data set contains observations with _TYPE_=S_JANAPR, _TYPE_=S_MAYAUG, and _TYPE_=S_SEPDEC. When the SEASONS= option species numbers, the seasonal factors are labeled _TYPE_=S_i_j. For example, SEASONS=(2 3) produces observations with _TYPE_ values of S_1_1, S_1_2, S_2_1, S_2_2, and S_2_3. The observation with _TYPE_=S_i_j contains the seasonal parameters for the jth season of the ith seasonal cycle. These observations are output METHOD=ADDWINTERS. only for METHOD=WINTERS or
SIGMA S1S3
S_name
WEIGHT
The observation contains the smoothing weight used for exponential smoothing. This is the value of the WEIGHT= option. This observation is output for METHOD=EXPO only.
WEIGHT1 WEIGHT2 WEIGHT3 The observations contain the weights used for smoothing the WINTERS or ADDWINTERS method parameters (specied by the WEIGHT= option). _TYPE_=WEIGHT1 is the weight used to smooth the CONSTANT parameter. _TYPE_=WEIGHT2 is the weight used to smooth the LINEAR and QUAD parameters. _TYPE_=WEIGHT3 is the weight used to smooth the seasonal parameters. These observations are output only for the WINTERS and ADDWINTERS methods. The observation contains the number of nonmissing residuals, n, used to compute the goodness-of-t statistics. The residuals are obtained by subtracting the one-step-ahead predicted values from the observed values. The observationP contains the total sum of squares for the series, corrected for the mean. S S T D n .yt y/2 , where y is the series mean. tD1
NRESID
SST
SSE
The observation contains the sum of the squared residuals, uncorrected for the P mean. S SE D nD1 .yt yt /2 , where y is the one-step predicted value for the O O t series. The observation contains the mean squared error, calculated from one-stepahead forecasts. MSE D n 1 k SSE, where k is the number of parameters in the model. p The observation contains the root mean squared error. RMSE D MSE. The observation contains the mean absolute percent error. P MAPE D 100 nD1 j.yt yt /=yt j. O t n The observation contains the mean percent error. P MPE D 100 n .yt yt /=yt . O tD1 n
1 P The observation contains the mean absolute error. MAE D n nD1 jyt t 1 Pn O The observation contains the mean error. MAE D n t D1 .yt yt /.
MSE
yt j. O
The observation contains the maximum error (the largest residual). The observation contains the minimum error (the smallest residual). The observation contains the maximum percent error. The observation contains the minimum percent error. The observation contains the R square statistic, R2 D 1 SSE=SST . If the model ts the series badly, the model error sum of squares SSE might be larger than SST and the R square statistic will be negative. The observation contains the adjusted R square statistic. ADJRSQ D 1
n .n 1 /.1 k
R2 /.
The observation contains Amemiyas adjusted R square statistic. ARSQ D 1 . nCk /.1 n k R2 /.
2 The observation contains the random walk R square statistic (Harveys RD statistic that uses the random walk model for comparison).
. n n 1 /SSE=RW SSE, P where RW S SE D nD2 .yt yt 1 /2 and t AIC D nln.S SE=n/ C 2k.
1 n 1
Pn
t D2 .yt
yt
1 /.
The observation contains Akaikes information criterion. The observation contains Schwarzs Bayesian criterion. SBC D nln.SSE=n/ C kln.n/. The observation contains Amemiyas prediction criterion.
1 AP C D n S S T . nCk /.1 n k 1 R2 / D . nCk / n SSE. n k
The observation contains the correlation coefcient between the actual values and the one-step-ahead predicted values. The observation contains Theils U statistic that uses original units. See Maddala (1977, pp. 344345), and Pindyck and Rubinfeld (1981, pp. 364365) for more information about Theil statistics.
RTHEILU THEILUM THEILUS THEILUC THEILUR THEILUD RTHEILUM RTHEILUS RTHEILUC RTHEILUR RTHEILUD
The observation contains Theils U statistic calculated using relative changes. The observation contains the bias proportion of Theils U statistic. The observation contains the variance proportion of Theils U statistic. The observation contains the covariance proportion of Theils U statistic. The observation contains the regression proportion of Theils U statistic. The observation contains the disturbance proportion of Theils U statistic. The observation contains the bias proportion of Theils U statistic, calculated by using relative changes. The observation contains the variance proportion of Theils U statistic, calculated by using relative changes. The observation contains the covariance proportion of Theils U statistic, calculated by using relative changes. The observation contains the regression proportion of Theils U statistic, calculated by using relative changes. The observation contains the disturbance proportion of Theils U statistic, calculated by using relative changes.
The INTERVAL=MONTH option indicates that the data are monthly, and the ID DATE statement gives the dating variable. The METHOD=WINTERS species the Winters smoothing method. The LEAD=12 option forecasts 12 months ahead. The OUT=OUT option species the output data set, while the OUTFULL and OUTRESID options include in the OUT= data set the predicted and residual values for the historical period and the condence limits for the forecast period. The OUTEST= option stores various statistics in an output data set. The WHERE statement is used to include only data from 1980 on. The following statements print the OUT= data set (rst 20 observations):
proc print data=out (obs=20) noobs; run;
The listing of the output data set produced by PROC PRINT is shown in part in Output 15.1.2.
Output 15.1.2 The OUT= Data Set Produced by PROC FORECAST (First 20 Observations)
Sales of Passenger Cars DATE JAN80 JAN80 JAN80 FEB80 FEB80 FEB80 MAR80 MAR80 MAR80 APR80 APR80 APR80 MAY80 MAY80 MAY80 JUN80 JUN80 JUN80 JUL80 JUL80 _TYPE_ ACTUAL FORECAST RESIDUAL ACTUAL FORECAST RESIDUAL ACTUAL FORECAST RESIDUAL ACTUAL FORECAST RESIDUAL ACTUAL FORECAST RESIDUAL ACTUAL FORECAST RESIDUAL ACTUAL FORECAST _LEAD_ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 VEHICLES 8808.00 8046.52 761.48 10054.00 9284.31 769.69 9921.00 10077.33 -156.33 8850.00 9737.21 -887.21 7780.00 9335.24 -1555.24 7856.00 9597.50 -1741.50 6102.00 6833.16
The PROC PRINT listing of the OUTEST= data set is shown in Output 15.1.3.
The following statements plot the residuals. The plot is shown in Output 15.1.4.
title1 "Sales of Passenger Cars"; title2 Plot of Residuals; proc sgplot data=out; where _type_ = RESIDUAL; needle x=date y=vehicles / markers markerattrs=(symbol=circlefilled); xaxis values=(1jan80d to 1jan92d by year); format date year4.; run;
The following statements plot the forecast and condence limits. The last two years of historical data are included in the plot to provide context for the forecast plot. A reference line is drawn at the start of the forecast period.
title1 "Sales of Passenger Cars"; title2 Plot of Forecast from WINTERS Method; proc sgplot data=out; series x=date y=vehicles / group=_type_ lineattrs=(pattern=1); where _type_ ^= RESIDUAL; refline 15dec91d / axis=x; yaxis values=(9000 to 25000 by 1000); xaxis values=(1jan90d to 1jan93d by qtr); run;
The PROC PRINT listing of the OUTEST= data set is shown in Output 15.2.3.
The following statements plot the forecasts and condence limits. The last two years of historical data are included in the plots to provide context for the forecast. A reference line is drawn at the start of the forecast period.
title1 Plot of Forecasts from STEPAR Method; proc sgplot data=out; series x=date y=durables / group=_type_; xaxis values=(1jan90d to 1jan93d by qtr); yaxis values=(100000 to 150000 by 10000); refline 15dec91d / axis=x; run; proc sgplot data=out; series x=date y=nondur / group=_type_; xaxis values=(1jan90d to 1jan93d by qtr); yaxis values=(100000 to 140000 by 10000); refline 15dec91d / axis=x; run;
The PROC PRINT listing of the output data set is shown in Output 15.3.2.
References
Ahlburg, D. A. (1984). Forecast Evaluation and Improvement Using Theils Decomposition, Journal of Forecasting, 3, 345351. Aldrin, M. and Damsleth, E. (1989). Forecasting Non-Seasonal Time Series with Missing Observations, Journal of Forecasting, 8, 97116. Archibald, B.C. (1990), Parameter Space of the Holt-Winters Model, International Journal of Forecasting, 6, 199209. Bails, D.G. and Peppers, L.C. (1982), Business Fluctuations: Forecasting Techniques and Applications, New Jersey: Prentice-Hall. Bartolomei, S.M. and Sweet, A.L. (1989). A Note on the Comparison of Exponential Smoothing Methods for Forecasting Seasonal Series, International Journal of Forecasting, 5, 111116. Bureau of Economic Analysis, U.S. Department of Commerce (1992 and earlier editions), Business Statistics, 27th and earlier editions, Washington: U.S. Government Printing Ofce.
References ! 825
Bliemel, F. (1973). Theils Forecast Accuracy Coefcient: A Clarication, Journal of Marketing Research, 10, 444446. Bowerman, B.L. and OConnell, R.T. (1979), Time Series and Forecasting: An Applied Approach, North Scituate, Massachusetts: Duxbury Press. Box, G.E.P. and Jenkins, G.M. (1976), Time Series Analysis: Forecasting and Control, Revised Edition, San Francisco: Holden-Day. Bretschneider, S.I., Carbone, R., and Longini, R.L. (1979). An Adaptive Approach to Time Series Forecasting, Decision Sciences, 10, 232244. Brown, R.G. (1962), Smoothing, Forecasting and Prediction of Discrete Time Series, New York: Prentice-Hall. Brown, R.G. and Meyer, R.F. (1961). The Fundamental Theorem of Exponential Smoothing, Operations Research, 9, 673685. Chateld, C. (1978). The Holt-Winters Forecasting Procedure, Applied Statistics, 27, 264279. Chateld, C., and Prothero, D.L. (1973). Box-Jenkins Seasonal Forecasting: Problems in a Case Study, Journal of the Royal Statistical Society, Series A, 136, 295315. Chow, W.M. (1965). Adaptive Control of the Exponential Smoothing Constant, Journal of Industrial Engineering, SeptemberOctober 1965. Cogger, K.O. (1974). The Optimality of General-Order Exponential Smoothing, Operations Research, 22, 858. Cox, D. R. (1961). Prediction by Exponentially Weighted Moving Averages and Related Methods, Journal of the Royal Statistical Society, Series B, 23, 414422. Fair, R.C. (1986). Evaluating the Predictive Accuracy of Models, In Handbook of Econometrics, Vol. 3., Griliches, Z. and Intriligator, M.D., eds. New York: North Holland. Fildes, R. (1979). Quantitative ForecastingThe State of the Art: Extrapolative Models, Journal of Operational Research Society, 30, 691710. Gardner, E.S. (1984). The Strange Case of the Lagging Forecasts, Interfaces, 14, 4750. Gardner, E.S., Jr. (1985). Exponential Smoothing: The State of the Art, Journal of Forecasting, 4, 138. Granger, C.W.J. and Newbold, P. (1977), Forecasting Economic Time Series, New York: Academic Press, Inc. Harvey, A.C. (1984). A Unied View of Statistical Forecasting Procedures, Journal of Forecasting, 3, 245275. Ledolter, J. and Abraham, B. (1984). Some Comments on the Initialization of Exponential Smoothing, Journal of Forecasting, 3, 7984.
Maddala, G.S. (1977), Econometrics, New York: McGraw-Hill. Makridakis, S., Wheelwright, S.C., and McGee, V.E. (1983). Forecasting: Methods and Applications, 2nd Ed. New York: John Wiley and Sons. McKenzie, Ed (1984). General Exponential Smoothing and the Equivalent ARMA Process, Journal of Forecasting, 3, 333344. Montgomery, D.C. and Johnson, L.A. (1976), Forecasting and Time Series Analysis, New York: McGraw-Hill. Muth, J.F. (1960). Optimal Properties of Exponentially Weighted Forecasts, Journal of the American Statistical Association, 55, 299306. Pierce, D.A. (1979). R2 Measures for Time Series, Journal of the American Statistical Association, 74, 901910. Pindyck, R.S. and Rubinfeld, D.L. (1981), Econometric Models and Economic Forecasts, Second Edition, New York: McGraw-Hill. Raine, J.E. (1971). Self-Adaptive Forecasting Reconsidered, Decision Sciences, 2, 181191. Roberts, S.A. (1982). A General Class of Holt-Winters Type Forecasting Models, Management Science, 28, 808820. Theil, H. (1966). Applied Economic Forecasting. Amsterdam: North Holland. Trigg, D.W., and Leach, A.G. (1967). Exponential Smoothing with an Adaptive Response Rate, Operational Research Quarterly, 18, 5359. Winters, P.R. (1960). Forecasting Sales by Exponentially Weighted Moving Averages, Management Science, 6, 324342.
Chapter 16
Another parameter the PROC LOAN statement can compute is the maximum amount you can borrow given the periodic payment you can afford and the rates available in the market. The following SAS statements analyze a loan for 180 monthly payments of $900, with a nominal annual rate of 7.5%, and compute the maximum amount that can be borrowed:
proc loan; fixed payment=900 rate=7.5 life=180; run;
Assume that you want to borrow $100,000 and can pay $900 a month. You know that the lender charges a 7.5% nominal interest rate compounded monthly. To determine how long it will take you to pay off your debt, use the following statements:
proc loan; fixed amount=100000 payment=900 rate=7.5; run;
Sometimes, a loan is expressed in terms of the amount borrowed and the amount and number of periodic payments. In this case, you want to calculate the annual nominal rate charged on the loan to compare it to other alternatives. The following statements analyze a loan of $100,000 paid in 180 monthly payments of $800:
proc loan; fixed amount=100000 payment=800 life=180; run;
There are four basic parameters that dene a loan: life (number of periodic payments), principal amount, interest rate, and the periodic payment amount. PROC LOAN calculates the missing parameter among these four. Loan analysis output includes a loan summary table and an amortization schedule. You can use the START= and LABEL= options to enhance your output. The START= option species the date of loan initialization and dates all the output accordingly. The LABEL= specication is used to label all output that corresponds to a particular loan; it is especially useful when multiple loans are analyzed. For example, the preceding statements for the rst xed rate loan are revised to include the START= and LABEL= options as follows:
proc loan start=1998:12; fixed amount=100000 rate=7.5 life=180 label=BANK1, Fixed Rate; run;
Date DEC1998
Rates and Payments for BANK1, Fixed Rate Nominal Rate Effective Rate 7.5000% 7.7633%
Payment 927.01
The loan is initialized in December 1998 and paid off in December 2013. The monthly payment is calculated to be $927.01, and the effective interest rate is 7.7633%. Over the 15 years, $66,862.61 is paid for interest charges on the loan.
the loan require two balloon payments of $2000 and $1000 at the 15th and 48th payment periods, respectively. These balloon payments keep the periodic payment lower than that of the xed rate loan. The balloon payment loan is dened by the following BALLOON statement:
proc loan start=1998:12; balloon amount=100000 rate=7.5 life=180 balloonpayment=(15=2000 48=1000) label = BANK2, with Balloon Payment; run;
Date
DEC1998
The periodic payment for the balloon payment loan is $23.76 less than that of the xed rate loan.
terms of the loan. These assumptions are specied by the WORSTCASE and CAPS= options, as shown in the following statements:
proc loan start=1998:12; arm amount=100000 rate=5.5 life=180 worstcase caps=(0.5, 2.5) label=BANK3, Adjustable Rate; run;
Notice that the periodic payment of the adjustable rate loan as of January 2004 ($931.03) exceeds that of the xed rate loan ($927.01).
Date DEC1998 DEC1998 JAN1999 FEB1999 MAR1999 APR1999 MAY1999 JUN1999 JUL1999 AUG1999 SEP1999 OCT1999 NOV1999 DEC1999 DEC1999
Payment 0.00 0.00 927.01 927.01 927.01 927.01 927.01 927.01 927.01 927.01 927.01 927.01 927.01 927.01 11124.12
The principal balance at the end of one year is $96,248.67. The total payment for the year is $11,124.12, of which $3,751.33 went toward principal repayment. You can also print the amortization schedule with annual summary information or for a specied number of years. The SCHEDULE=YEARLY option produces an annual summary loan amortization schedule, which is useful for loans with a long life. For example, to print the annual summary loan repayment schedule for the buydown loan shown in Figure 16.6, use the following statements:
proc loan start=1998:12; buydown amount=100000 rate=6.5 life=180 buydownrates=(24=8 48=9) pointpct=1 schedule=yearly label=BANK4, Buydown; run;
Year 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013
Payment 1000.00 10453.32 10528.71 11358.00 11403.51 11904.12 11904.12 11904.12 11904.12 11904.12 11904.12 11904.12 11904.12 11904.12 11904.12 11904.09
Loan Comparison
The LOAN procedure can compare alternative loans on the basis of different economic criteria and help select the most desirable loan. You can compare alternative loans through different points in time. The economic criteria offered by PROC LOAN are: outstanding principal balancethat is, the unpaid balance of the loan present worth of costthat is, before-tax or after-tax net value of the loan cash ow through the comparison period. The cash ow includes all payments, discount points, initialization costs, down payment, and the outstanding principal balance at the comparison period. true interest ratethat is, before-tax or after-tax effective annual interest rate charged on the loan. The cash ow includes all payments, discount points, initialization costs, and the outstanding principal balance at the specied comparison period. periodic payment the total interest paid on the loan The gures for present worth of cost, true interest rate, and interest paid are reported on the cash ow through the comparison period. The reported outstanding principal balance and the periodic payment are the values as of the comparison period.
The COMPARE statement species the type of comparison and the periods of comparison. For each period specied in the COMPARE statement, a loan comparison report is printed that also indicates the best alternative. Different criteria can lead to selection of different alternatives. Also, the period of comparison might change the desirable alternative. See the section Loan Comparison Details on page 852 for further information.
The default loan comparison report in Figure 16.7 shows the ending outstanding balance, periodic payment, interest paid, and before-tax true rate at the end of 30 years. In the case of the default loan comparison, the selection of the best alternative is based on minimization of the true rate.
Figure 16.7 Default Loan Comparison Report
The LOAN Procedure Loan Comparison Report Analysis through DEC2028 Ending Outstanding 0.00 0.00 Interest Paid 182058.02 68159.02 True Rate 7.76 6.70
NOTE: "15 year loan" is the best alternative based on true rate analysis through DEC2028.
Based on true rate, the best alternative is the 15-year loan. However, if the objective were to minimize the periodic payment, the 30-year loan would be the more desirable.
The two loan comparison reports in Figure 16.8 and Figure 16.9 show the ending outstanding balance, periodic payment, interest paid, and after-tax true rate at the end of ve years and ten years, respectively.
Figure 16.8 Loan Comparison Report as of December 2003
The LOAN Procedure Loan Comparison Report Analysis through DEC2003 Ending Outstanding 113540.74 112958.49 Interest Paid 43884.34 40701.93 True Rate 5.54 5.11
NOTE: "BANK3, Adjustable Rate" is the best alternative based on true rate analysis through DEC2003.
NOTE: "BANK1, Fixed Rate" is the best alternative based on true rate analysis through DEC2008.
The loan comparison report through December 2003 picks the adjustable rate loan as the best alternative, whereas the report through December 2008 shows the xed rate loan as the better alternative. This implies that if you intend to keep the loan for 10 years or longer, the best alternative is the xed rate alternative. Otherwise, the adjustable rate loan is the better alternative in spite of the worst-case scenario. Further analysis shows that the actual breakeven of true interest rate occurs at August 2008. That is, the desirable alternative switches from the adjustable rate loan to the xed rate loan in August 2008. Note that, under the assumption of worst-case scenario for the rate adjustments, the periodic payment for the adjustable rate loan already exceeds that of the xed rate loan on December 2003 (as of the rate adjustment on January 2003 to be exact). If the objective were to minimize the periodic payment, the xed rate loan would have been more desirable as of December 2003. However, all of the other criteria at that point still favor the adjustable rate loan.
Functional Summary
Table 16.1 summarizes the statements and options that control the LOAN procedure. Many of the loan specication options can be used on all of the statements except the COMPARE statement.
For these options, the statement column is left blank. Options specic to a type of loan indicate the statement name.
Table 16.1 LOAN Functional Summary
Description Statements specify an adjustable rate loan specify a balloon payment loan specify a buydown rate loan specify loan comparisons specify a xed rate loan Data Set Options specify output data set for loan summary specify output data set for repayment schedule specify output data set for loan comparison Printing Control Options suppress printing of loan summary report suppress all printed output print amortization schedule suppress printing of loan comparison report Required Specications specify the loan amount specify life of loan as number of payments specify the periodic payment specify the initial annual nominal interest rate Loan Specications Options specify loan amount as percentage of price specify time interval between compoundings specify down payment at loan initialization specify down payment as percentage of price specify amount paid for loan initialization specify initialization costs as a percent specify time interval between payments specify label for the loan specify amount paid for discount points specify discount points as a percent specify uniform or lump sum prepayments specify the purchase price specify number of decimal places for rounding specify the date of loan initialization
Statement
Option
COMPARE
AMOUNTPCT= COMPOUND= DOWNPAYMENT= DOWNPAYPCT= INITIAL= INITIALPCT= INTERVAL= LABEL= POINTS= POINTPCT= PREPAYMENTS= PRICE= ROUND= START=
Description Balloon Payment Loan Specication Option specify the list of balloon payments Rate Adjustment Terms Options specify frequency of rate adjustments specify periodic and life cap on rate adjustment specify maximum rate adjustment specify maximum annual nominal interest rate specify minimum annual nominal interest rate Rate Adjustment Case Options specify best-case (optimistic) scenario specify predicted interest rates specify constant rate specify worst case (pessimistic) scenario Buydown Rate Loan Specication Option specify list of nominal interest rates Loan Comparison Options specify all comparison criteria specify the loan comparison periods specify breakeven analysis of the interest paid specify breakeven analysis of periodic payment specify minimum attractive rate of return specify present worth of cost analysis specify the income tax rate specify true interest rate analysis
Statement
Option
BALLOON
BALLOONPAYMENT=
BUYDOWN
BUYDOWNRATES=
The OUTSUM= option can be used in the PROC LOAN statement. In addition, the following loan specication options can be specied in the PROC LOAN statement to be used as defaults for all loans unless otherwise specied for a given loan:
Output Option
OUTSUM= SAS-data-set
creates an output data set that contains loan summary information for all loans other than those for which a different OUTSUM= output data set is specied.
FIXED Statement
FIXED options ;
The FIXED statement species a xed rate and periodic payment loan. It can be specied using the options that are common to all loan statements. The FIXED statement options are listed in this section. You must specify three of the following options in each loan statement: AMOUNT=, LIFE=, RATE=, and PAYMENT=. The LOAN procedure calculates the fourth parameter based on the values you give the other three. If you specify all four of the options, the PAYMENT= specication is ignored, and the periodic payment is recalculated for consistency. As an alternative to specifying the AMOUNT= option, you can specify the PRICE= option along with one of the following options to facilitate the calculation of the loan amount: AMOUNTPCT=, DOWNPAYMENT=, or DOWNPAYPCT=.
Required Specications
AMOUNT=amount A=amount
species the loan amount (the outstanding principal balance at the initialization of the loan).
LIFE=n L=n
gives the life of the loan in number of payments. (The payment frequency is specied by the INTERVAL= option.) For example, if the life of the loan is 10 years with monthly payments, use LIFE=120 and INTERVAL=MONTH (default) to indicate a 10-year loan in which 120 monthly payments are made.
PAYMENT=amount P=amount
species the periodic payment. For ARM and BUYDOWN loans where the periodic payment might change, the PAYMENT= option species the initial amount of the periodic payment.
RATE=rate R=rate
species the initial annual (nominal) interest rate in percent notation. The rate specied must be in the range 0% to 120%. For example, use RATE=12.75 for a 12.75% loan. For ARM and BUYDOWN loans, where the rate might change over the life of the loan, the RATE= option species the initial annual interest rate.
Specication Options
AMOUNTPCT=value APCT=value
species the loan amount as a percentage of the purchase price (PRICE= option). The AMOUNTPCT= specication is used to calculate the loan amount if the AMOUNT= option is not specied. The value specied must be in the range 1% to 100%. If both the AMOUNTPCT= and DOWNPAYPCT= options are specied and the sum of their values is not equal to 100, the value of the downpayment percentage is set equal to 100 minus the value of the amount percentage.
COMPOUND=time-unit
species the time interval between compoundings. The default is the time unit given by the INTERVAL= option. If the INTERVAL= option is not used, then the default is COMPOUND=MONTH. The following time units are valid COMPOUND= values: CONTINUOUS, DAY, SEMIMONTH, MONTH, QUARTER, SEMIYEAR, and YEAR. The compounding interval is used to calculate the simple interest rate per payment period from the nominal annual interest rate or vice versa.
DOWNPAYMENT=amount DP=amount
species the down payment at the initialization of the loan. The down payment is included in the calculation of the present worth of cost but not in the calculation of the true interest rate. The after-tax analysis assumes that the down payment is not tax-deductible. (Specify after-tax analysis with the TAXRATE= option in the COMPARE statement.)
DOWNPAYPCT=value DPCT=value
species the down payment as a percentage of the purchase price (PRICE= option). The DOWNPAYPCT= specication is used to calculate the down payment amount if you do not specify the DOWNPAYMENT= option. The value you specify must be in the range 0% to 99%. If you specied both the AMOUNTPCT= and DOWNPAYPCT= options and the sum of their values is not equal to 100, the value of the downpayment percentage is set equal to 100 minus the value of the amount percentage.
INITIAL=amount INIT=amount
species the amount paid for loan initialization other than the discount points and down payment. This amount is included in the calculation of the present worth of cost and the true interest rate. The after-tax analysis assumes that the initial amount is not tax-deductible. (After-tax analysis is specied by the TAXRATE= option in the COMPARE statement.)
INITIALPCT=value INITPCT=value
species the initialization costs as a percentage of the loan amount (AMOUNT= option). The INITIALPCT= specication is used to calculate the amount paid for loan initialization if you do not specify the INITIAL= option. The value you specify must be in the range of 0% to 100%.
INTERVAL=time-unit
gives the time interval between periodic payments. The default is INTERVAL=MONTH. The following time units are valid INTERVAL values: SEMIMONTH, MONTH, QUARTER, SEMIYEAR, and YEAR.
LABEL=loan-label
species a label for the loan. If you specify the LABEL= option, all output related to the loan is labeled accordingly. If you do not specify the LABEL= option, the loan is labeled by sequence number.
POINTS=amount PNT=amount
species the amount paid for discount points at the initialization of the loan. This amount is included in the calculation of the present worth of cost and true interest rate. The amount paid for discount points is assumed to be tax-deductible in after-tax analysis (that is, if the TAXRATE= option is specied in the COMPARE statement).
POINTPCT=value PNTPCT=value
species the discount points as a percentage of the loan amount (AMOUNT= option). The POINTPCT= specication is used to calculate the amount paid for discount points if you do not specify the POINTS= option. The value you specify must be in the range of 0% to 100%.
PREPAYMENTS=amount PREPAYMENTS=( date1=prepayment1 date2=prepayment2 ... ) PREPAYMENTS=( period1=prepayment1 period2=prepayment2 ... ) PREP=
species either a uniform prepayment p throughout the life of the loan or lump sum prepayments. A uniform prepayment p is assumed to be paid with each periodic payment. Specify lump sum prepayments by pairs of periods (or dates) and respective prepayment amounts. You can specify the prepayment periods as dates if you specify the START= option. Prepayment periods or dates and the respective prepayment amounts must be in time sequence. The prepayments are treated as principal payments, and the outstanding principal balance is
adjusted accordingly. In the adjustable rate and buydown rate loans, if there is a rate adjustment after prepayments, the adjusted periodic payment is calculated based on the outstanding principal balance. The prepayments do not result in periodic payment amount adjustments in xed rate and balloon payment loans.
PRICE=amount PRC=amount
species the purchase price, which is the loan amount plus the down payment. If you specify the PRICE= option along with the loan amount (AMOUNT= option) or the down payment (DOWNPAYMENT= option), the value of the other one is calculated. If you specify the PRICE= option with the AMOUNTPCT= or DOWNPAYPCT= options, the loan amount and the downpayment are calculated.
ROUND=n ROUND=NONE
species the number of decimal places to which the monetary amounts are rounded for the loan. Valid values for n are integers from 0 to 6. If you specify ROUND=NONE, the values are not rounded off internally, but the printed output is rounded off to two decimal places. The default is ROUND=2.
START=SAS-date-literal START=yyyy:period S=
gives the date of loan initialization. The rst payment is assumed to be one payment interval after the start date. For example, you can specify the START= option as START=1APR2010D or as START=2010:3, where 3 is the third payment interval within the year 2010. If INTERVAL=QUARTER, 3 refers to the third quarter. If you specify the START= option, all output for the particular loan is dated accordingly.
Output Options
NOSUMMARYPRINT NOSUMPR
suppresses the printing of the loan summary report. The NOSUMMARYPRINT option is usually used when an OUTSUM= data set is created to store loan summary information.
NOPRINT NOP
writes the loan summary for the individual loan to an output data set.
prints the amortization schedule for the loan. SCHEDULE=nyears species the number of years the printed amortization table covers. If you omit the number of years or specify a period longer than the loan life, the schedule is printed for the full term of the loan. SCHEDULE=YEARLY prints yearly summary information in the amortization schedule rather than the full amortization schedule. SCHEDULE=YEARLY is useful for long-term loans.
BALLOON Statement
BALLOON options ;
The BALLOON statement species a xed rate loan with scheduled balloon payments in addition to the periodic payment. The following option is used in the BALLOON statement, in addition to the required options listed under the FIXED statement:
BALLOONPAYMENT=( date1=payment1 date2=payment2 ... ) BALLOONPAYMENT=( period1=payment1 period2=payment2 ... ) BPAY=( date1=payment1 date2=payment2 ... ) BPAY=( period1=payment1 period2=payment2 ... )
species pairs of periods and amounts of balloon (lump sum) payments in excess of the periodic payment during the life of the loan. You can also specify the balloon periods as dates if you specify the START= option. The dates are specied as SAS date literals. For example, BALLOONPAYMENT=( 1MAR2011D=1000 ) species a payment of 1000 in March of 2011. If you do not specify this option, the calculations are identical to a loan specied in a FIXED statement. Balloon periods (or dates) and the respective balloon payments must be in time sequence.
ARM Statement
ARM options ;
The ARM statement species an adjustable rate loan where the future interest rates are not known with certainty but will vary within specied limits according to the terms stated in the loan agreement. In practice, the adjustment terms vary. Adjustments in the interest rate can be captured using the ARM statement options. In addition to the required specications and options listed under the FIXED statement, you can use the following options with the ARM statement.
species the number of periods, in terms of the INTERVAL= specication, between rate adjustments. INTERVAL=MONTH ADJUSTFREQ=6 indicates that the nominal interest rate can be adjusted every six months until the life cap or maximum rate (whichever is specied) is reached. The default is ADJUSTFREQ=12. The periodic payment is adjusted every adjustment period even if there is no rate change; therefore, if prepayments are made (as specied with the PREPAYMENTS= option), the periodic payment might change even if the nominal rate does not.
CAPS=( periodic-cap, life-cap )
species the maximum interest rate adjustment, in percent notation, allowed by the loan agreement. The periodic cap species the maximum adjustment allowed at each adjustment period. The life cap species the maximum total adjustment over the life of the loan. For example, a loan specied with CAPS=(0.5, 2) indicates that the nominal interest rate can change by 0.5% each adjustment period, and the annual nominal interest rate throughout the life of the loan will be within a 2% range of the initial annual nominal rate.
MAXADJUST=rate MAXAD=rate
species the maximum rate adjustment, in percent notation, allowed at each adjustment period. Use the MAXADJUST= option with the MAXRATE= and MINRATE= options. The initial nominal rate plus the maximum adjustment should not exceed the specied MAXRATE= value. The initial nominal rate minus the maximum adjustment should not be less than the specied MINRATE= value.
MAXRATE=rate MAXR=rate
species the maximum annual nominal rate, in percent notation, that might be charged on the loan. The maximum annual nominal rate should be greater than or equal to the initial annual nominal rate specied with the RATE= option.
MINRATE=rate MINR=rate
species the minimum annual nominal rate, in percent notation, that might be charged on the loan. The minimum annual nominal rate should be less than or equal to the initial annual nominal rate specied with the RATE= option.
specied, worst-case analysis is performed. You can specify options for adjustable rate loans as follows:
BESTCASE B
species a best-case analysis. The best-case analysis assumes that the interest rate charged on the loan will reach its minimum allowed limits at each adjustment period and over the life of the loan. If you use the BESTCASE option, you must specify either the CAPS= option or the MINRATE= and MAXADJUST= options.
ESTIMATEDCASE=( date1=rate1 date2=rate2 ... ) ESTIMATEDCASE=( period1=rate1 period2=rate2 ... ) ESTC=
species an estimated case analysis that indicates the rate adjustments will follow the rates you predict. This option species pairs of periods and estimated nominal interest rates. The ESTIMATEDCASE= option can specify adjustments that cannot t into the BESTCASE, WORSTCASE, or FIXEDCASE specications, or what-if type analysis. If you specify the START= option, you can also specify the estimation periods as dates, in the form of SAS date literals. Estimated rates and the respective periods must be in time sequence. If the estimated period falls between two adjustment periods (determined by ADJUSTFREQ= option), the rate is adjusted in the next adjustment period. The nominal interest rate charged on the loan is constant between two adjustment periods. If any of the MAXRATE=, MINRATE=, CAPS=, and MAXADJUST= options are specied to indicate the rate adjustment terms of the loan agreement, these specications are used to bound the rate adjustments. By using the ESTIMATEDCASE= option, you are predicting what the annual nominal rates in the market will be at different points in time, not necessarily the interest rate on your particular loan. For example, if the initial nominal rate (RATE= option) is 6.0, ADJUSTFREQ=6, MAXADJUST=0.5, and the ESTIMATEDCASE=(6=6.5, 12=7.5), the actual nominal rates charged on the loan would be 6.0% initially, 6.5% for the sixth through the eleventh periods, and 7.5% for the twelfth period onward.
FIXEDCASE FIXCASE
species a xed case analysis that assumes the rate will stay constant. The FIXEDCASE option calculates the ARM loan values similar to a xed rate loan, but the payments are updated every adjustment period even if the rate does not change, leading to minor differences between the two methods. One such difference is in the way prepayments are handled. In a xed rate loan, the rate and the payments are never adjusted; therefore, the payment stays the same over the life of the loan even when prepayments are made (instead, the life of the loan is shortened). In an ARM loan with the FIXEDCASE option, on the other hand, if prepayments are made, the payment is adjusted in the following adjustment period, leaving the life of the loan constant.
WORSTCASE W
species a worst-case analysis. The worst-case analysis assumes that the interest rate charged
on the loan will reach its maximum allowed limits at each rate adjustment period and over the life of the loan. If the WORSTCASE option is used, either the CAPS= option or the MAXRATE= and MAXADJUST= options must be specied.
BUYDOWN Statement
BUYDOWN options ;
The BUYDOWN statement species a buydown rate loan. The buydown rate loans are similar to ARM loans, but the interest rate adjustments are predetermined at the initialization of the loan, usually by paying interest points at the time of loan initialization. You must use all the required specications and options listed under the FIXED statement with the BUYDOWN statement. The following option is specic to the BUYDOWN statement and is required:
BUYDOWNRATES=( date1=rate1 date2=rate2 ... ) BUYDOWNRATES=( period1=rate1 period2=rate2 ... ) BDR=
species pairs of periods and the predetermined nominal interest rates that will be charged on the loan starting at the corresponding time periods. You can also specify the buydown periods as dates in the form of SAS date literals if you also specify the date of the initial payment by using a date value in the START= option. Buydown periods (or dates) and the respective buydown rates must be in time sequence.
COMPARE Statement
COMPARE options ;
The COMPARE statement compares multiple loans, or it can be used with a single loan. You can use only one COMPARE statement. COMPARE statement options specify the periods and desired types of analysis for loan comparison. The default analysis reports the outstanding principal balance, breakeven of payment, breakeven of interest paid, and before-tax true interest rate. The default comparison period corresponds to the rst LIFE= option specication. If the LIFE= option is not specied for any loan, the loan comparison period defaults to the rst calculated life. You can use the following options with the COMPARE statement. For more detailed information on loan comparison, see the section Loan Comparison Details on page 852.
Analysis Options
ALL
TRUEINTEREST options. The loan comparison report includes all the criteria. You need to specify the MARR= option for present worth of cost calculation.
AT=( date1 date2 ... ) AT=( period1 period2 ... )
species the periods for loan comparison reports. If you specify the START= option in the PROC LOAN statement, you can specify the AT= option as a list of dates expressed as SAS date literals instead of periods. The comparison periods do not need to be in time sequence. If you do not specify the AT= option, the comparison period defaults to the rst LIFE= option specication. If you do not specify the LIFE= option for any of the loans, the loan comparison period defaults to the rst calculated life.
BREAKINTEREST BI
species breakeven analysis of the interest paid. The loan comparison report includes the interest paid for each loan through the specied comparison period (AT= option).
BREAKPAYMENT BP
species breakeven analysis of payment. The periodic payment for each loan is reported for every comparison period specied in the AT=option.
MARR=rate
species the MARR (minimum attractive rate of return) in percent notation. The MARR reects the cost of capital or the opportunity cost of money. The MARR= option is used in calculating the present worth of cost.
PWOFCOST PWC
calculates the present worth of cost (net present value of costs) for each loan based on the cash ow through the specied comparison periods. The calculations account for down payment, initialization costs, and discount points, as well as the payments and outstanding principal balance at the comparison period. If you specify the TAXRATE= option, the present worth of cost is based on after-tax cash ow. Otherwise, before-tax present worth of cost is calculated. You need to specify the MARR= option for present worth of cost calculations.
TAXRATE=rate TAX=rate
species income tax rate in percent notation for the after-tax calculations of the true interest rate and present worth of cost for those assets that qualify for tax deduction. If you specify this option, the amount specied in the POINTS= option and the interest paid on the loan are assumed to be tax-deductible. Otherwise, it is assumed that the asset does not qualify for tax deductions, and the cash ow is not adjusted for tax savings.
TRUEINTEREST TI
calculates the true interest rate (effective interest rate based on the cash ow of all payments,
initialization costs, discount points, and the outstanding principal balance at the comparison period) for all the specied loans through each comparison period. If you specify the TAXRATE= option, the true interest rate is based on after-tax cash ow. Otherwise, the before-tax true interest rate is calculated.
Output Options
NOCOMPRINT NOCP
suppresses the printing of the loan comparison report. The NOCOMPRINT option is usually used when an OUTCOMP= data set is created to store loan comparison information.
OUTCOMP=SAS-data-set
Computational Details
These terms are used in the formulas that follow: p a ra f f0 r re n periodic payment principal amount nominal annual rate compounding frequency (per year) payment frequency (per year) periodic rate effective interest rate total number of payments
The periodic rate, or the simple interest applied during a payment period, is given by
0 ra f =f r D 1C f
Note that the interest calculation is performed at each payment period rather than at the compound period. This is done by adjusting the nominal rate. See Muksian (1984) for details.
Note that when f D f 0 (that is, when the payment and compounding frequency coincide), the preceding expression reduces to the familiar form: rD ra f
The periodic rate for continuous compounding can be obtained from this general expression by taking the limit as the compounding frequency f goes to innity. The resulting expression is ra r D exp 1 f0 The effective interest rate, or annualized percentage rate (APR), is that rate which, if compounded once per year, is equivalent to the nominal annual rate compounded f times per year. Thus, ra f f .1 C re / D .1 C r/ D 1 C f or ra re D 1 C f f 1
For continuous compounding, the effective interest rate is given by re D exp .ra / 1
The amount is calculated as 1 p 1 aD r .1 C r/n Both the payment and amount are rounded to the nearest hundredth (cent) unless the ROUND= specication is different than the default, 2. The total number of payments n is calculated as ln 1 ar p nD ln.1 C r/ The total number of payments is rounded up to the nearest integer. The nominal annual rate is calculated using the bisection method, with a as the objective and r starting in the interval between 8 10 6 and 0.1 with an initial midpoint 0.01 and successive midpoints bisecting.
the equivalent amount for each loan at that point in time. Any loan that has a START= specication different from the one in the PROC LOAN statement is not processed in the loan comparison. The loan comparison report for each comparison period contains for each loan the loan label, outstanding balance, and any of the following measures if requested in the COMPARE statement: periodic payment (BREAKPAYMENT option), total interest paid to date (BREAKINTEREST option), present worth of cost (PWOFCOST option), and true interest rate (TRUEINTEREST option). The best loan is selected on the basis of present worth of cost or true interest rate. If both PWOFCOST and TRUEINTEREST options are specied, present worth of cost is the basis for the selection of the best loan. You can use the OUTCOMP= option in the COMPARE statement to write the loan comparison report to a data set. The NOCOMPRINT option suppresses the printing of a loan comparison report.
LABEL, loan label PAYMENT, periodic payment AMOUNT, loan principal DOWNPAY, down payment. DOWNPAY is included in the OUTSUM= data set only when you specify a down payment. INITIAL, loan initialization costs. INITIAL is included in the OUTSUM= data set only when you specify initialization costs. POINTS, discount points. POINTS is included in the OUTSUM= data set only when you specify discount points. TOTAL, total payment INTEREST, total interest paid RATE, nominal annual interest rate EFFRATE, effective interest rate INTERVAL, payment interval COMPOUND, compounding interval LIFE, loan life (that is, the number of payment intervals) NCOMPND, number of compounding intervals COMPUTE, computed loan parameter: life, amount, payment, or rate If you specied the START= option either in the PROC LOAN statement or for the individual loan, the OUTSUM= data set also contains the following variables: BEGIN, start date END, loan termination date
Printed Output
The output from PROC LOAN consists of the loan summary table, loan amortization schedule, and loan comparison report.
Table 16.2
Description
Option
ODS Tables Created by the PROC LOAN, FIXED, ARM, BALLOON, and BUYDOWN Statements Repayment loan repayment schedule SCHEDULE
ODS Tables Created by the FIXED, ARM, BALLOON, and BUYDOWN Statements LoanSummary RateList PrepayList loan summary rates and payments prepayments and periods default default PREPAYMENTS=
ODS Tables Created by the BALLOON Statement BalloonList balloon payments and periods default
ODS Tables Created by the COMPARE Statement Comparison loan comparison report default
proc loan start=1992:1 nosummaryprint amount=100000 life=360; fixed rate=8.25 label=8.25% - no discount points; fixed rate=8 points=1000 label=8% - 1 discount point; compare at=(48 54 60) all taxrate=33 marr=4; run;
Output 16.1.1 shows the loan comparison reports as of January 1996 (48th period), July 1996 (54th period), and January 1997 (60th period).
Output 16.1.1 Loan Comparison Reports for Discount Point Breakeven
The LOAN Procedure Loan Comparison Report Analysis through JAN1996 Ending Outstanding 96388.09 96219.32 Present Worth of Cost 105546.17 105604.05 Interest Paid 32449.05 31439.80
Loan Comparison Report Analysis through JAN1996 True Rate 5.67 5.69
NOTE: "8.25% - no discount points" is the best alternative based on present worth of cost analysis through JAN1996.
Loan Comparison Report Analysis through JUL1996 True Rate 5.67 5.67
NOTE: "8% - 1 discount point" is the best alternative based on present worth of cost analysis through JUL1996. Loan Comparison Report Analysis through JAN1997 Ending Outstanding 95283.74 95070.21 Present Worth of Cost 106768.07 106689.80 Interest Paid 40359.94 39095.81
Loan Comparison Report Analysis through JAN1997 True Rate 5.67 5.66
NOTE: "8% - 1 discount point" is the best alternative based on present worth of cost analysis through JAN1997.
Notice that the breakeven point for present worth of cost and true rate both happen on July 1996. This indicates that if you intend to keep the loan for 4.5 years or more, it is better to pay the discount points for the lower rate. If your objective is to minimize the interest paid or the periodic payment, the 8% - 1 discount point loan is the preferred choice.
Payment 796.20
The monthly payment on the original loan is $796.20. The ending outstanding principal balance as of February is $71,028.75. At this point, you might want to renance your loan with another 15year loan. The alternate loan has a 6.5% nominal annual rate. The initialization costs are $1,419.00. Use the following statements to compare your alternatives:
proc loan start=1998:2 amount=71028.75; fixed rate=9 payment=796.20 label=Keep the original loan noprint; fixed life=180 rate=6.5 init=1419 label=Refinance at 6.5% noprint; compare at=(15 16) taxrate=33 marr=4 all; run;
The comparison reports of May 1999 and June 1999 in Output 16.2.2 illustrate the break even between the two alternatives. If you intend to keep the loan through June 1999 or longer, your initialization costs for the renancing are justied. The periodic payment of the renanced loan is $618.74.
Loan Comparison Report Analysis through MAY1999 True Rate 6.20 6.23
NOTE: "Keep the original loan" is the best alternative based on present worth of cost analysis through MAY1999. Loan Comparison Report Analysis through JUN1999 Ending Outstanding 66567.37 67128.73 Present Worth of Cost 72844.52 72766.42 Interest Paid 8277.82 5999.82
Loan Comparison Report Analysis through JUN1999 True Rate 6.20 6.12
NOTE: "Refinance at 6.5%" is the best alternative based on present worth of cost analysis through JUN1999.
Date DEC1992
Rates and Payments for No prepayments Nominal Rate Effective Rate 8.2500% 8.5692%
Payment 1803.04
Date DEC1992
Payment 2303.04
Loan Comparison Report Analysis through DEC2002 True Rate 5.67 5.67
NOTE: "With Prepayments" is the best alternative based on present worth of cost analysis through DEC2002.
Output 16.3.1 through Output 16.3.3 illustrate the Loan Summary Reports and the Loan Comparison report. Notice that with prepayments you pay off the loan in slightly more than 15 years. Also, the total payments and total interest are considerably lower with the prepayments. If you can afford the prepayments of $500 each month, another alternative you should consider is using a 15-year loan, which is generally offered at a lower nominal interest rate.
proc loan start=1998:12 noprint outsum=loans amount=150000 life=360; fixed rate=7.5 life=180 prepayment=500 label=BANK1, Fixed Rate; arm rate=5.5 estimatedcase=(12=7.5 18=8) label=BANK1, Adjustable Rate; buydown rate=7 interval=semimonth init=15000 bdrates=(3=9 10=10) label=BANK2, Buydown; arm rate=5.75 worstcase caps=(0.5 2.5) adjustfreq=6 label=BANK3, Adjustable Rate prepayments=(12=2000 36=5000); balloon rate=7.5 life=480 points=1100 balloonpayment=(15=2000 48=2000) label=BANK4, with Balloon Payment; compare at=(120 360) all marr=7 tax=33 outcomp=comp; run; proc print data=loans; run; proc print data=comp; run;
Output 16.4.1 and Output 16.4.2 illustrate the contents of the output data sets.
EFFRATE 0.077633 0.056408 0.072399 0.059040 0.077633 START DEC1998 DEC1998 DEC1998 DEC1998 DEC1998
INTERVAL MONTHLY MONTHLY SEMIMONTHLY MONTHLY MONTHLY END FEB2008 DEC2028 DEC2013 DEC2028 DEC2038
PWOFCOST 137741.07 130397.88 161810.00 131955.90 130242.56 137741.07 121980.94 161536.44 124700.22 117294.50
proc loan start=2002:1 noprint; fixed amount=160000 dp=20000 rate=7.75 life=360 label=Piggyback: Primary loan out=loan1; compare at=(60 120 180 ) pwofcost taxrate=30 marr=12 breakpay breakint outcomp=cloan1; run; title2 LOAN: Piggyback: Secondary (Home Equity) Loan; proc loan start=2002:1 noprint; fixed amount=20000 rate=8.25 life=180 label=Piggyback: Secondary (Home Equity) Loan out=loan2; compare at=(60 120 180 ) pwofcost taxrate=30 marr=12 breakpay breakint outcomp=cloan2; run; data cloan12; set cloan1 run;
cloan2;
proc timeseries data=cloan12 out= totcomp ; id date interval=year5.25 acc=total notsorted; var payment interest pwofcost balance ; run; /*-- LOAN: Piggyback loan --*/ title; proc print data=totcomp; format date monyy7.; run; data comploans; set comploans; drop type label; run; /*-- LOAN: Conventional Loan --*/ title; proc print data=comploans; run;
The loan comparisons in Output 16.5.1 and Output 16.5.2 illustrate the after-tax comparison of the loans. The after-tax present value of cost for the piggyback loan is lower than the 20% down conventional xed rate loan.
References
DeGarmo, E.P., Sullivan, W.G., and Canada, J.R. (1984), Engineering Economy, Seventh Edition, New York: Macmillan Publishing Company. Muksian, R. (1984), Financial Mathematics Handbook, Englewood Cliffs, NJ: Prentice-Hall. Newnan, D.G. (1988), Engineering Economic Analysis, Third Edition, San Jose, CA: Engineering Press. Riggs, J.L. and West, T.M. (1986), Essentials of Engineering Economics, Second Edition, New York: McGraw-Hill.
Chapter 17
Example 17.3: Correlated Choice Modeling . . . . . . . . . . . . . . Example 17.4: Testing for Homoscedasticity of the Utility Function . . Example 17.5: Choice of Time for Work Trips: Nested Logit Analysis Example 17.6: Hausmans Specication and Likelihood Ratio Tests . . Acknowledgments: MDC Procedure . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . .
. . . . . .
. . . . . .
where Vij is a deterministic utility function, assumed to be linear in the explanatory variables, and ij is an unobserved random variable that captures the factors that affect utility that are not included in Vij . Different assumptions on the distribution of the errors, ij , give rise to different classes of models. The features of discrete choice models available in the MDC procedure are summarized in Table 17.1.
Table 17.1
Model Type Conditional Logit HEV Nested Logit Mixed Logit Multinomial Probit
Distribution of
ij
ij
IEV independent and identical HEV independent and nonidentical GEV correlated and identical IEV independent and identical MVN correlated and nonidentical
IEV stands for type I extreme-value (or Gumbel) distribution with the probability density function and the cumulative distribution function of the random error given by f . ij / D exp. ij / exp. exp. ij // and F . ij / D exp. exp. ij //; HEV stands for heteroscedastic extreme-value distribution with the probability density function and the cumulative distribution function of the random error given by f . ij / D 1 exp. ij / exp exp. ij / and F . ij / D
j j j
exp exp. ij /, where j is a scale parameter for the random component of the j th alternaj tive; GEV stands for generalized extreme-value distribution; MVN represents multivariate normal distribution; and ij is an error component. See the Mixed Logit Model on page 907 section for more information about ij .
j D 1; : : : ; 3
where the cumulative distribution function of the stochastic component is type I extreme value, F . ij / D exp. exp. ij //. You can estimate this conditional logit model with the following statements:
proc mdc; model decision = x1 x2 / type=clogit choice=(mode 1 2 3);
id pid; run;
Note that the MDC procedure, unlike other regression procedures, does not include the intercept term automatically. The dependent variable decision takes the value 1 when a specic alternative is chosen; otherwise it takes the value 0. Each individual is allowed to choose one and only one of the possible alternatives. In other words, the variable decision takes the value 1 one time only for each individual. If each individual has three elements (1, 2, and 3) in the choice set, the NCHOICE=3 option can be specied instead of CHOICE=(mode 1 2 3). Consider the following trinomial data from Daganzo (1979). The original data (origdata ) contain travel time (ttime1-ttime3 ) and choice (choice ) variables. ttime1-ttime3 are the travel times for three different modes of transportation, and choice indicates which one of the three modes is chosen. The choice variable must have integer values.
data origdata; input ttime1 datalines; 16.481 16.196 19.469 8.822 12.578 10.671
ttime2 ttime3 choice @@; 23.89 20.819 18.335 2 2 2 15.123 18.847 11.513 11.373 15.649 20.582 14.182 21.28 27.838 2 2 1
A new data set (newdata) is created since PROC MDC requires that each individual decision maker has one case for each alternative in his choice set. Note that the ID statement is required for all MDC models. In the following example, there are two public transportation modes, 1 and 2, and one private transportation mode, 3, and all individuals share the same choice set. The rst nine observations of the raw data set are shown in Figure 17.1.
Figure 17.1 Initial Choice Data
Obs 1 2 3 4 5 6 7 8 9 ttime1 16.481 15.123 19.469 18.847 12.578 11.513 10.651 8.359 11.679 ttime2 16.196 11.373 8.822 15.649 10.671 20.582 15.537 15.675 12.668 ttime3 23.890 14.182 20.819 21.280 18.335 27.838 17.418 21.050 23.104 choice 2 2 2 2 2 1 1 1 1
The following statements transform the data according to MDC procedure requirements.
data newdata(keep=pid decision mode ttime); set origdata; array tvec{3} ttime1 - ttime3;
retain pid 0; pid + 1; do i = 1 to 3; mode = i; ttime = tvec{i}; decision = ( choice = i ); output; end; run;
The rst nine observations of the transformed data set are shown in Figure 17.2.
Figure 17.2 Transformed Modal Choice Data
Obs 1 2 3 4 5 6 7 8 9 pid 1 1 1 2 2 2 3 3 3 mode 1 2 3 1 2 3 1 2 3 ttime 16.481 16.196 23.890 15.123 11.373 14.182 19.469 8.822 20.819 decision 0 1 0 0 1 0 0 1 0
The decision variable, decision, must have one nonzero value for each decision maker corresponding to the actual choice. When the RANK option is specied, the decision variable must contain rank data. For more details, see the section MODEL Statement on page 892. The following SAS statements estimate the conditional logit model by using maximum likelihood:
proc mdc data=newdata; model decision = ttime / type=clogit nchoice=3 optmethod=qn covest=hess; id pid; run;
The MDC procedure enables different individuals to face different choice sets. When all individuals have the same choice set, the NCHOICE= option can be used instead of the CHOICE= option. However, the NCHOICE= option is not allowed when a nested logit model is estimated. When the NCHOICE=number option is specied, the choices are generated as 1; : : : ; number. For more exible alternatives (e.g., 1 3 6 8), you need to use the CHOICE= option. The choice variable must have integer values. The OPTMETHOD=QN option species the quasi-Newton optimization technique. The covariance matrix of the parameter estimates is obtained from the Hessian matrix since COVEST=HESS is specied. You can also specify COVEST=OP or COVEST=QML. See the section MODEL Statement on page 892 for more details.
The MDC procedure produces a summary of model estimation displayed in Figure 17.3. Since there are multiple observations for each individual, the Number of Cases (150)that is, the total number of choices faced by all individualsis larger than the number of individuals, Number of Observations (50). Figure 17.4 shows the frequency distribution of the three choice alternatives. In this example, mode 2 is most frequently chosen.
Figure 17.3 Estimation Summary Table
The MDC Procedure Conditional Logit Estimates Model Fit Summary Dependent Variable Number of Observations Number of Cases Log Likelihood Log Likelihood Null (LogL(0)) Maximum Absolute Gradient Number of Iterations Optimization Method AIC Schwarz Criterion decision 50 150 -33.32132 -54.93061 2.97024E-6 6 Dual Quasi-Newton 68.64265 70.55467
The MDC procedure computes nine goodness-of-t measures for the discrete choice model. Seven of them are pseudo-R2 measures based on the null hypothesis that all coefcients except for an intercept term are zero (Figure 17.5). McFaddens likelihood ratio index (LRI) is the smallest in value. For more details see the section Model Fit and Goodness-of-Fit Statistics on page 915.
N = # of observations, K = # of regressors
Parameter ttime
DF 1
Estimate -0.3572
t Value -4.60
The predicted choice probabilities are produced using the OUTPUT statement:
output out=probdata pred=p;
The parameter estimates can be used to forecast the choice probability of individuals that are not in the input data set. To do so, you need to append to the input data set extra observations whose values of the dependent variable decision are missing, since these extra observations are not supposed to be used in the estimation stage. The identication variable pid must have values that are not used in the existing observations. The output data set, probdata, contains a new variable, p, in addition to input variables in the data set, extdata. The following statements forecast the choice probability of individuals that are not in the input data set.
data extra; input pid mode decision ttime; datalines; 51 1 . 5.0
51 51 ;
2 3
. .
15.0 14.0
data extdata; set newdata extra; run; proc mdc data=extdata; model decision = ttime / type=clogit covest=hess nchoice=3; id pid; output out=probdata pred=p; run; proc print data=probdata( where=( pid >= 49 ) ); var mode decision p ttime; id pid; run;
The last nine observations from the forecast data set (probdata ) are displayed in Figure 17.7. It is expected that the decision maker will choose mode 1 based on predicted probabilities for all modes.
Figure 17.7 Out-of-Sample Mode Choice Forecast
pid 49 49 49 50 50 50 51 51 51 mode 1 2 3 1 2 3 1 2 3 decision 0 1 0 0 1 0 . . . p 0.46393 0.41753 0.11853 0.06936 0.92437 0.00627 0.93611 0.02630 0.03759 ttime 11.852 12.147 15.672 15.557 8.307 22.286 5.000 15.000 14.000
j D 1; : : : ; 3
Suppose the set of all alternatives indexed by j is partitioned into K nests, B1 ; : : : ; BK . The nested logit model is obtained by assuming that the error term in the utility function has the GEV
1
ij = k gA
1 C A D1
where k is a measure of a degree of independence among the alternatives in nest k. When for all k, the model reduces to the standard logit model.
Since the public transportation modes, 1 and 2, tend to be correlated, these two choices can be grouped together. The decision tree displayed in Figure 17.8 is constructed.
Figure 17.8 Decision Tree for Model Choice
The two-level decision tree is specied in the NEST statement. The NCHOICE= option is not allowed for nested logit estimation. Instead, the CHOICE= option needs to be specied, as in the following statements:
/*-- nested logit estimation --*/ proc mdc data=newdata; model decision = ttime / type=nlogit choice=(mode 1 2 3) covest=hess; id pid; utility u(1,) = ttime; nest level(1) = (1 2 @ 1, 3 @ 2), level(2) = (1 2 @ 1); run;
In Figure 17.9, estimates of the inclusive value parameters, INC_L2G1C1 and INC_L2G1C2, are indicative of a nested model structure. See the section Nested Logit on page 910 and the section Decision Tree and Nested Logit on page 912 for more details on inclusive values.
DF 1 1 1
The nested logit model is estimated with the restriction INC_L2G1C1 =INC_L2G1C2 by specifying the SAMESCALE option, as in the following statements. The estimation result is displayed in Figure 17.10.
/*-- nlogit with samescale option --*/ proc mdc data=newdata; model decision = ttime / type=nlogit choice=(mode 1 2 3) samescale covest=hess; id pid; utility u(1,) = ttime; nest level(1) = (1 2 @ 1, 3 @ 2), level(2) = (1 2 @ 1); run;
DF 1 1
The nested logit model is equivalent to the conditional logit model if INC_L2G1C1 = INC_L2G1C2 = 1. You can verify this relationship by estimating a constrained nested logit model. (See the section RESTRICT Statement on page 901 for details on imposing linear restrictions on parameter estimates.) The parameter estimates and the active linear constraints for the following constrained nested logit model are displayed in Figure 17.11.
/*-- constrained nested logit estimation --*/ proc mdc data=newdata; model decision = ttime / type=nlogit choice=(mode 1 2 3) covest=hess; id pid; utility u(1,) = ttime; nest level(1) = (1 2 @ 1, 3 @ 2), level(2) = (1 2 @ 1); restrict INC_L2G1C1 = 1, INC_L2G1C2 =1; run;
DF 1 0 0 1 1
t Value -4.60
-0.26 0.37
0.7993* 0.7186*
Parameter Estimates Parameter ttime_L1 INC_L2G1C1 INC_L2G1C2 Restrict1 Restrict2 Parameter Label
Linear EC [ 1 ] Linear EC [ 2 ]
* Probability computed using beta distribution. Linearly Independent Active Linear Constraints 1 2 0 0 = = -1.0000 -1.0000 + + 1.0000 * INC_L2G1C1 1.0000 * INC_L2G1C2
j D 1; 2; 3
3 5
0 2 N @0; 4
1
21
21
1 0
31 0 0 5A 1
The correlation coefcient ( 21 ) between Ui1 and Ui 2 represents common neglected attributes of public transportation modes, 1 and 2. The following SAS statements estimate this trinomial probit model:
/*-- homoscedastic mprobit --*/ proc mdc data=newdata; model decision = ttime / type=mprobit nchoice=3 unitvariance=(1 2 3) covest=hess; id pid; run;
The UNITVARIANCE=(1 2 3) option species that the random component of utility for each of these choices have unit variance. If the UNITVARIANCE= option is specied, it needs to include at least two choices. The results of this constrained multinomial probit model estimation are displayed in Figure 17.12 and Figure 17.13. The test for ttime = 0 is rejected at the 1% signicance level.
Figure 17.12 Constrained Probit Estimation Summary
The MDC Procedure Multinomial Probit Estimates Model Fit Summary Dependent Variable Number of Observations Number of Cases Log Likelihood Log Likelihood Null (LogL(0)) Maximum Absolute Gradient Number of Iterations Optimization Method AIC Schwarz Criterion Number of Simulations Starting Point of Halton Sequence decision 50 150 -33.88604 -54.93061 0.0002380 8 Dual Quasi-Newton 71.77209 75.59613 100 11
DF 1 1
is
D exp. exp.
ij =j //
1 2 Note that the variance of ij is 6 2 j . Therefore, the error variance is proportional to the square of the scale parameter j . For model identication, at least one of the scale parameters must be normalized to 1. The following SAS statements estimate an HEV model under a unit scale restriction for mode 1 (1 D 1). The results of computation are presented in Output 17.14 and Output 17.15.
/*-- hev with gauss-laguerre method --*/ proc mdc data=newdata; model decision = ttime / type=hev nchoice=3 hev=(unitscale=1, integrate=laguerre) covest=hess; id pid; run;
DF 1 1 1
The parameters SCALE2 and SCALE3 in the output correspond to the estimates of the scale parameters 2 and 3 , respectively. Note that the estimate of the HEV model is not always stable since computation of the log-likelihood function requires numerical integration. Bhat (1995) proposed the Gauss-Laguerre method. In general, the log-likelihood function value of HEV should be larger than that of conditional logit since HEV models include the conditional logit as a special case, but in this example the reverse is true (33.414 for the HEV model, which is less than 33.321 for the conditional logit model). (See Figure 17.14 and Figure 17.3.) This indicates that the Gauss-Laguerre approximation to the true probability is too coarse. You can see how well the Gauss-Laguerre method works by specifying a unit scale restriction for all modes, as in the following statements, since the HEV model with the unit variance for all modes reduces to the conditional logit model:
/*-- hev with gauss-laguerre and unit scale --*/ proc mdc data=newdata; model decision = ttime / type=hev nchoice=3
Figure 17.16 shows that the ttime coefcient is not close to that of the conditional logit model.
Figure 17.16 HEV Estimates with All Unit Scale Parameters
The MDC Procedure Heteroscedastic Extreme Value Model Estimates Parameter Estimates Standard Error 0.0438 Approx Pr > |t| <.0001
Parameter ttime
DF 1
Estimate -0.2926
t Value -6.68
There is another option of specifying the integration method. The INTEGRATE=HARDY option uses the adaptive Romberg-type integration method. The adaptive integration produces much more accurate probability and log-likelihood function values, but often it is not practical to use this method of analyzing the HEV model since it requires excessive CPU time. The following SAS statements produce the HEV estimates by using the adaptive Romberg-type integration method. The results are displayed in Figure 17.17 and Figure 17.18.
/*-- hev with adaptive integration --*/ proc mdc data=newdata; model decision = ttime / type=hev nchoice=3 hev=(unitscale=1, integrate=hardy) covest=hess; id pid; run;
DF 1 1 1
With the INTEGRATE=HARDY option, the log-likelihood function value of the HEV model, 33:026, is greater than that of the conditional logit model, 33:321. (See Figure 17.17 and Figure 17.3.) When you impose unit scale restrictions on all choices, as in the following statements, the HEV model gives the same estimates as the conditional logit model. (See Figure 17.19 and Figure 17.6.)
/*-- hev with adaptive integration and unit scale --*/ proc mdc data=newdata; model decision = ttime / type=hev nchoice=3 hev=(unitscale=1 2 3, integrate=hardy) covest=hess; id pid; run;
Parameter ttime
DF 1
Estimate -0.3572
t Value -4.60
For comparison, we estimate a heteroscedastic multinomial probit model by imposing a zero restriction on the correlation parameter, 31 D 0. The MDC procedure requires normalization of at least two of the error variances in the multinomial probit model. Also, for identication, the correlation parameters associated with a unit normalized variance are restricted to be zero. When the UNITVARIANCE= option is specied, the zero restriction on correlation coefcients applies to the last choice of the list. In the following statements, the variances of the rst and second choices are normalized. The UNITVARIANCE=(1 2) option imposes additional restrictions that 32 D 21 D 0. The default for the UNITVARIANCE= option is the last two choices (which would have been equivalent to UNITVARIANCE=(2 3) for this example). The result is presented in Figure 17.20. The utility function can be dened as Uij D Vij C where 1 0 @0; 4 0 1 N 0 0 0 2 31 0 0 5A
2 3 ij
/*-- mprobit estimation --*/ proc mdc data=newdata; model decision = ttime / type=mprobit nchoice=3 unitvariance=(1 2) covest=hess; id pid; restrict RHO_31 = 0; run;
DF 1 1 0 1
Linear EC [ 1 ]
Note that in the output the estimates of standard errors and correlations are denoted by STD_i and RHO_ij, respectively. In this particular case the rst two variances (STD_1, STD_2) are normalized to one, and corresponding correlations (RHO_21, RHO_32) are set to zero, so they are not listed among parameter estimates.
You can estimate the model with normally distributed random coefcients of ttime with the following SAS statements:
/*-- mixed logit estimation --*/ proc mdc data=newdata type=mixedlogit; model decision = ttime / nchoice=3 mixed=(normalparm=ttime); id pid; run;
Let m and s be mean and scale parameters, respectively, for the random coefcient, . The relevant utility function is Uij D ttimeij C
ij
where D m C s ; m and s are xed mean and scale parameters, respectively. The stochastic component, , is assumed to be standard normal since the NORMALPARM= option is given. Alternatively, the UNIFORMPARM= or LOGNORMALPARM= option can be specied. The LOGNORMALPARM= option is useful when nonnegative parameters are being estimated. The NORMALPARM=, UNIFORMPARM=, and LOGNORMALPARM= variables must be included on the right-hand side of the MODEL statement. See the section Mixed Logit Model on page 907 for more details. To estimate a mixed logit model by using the transportation mode choice data, the MDC procedure requires the MIXED= option for random components. Results of the mixed logit estimation are displayed in Figure 17.21.
Figure 17.21 Mixed Logit Model Parameter Estimates
The MDC Procedure Mixed Multinomial Logit Estimates Parameter Estimates Standard Error 0.2184 0.1911 Approx Pr > |t| 0.0144 0.1368
DF 1 1
Note that the parameter ttime_M corresponds to the constant mean parameter m and the parameter ttime_S corresponds to the constant scale parameter s of the random coefcient .
PROC MDC options ; MDCDATA options ; BOUNDS bound1 < , bound2 . . . > ; BY variables ; ID variable ; MODEL dependent variables = regressors / options ; NEST LEVEL(value) = ((values)@(value),. . . , (values)@(value)) ; NLOPTIONS options ; OUTPUT options ; RESTRICT restriction1 < , restriction2 . . . > ; UTILITY U() = variables, . . . , U() = variables ;
Functional Summary
The statements and options used with the MDC procedure are summarized in the following table:
Table 17.2 MDC Functional Summary
Description Data Set Options formats the data for use by PROC MDC specify the input data set write parameter estimates to an output data set include covariances in the OUTEST= data set write linear predictors and predicted probabilities to an output data set Declaring the Role of Variables specify the ID variable specify BY-group processing variables Printing Control Options request all printing options display correlation matrix of the estimates display covariance matrix of the estimates display detailed information about optimization iterations suppress all displayed output Model Estimation Options specify the choice variables specify the convergence criterion specify the type of covariance matrix specify the starting point of the Halton sequence specify options specic to the HEV model set the initial values of parameters used by the iterative optimization algorithm
Option
ID BY
Description specify the maximum number of iterations specify the options specic to mixed logit specify the number of choices for each person specify the number of simulations specify the optimization technique specify the type of random number generators specify that initial values are generated using random numbers specify the rank dependent variable specify optimization restart options specify a restriction on inclusive parameters specify a seed for pseudo-random number generation specify a stated preference data restriction on inclusive parameters specify the type of the model specify normalization restrictions on multinomial probit error variances Controlling the Optimization Process specify upper and lower bounds for the parameter estimates specify linear restrictions on the parameter estimates specify nonlinear optimization options NESTED Logit Related Options specify the tree structure specify the type of utility function Output Control Options output predicted probabilities output estimated linear predictor
Statement MODEL MODEL MODEL MODEL MODEL MODEL MODEL MODEL MODEL MODEL MODEL MODEL MODEL MODEL
Option MAXITER= MIXED=() NCHOICE= NSIMUL= OPTMETHOD= RANDNUM= RANDINIT RANK RESTART=() SAMESCALE SEED= SPSCALE TYPE= UNITVARIANCE=()
NEST UTILITY
LEVEL()= U()=
OUTPUT OUTPUT
P= XBETA=
species the input SAS data set. If the DATA= option is not specied, PROC MDC uses the most recently created SAS data set.
OUTEST=SAS-data-set
names the SAS data set that the parameter estimates are written to. See OUTEST= Data Set later in this chapter for information about the contents of this data set.
COVOUT
writes the covariance matrix for the parameter estimates to the OUTEST= data set. This option is valid only if the OUTEST= option is specied. In addition, any of the following MODEL statement options can be specied in the PROC MDC statement, which is equivalent to specifying the option for the MODEL statement: ALL, CONVERGE=, CORRB, COVB, COVEST=, HALTONSTART=, ITPRINT, MAXITER=, NOPRINT, NSIMUL=, OPTMETHOD=, RANDINIT, RANK, RESTART=, SAMESCALE, SEED=, SPSCALE, TYPE=, and UNITVARIANCE=.
MDCDATA Statement
MDCDATA options < / OUT= SAS-data-set > ;
The MDCDATA statement prepares data for use by PROC MDC when the choice-specic information is stored in multiple variables (see Output 17.1).
VARLIST (name1 = (var1 var2 ...) name2 = (var1 var2 . . . ) ...)
creates name variables from a multiple variable list of choice alternatives in parentheses. The choice-specic dummy variables are created for the rst set of multiple variables. At least one set of multiple variables must be specied.
SELECT=( variable )
creates a 0-1 variable indicating the choice made for each individual.
OUT=SAS-data-set
BOUNDS Statement
BOUNDS bound1 < , bound2 . . . > ;
BY Statement ! 891
The BOUNDS statement imposes simple boundary constraints on the parameter estimates. BOUNDS statement constraints refer to the parameters estimated by the MDC procedure. You can specify any number of BOUNDS statements. Each bound is composed of parameters, constants, and inequality operators:
item operator item < operator item < operator item . . . > >
Each item is a constant, parameter, or list of parameters. Parameters associated with a regressor variable are referred to by the name of the corresponding regressor variable. Each operator is <, >, <=, or >=. You can use both the BOUNDS statement and the RESTRICT statement to impose boundary constraints; however, the BOUNDS statement provides a simpler syntax for specifying these kinds of constraints. See the section RESTRICT Statement on page 901 as well. Lagrange multipliers are reported for all the active boundary constraints. In the displayed output, the Lagrange multiplier estimates are identied with the names Restrict1, Restrict2, and so forth. The probability of the Lagrange multipliers is computed using a beta distribution (LaMotte 1994). Nonactive, or nonbinding, bounds have no effect on the estimation results and are not noted in the output. The following BOUNDS statement constrains the estimates of the coefcient of ttime to be negative and the coefcients of x1 through x10 to be between zero and one. This example illustrates the use of parameter lists to specify boundary constraints.
bounds ttime < 0, 0 < x1-x10 < 1;
BY Statement
BY variables ;
A BY statement can be used with PROC MDC to obtain separate analyses on observations in groups dened by the BY variables.
ID Statement
ID variable ;
The ID statement must be used with PROC MDC to specify the identication variable that controls multiple choice-specic cases. The MDC procedure requires only one ID statement even with multiple MODEL statements.
MODEL Statement
MODEL dependent = regressors < / options > ;
The MODEL statement species the dependent variable and independent regressor variables for the regression model. When the nested logit model is estimated, regressors in the UTILITY statement are used for estimation. The following options can be used in the MODEL statement after a slash (/).
CHOICE=( variables ) CHOICE=( variable numbers )
species the variables that contain possible choices for each individual. Choice variables must have integer values. Multiple choice variables are allowed only for nested logit models. If all possible alternatives are written with the variable name, the MDC procedure checks all values of the choice variable. The CHOICE=(X 1 2 3) specication implies that the value of X should be 1, 2, or 3. On the other hand, the CHOICE=(X) specication considers all distinctive nonmissing values of X as elements of the choice set.
CONVERGE=number
species the convergence criterion. The CONVERGE= option is the same as the ABSGCONV= option in the NLOPTIONS statement. The ABSGCONV= option in the NLOPTIONS statement overrides the CONVERGE= option. The default value is 1E5.
HALTONSTART=number
species the starting point of the Halton sequence. The specied number must be a positive integer. The default is HALTONSTART=11.
HEV=( option-list )
species options that are used to estimate the HEV model. The HEV model with a unit scale for the alternative 1 is estimated using the following SAS statement:
model y = x1 x2 x3 / hev=(unitscale=1);
The following options can be used in the HEV=() option. These options are listed within parentheses and separated by commas. INTORDER=number species the number of summation terms for Gaussian quadrature integration. The default is INTORDER=40. The maximum order is limited to 45. This option applies only to the INTEGRATION=LAGUERRE method. UNITSCALE=number-list species restrictions on scale parameters of stochastic utility components. INTEGRATE=LAGUERRE | HARDY species the integration method. The INTEGRATE=HARDY option species an adaptive integration method, while the INTEGRATE=LAGUERRE option species the Gauss-Laguerre approximation method. The default is INTEGRATE=LAGUERRE.
MIXED=( option-list )
species options that are used for mixed logit estimation. The mixed logit model with normally distributed random parameters is specied as follows:
model y = x1 x2 x3 / mixed=(normalparm=x1);
The following options can be used in the MIXED=() option. The options are listed within parentheses and separated by commas. LOGNORMALPARM=variables species the variables whose random coefcients are lognormally distributed. LOGNORMALPARM= variables must be included on the right-hand side of the MODEL statement. NORMALEC=variables species the error component variables whose coefcients have a normal distribution N.0; 2 /. NORMALPARM=variables species the variables whose random coefcients are normally distributed. NORMALPARM= variables must be included on the right-hand side of the MODEL statement. UNIFORMEC=variables species the error component variables whose coefcients have a uniform distribution p p U. 3 ; 3 /. UNIFORMPARM=variables species the variables whose random coefcients are uniformly distributed. UNIFORMPARM= variables must be included on the right-hand side of the MODEL statement.
NCHOICE=number
species the number of choices for multinomial choice models when all individuals have the same choice set. When individuals have different number of choices, NCHOICE= option is not allowed, and CHOICE= option should be used. The NCHOICE= and CHOICE= options must not be used simultaneously, and the NCHOICE= option cannot be used for nested logit models.
NSIMUL=number
species the number of simulations when the mixed logit or multinomial probit model is estimated. The default is NSIMUL=100. In general, you need a smaller number of simulations with RANDNUM=HALTON than with RANDNUM=PSEUDO.
RANDNUM=value
species the type of the random number generator used for simulation. RANDNUM=HALTON is the default. The following option values are allowed: PSEUDO HALTON species pseudo-random number generation. species Halton sequence generation.
RANDINIT RANDINIT=number
species that initial parameter values be perturbed by uniform pseudo-random numbers for numerical optimization of the objective function. The default is U. 1; 1/. When the RANDINIT=r option is specied, U. r; r/ pseudo-random numbers are generated. The value r should be positive. With a RANDINIT or RANDINIT= option, there are pure random searches for a given number of trials (1000 for conditional or nested logit, and 500 for other models) to get a maximum (or minimum) value of the objective function. For example, when there is a parameter estimate with an initial value of 1, the RANDINIT option will add a generated random number u to the initial value and compute an objective function value by using 1 C u. This option is helpful in nding the initial value automatically if there is no guidance in setting the initial estimate.
RANK
species that the dependent variable contain ranks. The numbers must be positive integers starting from 1. When the dependent variable has value 1, the corresponding alternative is chosen. This option is provided only as a convenience to the user: the extra information contained in the ranks is not currently used for estimation purposes.
RESTART=( option-list )
species options that are used for reiteration of the optimization problem. When the ADDRANDOM option is specied, the initial value of reiteration is computed using random grid searches around the initial solution, as follows:
model y = x1 x2 / type=clogit restart=(addvalue=(.01 .01));
The preceding SAS statement reestimates a conditional logit model by adding ADDVALUE= values. If the ADDVALUE= option contains missing values, the RESTART= option uses the corresponding estimate from the initial stage. If no ADDVALUE= value is specied for an estimate, a default value equal to (|estimate| * 1e-3) is added to the corresponding estimate from the initial stage. If both the ADDVALUE= and ADDRANDOM(=) options are specied, ADDVALUE= is ignored. The following options can be used in the RESTART=() option. The options are listed within parentheses. ADDMAXIT=number species the maximum number of iterations for the second stage of the estimation. The default is ADDMAXIT=100. ADDRANDOM ADDRANDOM=value species random added values to the estimates from the initial stage. With the ADDRANDOM option, U. 1; 1/ random numbers are created and added to the estimates obtained in the initial stage. When the ADDRANDOM=r option is specied, U. r; r/ random numbers are generated. The restart initial value is determined based on the given number of random searches (1000 for conditional or nested logit, and 500 for other models).
ADDVALUE=( value-list ) species values added to the estimates from the initial stage. A missing value in the list is considered as a zero value for the corresponding estimate. When the ADDVALUE= option is not specied, default values equal to (|estimate| * 1e-3) are added.
SAMESCALE
species that the parameters of the inclusive values be the same within a group at each level when the nested logit is estimated.
SEED=number
species an initial seed for pseudo-random number generation. The SEED= value must are less than 231 1. If the SEED= value is negative or zero, the time of day from the computers clock is used to obtain the initial seed. The default is SEED=0.
SPSCALE
species that the parameters of the inclusive values be the same for any choice with only one nested choice within a group, for each level in a nested logit model. This option is useful in analyzing stated preference data.
TYPE=value
species the type of model to be analyzed. The supported model types are as follows: CONDITIONLOGIT | CLOGIT | CL HEV MIXEDLOGIT | MXL MULTINOMPROBIT | MPROBIT | MP NESTEDLOGIT | NLOGIT | NL
UNITVARIANCE=( number-list )
species a conditional logit model. species a heteroscedastic extreme-value model. species a mixed logit model. species a multinomial probit model. species a nested logit model.
species normalization restrictions on error variances of multinomial probit for the choices whose numbers are given in the list. If the UNITVARIANCE= option is specied, it must include at least two choices. Also, for identication, additional zero restrictions are placed on the correlation coefcients for the last choice in the list.
COVEST=value
The COVEST= option species the type of covariance matrix. Possible values are OP, HESSIAN, and QML. OP HESSIAN QML species the covariance from the outer product matrix. species the covariance from the Hessian matrix. species the covariance from the outer product and Hessian matrices.
When COVEST=OP is specied, the outer product matrix is used to compute the covariance matrix of the parameter estimates. The COVEST=HESSIAN option produces the covariance matrix by using the inverse Hessian matrix. The quasi-maximum likelihood estimates
are computed with COVEST=QML. The default is COVEST=HESSIAN when the NewtonRaphson method is used. COVEST=OP is the default when the OPTMETHOD=QN option is specied.
Printing Options
ALL
displays the initial parameter estimates, convergence criteria, and constraints of the optimization. At each iteration, objective function value, maximum absolute gradient element, step size, and slope of search direction are printed as well. The objective function is the full negative log-likelihood function for the maximum likelihood method. When the ITPRINT option is specied in the presence of the NLOPTIONS statement, all printing options in the NLOPTIONS statement are ignored.
NOPRINT
species initial values for some or all of the parameter estimates. The values specied are assigned to model parameters in the same order in which the parameter estimates are displayed in the MDC procedure output. When you use the INITIAL= option, the initial values in the INITIAL= option must satisfy the restrictions specied for the parameter estimates. If they do not, the initial values you specify are adjusted to satisfy the restrictions.
MAXITER=number
sets the maximum number of iterations allowed. The MAXITER= option overrides the MAXITER= option in the NLOPTIONS statement. The default is MAXITER=100.
OPTMETHOD=value
The OPTMETHOD= option species the optimization technique when the estimation method uses nonlinear optimization.
QN NR TR
species the quasi-Newton method. species the Newton-Raphson method. species the trust region method.
The OPTMETHOD=NR option is the same as the TECHNIQUE=NEWRAP option in the NLOPTIONS statement. For the conditional and nested logit models the default is OPTMETHOD=NR. For other models the default is OPTMETHOD=QN.
NEST Statement
NEST LEVEL( level-number )= ( choices@choice, . . . ) ;
The NEST statement is used when one choice variable contains all possible alternatives and the TYPE=NLOGIT option is specied. The decision tree is constructed based on the NEST statement. When the choice set is specied using multiple CHOICE= variables in the MODEL statement, the NEST statement is ignored. Consider the following eight choices that are nested in a three-level tree structure.
Level 1 1 2 3 4 5 6 7 8 Level 2 1 1 1 2 2 2 3 3 Level 3 1 1 1 1 1 1 2 2 top 1 1 1 1 1 1 1 1
You can use the following NEST statement to specify the tree structure displayed in Figure 17.22:
nest level(1) = (1 2 3 @ 1, 4 5 6 @ 2, 7 8 @ 3), level(2) = (1 2 @ 1, 3 @ 2), level(3) = (1 2 @ 1);
Note that the decision tree is constructed based on the sequence of rst-level choice set specication. Therefore, specifying another order at Level 1 builds a different tree. The following NEST statement builds the tree displayed in Figure 17.23:
nest level(1) = (4 5 6 @ 2, 1 2 3 @ 1, 7 8 @ 3), level(2) = (1 2 @ 1, 3 @ 2), level(3) = (1 2 @ 1);
However, the NEST statement with a different sequence of choice specication at higher levels builds the same tree as displayed in Figure 17.22 if the sequence at the rst level is the same:
Since the MDC procedure contains multiple cases for each individual, it is important to keep the data sequence in the proper order. Consider the four-choice multinomial model with one explanatory variable cost:
pid 1 1 1 1 2 2 2 2 choice 1 2 3 4 1 2 3 4 y 1 0 0 0 0 0 1 0 cost 10 25 20 30 15 22 16 25
The order of data needs to correspond to the value of choice. Therefore, the following data set is equivalent to the preceding data:
pid 1 1 1 1 2 2 2 2 choice 2 3 1 4 3 4 1 2 y 0 0 1 0 1 0 0 0 cost 25 20 10 30 16 25 15 22
Another model is estimated if you specify the decision tree as in Figure 17.25. The different nested tree structure is specied in the following SAS statements:
proc mdc data=one type=nlogit; model y = cost / choice=(choice); id pid; utility u(1,) = cost; nest level(1) = (1 @ 1, 2 3 4 @ 2), level(2) = (1 2 @ 1); run;
NLOPTIONS Statement
NLOPTIONS options ;
PROC MDC uses the nonlinear optimization (NLO) subsystem to perform nonlinear optimization tasks. The NLOPTIONS statement species nonlinear optimization options. The NLOPTIONS statement must follow the MODEL statement. For a list of all the options of the NLOPTIONS statement, see Chapter 6, Nonlinear Optimization Methods.
OUTPUT Statement
OUTPUT options ;
The MDC procedure supports the OUTPUT statement. The OUTPUT statement creates a new SAS data set that contains all the variables in the input data set and, optionally, the estimated linear predictors (XBETA) and predicted probabilities (P). The input data set must be sorted by the choice variable(s) within each ID.
OUT=SAS-data-set
requests the predicted probabilities by naming the variable that contains the predicted probabilities in the output data set.
XBETA=variable name
names the variable that contains the linear predictor (x0 ) values. However, the XBETA= option is not supported in the nested logit model.
RESTRICT Statement
RESTRICT restriction1 < , restriction2 . . . > ;
The RESTRICT statement is used to impose linear restrictions on the parameter estimates. You can specify any number of RESTRICT statements. Each restriction is written as an expression, followed by an equality operator (=) or an inequality operator (<, >, <=, >=), followed by a second expression:
expression operator expression
The operator can be =, <, >, <= , or >=. Restriction expressions can be composed of parameters; multiplication ( ), summation (C), and substraction ( ) operators; and constants. Parameters named in restriction expressions must be among the parameters estimated by the model. Parameters associated with a regressor variable are referred to by the name of the corresponding regressor variable. The restriction expressions must be a linear function of the parameters. Lagrange multipliers are reported for all the active linear constraints. In the displayed output, the Lagrange multiplier estimates are identied with the names Restrict1, Restrict2, and so forth. The probability of the Lagrange multipliers is computed using a beta distribution (LaMotte 1994). The following are examples of using the RESTRICT statement:
proc mdc data=one;
model y = x1-x10 / type=clogit choice=(mode 1 2 3); id pid; restrict x1*2 <= x2 + x3, ; run; proc mdc data=newdata; model decision = ttime / type=mprobit nchoice=3 unitvariance=(1 2) covest=hess; id pid; restrict RHO_31 = 0, STD_3<=1; run;
UTILITY Statement
UTILITY U(level < , choices > )= variables ;
The UTILITY statement can be used in estimating a nested logit model. The U()= option can have two arguments. The rst argument contains level information, while the second argument is related to choice information. The second argument can be omitted for the rst level when all the choices at the rst level share the same variables and the same parameters. However, for any level above the rst, the second argument must be provided. The UTILITY statement species a utility function while the NEST statement constructs the decision tree. Consider a two-level nested logit model that has one explanatory variable at level 1. This model can be specied as follows:
proc mdc data=one type=nlogit; model y = cost / choice=(choice); id pid; utility u(1,2 3 4) = cost; nest level(1) = (1 @ 1, 2 3 4 @ 2), level(2) = (1 2 @ 1); run;
You also can specify the following statement, since all the variables at the rst level share the same explanatory variable, cost:
utility u(1,) = cost;
The variable, cost, should be listed in the MODEL statement. When the additional explanatory variable, dummy, is included at level 2, another U()= option needs to be specied. Note that the U()= option must specify choices within any level above the rst. Thus, it is specied as U(2, 1 2) in the following statements:
proc mdc data=one type=nlogit; model y = cost dummy / choice=(choice); id pid; utility u(1,) = cost, u(2,1 2) = dummy; nest level(1) = (1 @ 1, 2 3 4 @ 2), level(2) = (1 2 @ 1); run;
where the subscript i is an index for the individual, the subscript j is an index for the alternative, Vij is a nonstochastic utility function, and ij is a random component, or error, which captures unobserved characteristics of alternatives and/or individuals. In multinomial discrete choice models, the utility function is assumed to be linear, so that Vij D x0 . In the conditional logit model, each ij ij for all j 2 Ci is distributed independently and identically (iid) with the type I extreme-value distribution, exp. exp. ij //, also known as the Gumbel distribution. The iid assumption on the random components of the utilities of the different alternatives can be relaxed to overcome the wellknown and restrictive Independence from Irrelevant Alternatives (IIA) property of the conditional logit model. This allows for more exible substitution patterns among alternatives than the one imposed by the conditional logit model. See the section Independence from Irrelevant Alternatives (IIA) on page 905. The nested logit model is derived by allowing the random components to be identical but nonindependent. Instead of independent type I extreme-value errors, the errors are assumed to have a generalized extreme-value distribution. This model generalizes the conditional logit model to allow for particular patterns of correlation in unobserved utility (McFadden 1978). Another generalization of the conditional logit model, the heteroscedastic extreme-value (HEV) model, is obtained by allowing independent but nonidentical errors distributed with a type I extremevalue distribution (Bhat 1995). It permits different variances on the random components of utility across the alternatives. Mixed logit models are also generalizations of the conditional logit model that can represent very general patterns of substitution among alternatives. See the Mixed Logit Model on page 907 section for details. The multinomial probit (MNP) model is derived when the errors, . i1 ; i2 ; ; iJ /, have a multivariate normal (MVN) distribution. Thus, this model accommodates a very general error structure. The multinomial probit model requires burdensome computation compared to a family of multinomial choice models derived from the Gumbel distributed utility function, since it involves multi-
dimensional integration (with dimension J 1) in the estimation process. In addition, the multinomial probit model requires more parameters than other multinomial choice models. As a result, conditional and nested logit models are used more frequently, even though they are derived from a utility function whose random component is more restrictively dened than the multinomial probit model. The event of a choice being made, fyi D j g, can be expressed using a random utility function as follows: Uij maxk2Ci ;kj Uik
where Ci is the choice set of individual i . Individual i chooses alternative j if and only if it provides a level of utility that is greater than or equal to that of any other alternative in his choice set. Then, the probability that individual i chooses alternative j (from among the ni choices in his choice set Ci ) is Pi .j / D Pij D P x0 C ij
ij
maxk2Ci .x0 C ik
i k /
where yi is a random variable that indicates the choice made, xi is a vector of characteristics specic to the i th individual, and j is a vector of coefcients specic to the j th alternative. Thus, this model involves choice-specic coefcients and only individual specic regressors. For model identication, one often assumes that 0 D 0. The multinomial logit model reduces to the binary logit model if J D 1. The ratio of the choice probabilities for alternatives j and l, or the odds ratio of alternatives j and l, is P exp.x0 j /= J exp.x0 k / Pij kD0 i i D D expx0 .j l / PJ i 0 0 Pil exp.xi l /= kD0 exp.xi k / Note that the odds ratio of alternatives j and l does not depend on any alternatives other than j and l. For more on this, see the section Independence from Irrelevant Alternatives (IIA) on page 905. The log-likelihood function of the multinomial logit model is LD where dij D 1 if individual i chooses alternative j 0 otherwise
N J XX i D1 j D0
dij ln P .yi D j /
This type of multinomial choice modeling has a couple of weaknesses: it has too many parameters (the number of individual characteristics times J ) and it is difcult to interpret. The multinomial logit model can be used to predict the choice probabilities, among a given set of J C 1 alternatives, of an individual with known vector of characteristics xi . The parameters of the multinomial logit model can be estimated with the TYPE=CLOGIT option in the MODEL statement; however, this requires modication of the conditional logit model to allow individual specic effects. The conditional logit model, sometimes also called the multinomial logit model, is similarly dened when choice-specic data are available. Using properties of type I extreme-value (or Gumbel) distribution, the probability that individual i chooses alternative j from among the choices in his choice set Ci is P .yi D j / D Pij D P x0 ij C
ij
i k /
DP
exp.x0 / ij
k2Ci
exp.x0 k / i
where xij is a vector of attributes specic to the j th alternative as perceived by the i th individual. It is assumed that there are ni choices in each individuals choice set, Ci . The log-likelihood function of the conditional logit model is LD
N XX i D1 j 2Ci
dij ln P .yi D j /
The conditional logit model can be used to predict the probability that an individual will choose a previously unavailable alternative, given knowledge of and the vector xij of choice-specic characteristics.
the IIA property is restrictive from the point of view of choice behavior. Models that display the IIA property predict that a change in the attributes of one alternative changes the probabilities of the other alternatives proportionately such that the ratios of probabilities remain constant. Thus, cross elasticities due to a change in the attributes of an alternative j are equal for all alternatives k j . This particular substitution pattern might be too restrictive in some choice settings. The IIA property of the conditional logit model follows from the assumption that the random components of utility are identically and independently distributed. The other models in PROC MDC, namely, nested logit, HEV, mixed logit, and multinomial probit, relax the IIA property in different ways. For an example of Hausmans specication test of IIA assumption, see Example 17.6: Hausmans Specication and Likelihood Ratio Tests on page 936.
where the choice set Ci has ni elements and .x/ D exp. exp. x// .x/ D exp. x/.x/ are the cumulative distribution function and probability density function of the type I extreme-value 2 distribution. The variance of the error term for the j th alternative is 1 2 j . If the scale parameters, 6 j , of the random components of utility of all alternatives are equal, then this choice probability is the same as that of the conditional logit model. The log-likelihood function of the HEV model can be written as LD where dij D 1 if individual i chooses alternative j 0 otherwise
N XX i D1 j 2Ci
dij lnPi .j /
Since the log-likelihood function contains an improper integral function, it is computationally difcult to get a stable estimate. With the transformation u D exp. w/, the probability can be written Z Pi .j / D Z D
0 0 1 1
x0 ij
x0 k i k
j ln.u/
# exp. u/du
Using the Gauss-Laguerre weight function, W .x/ D exp. u/, the integration of the log-likelihood function can be replaced with a summation as follows: Z
0 1
K X kD1
wk Gij .xk /
Weights (wk ) and abscissas (xk ) are tabulated by Abramowitz and Stegun (1970).
ij
where xij is a vector of observed variables relating to individual i and alternative j , is a vector of parameters, ij is an error component that can be correlated among alternatives and heteroscedastic for each individual, and ij is a random term with zero mean that is independently and identically distributed over alternatives and individuals. The conditional logit model is derived if you assume ij has an iid Gumbel distribution and V . ij / D 0. The mixed logit model assumes a general distribution for ij and an iid Gumbel distribution for ij . Denote the density function of the error component ij as f . ij j /, where is a parameter vector of the distribution of ij . The choice probability of alternative j for individual i is written as Z Pi .j / D Qi .j j ij /f . ij j /d ij where the conditional choice probability for a given value of Qi .j j
ij / D P ij
is the logit
exp.x0 C ij
0 k2Ci exp.xik
ij /
ik/
Since ij is not given, the unconditional choice probability, Pi .j /, is the integral of the conditional choice probability, Qi .j j ij /, over the distribution of ij . This model is called mixed logit since the choice probability is a mixture of logits with f . ij j / as the mixing distribution.
In general, the mixed logit model does not have an exact likelihood function since the probability Pi .j / does not always have a closed form solution. Therefore, a simulation method is used for computing the approximate probability: Q Pi .j / D 1=S
S X sD1
Q Qi .j j
s ij /
Q where S is the number of simulation replications and Pi .j / is a simulated probability. The simulated log-likelihood function is computed as Q LD where dij D 1 if individual i chooses alternative j 0 otherwise
N ni XX i D1 j D1
Q dij ln.Pi .j //
For simulation purposes, assume that the error component has a specic structure,
ij
D z0 ij
C w0 ij
where zij is a vector of observed data and is a random vector with zero mean and density function . j /. The observed data vector (zij ) of the error component can contain some or all elements of xij . The component z0 induces heteroscedasticity and correlation across unobserved utility ij components of the alternatives. This allows exible substitution patterns among the alternatives. The kth element of vector is distributed as
k
.0;
k
2 k/
Therefore,
k
can be specied as
k
where N.0; 1/ or U. p p 3; 3/
In addition, is a vector of random parameters (or random coefcients). Random coefcients allow heterogeneity across individuals in their sensitivity to observed exogenous variables. The observed data vector, wij , is a subset of xij . Three types of distributions for the random coefcients are supported. Denote the mth element of as m . Normally distributed coefcient with the mean bm and spread sm being estimated. m D bm C sm
and
N.0; 1/
Uniformly distributed coefcient with the mean bm and spread sm being estimated. A uniform distribution with mean b and spread s is U.b s; b C s/. m D bm C sm
and
U. 1; 1/
and
N.0; 1/
where bm and sm are parameters that are estimated. The estimate of spread for normally, uniformly, and lognormally distributed coefcients can be negative. The absolute value of the estimated spread can be interpreted as an estimate of standard deviation for normally distributed coefcients. A detailed description of mixed logit models can be found, for example, in Brownstone and Train (1999).
Multinomial Probit
The multinomial probit model allows the random components of the utility of the different alternatives to be nonindependent and nonidentical. Thus, it does not have the IIA property. The increase in the exibility of the error structure comes at the expense of introducing several additional parameters in the covariance matrix of the errors. Consider the random utility function Uij D x0 C ij
ij i1 ; i2 ;
iJ /
is multivariate normal:
6 i2 7 6 7 6 : 7 : 5 4 :
iJ
N.0; /
j k j;kD1;:::;J
The dimension of the error covariance matrix is determined by the number of alternatives J . Given .xi1 ; xi2 ; ; xiJ /, the j th alternative is chosen if and only if Uij Ui k for all k j . Thus, the probability that the j th alternative is chosen is P .yi D j / D Pij D P
i1 ij
< .xij
xi1 /0 ; : : : ;
iJ
ij
< .xij
xiJ /0
where yi is a random variable that indicates the choice made. This is a cumulative probability from a .J 1/-variate normal distribution. Since evaluation of this probability involves multidimensional integration, it is practical to use a simulation method to estimate the model. Many studies have shown that the simulators proposed by Geweke (1989), Hajivassiliou (1993), and Keane (1994)
(GHK) perform well. For example, Hajivassiliou, McFadden, and Ruud (1996) compare 13 simulators using 11 different simulation methods and conclude that the GHK simulation method is the most reliable. To compute the probability of the multivariate normal distribution, the recursive simulation method is used. Refer to Hajivassiliou (1993) for more details on GHK simulators. The log-likelihood function for the multinomial probit model can be written as LD where dij D 1 if individual i chooses alternative j 0 otherwise
N J XX i D1 j D1
dij ln P .yi D j /
For identication of the multinomial probit model, two of the diagonal elements of are normalized to 1, and it is assumed that for one of the choices whose error variance is normalized to 1 (say, k), it is also true that j k D kj D 0 for j D 1; ; J and j k. Thus, a model with J alternatives has at most J.J 1/=2 1 covariance parameters after normalization. Let D and R be dened as 2 0 1 6 0 2 6 DD6 : : : : : 4 : : : : 0 2 6 6 RD6 4
2 j
0 0 : : :
J
3 7 7 7 5
1J 2J
0
12
1
21
: : :
J1
1 : : :
J2
: : :
7 7 : 7 : 5 : 1
kJ
where D jj and j k D jj kk . Then, for identication, J 1 D J D 1 and for all k J can be imposed, and the error covariance matrix is D DRD.
Jk
D 0, and
In the standard MDC output, the parameter estimates STD_j and RHO_jk correspond to jk.
In principle, the multinomial probit model is fully identied with the preceding normalizations. However, in practice, convergence in applications of the model with more than three alternatives often requires additional restrictions on the elements of . It must also be noted that the unrestricted structure of the error covariance matrix makes it impossible to forecast demand for a new alternative without knowledge of the new .J C 1/ by .J C 1/ error covariance matrix.
Nested Logit
The nested logit model (McFadden 1978, 1981) allows partial relaxation of the assumption of independence of the stochastic components of utility of alternatives. In some choice situations, the IIA
property holds for some pairs of alternatives but not all. In these situations, the nested logit model can be used if the set of alternatives faced by an individual can be partitioned into subsets such that the IIA property holds within subsets but not across subsets. In the nested logit model, the joint distribution of the errors is generalized extreme value (GEV). This is a generalization of the type I extreme-value distribution that gives rise to the conditional logit model. Note that all ij within each subset are correlated with each other. Refer to McFadden (1978, 1981) for details. Nested logit models can be described analytically following the notation of McFadden (1981). Assume that there are L levels, with 1 representing the lowest and L representing the highest level of the tree. The index of a node at level h in the tree is a pair .jh ; h /, where h D .jhC1 ; ; jL / is the index of the adjacent node at level h C 1. Thus, the primitive alternatives, at level 1 in the tree, are indexed by vectors .j1 ; ; jL /, while the alternative nodes at level L are indexed by integers jL . The choice set C h is the set of primitive alternatives (at level 1) that belong to branches below the node h . The notation C h is also used to denote a set of indices jh such that .jh ; h / is a node immediately below h . Note that C 0 is a set with a single element, while C L represents a choice set containing all possible alternatives. As an example, consider the circled node at level 1 in Figure 17.26. Since it stems from node 11, h D 11, and since it is the second node stemming from 11, jh D 2, so that its index is h 1 D 0 D .jh ; h / D 211. Similarly, C11 D f111; 211; 311g contains all the possible choices below 11. Note that while this notation is useful for writing closed-form solutions for probabilities, the MDC procedure allows a more exible denition of indices. See the NEST statement in the section Syntax for more details on how to describe decision trees within the MDC procedure.
Figure 17.26 Node Indices for a Three-Level Tree
Let xi Ijh h denote the vector of observed variables for individual i common to the alternatives below node jh h . The probability of choice at level h has a closed-form solution and is written h i P .h/0 exp xi Ijh h .h/ C k2CiIj Ik;jh h k;jh h h h h i ; h D 2; Pi .jh j h / D P ;L P .h/0 .h/ C exp xi Ij h Ik;j h k;j h j 2Ci I k2Ci Ij
h h
.h/
where I h is the inclusive value (at level h C 1) of the branch below node sively as follows: 8 39 2 = < X X .h/0 Ik;j h k;j h 5 exp 4xi Ij h .h/ C I h D ln ; :
j 2CiI
h
k2Ci Ij
0 k;
k;
L 1
The inclusive value I h denotes the average utility that the individual can expect from the branch below h . The dissimilarity parameters or inclusive value parameters (k;j h ) are the coefcients of the inclusive values and have values between 0 and 1 if nested logit is the correct model specication. When they all take value 1, the nested logit model is equivalent to the conditional logit model. At decision level 1, there is no inclusive value; i.e., I is h i .1/0 exp xi Ij1 1 .1/ i h Pi .j1 j 1 / D P .1/0 exp xiIj 1 .1/ j 2Ci I
1 0
X
h0 2Ci; hC1
X
j 2Ci;
h0
yi;j
h0
ln P .Ci;j
h0
jCi;
h0
where yi;j h0 is an indicator variable that has the value of 1 for the selected choice. The full loglikelihood function of the nested logit model is obtained by adding the conditional log-likelihood functions at each level: LD
L X hD1
L.h/
Note that the log-likelihood functions are computed from conditional probabilities when h < L. The nested logit model is estimated using the full information maximum likelihood method.
A nested logit model with two levels can be specied using the following SAS statements:
proc mdc data=one type=nlogit; model decision = x1 x2 x3 x4 x5 / choice=(upmode 1 2, mode 1 2 3 4 5); id pid; utility u(1, 3 4 5 @ 2) = x1 x2, u(1, 1 2 @ 1) = x3 x4, u(2, 1 2) = x5; run;
All model variables, x1 through x5, are specied in the UTILITY statement. It is required that entries denoted as # have values for model estimation and prediction. The values of the level 2 utility variable x5 should be the same for all the primitive (level 1) alternatives below node 1 at level 2 and, similarly, for all the primitive alternatives below node 2 at level 2. In other words, x5 should have the same value for primitive alternatives 1 and 2 and, similarly, it should have the same value for primitive alternatives 3, 4, and 5. More generally, the values of any level 2 or higher utility function variables should be constant across primitive alternatives under each node for which the utility function applies. Since PROC MDC expects this to be the case, it will use the values of x5 only for the primitive alternatives 1 and 3, ignoring the values for the primitive alternatives 2, 4, and 5. Thus, PROC MDC uses the values of the utility function variable only
for the primitive alternatives that come rst under each node for which the utility function applies. This behavior applies to any utility function variables that are specied above the rst level. The choice variable for level 2 (upmode ) should be placed before the rst-level choice variable (mode ) when the CHOICE= option is given. Alternatively, the NEST statement can be used to specify the decision tree. The following SAS statements t the same nested logit model:
proc mdc data=a type=nlogit; model decision = x1 x2 x3 x4 x5 / choice=(mode 1 2 3 4 5); id pid; utility u(1, 3 4 5 @ 2) = x1 x2, u(1, 1 2 @ 1) = x3 x4, u(2, 1 2) = x5; nest level(1) = (1 2 @ 1, 3 4 5 @ 2), level(2) = (1 2 @ 1); run;
The U(1, 3 4 5 @ 2)= option species three choices, 3, 4, and 5, at level 1 of the decision tree. They are connected to the upper branch 2. The specied variables (x1 and x2 ) are used to model this utility function. The bottom level of the decision tree is level 1. All variables in the UTILITY statement must be included in the MODEL statement. When all choices at the rst level share the same variables, you can omit the second argument of the U()= option for that level. However, U(1, ) = x1 x2 is not equivalent to the following statements:
u(1, 3 4 5 @ 2) = x1 x2; u(1, 1 2 @ 1) = x1 x2;
The CHOICE= variables need to be specied from the top to the bottom level. To forecast demand for new products, stated preference data are widely used. Stated preference data are attractive for market researchers since attribute variations can be controlled. Hensher (1993) explores the advantage of combining revealed preference (market data) and stated preference data. The scale factor (Vrp =Vsp ) can be estimated using the nested logit model with the decision tree structure displayed in Figure 17.28.
Figure 17.28 Decision Tree for Revealed and Stated Preference Data
The SPSCALE option species that parameters of inclusive values for nodes 2, 3, and 4 at level 2 be the same. When you specify the SAMESCALE option, the MDC procedure imposes the same coefcient of inclusive values for choices 14.
ln L ln L0
where L is the maximum of the log-likelihood function and L0 is the maximum of the log-likelihood function when all coefcients, except for an intercept term, are zero. McFaddens likelihood ratio index is bounded by 0 and 1. Estrella (1998) proposes the following requirements for a goodness-of-t measure to be desirable in discrete choice modeling: The measure must take values in 0; 1, where 0 represents no t and 1 corresponds to perfect t. The measure should be directly related to the valid test statistic for the signicance of all slope coefcients. The derivative of the measure with respect to the test statistic should comply with corresponding derivatives in a linear regression. Estrellas measure is written as .2=N / ln L0 ln L 2 RE1 D 1 ln L0 Estrella suggests an alternative measure,
2 RE 2 D 1
.ln L
K/= ln L0
.2=N / ln L0
where ln L0 is computed with null parameter values, N is the number of observations used, and K represents the number of estimated parameters. Other goodness-of-t measures are summarized as follows:
2 RC U1 2 RC U 2
D1 D
L0 L
2 N
1 .L0 =L/ N 1
2 N L0
2 RA D 2 RV Z
The AIC and SBC are computed as follows: AIC D SBC D where ln.L/ is the log likelihood value for the model, k is the number of parameters estimated, and n is the number of observations (i.e., the number of respondents). 2 ln.L/ C 2 k 2 ln.L/ C ln.n/ k
You need to specify all possible choices in the CHOICE= option since the OUTEST= option is specied:
proc mdc data=a outest=e; model y = x / type=hev choice=(alter 1 2 3); run;
When the NCHOICE= option is given, no additional information about possible choices is required. Therefore, the following are correct SAS statements:
The nested logit model does not produce the OUTEST= data set unless the NEST statement is specied. Each parameter contains the estimate for the corresponding parameter in the corresponding model. In addition, the OUTEST= data set contains the following variables: _DEPVAR_ _METHOD_ _MODEL_ _STATUS_ the name of the dependent variable the estimation method the label of the MODEL statement if one is given, or blank otherwise a character variable indicating whether the optimization process reached convergence or failed to converge; 0 indicates that the convergence was reached, 1 indicates that the maximum number of iterations allowed was exceeded, 2 indicates a failure to improve the function value, and 3 indicates a failure to converge because objective function or its derivatives could not be evaluated, or improved, or linear constraints were dependent, or the algorithm failed to return to feasible region, or the number of iterations was greater than prespecied. the name of the row of the covariance matrix for the parameter estimate, if the COVOUT option is specied, or blank otherwise the log-likelihood value standard error of the parameter estimate, if the COVOUT option is specied PARMS for observations containing parameter estimates, or COV for observations containing covariance matrix elements
The OUTEST= data set contains one observation for the MODEL statement giving the parameter estimates for that model. If the COVOUT option is specied, the OUTEST= data set includes additional observations for the MODEL statement giving the rows of the covariance matrix of parameter estimates. For covariance observations, the value of the _TYPE_ variable is COV, and the _NAME_ variable identies the parameter associated with that row of the covariance matrix.
Table 17.3
Description
Option default default default default default default COVB CORRB ITPRINT ITPRINT ITPRINT ITPRINT ITPRINT ITPRINT ITPRINT ITPRINT ITPRINT
ODS Tables Created by the MODEL Statement ResponseProle Response Prole ClassLevels Class Levels FitSummary Summary of Nonlinear Estimation GoodnessOfFit Pseudo-R2 Measures ConvergenceStatus Convergence Status ParameterEstimates Parameter Estimates CovB Covariance of Parameter Estimates CorrB Correlation of Parameter Estimates LinCon Linear Constraints InputOptions Input Options ProblemDescription Problem Description IterStart Optimization Start IterHist Iteration History IterStop Optimization Results ConvergenceStatus Convergence Status ParameterEstimatesResults Resulting Parameters LinConSol Linear Constraints Evaluated at Solution ODS Tables Created by the TEST Statement TestResults Test Results
default
data smdata; input gpa tuce psi grade; datalines; 2.66 20 0 0 2.89 22 0 0 3.28 24 0 0 ... more lines ...
data smdata1; set smdata; retain id 0; id + 1; /*-- first choice --*/ choice1 = 1; choice2 = 0; decision = (grade = 0); gpa_2 = 0; tuce_2 = 0; psi_2 = 0; output; /*-- second choice --*/ choice1 = 0; choice2 = 1; decision = (grade = 1); gpa_2 = gpa; tuce_2 = tuce; psi_2 = psi; output; run;
The rst 10 observations are displayed in Output 17.1.1. The variables related to grade =0 are omitted since these are not used for binary choice model estimation.
Output 17.1.1 Converted Binary Data
id 1 1 2 2 3 3 4 4 5 5 decision 1 0 1 0 1 0 1 0 0 1 choice2 0 1 0 1 0 1 0 1 0 1 gpa_2 0.00 2.66 0.00 2.89 0.00 3.28 0.00 2.92 0.00 4.00 tuce_2 0 20 0 22 0 24 0 12 0 21 psi_2 0 0 0 0 0 0 0 0 0 0
Consider the choice probability of the conditional logit model for binary choice: Pi .j / D P2 exp.x0 / ij ; j D 1; 2
0 kD1 exp.xik /
The choice probability of the binary logit model is computed based on normalization. The preceding conditional logit model can be converted as Pi .1/ D Pi .2/ D 1 1 C exp..xi2 xi1 /0 /
Therefore, you can interpret the binary choice data as the difference between the rst and second choice characteristics. In the following statements, it is assumed that xi1 D 0. The binary logit model is estimated and displayed in Output 17.1.2.
/*-- Conditional Logit --*/ proc mdc data=smdata1; model decision = choice2 gpa_2 tuce_2 psi_2 / type=clogit nchoice=2 covest=hess; id id; run;
DF 1 1 1 1
< .xij
xi1 /0 ; : : : ;
iJ
ij
< .xij
xiJ /0
The probabilities of choice of the two alternatives can be written as Pi .1/ D P Pi .2/ D P
i2 i1 i1 i2
xi 2 /0 xi1 /0
where
i1 i2
N 0;
2 1 12
12 2 2
model is estimated and displayed in Output 17.1.3. You do not get the same estimates as that of the usual binary probit model. The probabilities of choice in the binary probit model are Pi .2/ D P Pi .1/ D 1
i
< x0 i
i
< x0 i
where i N.0; 1/. However, the multinomial probit model has the error variance Var. i 2 i1 / D 2 2 1 C 2 if i1 and i2 are independent ( 12 D 0). In the following statements, unit variance 2 2 restrictions are imposed on choices 1 and 2 ( 1 D 2 D 1). Therefore, the usual binary probit estimates (and standard errors) can be obtained by multiplying the multinomial probit estimates p (and standard errors) in Output 17.1.3 by 1= 2.
/*-- Multinomial Probit --*/ proc mdc data=smdata1; model decision = choice2 gpa_2 tuce_2 psi_2 / type=mprobit nchoice=2 covest=hess unitvariance=(1 2); id id; run;
DF 1 1 1 1
data travel; length mode $ 8; input auto transit mode $; datalines; 52.9 4.4 Transit 4.1 28.5 Transit 4.1 86.9 Auto 56.2 31.6 Transit 51.8 20.2 Transit 0.2 91.2 Auto 27.6 79.7 Auto 89.9 2.2 Transit 41.5 24.5 Transit 95.0 43.5 Transit ... more lines ...
The travel time is stored in two variables, auto and transit. In addition, the chosen alternatives are stored in a character variable, mode. The choice variable, mode, is converted to a numeric variable, decision, since the MDC procedure supports only numeric variables. The following statements convert the original data set, travel, and estimate the binary logit model. The rst 10 observations of a relevant subset of the new data set and the parameter estimates are displayed in Output 17.2.1 and Output 17.2.2, respectively.
data new; set travel; retain id 0; id+1; /*-- create auto variable --*/ decision = (upcase(mode) = AUTO); ttime = auto; autodum = 1; trandum = 0; output; /*-- create transit variable --*/ decision = (upcase(mode) = TRANSIT); ttime = transit; autodum = 0; trandum = 1; output; run; proc print data=new(obs=10); var decision autodum trandum ttime; id id; run;
DF 1 1
In order to handle more general cases, you can use the MDCDATA statement. Choice-specic dummy variables are generated and multiple observations for each individual are created. The following example converts the original data set travel by using the MDCDATA statement and performs conditional logit analysis. Interleaved data are output into the new data set new3. This data set has twice as many observations as the original travel data set.
proc mdc data=travel; mdcdata varlist( x1 = (auto transit) ) select=mode id=id alt=alternative decvar=Decision / out=new3; model decision = auto x1 / nchoice=2
The rst nine observations of the modied data set are shown in Output 17.2.3. The result of the preceding program is listed in Output 17.2.4.
Output 17.2.3 Transformed Model Choice Data
Obs 1 2 3 4 5 6 7 8 9 MODE TRANSIT TRANSIT TRANSIT TRANSIT AUTO AUTO TRANSIT TRANSIT TRANSIT AUTO 1 0 1 0 1 0 1 0 1 TRANSIT 0 1 0 1 0 1 0 1 0 X1 52.9 4.4 4.1 28.5 4.1 86.9 56.2 31.6 51.8 ID 1 1 2 2 3 3 4 4 5 ALTERNATIVE 1 2 1 2 1 2 1 2 1 DECISION 0 1 0 1 1 0 0 1 0
Parameter AUTO X1
DF 1 1
j D 1; 2; 3
where 31 2 :6 0 N @0; 4 :6 1 0 5A 0 0 1 0 2
ij
/*-- generate simulated series --*/ %let ndim = 3; %let nobs = 1000; data trichoice; array error{&ndim} e1-e3; array vtemp{&ndim} _temporary_; array lm{6} _temporary_ (1.4142136 0.4242641 0.9055385 0 0 1); retain nseed 345678 useed 223344; do id = 1 to &nobs; index = 0; /* generate independent normal variate */ do i = 1 to &ndim; /* index of diagonal element */ vtemp{i} = rannor(nseed); end; /* get multivariate normal variate */ index = 0; do i = 1 to &ndim; error{i} = 0; do j = 1 to i; error{i} = error{i} + lm{index+j}*vtemp{j}; end; index = index + i; end; x1 = 1.0 + 2.0 * ranuni(useed); x2 = 1.2 + 2.0 * ranuni(useed); x3 = 1.5 + 1.2 * ranuni(useed); util1 = 2.0 * x1 + e1; util2 = 2.0 * x2 + e2; util3 = 2.0 * x3 + e3; do i = 1 to &ndim; vtemp{i} = 0; end; if ( util1 > util2 & util1 > util3 ) then vtemp{1} = 1; else if ( util2 > util1 & util2 > util3 ) then vtemp{2} = 1; else if ( util3 > util1 & util3 > util2 ) then vtemp{3} = 1; else continue; /*-- first choice --*/ x = x1; mode = 1; decision = vtemp{1}; output; /*-- second choice --*/
x = x2; mode = 2; decision = vtemp{2}; output; /*-- third choice --*/ x = x3; mode = 3; decision = vtemp{3}; output; end; run;
First, the multinomial probit model is estimated (see the following program). Results show that standard deviation, correlation, and slope p estimates are close to the parameter values. Note that p 0:6 12 D p.2/.1/ D 0:42, 1 D 2 D 1:41, 2 D 1 D 1, and the parameter value for 12 D q 2 2
.
1 /. 2 /
DF 1 1 1
The nested model is also estimated based on a two-level decision tree (see the following program). (See Output 17.3.2.) The estimated result (see Output 17.3.3) shows that the data support the nested tree model since the estimates of the inclusive value parameters are signicant and are less than 1.
/*-- Two-Level Nested Logit --*/ proc mdc data=trichoice; model decision = x / type=nlogit choice=(mode 1 2 3) covest=op optmethod=qn; id id; utility u(1,) = x; nest level(1) = (1 2 @ 1, 3 @ 2), level(2) = (1 2 @ 1); run;
DF 1 1 1
Daganzos trinomial choice data and displays the HEV parameter estimates in Figure 17.15. The inverted scale estimates for mode 2 and mode 3 suggest that the conditional logit model (which imposes equal variances on random components of utility of all alternatives) might be misleading. The HEV estimation summary from that analysis is repeated in Output 17.4.1.
Output 17.4.1 HEV Estimation Summary (1 D 1)
Model Fit Summary Dependent Variable Number of Observations Number of Cases Log Likelihood Maximum Absolute Gradient Number of Iterations Optimization Method AIC Schwarz Criterion decision 50 150 -33.41383 0.0000218 11 Dual Quasi-Newton 72.82765 78.56372
You can estimate the HEV model with unit scale restrictions on all three alternatives (1 D 2 D 3 D 1) with the following statements. Output 17.4.2 displays the estimation summary.
/*-- HEV Estimation --*/ proc mdc data=newdata; model decision = ttime / type=hev nchoice=3 hev=(unitscale=1 2 3, integrate=laguerre) covest=hess; id pid; run;
The test for scale equivalence (SCALE2=SCALE3=1) is performed using a likelihood ratio test statis-
tic. The following SAS statements compute the test statistic (1.4276) and its p-value (0.4898) from the log-likelihood values in Figure 17.4.1 and Figure 17.4.2:
data _null_; /*-- test for H0: scale2 = scale3 = 1 --*/ /* ln L(max) = -34.1276 */ /* ln L(0) = -33.4138 */ stat = -2 * ( - 34.1276 + 33.4138 ); df = 2; p_value = 1 - probchi(stat, df); put stat= p_value=; run;
The test statistic fails to reject the null hypothesis of equal scale parameters, which implies that the random utility function is homoscedastic. A multinomial probit model also allows heteroscedasticity of the random components of utility for different alternatives. Consider the following utility function: Uij D Vij C where 1 0 @0; 4 0 1 N 0 0 0 2 31 0 0 5A
2 3 ij
This multinomial probit model is estimated by using the following program. The estimation summary is displayed in Output 17.4.3.
/*-- Heteroscedastic Multinomial Probit --*/ proc mdc data=newdata; model decision = ttime / type=mprobit nchoice=3 unitvariance=(1 2) covest=hess; id pid; restrict RHO_31 = 0; run;
Next, the multinomial probit model with unit variances ( 1 D 2 D 3 D 1) is estimated in the following statements. The estimation summary is displayed in Output 17.4.4.
/*-- Homoscedastic Multinomial Probit --*/ proc mdc data=newdata; model decision = ttime / type=mprobit nchoice=3 unitvariance=(1 2 3) covest=hess; id pid; restrict RHO_21 = 0; run;
Example 17.5: Choice of Time for Work Trips: Nested Logit Analysis ! 931
The test for homoscedasticity ( 3 = 1) under 1 D 2 D 1 shows that the error variance is not heteroscedastic since the test statistic (1:313) is less than 2 0:05;1 D 3:84. The marginal probability or p-value computed in the following program from the PROBCHI function is 0:2519.
data _null_; /*-- test for H0: sigma3 = 1 --*/ /* ln L(max) = -33.8860 */ /* ln L(0) = -34.5425 */ stat = -2 * ( -34.5425 + 33.8860 ); df = 1; p_value = 1 - probchi(stat, df); put stat= p_value=; run;
Example 17.5: Choice of Time for Work Trips: Nested Logit Analysis
This example uses sample data of 527 automobile commuters in the San Francisco Bay Area to demonstrate the use of nested logit model. Brownstone and Small (1989) analyzed a two-level nested logit model displayed in Output 17.5.1. The probability of choosing j at level 2 is written as Pi .j / D P3 exp. j Ij /
j 0 D1 exp. j 0 Ij 0 /
exp.x0 0 / ik
The full information maximum likelihood (FIML) method maximizes the following log-likelihood function: LD
N J XX i D1 j D1
Sample data of 527 automobile commuters in the San Francisco Bay Area have been analyzed by Small (1982) and Brownstone and Small (1989). The regular time of arrival is recorded as between 42.5 minutes early and 17.5 minutes late, and indexed by 12 alternatives, using ve-minute interval groups. Refer to Small (1982) for more details on these data. The following statements estimate the two-level nested logit model:
/*-- Two-level Nested Logit --*/ proc mdc data=small maxit=200 outest=a; model decision = r15 r10 ttime ttime_cp sde sde_cp sdl sdlx d2l / type=nlogit choice=(alt); id id; utility u(1, ) = r15 r10 ttime ttime_cp sde sde_cp sdl sdlx d2l; nest level(1) = (1 2 3 4 5 6 7 8 @ 1, 9 @ 2, 10 11 12 @ 3), level(2) = (1 2 3 @ 1); run;
The estimation summary, discrete response prole, and the FIML estimates are displayed in Output 17.5.2 through Output 17.5.4.
Example 17.5: Choice of Time for Work Trips: Nested Logit Analysis ! 933
Parameter r15_L1 r10_L1 ttime_L1 ttime_cp_L1 sde_L1 sde_cp_L1 sdl_L1 sdlx_L1 d2l_L1 INC_L2G1C1 INC_L2G1C2 INC_L2G1C3
DF 1 1 1 1 1 1 1 1 1 1 1 1
Estimate 1.1034 0.3931 -0.0465 -0.0498 -0.6618 0.0519 -2.1006 -3.5240 -1.0941 0.6762 1.0906 0.7622
t Value 9.04 3.29 -1.98 -1.63 -7.95 0.41 -4.15 -2.30 -3.34 2.46 3.53 4.62
Brownstone and Small (1989) also estimate the two-level nested logit model with equal scale parameter constraints, 1 D 2 D 3 . Replication of their model estimation is listed in the following program, and the results are displayed in Output 17.5.5 and Output 17.5.6. The parameter estimates and standard errors are almost identical to those in Brownstone and Small (1989, p. 69).
/*-- Nested Logit with Equal Dissimilarity proc mdc data=small maxit=200 outest=a; model decision = r15 r10 ttime ttime_cp sdl sdlx d2l / samescale type=nlogit choice=(alt); id id; utility u(1, ) = r15 r10 ttime ttime_cp sdl sdlx d2l; nest level(1) = (1 2 3 4 5 6 7 8 @ 1, 9 level(2) = (1 2 3 @ 1); run; Parameters --*/ sde sde_cp
Example 17.5: Choice of Time for Work Trips: Nested Logit Analysis ! 935
Output 17.5.5 Nested Logit Estimation Summary with Equal Dissimilarity Parameters
The MDC Procedure Nested Logit Estimates Model Fit Summary Dependent Variable Number of Observations Number of Cases Log Likelihood Log Likelihood Null (LogL(0)) Maximum Absolute Gradient Number of Iterations Optimization Method AIC Schwarz Criterion decision 527 6324 -994.39402 -1310 2.97172E-6 16 Newton-Raphson 2009 2051
Parameter r15_L1 r10_L1 ttime_L1 ttime_cp_L1 sde_L1 sde_cp_L1 sdl_L1 sdlx_L1 d2l_L1 INC_L2G1
DF 1 1 1 1 1 1 1 1 1 1
Estimate 1.1345 0.4194 -0.1626 0.1285 -0.7548 0.2292 -2.0719 -2.8216 -1.3164 0.8059
t Value 10.39 3.88 -2.67 1.51 -11.28 2.34 -4.26 -2.25 -3.79 4.73
However, the test statistic for H0 W 1 D 2 D level since 2 .ln L.0/ ln L/ D 7:15 > following program and is equal to 0:0280.
data _null_; /*-- test for H0: tau1 = tau2 = tau3 --*/ /* ln L(max) = -990.8191 */ /* ln L(0) = -994.3940 */ stat = -2 * ( -994.3940 + 990.8191 ); df = 2; p_value = 1 - probchi(stat, df); put stat= p_value=;
run;
D .O U
OR /0 VO U
VO R
.O U
OR /
where OR and VO represent parameter estimates and variance-covariance matrix from the model R where the ninth alternative was omitted, and O and VO represent parameter estimates and varianceU U covariance matrix from the full model. The following macro can be used to perform the IIA test for the ninth alternative.
/*--------------------------------------------------------------* name: %IIA * note: This macro test the IIA hypothesis using the Hausmans specification test. Inputs into the macro are as follows: *
indata: input data set * varlist: list of RHS variables * nchoice: number of choices for each individual * choice: list of choices * nvar: number of dependent variables * nIIA: number of choice alternatives used to test IIA * IIA: choice alternatives used to test IIA * id: ID variable * decision: 0-1 LHS variable representing nchoice choices * purpose: Hausmans specification test * *--------------------------------------------------------------*/ %macro IIA(indata=, varlist=, nchoice=, choice= , nvar= , IIA= , nIIA=, id= , decision=); %let n=%eval(&nchoice-&nIIA); proc mdc data=&indata outest=cov covout ; model &decision = &varlist / type=clogit nchoice=&nchoice; id &id; run; data two; set &indata; if &choice in &IIA and &decision=1 then output; run; data two; set two; keep &id ind; ind=1; run; data merged; merge &indata two; by &id; if ind=1 or &choice in &IIA then delete; run; proc mdc data=merged outest=cov2 covout ; model &decision = &varlist / type=clogit nchoice=&n; id &id; run; proc IML; use cov var{_TYPE_ &varlist}; read first into BetaU; read all into CovVarU where(_TYPE_=COV); close cov;
use cov2 var{_TYPE_ &varlist}; read first into BetaR; read all into CovVarR where(_TYPE_=COV); close cov; tmp = BetaU-BetaR; ChiSq=tmp*ginv(CovVarR-CovVarU)*tmp; if ChiSq<0 then ChiSq=0; Prob=1-Probchi(ChiSq, &nvar); Print "Hausman Test for IIA for Variable &IIA"; Print ChiSq Prob; run; quit; %mend IIA;
The following statement invokes the %IIA macro to test IIA for commuters arriving on time:
%IIA( indata=small, varlist=r15 r10 ttime ttime_cp sde sde_cp sdl sdlx d2l, nchoice=12, choice=alt, nvar=9, nIIA=1, IIA=(9), id=id, decision=decision );
The obtained 2 of 7.9 and the p-value of 0.54 indicate that IIA does not hold for commuters arriving on time (alternative 9). In this case the following model (nested logit), which reserves a subcategory for alternative 9, might be more appropriate (Output 17.5.1):
proc mdc data=small maxit=200 outest=a; model decision = r15 r10 ttime ttime_cp sde sde_cp sdl sdlx d2l / type=nlogit choice=(alt); id id; utility u(1, ) = r15 r10 ttime ttime_cp sde sde_cp sdl sdlx d2l; nest level(1) = (1 2 3 4 5 6 7 8 @ 1, 9 @ 2, 10 11 12 @ 3), level(2) = (1 2 3 @ 1); run;
Similarly, IIA could be tested for commuters arriving approximately on time (alternative 8, 9, 10), as follows:
%IIA( indata=small, varlist=r15 r10 ttime ttime_cp sde sde_cp sdl sdlx d2l, nchoice=12, choice=alt, nvar=9, nIIA=3, IIA=(8 9 10),
id=id, decision=decision );
Based on this test, independence of irrelevant alternatives is rejected for this subgroup ( 2 D 10:3), and it is concluded that these three alternatives are not independent from the other nine alternatives. This nding provides a partial justication for another nested logit model, with commuters arriving approximately on time being in one subcategory.
Second, run the restricted model with inclusive value parameters constrained to one. This model, with a restriction on inclusive value parameters, is equal to the standard multinomial logit.
/*-- Restricted Model with Inclusive Value Parameters Constrained to One --*/ proc mdc data=small maxit=200 outest=a; model decision = r15 r10 ttime ttime_cp sde sde_cp sdl sdlx d2l / type=nlogit choice=(alt); id id; utility u(1, ) = r15 r10 ttime ttime_cp sde sde_cp sdl sdlx d2l; nest level(1) = (1 2 3 4 5 6 7 8 @ 1, 9 @ 2, 10 11 12 @ 3), level(2) = (1 2 3 @ 1); restrict INC_L2G1C1=1, INC_L2G1C2=1, INC_L2G1C3=1; run;
D2
O .LogLU
O LogLR /
O O where LogLU represents the log of unrestricted likelihood and LogLR is the log of restricted likelihood at the optimized solution. Unrestricted and restricted log-likelihood values can be found in the Model Fit Summary table (see Output 17.5.5). Calculating the LR, test we conclude that nested logit is a more appropriate model. The LR test can be used to test other types of restrictions in the nested logit setting as long as one model can be nested within another.
References
Abramowitz, M. and Stegun, A. (1970), Handbook of Mathematical Functions, New York: Dover Press. Amemiya, T. (1981), Qualitative Response Models: A Survey, Journal of Economic Literature, 19, 483536. Amemiya, T. (1985), Advanced Econometrics, Cambridge: Harvard University Press. Ben-Akiva, M. and Lerman, S. R. (1985), Discrete Choice Analysis, Cambridge: MIT Press. Bhat, C. R. (1995), A Heteroscedastic Extreme Value Model of Intercity Travel Mode Choice, Transportation Research, 29, 471483. Brownstone, D. and Small, K. A. (1989), Efcient Estimation of Nested Logit Models, Journal of Business and Statistics, 7, 6774. Brownstone, D. and Train, K. (1999), Forecasting New Product Penetration with Flexible Substitution Patterns, Journal of Econometrics, 89, 109129. Daganzo, C. (1979), Multinomial Probit: The Theory and Its Application to Demand Forecasting, New York: Academic Press. Daganzo, C. and Kusnic, M. (1993), Two Properties of the Nested Logit Model, Transportation Science, 27, 395400. Estrella, A. (1998), A New Measure of Fit for Equations with Dichotomous Dependent Variables, Journal of Business and Economic Statistics, 16, 198205.
References ! 941
Geweke, J. (1989),Bayesian Inference in Econometric Models Using Monte Carlo Integration, Econometrica, 57, 13171340. Geweke, J., Keane, M., and Runkle, D. (1994), Alternative Computational Approaches to Inference in the Multinomial Probit Model, Review of Economics and Statistics, 76, 609632. Godfrey, L. G. (1988), Misspecication Tests in Econometrics, Cambridge: Cambridge University Press. Green, W. H. (1997), Econometric Analysis, Upper Saddle River, NJ: Prentice Hall. Hajivassiliou, V. A. (1993), Simulation Estimation Methods for Limited Dependent Variable Models, in Handbook of Statistics, vol. 11, ed. G. S. Maddala, C. R. Rao, and H. D. Vinod, New York: Elsevier Science Publishing. Hajivassiliou, V., McFadden, D., and Ruud, P. (1996), Simulation of Multivariate Normal Rectangle Probabilities and Their Derivatives: Theoretical and Computational Results, Journal of Econometrics, 72, 85134. Hausman, J., and McFadden, D. (1984), Specication Tests for the Multinomial Logit Model, Econometrica, 52(5), 121940. Hensher, D. A. (1986), Sequential and Full Information Maximum Likelihood Estimation of a Nested Logit Model, Review of Economics and Statistics, 68, 657667. Hensher, D. A. (1993), Using Stated Response Choice Data to Enrich Revealed Preference Discrete Choice Models, Marketing Letters, 4, 139151. Keane, M. P. (1994), A Computationally Practical Simulation Estimator for Panel Data, Econometrica, 62, 95116. Keane, M. P. (1997), Current Issues in Discrete Choice Modeling, Marketing Letters, 8, 307322. LaMotte, L. R. (1994), A Note on the Role of Independence in t Statistics Constructed from Linear Statistics in Regression Models, The American Statistician, 48, 238240. Luce, R. D. (1959), Individual Choice Behavior: A Theoretical Analysis, New York: Wiley. Maddala, G. S. (1983), Limited Dependent and Qualitative Variables in Econometrics, Cambridge: Cambridge University Press. McFadden, D. (1974), Conditional Logit Analysis of Qualitative Choice Behavior, in Frontiers in Econometrics, ed. P. Zarembka, New York: Academic Press. McFadden, D. (1978), Modelling the Choice of Residential Location, in Spatial Interaction Theory and Planning Models, ed. A. Karlqvist, L. Lundqvist, F. Snickars, and J. Weibull, Amsterdam: North Holland. McFadden, D. (1981), Econometric Models of Probabilistic Choice, in Structural Analysis of Discrete Data with Econometric Applications, ed. C.F. Manski and D. McFadden, Cambridge: MIT Press.
McFadden, D. and Ruud, P. A. (1994), Estimation by Simulation, Review of Economics and Statistics, 76, 591608. McFadden, D., Talvitie, A. P., and Associates (1977), The Urban Travel Demand Forecasting Project: Phase I Final Report Series, vol. 5, The Institute of Transportation Studies, University of California, Berkeley. Powers, D. A. and Xie, Y. (2000), Statistical Methods for Categorical Data Analysis, San Diego: Academic Press. Schmidt, P. and Strauss, R. (1975), The Prediction of Occupation Using Multiple Logit Models, International Economic Review, 16, 471486. Small, K. (1982), The Scheduling of Consumer Activities: Work Trips, American Economic Review, 72, 467479. Spector, L. and Mazzeo, M. (1980), Probit Analysis and Economic Education, Journal of Economic Education, 11, 3744. Swait, J. and Bernardino, A. (2000), Distingushing Taste Variation from Error Structure in Discrete Choice Data, Transportation Research Part B, 34, 115. Theil, H. (1969), A Multinomial Extension of the Linear Logit Model, International Economic Review, 10, 251259. Train, K. E., Ben-Akiva, M., and Atherton, T. (1989), Consumption Patterns and Self-Selecting Tariffs, Review of Economics and Statistics, 71, 6273.
Chapter 18
Estimation Methods . . . . . . . . . . . . . . . . . . Properties of the Estimates . . . . . . . . . . . . . . . Minimization Methods . . . . . . . . . . . . . . . . . Convergence Criteria . . . . . . . . . . . . . . . . . . Troubleshooting Convergence Problems . . . . . . . . Iteration History . . . . . . . . . . . . . . . . . . . . Computer Resource Requirements . . . . . . . . . . . Testing for Normality . . . . . . . . . . . . . . . . . Heteroscedasticity . . . . . . . . . . . . . . . . . . . Testing for Autocorrelation . . . . . . . . . . . . . . Transformation of Error Terms . . . . . . . . . . . . Error Covariance Structure Specication . . . . . . . Ordinary Differential Equations . . . . . . . . . . . . Restrictions and Bounds on Parameters . . . . . . . . Tests on Parameters . . . . . . . . . . . . . . . . . . Hausman Specication Test . . . . . . . . . . . . . . Chow Tests . . . . . . . . . . . . . . . . . . . . . . . Prole Likelihood Condence Intervals . . . . . . . . Choice of Instruments . . . . . . . . . . . . . . . . . Autoregressive Moving-Average Error Processes . . . Distributed Lag Models and the %PDL Macro . . . . Input Data Sets . . . . . . . . . . . . . . . . . . . . . Output Data Sets . . . . . . . . . . . . . . . . . . . . ODS Table Names . . . . . . . . . . . . . . . . . . . ODS Graphics . . . . . . . . . . . . . . . . . . . . . Details: Simulation by the MODEL Procedure . . . . . . . Solution Modes . . . . . . . . . . . . . . . . . . . . Multivariate t Distribution Simulation . . . . . . . . . Alternate Distribution Simulation . . . . . . . . . . . Mixtures of DistributionsCopulas . . . . . . . . . . Solution Mode Output . . . . . . . . . . . . . . . . . Goal Seeking: Solving for Right-Hand-Side Variables Numerical Solution Methods . . . . . . . . . . . . . Numerical Integration . . . . . . . . . . . . . . . . . Limitations . . . . . . . . . . . . . . . . . . . . . . . SOLVE Data Sets . . . . . . . . . . . . . . . . . . . Programming Language Overview: MODEL Procedure . . . Variables in the Model Program . . . . . . . . . . . . Equation Translations . . . . . . . . . . . . . . . . . Derivatives . . . . . . . . . . . . . . . . . . . . . . . Mathematical Functions . . . . . . . . . . . . . . . . Functions across Time . . . . . . . . . . . . . . . . . Language Differences . . . . . . . . . . . . . . . . . Storing Programs in Model Files . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1007 1025 1027 1028 1030 1042 1044 1048 1050 1058 1059 1062 1066 1076 1078 1079 1081 1082 1084 1088 1102 1104 1110 1113 1115 1116 1117 1122 1124 1126 1132 1138 1140 1146 1147 1148 1151 1151 1155 1157 1159 1160 1164 1167
Diagnostics and Debugging . . . . . . . . . . . . . . . . . . . . . . . . . . Analyzing the Structure of Large Models . . . . . . . . . . . . . . . . . . . Examples: MODEL Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . Example 18.1: OLS Single Nonlinear Equation . . . . . . . . . . . . . . . Example 18.2: A Consumer Demand Model . . . . . . . . . . . . . . . . . Example 18.3: Vector AR(1) Estimation . . . . . . . . . . . . . . . . . . . Example 18.4: MA(1) Estimation . . . . . . . . . . . . . . . . . . . . . . . Example 18.5: Polynomial Distributed Lags by Using %PDL . . . . . . . . Example 18.6: General Form Equations . . . . . . . . . . . . . . . . . . . Example 18.7: Spring and Damper Continuous System . . . . . . . . . . . Example 18.8: Nonlinear FIML Estimation . . . . . . . . . . . . . . . . . . Example 18.9: Circuit Estimation . . . . . . . . . . . . . . . . . . . . . . . Example 18.10: Systems of Differential Equations . . . . . . . . . . . . . . Example 18.11: Monte Carlo Simulation . . . . . . . . . . . . . . . . . . . Example 18.12: Cauchy Distribution Estimation . . . . . . . . . . . . . . . Example 18.13: Switching Regression Example . . . . . . . . . . . . . . . Example 18.14: Simulating from a Mixture of Distributions . . . . . . . . . Example 18.15: Simulated Method of MomentsSimple Linear Regression Example 18.16: Simulated Method of MomentsAR(1) Process . . . . . . Example 18.17: Simulated Method of MomentsStochastic Volatility Model Example 18.18: Duration Data Model with Unobserved Heterogeneity . . . Example 18.19: EMM Estimation of a Stochastic Volatility Model . . . . . Example 18.20: Illustration of ODS Graphics . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1168 1174 1180 1180 1183 1188 1192 1196 1201 1205 1211 1213 1215 1218 1219 1221 1225 1233 1235 1236 1238 1240 1244 1254
the following methods for parameter estimation: ordinary least squares (OLS) two-stage least squares (2SLS) seemingly unrelated regression (SUR) and iterative SUR (ITSUR) three-stage least squares (3SLS) and iterative 3SLS (IT3SLS) generalized method of moments (GMM) simulated method of moments (SMM) full information maximum likelihood (FIML) general log-likelihood maximization simulation and forecasting capabilities Monte Carlo simulation goal-seeking solutions ODS Graphics is now available with the MODEL procedure. For more information, see the section ODS Graphics on page 1115. A system of equations can be nonlinear in the parameters, nonlinear in the observed variables, or nonlinear in both the parameters and the variables. Nonlinear in the parameters means that the mathematical relationship between the variables and parameters is not required to have a linear form. (A linear model is a special case of a nonlinear model.) A general nonlinear system of equations can be written as q1 .y1;t ; y2;t ; : : :; yg;t ; x1;t ; x2;t ; : : :; xm;t ; 1 ; 2 ; : : :; p / D q2 .y1;t ; y2;t ; : : :; yg;t ; x1;t ; x2;t ; : : :; xm;t ; 1 ; 2 ; : : :; p / D : : : qg .y1;t ; y2;t ; : : :; yg;t ; x1;t ; x2;t ; : : :; xm;t ; 1 ; 2 ; : : :; p / D
1;t 2;t
g;t
where yi;t is an endogenous variable, xi;t is an exogenous variable, i is a parameter, and unknown error. The subscript t represents time or some index to the data.
is the
In econometrics literature, the observed variables are either endogenous (dependent) variables or exogenous (independent) variables. This system can be written more succinctly in vector form as q.yt ; xt ; / D
t
This system of equations is in general form because the error term is by itself on one side of the equality. Systems can also be written in normalized form by placing the endogenous variable on one side of the equality, with each equation dening a predicted value for a unique endogenous variable. A normalized form equation system can be written in vector notation as yt D f.yt ; xt ; / C
t:
PROC MODEL handles equations written in both forms. Econometric models often explain the current values of the endogenous variables as functions of past values of exogenous and endogenous variables. These past values are referred to as lagged values, and the variable xt i is called lag i of the variable xt . Using lagged variables, you can create a dynamic, or time-dependent, model. In the preceding model systems, the lagged exogenous and endogenous variables are included as part of the exogenous variables. If the data are time series, so that t indexes time (see Chapter 3, Working with Time Series Data, for more information on time series), it is possible that t depends on t i or, more generally, the t s are not identically and independently distributed. If the errors of a model system are autocorrelated, the standard error of the estimates of the parameters of the system will be inated. Sometimes the i s are not identically distributed because the variance of is not constant. This is known as heteroscedasticity. Heteroscedasticity in an estimated model can also inate the standard error of the estimates of the parameters. Using a weighted estimation can sometimes eliminate this problem. Alternately, a variance model such as GARCH or EGARCH can be estimated to correct for heteroscedasticity. If the proper weighting scheme and the form of the error model is difcult to determine, generalized methods of moments (GMM) estimation can be used to determine parameter estimates with very few assumptions about the form of the error process. Other problems can also arise when estimating systems of equations. Consider the following system of equations, which is nonlinear in its parameters and cannot be estimated with linear regression: y1;t y2;t
t D 1 C .2 C 3 4 / 1
C 5 y2;t C C 10 y1;t C
1;t 2;t
D 6 C .7 C
t 8 9 / 1
This system of equations represents a rudimentary predator-prey process with y1 as the prey and y2 as the predator (the second term in both equations is a logistics curve). The two equations must be estimated simultaneously because of the cross-dependency of ys. This cross-dependency makes 1 and 2 violate the assumption of independence. Nonlinear ordinary least squares estimation of these equations produce biased and inconsistent parameter estimates. This is called simultaneous equation bias. One method to remove simultaneous equation bias, in the linear case, is to replace the endogenous variables on the right-hand side of the equations with predicted values that are uncorrelated with the error terms. These predicted values can be obtained through a preliminary, or rst-stage, instrumental variable regression. Instrumental variables, which are uncorrelated with the error term, are used as regressors to model the predicted values. The parameter estimates are obtained by a second regression by using the predicted values of the regressors. This process is called two-stage least squares. In the nonlinear case, nonlinear ordinary least squares estimation is performed iteratively by using a linearization of the model with respect to the parameters. The instrumental solution to simultaneous equation bias in the nonlinear case is the same as the linear case, except the linearization of the model with respect to the parameters is predicted by the instrumental regression. Nonlinear two-stage least squares is one of several instrumental variables methods available in the MODEL procedure to handle simultaneous equation bias. When you have a system of several regression equations, the random errors of the equations can
be correlated. In this case, the large-sample efciency of the estimation can be improved by using a joint generalized least squares method that takes the cross-equation correlations into account. If the equations are not simultaneous (no dependent regressors), then seemingly unrelated regression (SUR) can be used. The SUR method requires an estimate of the cross-equation error covariance O matrix, . The usual approach is to rst t the equations by using OLS, compute an estimate O from the OLS residuals, and then perform the SUR estimation based on . The MODEL procedure estimates by default, or you can supply your own estimate of . If the equation system is simultaneous, you can combine the 2SLS and SUR methods to take into account both simultaneous equation bias and cross-equation correlation of the errors. This is called three-stage least squares or 3SLS. A different approach to the simultaneous equation bias problem is the full information maximum likelihood (FIML) estimation method. FIML does not require instrumental variables, but it assumes that the equation errors have a multivariate normal distribution. 2SLS and 3SLS estimation do not assume a particular distribution for the errors. Other nonnormal error distribution models can be estimated as well. The centered t distribution with estimated degrees of freedom and nonconstant variance is an additional built-in likelihood function. If the distribution of the equation errors is not normal or t but known, then the log likelihood can be specied by using the ERRORMODEL statement. Once a nonlinear model has been estimated, it can be used to obtain forecasts. If the model is linear in the variables you want to forecast, a simple linear solve can generate the forecasts. If the system is nonlinear, an iterative procedure must be used. The preceding example system is linear in its endogenous variables. The MODEL procedures SOLVE statement is used to forecast nonlinear models. One of the main purposes of creating models is to obtain an understanding of the relationship among the variables. There are usually only a few variables in a model you can control (for example, the amount of money spent on advertising). Often you want to determine how to change the variables under your control to obtain some target goal. This process is called goal seeking. PROC MODEL allows you to solve for any subset of the variables in a system of equations given values for the remaining variables. The nonlinearity of a model creates two problems with the forecasts: the forecast errors are not normally distributed with zero mean, and no formula exists to calculate the forecast condence intervals. PROC MODEL provides Monte Carlo techniques, which, when used with the covariance of the parameters and error covariance matrix, can produce approximate error bounds on the forecasts. The following distributions on the errors are supported for multivariate Monte Carlo simulation: Cauchy chi-squared empirical F Poisson t
uniform A transformation technique is used to create a covariance matrix for generating the correct innovations in a Monte Carlo simulation.
An Example
The SASHELP library contains the data set CITIMON, which contains the variable LHUR, the monthly unemployment gures, and the variable IP, the monthly industrial production index. You suspect that the unemployment rates are inversely proportional to the industrial production index. Assume that these variables are related by the following nonlinear equation: lhur D 1 CcC a ip C b
In this equation a, b, and c are unknown coefcients and is an unobserved random error. The following statements illustrate how to use PROC MODEL to estimate values for a, b, and c from the data in SASHELP.CITIMON.
Notice that the model equation is written as a SAS assignment statement. The variable LHUR is assumed to be the dependent variable because it is named in the FIT statement and is on the left-hand side of the assignment. PROC MODEL determines that LHUR and IP are observed variables because they are in the input data set. A, B, and C are treated as unknown parameters to be estimated from the data because they are not in the input data set. If the data set contained a variable named A, B, or C, you would need to explicitly declare the parameters with a PARMS statement. In response to the FIT statement, PROC MODEL estimates values for A, B, and C by using nonlinear least squares and prints the results. The rst part of the output is a Model Summary table, shown in Figure 18.1.
Figure 18.1 Model Summary Report
The MODEL Procedure Model Summary Model Variables Parameters Equations Number of Statements Model Variables Parameters Equations 1 3 1 1
LHUR a b c LHUR
This table details the size of the model, including the number of programming statements that dene the model, and lists the dependent variables (LHUR in this case), the unknown parameters (A, B, and C), and the model equations. In this case the equation is named for the dependent variable, LHUR. PROC MODEL then prints a summary of the estimation problem, as shown in Figure 18.2.
Figure 18.2 Estimation Problem Report
The Equation to Estimate is LHUR = F(a, b, c(1))
The notation used in the summary of the estimation problem indicates that LHUR is a function of A, B, and C, which are to be estimated by tting the function to the data. If the partial derivative of the equation with respect to a parameter is a simple variable or constant, the derivative is shown in
parentheses after the parameter name. In this case, the derivative with respect to the intercept C is 1. The derivatives with respect to A and B are complex expressions and so are not shown. Next, PROC MODEL prints an estimation summary as shown in Figure 18.3.
Figure 18.3 Estimation Summary Report
The MODEL Procedure OLS Estimation Summary Data Set Options DATA= SASHELP.CITIMON
Final Convergence Criteria R PPC(b) RPC(b) Object Trace(S) Objective Value 0.000737 0.003943 0.00968 4.784E-6 0.533325 0.522214
The estimation summary provides information on the iterative process used to compute the estimates. The heading OLS Estimation Summary indicates that the nonlinear ordinary least squares (OLS) estimation method is used. This table indicates that all three parameters were estimated successfully by using 144 nonmissing observations from the data set SASHELP.CITIMON. Calculating the estimates required 10 iterations of the GAUSS method. Various measures of how well the iterative process converged are also shown. For example, the RPC(B) value 0.00968 means that on the nal iteration the largest relative change in any estimate was for parameter B, which changed by 0.968 percent. See the section Convergence Criteria on page 1028 for details. PROC MODEL then prints the estimation results. The rst part of this table is the summary of residual errors, shown in Figure 18.4.
Equation LHUR
SSE 75.1989
MSE 0.5333
R-Square 0.7472
This table lists the sum of squared errors (SSE), the mean squared error (MSE), the root mean squared error (root MSE), and the R2 and adjusted R2 statistics. The R2 value of 0.7472 means that the estimated model explains approximately 75 percent more of the variability in LHUR than a mean model explains. Following the summary of residual errors is the parameter estimates table, shown in Figure 18.5.
Figure 18.5 Parameter Estimates
Nonlinear OLS Parameter Estimates Approx Std Err 0.00343 0.2617 0.7297 Approx Pr > |t| 0.0094 0.0309 <.0001
Parameter a b c
Because the model is nonlinear, the standard error of the estimate, the t value, and its signicance level are only approximate. These values are computed using asymptotic formulas that are correct for large sample sizes but only approximately correct for smaller samples. Thus, you should use caution in interpreting these statistics for nonlinear models, especially for small sample sizes. For linear models, these results are exact and are the same as standard linear regression. The last part of the output produced by the FIT statement is shown in Figure 18.6.
Figure 18.6 System Summary Statistics
Number of Observations Used Missing 144 1 Statistics for System Objective Objective*N 0.5222 75.1989
This table lists the objective value for the estimation of the nonlinear system. Since there is only a single equation in this case, the objective value is the same as the residual MSE for LHUR except that the objective value does not include a degrees-of-freedom correction. This can be seen in the fact that Objective*N equals the residual SSE, 75.1989. N is 144, the number of observations used.
The estimation method selected is added after the slash (/) on the FIT statement. The INSTRUMENTS statement follows the FIT statement and in this case selects all the exogenous variables as instruments with the _EXOG_ keyword. The parameters B2 and C2 in the instruments list request that the derivatives with respect to B2 and C2 be additional instruments. Full information maximum likelihood (FIML) can also be used to avoid simultaneous equation bias. FIML is computationally more expensive than an instrumental variables method and assumes that the errors are normally distributed. On the other hand, FIML does not require the specication of instruments. FIML is selected with the FIML option on the FIT statement. The preceding example is estimated with FIML by using the following statements:
proc model data=test2; exogenous x1 x2; parms a1 a2 b2 2.5 c2 55 d1; y1 = a1 * y2 + b2 * x1 * x1 + d1; y2 = a2 * y1 + b2 * x2 * x2 + c2 / x2 + d1; fit y1 y2 / fiml; run;
The use of the EQ. prex tells PROC MODEL that the variable is an error term and that it should not expect actual values for the variable ONE in the input data set.
Assume the quantity of interest is the amount of energy consumed in the U.S., the price is the price of gasoline, and the income variable is the consumer debt. When the market is at equilibrium, these
equations determine the market price and the equilibrium quantity. These equations are written in general form as
1 2
D quant i ty D quant i ty
Note that the endogenous variables quantity and price depend on two error terms so that OLS should not be used. The following example uses three-stage least squares estimation. Data for this model is obtained from the SASHELP.CITIMON data set.
title1 Supply-Demand Model using General-form Equations; proc model data=sashelp.citimon; endogenous eegp eec; exogenous exvus cciutc; parameters a1 a2 b1 b2 b3 ; label eegp = Gasoline Retail Price eec = Energy Consumption cciutc = Consumer Debt; /* -------- Supply equation ------------- */ eq.supply = eec - (a1 + a2 * eegp ); /* -------- Demand equation ------------- */ eq.demand = eec - (b1 + b2 * eegp + b3 * cciutc); /* -------- Instrumental variables -------*/ lageegp = lag(eegp); lag2eegp=lag2(eegp); /* -------- Estimate parameters --------- */ fit supply demand / n3sls fsrsq; instruments _EXOG_ lageegp lag2eegp; run;
The FIT statement species the two equations to estimate and the method of estimation, N3SLS. Note that 3SLS is an alias for N3SLS. The option FSRSQ is selected to get a report of the rst stage R2 to determine the acceptability of the selected instruments. Since three-stage least squares is an instrumental variables method, instruments are specied with the INSTRUMENTS statement. The instruments selected are all the exogenous variables, selected with the _EXOG_ option, and two lags of the variable EEGP: LAGEEGP and LAG2EEGP. The data set CITIMON has four observations that generate missing values because values for EEGP, EEC, or CCIUTC are missing. This is revealed in the Observations Processed output shown in Figure 18.7. Missing values are also generated when the equations cannot be computed for a given observation. Missing observations are not used in the estimation.
The lags used to create the instruments also reduce the number of observations used. In this case, the rst two observations were used to ll the lags of EEGP. The data set has a total of 145 observations, of which four generated missing values and two were used to ll lags, which left 139 observations for the estimation. In the estimation summary, in Figure 18.8, the total degrees of freedom for the model and error is 139.
Figure 18.8 Supply-Demand Parameter Estimates
Supply-Demand Model using General-form Equations The MODEL Procedure Nonlinear 3SLS Summary of Residual Errors DF Model 2 3 DF Error 137 136 Adj R-Sq
R-Square
Nonlinear 3SLS Parameter Estimates 1st Stage R-Square 1.0000 0.9617 1.0000 0.9617 1.0000
Parameter a1 a2 b1 b2 b3
One disadvantage of specifying equations in general form is that there are no actual values associated with the equation, so the R2 statistic cannot be computed.
As a second example, suppose you want to compute points of intersection between the square root function and hyperbolas of the form a C b=x. That is, you want to solve the system: p (square root) y D x b (hyperbola) y D a C x The following statements read parameters for several hyperbolas in the input data set TEST and solve the nonlinear equations. The SOLVEPRINT option in the SOLVE statement prints the solution values. The ID statement is used to include the values of A and B in the output of the SOLVEPRINT option.
title1 Solving a Simultaneous System; data test; input a b @@; datalines; 0 1 1 1 1 2
The printed output produced by this example consists of a model summary report, a listing of the solution values for each observation, and a solution summary report. The model summary for this example is shown in Figure 18.9.
Figure 18.9 Model Summary Report
Solving a Simultaneous System The MODEL Procedure Model Summary Model Variables ID Variables Equations Number of Statements Model Variables Equations 2 2 2 2
x y sqrt hyperbola
Solution Values x 1.000000 Observation 2 a Iterations 1.0000 5 b CC y 1.000000 1.0000 0.000000 eq.hyperbola 0.000000
For each observation, a heading line is printed that lists the values of the ID variables for the observation and information about the iterative process used to compute the solution. The number of iterations required, and the convergence measure (labeled CC) are printed. This convergence measure indicates the maximum error by which solution values fail to satisfy the equations. When this error is small enough (as determined by the CONVERGE= option), the iterations terminate. The equation with the largest error is indicated. For example, for observation 3 the HYPERBOLA equation has an error of 4:42 10 13 while the error of the SQRT equation is even smaller. Following the heading line for the observation, the solution values are printed. The last part of the SOLVE statement output is the solution summary report shown in Figure 18.11. This report summarizes the solution method used (Newtons method by default), the iteration history, and the observations processed.
Figure 18.11 Solution Summary Report
Solving a Simultaneous System The MODEL Procedure Simultaneous Simulation Data Set Options DATA= TEST
Solution Summary Variables Solved Implicit Equations Solution Method CONVERGE= Maximum CC Maximum Iterations Total Iterations Average Iterations 2 2 NEWTON 1E-8 9.176E-9 17 26 8.666667
Rate of Yen/$ Rate of Gm/$ US from Japan in 1984 $ US from WG in 1984 $ in Inflation Rates US-JP in Inflation Rates US-WG;
/* Fit the EXCHANGE data */ fit rate_jp rate_wg / sur outest=xch_est outcov outs=s; /* Solve using the WHATIF data set */ solve rate_jp rate_wg / data=whatif estdata=xch_est sdata=s random=100 seed=123 out=monte forecast; id yr; range yr=1986; run;
Data for the EXCHANGE data set was obtained from the Department of Commerce and the yearly Economic Report of the President. First, the parameters are estimated using SUR selected by the SUR option in the FIT statement. The OUTEST= option is used to create the XCH_EST data set which contains the estimates of the parameters. The OUTCOV option adds the covariance matrix of the parameters to the XCH_EST data set. The OUTS= option is used to save the covariance of the equation error in the data set S. Next, Monte Carlo simulation is requested by using the RANDOM= option in the SOLVE statement. The data set WHATIF is used to drive the forecasts. The ESTDATA= option reads in the XCH_EST data set which contains the parameter estimates and covariance matrix. Because the parameter covariance matrix is included, perturbations of the parameters are performed. The SDATA= option causes the Monte Carlo simulation to use the equation error covariance in the S data set to perturb the equation errors. The SEED= option selects the number 123 as a seed value for the random number generator. The output of the Monte Carlo simulation is written to the data set MONTE selected by the OUT= option. To generate a condence interval plot for the forecast, use PROC UNIVARIATE to generate percentile bounds and use PROC SGPLOT to plot the graph. The following SAS statements produce the graph in Figure 18.12.
proc sort data=monte; by yr; run; proc univariate data=monte noprint; by yr; var rate_jp rate_wg; output out=bounds mean=mean p5=p5 p95=p95; run; title "Monte Carlo Generated Confidence Intervals on a Forecast"; proc sgplot data=bounds noautolegend; series x=yr y=mean / markers; series x=yr y=p5 / markers; series x=yr y=p95 / markers; run;
DO ; DO variable = expression < TO expression > < BY expression > < , expression TO expression < BY expression > . . . > < WHILE expression > < UNTIL expression > ; END ; DROP variable . . . ; ENDOGENOUS variable < initial-values > . . . ; ERRORMODEL equation-name distribution < CDF=( CDF(options) ) > ; ESTIMATE item1 < , item2 . . . > < ,/ options > ; EXOGENOUS variable < initial values > . . . ; FIT equations < PARMS=( parameter values . . . ) > < START=( parameter values . . . ) > < DROP=( parameters ) > < / options > ; FORMAT variable-list < format > < DEFAULT= default-format > ; GOTO statement-label ; ID variable-list ; IF expression ; IF expression THEN programming-statement1 ; < ELSE programming-statement2 > ; variable = expression ; variable + expression ; INCLUDE model-le . . . ; INSTRUMENTS < instruments > < _EXOG_ > < EXCLUDE=( parameters ) > < / options > ; KEEP variable . . . ; LABEL variable =label . . . ; LENGTH variable-list < $ > length . . . < DEFAULT=length > ; LINK statement-label ; OUTVARS variable . . . ; PARAMETERS variable1 < value1 > < variable2 < value2 . . . > > ; PUT print-item . . . < @ > < @@ > ; RANGE variable < = rst > < TO last > ; RENAME old-name1 = new-name1 < . . . old-name2 = new-name2 > ; RESET options ; RESTRICT restriction1 < , restriction2 . . . > ; RETAIN variable-list1 value1 < variable-list2 value2 . . . > ; RETURN ; SOLVE variable-list < SATISFY=(equations) > < / options > ; SUBSTR ( variable, index, length ) = expression ; SELECT < ( expression ) > ; OTHERWISE programming-statement ; STOP ; TEST < "name" > test1 < , test2 . . . > < ,/ options > ; VAR variable < initial-values > . . . ; WEIGHT variable ; WHEN ( expression ) programming-statement ;
Functional Summary
The statements and options in the MODEL procedure are summarized in the following table.
Description
Statement
Option
Data Set Options specify the input data set for the variables specify the input data set for parameters specify the method for handling missing values specify the input data set for parameters request that the procedure produce graphics via the Output Delivery System specify the output data set for residual, predicted, or actual values specify the output data set for solution mode results write the actual values to OUT= data set select all output options write the covariance matrix of the estimates write the parameter estimates to a data set write the parameter estimates to a data set write the observations used to start the lags write the predicted values to the OUT= data set write the residual values to the OUT= data set write the covariance matrix of the equation errors to a data set write the S matrix used in the objective function denition to a data set write the estimate of the variance matrix of the moment generating function read the covariance matrix of the equation errors read the covariance matrix for GMM and ITGMM specify the name of the time variable select the estimation type to read General ESTIMATE Statement Options specify the name of the data set in which the estimate of the functions of the parameters are to be written
FIT, SOLVE FIT, SOLVE FIT MODEL MODEL FIT SOLVE FIT FIT FIT FIT MODEL SOLVE FIT FIT FIT FIT FIT FIT, SOLVE FIT FIT, SOLVE, MODEL FIT, SOLVE
DATA= ESTDATA= MISSING= PARMSDATA= PLOTS= OUT= OUT= OUTACTUAL OUTALL OUTCOV OUTEST= OUTPARMS= OUTLAGS OUTPREDICT OUTRESID OUTS= OUTSUSED= OUTV= SDATA= VDATA= TIME= TYPE=
ESTIMATE
OUTEST=
Description
Statement
Option
write the covariance matrix of the functions of the parameters to the OUTEST= data set print the covariance matrix of the functions of the parameters print the correlation matrix of the functions of the parameters Printing Options for FIT Tasks print the modied Breusch-Pagan test for heteroscedasticity print the Chow test for structural breaks print collinearity diagnostics print the correlation matrices print the correlation matrix of the parameters print the correlation matrix of the residuals print the covariance matrices print the covariance matrix of the parameters print the covariance matrix of the residuals print Durbin-Watson d statistics print rst-stage R2 statistics print Godfreys tests for autocorrelated residuals for each equation print Hausmans specication test print tests of normality of the model residuals print the predictive Chow test for structural breaks specify all the printing options print Whites test for heteroscedasticity Options to Control FIT Iteration Output print the inverse of the crossproducts Jacobian matrix print a summary iteration listing print a detailed iteration listing print the crossproduct Jacobian matrix specify all the iteration printing-control options Options to Control the Minimization Process specify the convergence criteria select the Hessian approximation used for FIML
FIT FIT FIT FIT FIT FIT FIT FIT FIT FIT FIT FIT FIT FIT FIT FIT FIT
BREUSCH CHOW= COLLIN CORR CORRB CORRS COV COVB COVS DW FSRSQ GODFREY HAUSMAN NORMAL PCHOW= PRINTALL WHITE
FIT FIT
CONVERGE= HESSIAN=
Description
Statement
Option
specify the local truncation error bound for the integration specify the maximum number of iterations allowed specify the maximum number of subiterations allowed select the iterative minimization method to use specify the smallest allowed time step to be used in the integration modify the iterations for estimation methods that iterate the S matrix or the V matrix specify the smallest pivot value specify the number of minimization iterations to perform at each grid point specify a weight variable Options to Read and Write Model Files read a model from one or more input model les suppress the default output of the model le specify the name of an output model le delete the current model Options to List or Analyze the Structure of the Model print a dependency structure of a model print a graph of the dependency structure of a model print the model program and variable lists print the derivative tables and compiled model program code print a dependency list print a table of derivatives print a cross-reference of the variables General Printing Control Options expand parts of the printed output print a message for each statement as it is executed select the maximum number of execution errors that can be printed select the number of decimal places shown in the printed output suppress the normal printed output
FIT, SOLVE, MODEL FIT FIT FIT FIT, SOLVE, MODEL FIT MODEL, FIT, SOLVE FIT WEIGHT
FIT, SOLVE FIT, SOLVE FIT, SOLVE FIT, SOLVE FIT, SOLVE
Description
Statement
Option
specify all the noniteration printing options print the result of each operation as it is executed request a comprehensive memory usage summary turn off the NOPRINT option Statements that Declare Variables associate a name with a list of variables and constants declare a variable to have a xed value declare a variable to be a dependent or endogenous variable declare a variable to be an independent or exogenous variable specify identifying variables assign a label to a variable select additional variables to be output declare a variable to be a parameter force a variable to hold its value from a previous observation declare a model variable declare an instrumental variable omit the default intercept term in the instruments list General FIT Statement Options omit parameters from the estimation associate a variable with an initial value as a parameter or a constant bypass OLS to get initial parameter estimates for GMM, ITGMM, or FIML bypass 2SLS to get initial parameter estimates for GMM, ITGMM, or FIML specify the parameters to estimate request condence intervals on estimated parameters select a grid search Options to Control the Estimation Method Used specify nonlinear ordinary least squares specify iterated nonlinear ordinary least squares
ARRAY CONTROL ENDOGENOUS EXOGENOUS ID LABEL OUTVARS PARAMETERS RETAIN VAR INSTRUMENTS INSTRUMENTS
NOINT
FIT FIT
OLS ITOLS
Description
Statement
Option
specify seemingly unrelated regression specify iterated seemingly unrelated regression specify two-stage least squares specify iterated two-stage least squares specify three-stage least squares specify iterated three-stage least squares specify full information maximum likelihood specify simulated method of moments specify number of draws for the V matrix specify number of initial observations for SMM select the variance-covariance estimator used for FIML specify generalized method of moments specify the kernel for GMM and ITGMM specify iterated generalized method of moments specify the type of generalized inverse used for the covariance matrix specify the denominator for computing variances and covariances specify adding the variance adjustment for SMM specify variance correction for heteroscedasticity specify GMM variance under arbitrary weighting matrix specify GMM variance under optimal weighting matrix Solution Mode Options select a subset of the model equations solve only for missing variables solve for all solution variables Solution Mode Options: Lag Processing use solved values in the lag functions use actual values in the lag functions produce successive forecasts to a xed forecast horizon select the observation to start dynamic solutions
FIT FIT FIT FIT FIT FIT FIT FIT FIT FIT FIT FIT FIT FIT FIT FIT FIT FIT FIT FIT
SUR ITSUR 2SLS IT2SLS 3SLS IT3SLS FIML NDRAW NDRAWV NPREOBS COVBEST= GMM KERNEL= ITGMM GINV= VARDEF= ADJSMMV HCCME= GENGMMV NOGENGMMV
Description
Statement
Option
Solution Mode Options: Numerical Methods specify the maximum number of iterations allowed specify the maximum number of subiterations allowed specify the convergence criteria compute a simultaneous solution using a Jacobi-like iteration compute a simultaneous solution using a Gauss-Seidel-like iteration compute a simultaneous solution using Newtons method compute a nonsimultaneous solution Monte Carlo Simulation Options specify quasi-random number generator specify pseudo-random number generator repeat the solution multiple times initialize the pseudo-random number generator specify copula options Solution Mode Printing Options print between data points integration values for the DERT. variables and the auxiliary variables print the solution approximation and equation errors print the solution values and residuals at each observation print various summary statistics print tables of Theil inequality coefcients specify all printing control options General TEST Statement Options specify that a Wald test be computed specify that a Lagrange multiplier test be computed specify that a likelihood ratio test be computed request all three types of tests specify the name of an output SAS data set that contains the test results
INTGPRINT
Description
Statement
Option
Miscellaneous Statements specify the range of observations to be used subset the data set with BY variables
RANGE BY
The following options can be specied in the PROC MODEL statement. All of the nonassignment options (the options that do not accept a value after an equal sign) can have NO prexed to the option name in the RESET statement to turn the option off. The default case is not explicitly indicated in the discussion that follows. Thus, for example, the option DETAILS is documented in the following, but NODETAILS is not documented since it is the default. Also, the NOSTORE option is documented because STORE is the default.
names the input data set. Variables in the model program are looked up in the DATA= data set and, if found, their attributes (type, length, label, format) are set to be the same as those in the input data set (if not previously dened otherwise). The values for the variables in the program are read from the input data set when the model is estimated or simulated by FIT and SOLVE statements.
OUTPARMS=SAS-data-set
writes the parameter estimates to a SAS data set. See the section Output Data Sets on page 1110 for details.
PARMSDATA=SAS-data-set
names the SAS data set that contains the parameter estimates. In PROC MODEL, you have several options to specify starting values for the parameters to be estimated. When more than one option is specied, the options are implemented in the following order of precedence (from highest to lowest): the START= option, the PARMS statement initialization value, the ESTDATA= option, and the PARMSDATA= option. If no options are specied for the starting value, the default value of 0.0001 is used. See the section Input Data Sets on page 1104 for details.
PLOTS=global-plot-options | plot-request
requests that the MODEL procedure produce statistical graphics via the Output Delivery System, provided that the ODS GRAPHICS statement has been specied. For general information about ODS Graphics, see Chapter 21, Statistical Graphics Using ODS (SAS/STAT
Users Guide). The global-plot-options apply to all relevant plots generated by the MODEL procedure. The global-plot-options supported by the MODEL procedure follow.
UNPACKPANEL breaks a graphic that is otherwise paneled into individual component plots.
RESIDUAL | RES plots the residuals. STUDENTRESIDUAL NONE RESIDUALHISTOGRAM | RESIDHISTOGRAM suppresses all plots.
reads the model from one or more input model les created by previous PROC MODEL executions. Model les are written by the OUTMODEL= option.
NOSTORE
suppresses the default output of the model le. This option is applicable only when FIT or SOLVE statements are not used, the MODEL= option is not used, and when a model is specied.
OUTMODEL=model-name
species the name of an output model le to which the model is to be written. Starting with SAS 9.2, model les are being stored as XML-based SAS data sets instead of being stored as members of a SAS catalog as in earlier releases. This makes MODEL les more readily extendable in the future and enables Java-based applications to read the MODEL les directly. To change this behavior, use the SAS global-CMPMODEL-options. You can choose the format in which the output model le is stored and read by using the CMPMODEL=globalCMPMODEL-options in an OPTIONS statement as follows.
OPTIONS CMPMODEL=global-CMPMODEL-options;
The global CMPMODEL options are: CATALOG XML BOTH species that model les be written and read from SAS catalogs only. species that model les be written and read from XML datasets only. species that model les be written to both XML and CATALOG formats. When BOTH is specied, model les are read from the data set rst and read from the SAS catalog only if the data set is not found. This is the default option.
prints an analysis of the structure of the model given by the assignments to model variables appearing in the model program. This analysis includes a classication of model variables into endogenous (dependent) and exogenous (independent) groups based on the presence of the variable on the left-hand side of an assignment statement. The endogenous variables are grouped into simultaneously determined blocks. The dependency structure of the simultaneous blocks and exogenous variables is also printed. The BLOCK option cannot analyze dependencies implied by general form equations.
GRAPH
prints the graph of the dependency structure of the model. The GRAPH option also invokes the BLOCK option and produces a graphical display of the information listed by the BLOCK option.
LIST
prints the model program and variable lists, including the statements added by PROC MODEL and macros.
LISTALL
prints the derivative tables and compiled model program code. LISTCODE is a debugging feature and is not normally needed.
LISTDEP
prints a report that lists for each variable in the model program the variables that depend on it and that it depends on. These lists are given separately for current-period values and for lagged values of the variables.
The information displayed is the same as that used to construct the BLOCK report but differs in that the information is listed for all variables (including parameters, control variables, and program variables), not just for the model variables. Classication into endogenous and exogenous groups and analysis of simultaneous structure is not done by the LISTDEP report.
LISTDER
prints a table of derivatives for FIT and SOLVE tasks. (The LISTDER option is applicable only for the default NEWTON method for SOLVE tasks.) The derivatives table shows each nonzero derivative computed for the problem. The derivative listed can be a constant, a variable in the model program, or a special derivative variable created to hold the result of the derivative expression. This option is turned on by the LISTCODE and PRINTALL options.
XREF
prints a cross-reference of the variables in the model program that shows where each variable was referenced or given a value. The XREF option is normally used in conjunction with the LIST option. A more detailed description is given in the section Diagnostics and Debugging on page 1168.
species the detailed printout. Parts of the printed output are expanded when the DETAILS option is specied. If ODS GRAPHICS ON is specied, the following additional graphs of the residuals are produced: ACF, PACF, IACF, white noise, and QQ plot versus the normal.
FLOW
prints a message for each statement in the model program as it is executed. This debugging option is needed very rarely and produces voluminous output.
MAXERRORS=n
species the maximum number of execution errors that can be printed. The default is MAXERRORS=50.
NDEC=n
species the precision of the format that PROC MODEL uses when printing various numbers. The default is NDEC=3, which means that PROC MODEL attempts to print values by using the D format but ensures that at least three signicant digits are shown. If the NDEC= value is greater than nine, the BEST. format is used. The smallest value allowed is NDEC=2. The NDEC= option affects the format of most, but not all, of the oating point numbers that PROC MODEL can print. For some values (such as parameter estimates), a precision limit one or two digits greater than the NDEC= value is used. This option does not apply to the precision of the variables in the output data set.
NOPRINT
suppresses the normal printed output but does not suppress error listings. Using any other print option turns the NOPRINT option off. The PRINT option can be used with the RESET statement to turn off NOPRINT.
PRINTALL
turns on all the printing-control options. The options set by PRINTALL are DETAILS; the model information options LIST, LISTDEP, LISTDER, XREF, BLOCK, and GRAPH; the FIT task printing options FSRSQ, COVB, CORRB, COVS, CORRS, DW, and COLLIN; and the SOLVE task printing options STATS, THEIL, SOLVEPRINT, and ITPRINT.
TRACE
prints the result of each operation in each statement in the model program as it is executed, in addition to the information printed by the FLOW option. This debugging option is needed very rarely and produces voluminous output.
MEMORYUSE
prints a report of the memory required for the various parts of the analysis.
makes two-stage least squares the default parameter estimation method for FIT statements that do not specify an estimation method.
BOUNDS Statement
BOUNDS bound1 < , bound2 . . . > ;
The BOUNDS statement imposes simple boundary constraints on the parameter estimates. BOUNDS statement constraints refer to the parameters estimated by the associated FIT statement (that is, to either the preceding FIT statement or, in the absence of a preceding FIT statement, to the following FIT statement). You can specify any number of BOUNDS statements. Each bound is composed of parameters and constants and inequality operators:
item operator item < operator item < operator item . . . > >
Each item is a constant, the name of an estimated parameter, or a list of parameter names. Each operator is <, >, <=, or >=. You can use both the BOUNDS statement and the RESTRICT statement to impose boundary constraints; however, the BOUNDS statement provides a simpler syntax for specifying these kinds of constraints. See the section RESTRICT Statement on page 999 for more information about the computational details of estimation with inequality restrictions. Lagrange multipliers are reported for all the active boundary constraints. In the printed output and in the OUTEST= data set, the Lagrange multiplier estimates are identied with the names BOUND0, BOUND1, and so forth. The probability of the Lagrange multipliers are computed using a beta distribution (LaMotte 1994). To give the constraints more descriptive names, use the RESTRICT statement instead of the BOUNDS statement. The following BOUNDS statement constrains the estimates of the parameters A and B and the ten parameters P1 through P10 to be between zero and one. This example illustrates the use of parameter lists to specify boundary constraints.
bounds 0 < a b p1-p10 < 1;
The following statements are an example of the use of the BOUNDS statement and they produce the output shown in Figure 18.13:
title Holzman Function (1969), Himmelblau No. 21, N=3; data zero; do i = 1 to 99; output; end; run; proc model data=zero; parms x1= 100 x2= 12.5 x3= bounds .1 <= x1 <= 100, 0 <= x2 <= 25.6, 0 <= x3 <= 5; t = 2 / 3;
3;
u = 25 + (-50 * log(0.01 * i )) ** t; v = (u - x2) ** x3; w = exp(-v / x1); eq.foo = -.01 * i + w; fit foo / method=marquardt; run;
Parameter x1 x2 x3
t Value . . .
BY Statement
BY variables ;
A BY statement is used with the FIT statement to obtain separate estimates for observations in groups dened by the BY variables. Note that if an output model le is written, using the OUTMODEL= option, the parameter values stored are those from the last BY group processed. To save parameter estimates for each BY group, use the OUTEST= option in the FIT statement. A BY statement is used with the SOLVE statement to obtain solutions for observations in groups dened by the BY variables. If the BY variables are identical in the DATA= data set and the ESTDATA= data set, then the two data sets are synchronized and the calculations are performed by using the data and parameters for each BY group. This holds for BY variables in the SDATA= data set as well. If the BY variables do not match, BY group processing is abandoned in either the ESTDATA= data set or the SDATA= data set, whichever has the missing BY value. If the DATA= data set does not contain BY variables and the ESTDATA= data set or the SDATA= data set does, then BY processing is performed for the ESTDATA= data set and the SDATA= data set by reusing the data in the DATA= data set for each BY group. If both FIT and SOLVE tasks require BY group processing, then two separate BY statements are needed. If parameters for each BY group in the OUTEST = data set obtained from the FIT task are to be used for the corresponding BY group for the SOLVE task, then one of the two BY statements must appear after the SOLVE statement. The following linear regression example illustrates the use of BY group processing. Both the datasets A and D to be used for tting and solving, respectively, have three groups.
BY Statement ! 977
/*------ data set for fit task------ */ data a ; do group = 1 to 3 ; do i = 1 to 100 ; x = normal(1); y = 2 + 3*x + rannor(3535) ; output ; end ; end ; run ; /*------ data set for solve task------ */ data d ; do group = 1 to 3 ; x = normal(1) ; output ; end ; run ; /* ------ 2 BY statements, one of them appear after SOLVE statement ------ */ proc model data = a ; by group ; y = a0 + a1*x ; fit y / outest = b1 ; solve y / data = d estdata = b1 out = c1 ; by group ; run; proc print data = b1 ;run; proc print data = c1 ; run;
Each of the parameter estimates obtained from the BY group processing in the FIT statement shown in Figure 18.14 is used in the corresponding BY group variables in the SOLVE statement. The output dataset is shown in Figure 18.15.
Figure 18.14 Listing of OUTEST= Data Set Created in the FIT Statement with Two BY Statements
Obs 1 2 3 group 1 2 3 _NAME_ _TYPE_ OLS OLS OLS _STATUS_ 0 Converged 0 Converged 0 Converged _NUSED_ 100 100 100 a0 2.00338 2.05091 2.15528 a1 3.00298 3.08808 3.04290
Figure 18.15 Listing of OUT= Data Set Created in the SOLVE Statement with Two BY Statements
Obs 1 2 3 group 1 2 3 _TYPE_ PREDICT PREDICT PREDICT _MODE_ SIMULATE SIMULATE SIMULATE _ERRORS_ 0 0 0 y 7.42322 1.80413 3.36202 x 1.80482 -0.07992 0.39658
If only one BY statement is used and it appears before the SOLVE statement, then parameters for the last BY group in the OUTEST = data set are used for all BY groups for the SOLVE task.
/*------ 1 BY statement that appears before SOLVE statement------ */ proc model data = a ; by group ; y = a0 + a1*x ; fit y / outest = b2 ; solve y / data = d estdata = b2 out = c2 ; run; proc print data = b2 ; run; proc print data = c2 ; run;
The estimates of the parameters are shown in Figure 18.16, and the output data set of the SOLVE statement is shown in Figure 18.17. Hence, the estimates and the predicted values obtained in the last BY group variable of both DATA C1 and C2 are the same while the others do not match.
Figure 18.16 Listing of OUTEST= Data Set Created in the FIT Statement with One BY Statement That Appears before the SOLVE Statement
Obs 1 2 3 group 1 2 3 _NAME_ _TYPE_ OLS OLS OLS _STATUS_ 0 Converged 0 Converged 0 Converged _NUSED_ 100 100 100 a0 2.00338 2.05091 2.15528 a1 3.00298 3.08808 3.04290
Figure 18.17 Listing of OUT= Data Set Created in the SOLVE Statement with One BY Statement That Appears before the SOLVE Statement
Obs 1 2 3 _TYPE_ PREDICT PREDICT PREDICT _MODE_ SIMULATE SIMULATE SIMULATE _ERRORS_ 0 0 0 y 7.64717 1.91211 3.36202 x 1.80482 -0.07992 0.39658
If only one BY statement is used and it appears after the SOLVE statement, then BY group processing does not apply to the FIT task. In this case, the OUTEST=data set does not contain the BY variable, and the single set of parameter estimates obtained from the FIT task are used for all BY groups during the SOLVE task.
/*------ 1 BY statement that appears after SOLVE statement------*/ proc model data = a ; y = a0 + a1*x ; fit y / outest = b3 ; solve y / data = d estdata = b3 out = c3 ; by group ; run; proc print data = b3 ; run; proc print data = c3 ; run;
The output data B3 and C3 are listed in Figure 18.18 and Figure 18.19, respectively.
Figure 18.18 Listing of OUTEST= Data Set Created in the FIT Statement with One BY Statement That Appears after the SOLVE Statement
Obs 1 _NAME_ _TYPE_ OLS _STATUS_ 0 Converged _NUSED_ 300 a0 2.06624 a1 3.04219
Figure 18.19 Listing of OUT= Data Set Created in the First SOLVE Statement with One BY Statement That Appears after the SOLVE Statement
Obs 1 2 3 group 1 2 3 _TYPE_ PREDICT PREDICT PREDICT _MODE_ SIMULATE SIMULATE SIMULATE _ERRORS_ 0 0 0 y 7.55686 1.82312 3.27270 x 1.80482 -0.07992 0.39658
CONTROL Statement
CONTROL variable < value > . . . ;
The CONTROL statement declares control variables and species their values. A control variable is like a parameter except that it has a xed value and is not estimated from the data. You can use control variables for constants in model equations that you might want to change in different solution cases. You can use control variables to vary the program logic. Unlike the retained variables, these values are xed across iterations.
ENDOGENOUS Statement
ENDOGENOUS variable < initial-values > . . . ;
The ENDOGENOUS statement declares model variables and identies them as endogenous. You can declare model variables with an ENDOGENOUS statement instead of with a VAR statement
to help document the model or to indicate the default solution variables. The variables declared endogenous are solved when a SOLVE statement does not indicate which variables to solve. Valid abbreviations for the ENDOGENOUS statement are ENDOG and ENDO. The DEPENDENT statement is equivalent to the ENDOGENOUS statement and is provided for the convenience of non-econometric practitioners. The ENDOGENOUS statement optionally provides initial values for lagged dependent variables. See the section Lag Logic on page 1161 for more information.
ERRORMODEL Statement
ERRORMODEL equation-name distribution < CDF= CDF(options) > ;
The ERRORMODEL statement is the mechanism for specifying the distribution of the residuals. You must specify the dependent/endogenous variables or general-form model name, a tilde ( ), and then a distribution with its parameters. The following options are used in the ERRORMODEL statement:
species the Cauchy distribution. This option is supported only for simulation. The arguments correspond to the arguments of the SAS CDF function which computes the cumulative distribution function (ignoring the random variable argument).
CHISQUARED ( df < , nc > )
species the 2 distribution. This option is supported only for simulation. The arguments correspond to the arguments of the SAS CDF function (ignoring the random variable argument).
GENERAL(Likelihood < , parm1, parm2, : : : parmn > )
species the negative of a general log-likelihood function that you construct by using SAS programming statements. The procedure minimizes the negative log-likelihood function specied. parm1; parm2; : : : parmn are optional parameters for this distribution and are used for documentation purposes only.
F( ndf, ddf < , nc > )
species the F distribution. This option is supported only for simulation. The arguments correspond to the arguments of the SAS CDF function (ignoring the random variable argument).
NORMAL( v1 v2 : : : vn )
species a multivariate normal (Gaussian) distribution with mean 0 and variances v1 through vn .
POISSON( mean )
species the Poisson distribution. This option is supported only for simulation. The argu-
ments correspond to the arguments of the SAS CDF function (ignoring the random variable argument).
T( v1 v2
vn , df ) species a multivariate t distribution with noncentrality 0, variance v1 through vn , and common degrees of freedom df .
species the uniform distribution. This option is supported only for simulation. The arguments correspond to the arguments of the SAS CDF function (ignoring the random variable argument).
species the univariate distribution that is used for simulation so that the estimation can be done for one set of distributional assumptions and the simulation for another. The CDF can be any of the distributions from the previous section with the exception of the general likelihood. In addition, you can specify the empirical distribution of the residuals.
EMPIRICAL= ( < TAILS=(options) > )
species how to handle the tails in computing the inverse CDF from an empirical distribution, where tail-options are:
NORMAL T( df ) PERCENT= p
species the normal distribution to extrapolate the tails. species the t distribution to extrapolate the tails. species the percentage of the observations to use in constructing each tail. The default for the PERCENT= option is 10. A normal distribution or a t distribution is used to extrapolate the tails to innity. The variance for the tail distribution is obtained from the data so that the empirical CDF is continuous.
ESTIMATE Statement
ESTIMATE item < , item . . . > < ,/ options > ;
The ESTIMATE statement computes estimates of functions of the parameters. The ESTIMATE statement refers to the parameters estimated by the associated FIT statement (that is, to either the preceding FIT statement or, in the absence of a preceding FIT statement, to the following FIT statement). You can use any number of ESTIMATE statements.
O Let h. / denote the function of parameters that needs to be estimated. Let denote the unconO be the estimate of the covariance matrix of strained estimate of the parameter of interest, . Let V . Denote A. / D @h. /=@ j O Then the standard error of the parameter function estimate is computed by obtaining the square root O O 0 O of A. /VA . /. This is the same as the variance needed for a Wald type test statistic with null hypothesis h. / D 0. If the expression of the function in the ESTIMATE statement includes a variable, then the value used in computing the function estimate is the last observation of the variable in the DATA= data set. If you specify options on the ESTIMATE statement, a comma is required before the / character that separates the test expressions from the options, since the / character can also be used within test expressions to indicate division. Each item is written as an optional name followed by an expression,
< "name" > expression
where "name" is a string used to identify the estimate in the printed output and in the OUTEST= data set. Expressions can be composed of parameter names, arithmetic operators, functions, and constants. Comparison operators (such as = or <) and logical operators (such as &) cannot be used in ESTIMATE statement expressions. Parameters named in ESTIMATE expressions must be among the parameters estimated by the associated FIT statement. You can use the following options in the ESTIMATE statement:
OUTEST=
species the name of the data set in which the estimate of the functions of the parameters are to be written. The format for this data set is identical to the OUTEST= data set for the FIT statement. If you specify a name in the ESTIMATE statement, that name is used as the parameter name for the estimate in the OUTEST= data set. If no name is provided and the expression is just a symbol, the symbol name is used; otherwise, the string _Estimate # is used, where # is the variable number in the OUTEST= data set.
OUTCOV
writes the covariance matrix of the functions of the parameters to the OUTEST= data set in addition to the parameter estimates.
COVB
prints the correlation matrix of the functions of the parameters. The following statements are an example of the use of the ESTIMATE statement in a segmented model and produce the output shown in Figure 18.20:
data a; input y x @@; datalines; .46 1 .47 2 .57 3 .61 4 .62 5 .68 6 .69 7 .78 8 .70 9 .74 10 .77 11 .78 12 .74 13 .80 13 .80 15 .78 16 ; title Segmented Model -- Quadratic with Plateau; proc model data=a; x0 = -.5 * b / c; if x < x0 then y = a + b*x + c*x*x; else y = a + b*x0 + c*x0*x0; fit y start=( a .45 b .5 c -.0025 ); estimate Join point x0 , plateau a + b*x0 + c*x0**2 ; run;
EXOGENOUS Statement
EXOGENOUS variable < initial-values > . . . ;
The EXOGENOUS statement declares model variables and identies them as exogenous. You can declare model variables with an EXOGENOUS statement instead of with a VAR statement to help document the model or to indicate the default instrumental variables. The variables declared exogenous are used as instruments when an instrumental variables estimation method is requested (such as N2SLS or N3SLS) and an INSTRUMENTS statement is not used. Valid abbreviations for the EXOGENOUS statement are EXOG and EXO. The INDEPENDENT statement is equivalent to the EXOGENOUS statement and is provided for
the convenience of non-econometric practitioners. The EXOGENOUS statement optionally provides initial values for lagged exogenous variables. See the section Lag Logic on page 1161 for more information.
FIT Statement
FIT < equations > < PARMS=( parameter < values > . . . ) > < START=( parameter values . . . ) > < DROP=( parameter . . . ) > < INITIAL=( variable < = parameter | constant > . . . ) > < / options > ;
The FIT statement estimates model parameters by tting the model equations to input data and optionally selects the equations to be t. If the list of equations is omitted, all model equations that contain parameters are tted. The following options can be used in the FIT statement.
DROP= ( parameters . . . )
species that the named parameters not be estimated. All the parameters in the equations t are estimated except those listed in the DROP= option. The dropped parameters retain their previous values and are not changed by the estimation.
INITIAL= ( variable = < parameter | constant > . . . )
associates a variable with an initial value as a parameter or a constant. This option applies only to ordinary differential equations. See the section Ordinary Differential Equations on page 1066 for more information.
PARMS= ( parameters [values] . . . )
selects a subset of the parameters for estimation. When the PARMS= option is used, only the named parameters are estimated. Any parameters not specied in the PARMS= list retain their previous values and are not changed by the estimation. In PROC MODEL, you have several options to specify starting values for the parameters to be estimated. When more than one option is specied, the options are implemented in the following order of precedence (from highest to lowest): the START= option, the PARMS statement initialization value, the ESTDATA= option, and the PARMSDATA= option. If no options are specied for the starting value, the default value of 0.0001 is used.
PRL= WALD | LR | BOTH
requests condence intervals on estimated parameters. By default, the PRL option produces 95% likelihood ratio condence limits. The coverage of the condence interval is controlled by the ALPHA= option in the FIT statement.
START= ( parameter values . . . )
supplies starting values for the parameter estimates. In PROC MODEL, you have several options to specify starting values for the parameters to be estimated. When more than one option is specied, the options are implemented in the following order of precedence (from
highest to lowest): the START= option, the PARMS statement initialization value, the ESTDATA= option, and the PARMSDATA= option. If no options are specied for the starting value, the default value of 0.0001 is used. If the START= option species more than one starting value for one or more parameters, a grid search is performed over all combinations of the values, and the best combination is used to start the iterations. For more information, see the STARTITER= option.
species adding the variance adjustment from simulating the moments to the variancecovariance matrix of the parameter estimators. By default, no adjustment is made.
COVBEST=GLS | CROSS | FDA
species the variance-covariance estimator used for FIML. COVBEST=GLS selects the generalized least squares estimator. COVBEST=CROSS selects the crossproducts estimator. COVBEST=FDA selects the inverse of the nite difference approximation to the Hessian. The default is COVBEST=CROSS.
DYNAMIC
species dynamic estimation of ordinary differential equations. See the section Ordinary Differential Equations on page 1066 for more details.
FIML
species the type of generalized inverse to be used when computing the covariance matrix. G4 selects the Moore-Penrose generalized inverse. The default is GINV=G2. Rather than deleting linearly related rows and columns of the covariance matrix, the MoorePenrose generalized inverse averages the variance effects between collinear rows. When the option GINV=G4 is used, the Moore-Penrose generalized inverse is used to calculate standard errors and the covariance matrix of the parameters as well as the change vector for the optimization problem. For singular systems, a normal G2 inverse is used to determine the singular rows so that the parameters can be marked in the parameter estimates table. A G2 inverse is calculated by satisfying the rst two properties of the Moore-Penrose generalized inverse; that is, AAC A D A and AC AAC D AC . Whether or not you use a G4 inverse, if the covariance matrix is singular, the parameter estimates are not unique. Refer to Noble and Daniel (1977, pp. 337340) for more details about generalized inverses.
GENGMMV
specify GMM variance under arbitrary weighting matrix. See the section Estimation Methods on page 1007 for more details. This is the default method for GMM estimation.
GMM
HCCME= 0 | 1 | 2 | 3 | NO
species the type of heteroscedasticity-consistent covariance matrix estimator to use for OLS, 2SLS, 3SLS, SUR, and the iterated versions of these estimation methods. The number corresponds to the type of covariance matrix estimator to use as H C0 W H C1 W H C2 W H C3 W
2 Ot n O2 n df t 2 Ot =.1 2 Ot =.1
O ht / O ht /2
species iterated ordinary least squares estimation. This is the same as OLS unless there are cross-equation parameter restrictions.
ITSUR
species iterated two-stage least squares estimation. This is the same as 2SLS unless there are cross-equation parameter restrictions.
IT3SLS
species the kernel to be used for GMM and ITGMM. PARZEN selects the Parzen kernel, BART selects the Bartlett kernel, and QS selects the quadratic spectral kernel. e 0 and c 0 are used to compute the bandwidth parameter. The default is KERNEL=(PARZEN, 1, 0.2). See the section Estimation Methods on page 1007 for more details.
N2SLS | 2SLS
species nonlinear two-stage least squares estimation. This is the default when an INSTRUMENTS statement is used.
N3SLS | 3SLS
requests the simulation method for estimation. H is the number of draws. If number of draws is not specied, the default H is set to 10.
NOOLS NO2SLS
species bypassing OLS or 2SLS to get initial parameter estimates for GMM, ITGMM, or FIML. This is important for certain models that are poorly dened in OLS or 2SLS, or if good initial parameter values are already provided. Note that for GMM, the V matrix is created by using the initial values specied and this might not be consistently estimated.
NO3SLS
species not to use 3SLS automatically for FIML initial parameter starting values.
NOGENGMMV
species not to use GMM variance under arbitrary weighting matrix. Use GMM variance under optimal weighting matrix instead. See the section Estimation Methods on page 1007 for more details.
NPREOBS =number of obs to initialize
species the initial number of observations to run the simulation before the simulated values are compared to observed variables. This option is most useful in cases where the program statements involve lag operations. Use this option to avoid the effect of the starting point on the simulation.
NVDRAW =number of draws for V matrix
species H 0 , the number of draws for V matrix. If this option is not specied, the default H 0 is set to 20.
OLS
species the denominator to be used in computing variances and covariances, MSE, root MSE measures, and so on. VARDEF=N species that the number of nonmissing observations be used. VARDEF=WGT species that the sum of the weights be used. VARDEF=DF species that the number of nonmissing observations minus the model degrees of freedom (number of parameters) be used. VARDEF=WDF species that the sum of the weights minus the model degrees of freedom be used. The default is VARDEF=DF. VARDEF=N is the only option used for FIML estimation regardless of any option specied.
species the input data set. Values for the variables in the program are read from this data set. If the DATA= option is not specied on the FIT statement, the data set specied by the DATA= option on the PROC MODEL statement is used.
ESTDATA=SAS-data-set
species a data set whose rst observation provides initial values for some or all of the parameters.
MISSING=PAIRWISE | DELETE
The option MISSING=PAIRWISE species that missing values are tracked on an equationby-equation basis. The MISSING=DELETE option species that the entire observation is omitted from the analysis when any equation has a missing predicted or actual value for the equation. The default is MISSING=DELETE.
OUT=SAS-data-set
names the SAS data set to contain the residuals, predicted values, or actual values from each estimation. The residual values written to the OUT= data set are dened as the actual pred i ct ed , which is the negative of RESID.variable as dened in the section Equation Translations on page 1155. Only the residuals are output by default.
OUTACTUAL
writes the actual values of the endogenous variables of the estimation to the OUT= data set. This option is applicable only if the OUT= option is specied.
OUTALL
writes the covariance matrix of the estimates to the OUTEST= data set in addition to the parameter estimates. The OUTCOV option is applicable only if the OUTEST= option is also specied.
OUTEST=SAS-data-set
names the SAS data set to contain the parameter estimates and optionally the covariance of the estimates.
OUTLAGS
writes the observations used to start the lags to the OUT= data set. This option is applicable only if the OUT= option is specied.
OUTPREDICT
writes the predicted values to the OUT= data set. This option is applicable only if OUT= is specied.
OUTRESID
writes the residual values computed from the parameter estimates to the OUT= data set. The OUTRESID option is the default if neither OUTPREDICT nor OUTACTUAL is specied. This option is applicable only if the OUT= option is specied. If the h.var equation is specied, the residual values written to the OUT= data set are the normalized residuals, dened as act ual pred i ct ed , divided by the square root of the h.var value. If the WEIGHT statement is used, the residual values are calculated as actual pred icted multiplied by the square root of the WEIGHT variable.
OUTS=SAS-data-set
names the SAS data set to contain the estimated covariance matrix of the equation errors. This is the covariance of the residuals computed from the parameter estimates.
OUTSN=SAS-data-set
names the SAS data set to contain the estimated normalized covariance matrix of the equation errors. This is valid for multivariate t distribution estimation.
OUTSUSED=SAS-data-set
names the SAS data set to contain the S matrix used in the objective function denition. The OUTSUSED= data set is the same as the OUTS= data set for the methods that iterate the S matrix.
OUTUNWGTRESID
writes the unweighted residual values computed from the parameter estimates to the OUT= data set. These are residuals computed as actual pred icted with no accounting for the WEIGHT statement, the _WEIGHT_ variable, or any variance expressions. This option is applicable only if the OUT= option is specied.
OUTV=SAS-data-set
names the SAS data set to contain the estimate of the variance matrix for GMM and ITGMM.
SDATA=SAS-data-set
species a data set that provides the covariance matrix of the equation errors. The matrix read from the SDATA= data set is used for the equation covariance matrix (S matrix) in the estimation. (The SDATA= S matrix is used to provide only the initial estimate of S for the methods that iterate the S matrix.)
TIME=name
species the name of the time variable. This variable must be in the data set.
TYPE=name
species the estimation type to read from the SDATA= and ESTDATA= data sets. The name specied in the TYPE= option is compared to the _TYPE_ variable in the ESTDATA= and SDATA= data sets to select observations to use in constructing the covariance matrices. When the TYPE= option is omitted, the last estimation type in the data set is used. Valid values are the estimation methods used in PROC MODEL.
VDATA=SAS-data-set
species a data set that contains a variance matrix for GMM and ITGMM estimation. See the section Output Data Sets on page 1110 for details.
species the modied Breusch-Pagan test, where variable-list is a list of variables used to model the error variance.
prints the Chow test for break points or structural changes in a model. The argument is the number of observations in the rst sample or a parenthesized list of rst sample sizes. If the size of the one of the two groups in which the sample is partitioned is less than the number of parameters, then a predictive Chow test is automatically used. See the section Chow Tests on page 1081 for details.
COLLIN
prints collinearity diagnostics for the Jacobian crossproducts matrix (XPX) after the parameters have converged. Collinearity diagnostics are also automatically printed if the estimation fails to converge.
CORR
prints the correlation matrices of the residuals and parameters. Using CORR is the same as using both CORRB and CORRS.
CORRB
prints the covariance matrices of the residuals and parameters. Specifying COV is the same as specifying both COVB and COVS.
COVB
prints Durbin-Watson d statistics, which measure autocorrelation of the residuals. When the residual series is interrupted by missing observations, the Durbin-Watson statistic calculated is d 0 as suggested by Savin and White (1978). This is the usual Durbin-Watson computed by ignoring the gaps. Savin and White show that it has the same null distribution as the DW with no gaps in the series and can be used to test for autocorrelation using the standard tables. The Durbin-Watson statistic is not valid for models that contain lagged endogenous variables. You can use the DW= option to request higher-order Durbin-Watson statistics. Since the ordinary Durbin-Watson statistic tests only for rst-order autocorrelation, the Durbin-Watson statistics for higher-order autocorrelation are called generalized Durbin-Watson statistics.
DWPROB
prints the signicance level (p-values) for the Durbin-Watson tests. Since the Durbin-Watson p-values are computationally expensive, they are not reported by default. In the DurbinWatson test, the null hypothesis is that there is autocorrelation at a specic lag.
See the section Generalized Durbin-Watson Tests for limitations of the statistic in the Chapter 8, The AUTOREG Procedure.
FSRSQ
prints the rst-stage R2 statistics for instrumental estimation methods. These R2 statistics measure the proportion of the variance retained when the Jacobian columns associated with the parameters are projected through the instruments space.
GODFREY GODFREY=n
performs Godfreys tests for autocorrelated residuals for each equation, where n is the maximum autoregressive order, and species that Godfreys tests be computed for lags 1 through n. The default number of lags is one.
HAUSMAN
prints the predictive Chow test for break points or structural changes in a model. The argument is the number of observations in the rst sample or a parenthesized list of rst sample sizes. See the section Chow Tests on page 1081 for details.
PRINTALL
species the printing options COLLIN, CORRB, CORRS, COVB, COVS, DETAILS, DW, and FSRSQ.
WHITE
species all iteration printing-control options (I, ITDETAILS, ITPRINT, and XPX). ITALL also prints the crossproducts matrix (labeled CROSS), the parameter change vector, and the estimate of the cross-equation covariance of residuals matrix at each iteration.
ITDETAILS
prints a detailed iteration listing. This includes the ITPRINT information and additional statistics.
ITPRINT
prints the parameter estimates, objective function value, and convergence criteria at each iteration.
XPX
species the convergence criteria. The convergence measure must be less than value1 before convergence is assumed. value2 is the convergence criterion for the S and V matrices for S and V iterated methods. value2 defaults to value1. See the section Convergence Criteria on page 1028 for details. The default value is CONVERGE=0.001.
HESSIAN=CROSS | GLS | FDA
species the Hessian approximation used for FIML. HESSIAN=CROSS selects the crossproducts approximation to the Hessian, HESSIAN=GLS selects the generalized least squares approximation to the Hessian, and HESSIAN=FDA selects the nite difference approximation to the Hessian. HESSIAN=GLS is the default.
LTEBOUND=n
species the local truncation error bound for the integration. This option is ignored if no ordinary differential equations (ODEs) are specied.
EPSILON =value
species the tolerance value used to transform strict inequalities into inequalities when restrictions on parameters are imposed. By default, EPSILON=1E8. See the section Restrictions and Bounds on Parameters on page 1076 for details.
MAXITER=n
species the maximum number of subiterations allowed for an iteration. For the GAUSS method, the MAXSUBITER= option limits the number of step halvings. For the MARQUARDT method, the MAXSUBITER= option limits the number of times can be increased. The default is MAXSUBITER=30. See the section Minimization Methods on page 1027 for details.
METHOD=GAUSS | MARQUARDT
species the iterative minimization method to use. METHOD=GAUSS species the Gauss-Newton method, and METHOD=MARQUARDT species the Marquardt-Levenberg method. The default is METHOD=GAUSS. If the default GAUSS method fails to converge, the procedure switches to the MARQUARDT method. See the section Minimization Methods on page 1027 for details.
ID Statement ! 993
MINTIMESTEP=n
species the smallest allowed time step to be used in the integration. This option is ignored if no ODEs are specied.
NESTIT
changes the way the iterations are performed for estimation methods that iterate the estimate of the equation covariance (S matrix). The NESTIT option is relevant only for the methods that iterate the estimate of the covariance matrix (ITGMM, ITOLS, ITSUR, IT2SLS, and IT3SLS). See the section Details on the Covariance of Equation Errors on page 1026 for an explanation of NESTIT.
SINGULAR=value
species the number of minimization iterations to perform at each grid point. The default is STARTITER=0, which implies that no minimization is performed at the grid points. See the section Using the STARTITER Option on page 1035 for more details.
Other Options
Other options that can be used on the FIT statement include the following that list and analyze the model: BLOCK, GRAPH, LIST, LISTCODE, LISTDEP, LISTDER, and XREF. The following printing control options are also available: DETAILS, FLOW, INTGPRINT, MAXERRORS=, NOPRINT, PRINTALL, and TRACE. For complete descriptions of these options, see the discussion of the PROC MODEL statement options earlier in this chapter.
ID Statement
ID variables ;
The ID statement species variables to identify observations in error messages or other listings and in the OUT= data set. The ID variables are normally SAS date or datetime variables. If more than one ID variable is used, the rst variable is used to identify the observations; the remaining variables are added to the OUT= data set.
INCLUDE Statement
INCLUDE model-names . . . ;
The INCLUDE statement reads model les and inserts their contents into the current model. However, instead of replacing the current model as the RESET MODEL= option does, the contents of included model les are inserted into the model program at the position that the INCLUDE statement appears.
INSTRUMENTS Statement
INSTRUMENTS variables < _EXOG_ > ; INSTRUMENTS < variables-list > < _EXOG_ > < EXCLUDE =( parameters ) > < / options > ; INSTRUMENTS (equation, variables) (equation, variables) . . . ;
The INSTRUMENTS statement species the instrumental variables to be used in the N2SLS, N3SLS, IT2SLS, IT3SLS, GMM, and ITGMM estimation methods. There are three ways of specifying the INSTRUMENTS statement. The rst form of the INSTRUMENTS statement is declared before a FIT statement and denes the default instruments list. The items specied as instruments can be variables or the special keyword _EXOG_. The keyword _EXOG_ indicates that all the model variables declared EXOGENOUS are to be added to the instruments list. If a single INSTRUMENTS statement of the rst form is declared before multiple FIT statements, then it serves as the default instruments list for each of the FIT statements. However, if any of these FIT statements are followed by separate INSTRUMENTS statement, then the latter take precedence over the default list. Hence, in the case of multiple FIT statements, the INSTRUMENTS statement for a particular FIT statement is written below the FIT statement if instruments other than the default are required. For a single FIT statement, you can declare the INSTRUMENTS statement of the rst form either preceding or following the FIT statement. The second form of the INSTRUMENTS statement is used only after the FIT statement and before the next RUN statement. The items specied as instruments for the second form can be variables, names of parameters to be estimated, or the special keyword _EXOG_. If you specify the name of a parameter in the instruments list, the partial derivatives of the equations with respect to the parameter (that is, the columns of the Jacobian matrix associated with the parameter) are used as instruments. The parameter itself is not used as an instrument. These partial derivatives should not depend on any of the parameters to be estimated. Only the names of parameters to be estimated can be specied. Note that an INSTRUMENTS statement of only the rst form declared before multiple FIT statements serves as the default instruments list. Hence, in the cases of multiple as well as single FIT statements, you can declare the second form of INSTRUMENTS statements only following the FIT statements. In the case where a FIT statement is preceded by an INSTRUMENTS statement of the second form in error and not followed by any INSTRUMENTS statement, then the default list is used. This default list is given by the INSTRUMENTS statement of the rst form as explained above. If such a list is not declared, all the model variables declared EXOGENOUS comprise the default. A third form of the INSTRUMENTS statement is used to specify instruments for each equation. No explicit intercept is added, parameters cannot be specied to represent instruments, and the _EXOG_ keyword is not allowed. Equations not explicitly assigned instruments use all the instruments specied for the other equations as well as instruments not assigned specic equations. In the following statements, z1, z2, and z3 are instruments used with equation y1, and z2, z3, and z4 are instruments used with equation y2.
proc model data=data_sim; exogenous x1 x2; parms a b c d e f; y1 =a*x1**2 + b*x2**2 + c*x1*x2 ; y2 =d*x1**2 + e*x2**2 + f*x1*x2**2; fit y1 y2 / 3sls ; instruments (y1, z1 z2 z3) (y2,z2 z3 z4); run;
EXCLUDE=(parameters)
species that the derivatives of the equations with respect to all of the parameters to be estimated (except the parameters listed in the EXCLUDE list) be used as instruments, in addition to the other instruments specied. If you use the EXCLUDE= option, you should be sure that the derivatives with respect to the nonexcluded parameters in the estimation are independent of the endogenous variables and not functions of the parameters estimated. The following options can be specied on the INSTRUMENTS statement following a slash (/):
NOINTERCEPT NOINT
excludes the constant of 1.0 (intercept) from the instruments list. An intercept is included as an instrument while using the rst or second form of the INSTRUMENTS statement unless NOINTERCEPT is specied. When a FIT statement species an instrumental variables estimation method and no INSTRUMENTS statement accompanies the FIT statement, the default instruments are used. If no default instruments list has been specied, all the model variables declared EXOGENOUS are used as instruments. See the section Choice of Instruments on page 1084 for more details.
INTONLY
species that only the intercept be used as an instrument. This option is used for GMM estimation where the moments have been specied explicitly.
LABEL Statement
LABEL variable=label . . . ;
The LABEL statement species a label of up to 255 characters for parameters and other variables used in the model program. Labels are used to identify parts of the printout of FIT and SOLVE tasks. The labels are displayed in the output if the LINESIZE= option is large enough.
MOMENT Statement
MOMENT variables = moment specication ;
In many scenarios, endogenous variables are observed from data. From the models, you can simulate these endogenous variables based on a xed set of parameters. The goal of simulated method of moments (SMM) is to nd a set of parameters such that the moments of the simulated data match the moments of the observed variables. If there are many moments to match, the code might be tedious. The following MOMENT statement provides a way to generate some commonly used moments automatically. Multiple MOMENT statements can be used. variables can be one or more endogenous variables. moment specication can have the following four types: ( number list ) species that the endogenous variable is raised to the power specied by each number in number list. For example,
moment y = (2 3);
ABS( number list ) species that the absolute value of the endogenous variable is raised to the power specied by each number in number list. For example,
moment y = ABS(3);
LAGnum ( number list ) species that the endogenous variable is multiplied by the num th lag of the endogenous variable, and this product is raised to the power specied by each number in number list. For example,
moment y = LAG4(3);
ABS_LAGnum ( number list ) species that the endogenous variable is multiplied by the num th lag of the endogenous variable, and the absolute value of this product is raised to the power specied by each number in number list. For example,
moment y = ABS_LAG4(3);
The following PROC MODEL statements use the MOMENT statement to generate 24 moments and t these moments using SMM.
proc model data=_tmpdata list; parms a b .5 s 1; instrument _exog_ / intonly; u = rannor( 10091 ); z = rannor( 97631 ); lsigmasq = xlag(sigmasq,exp(a)); lnsigmasq = a + b * log(lsigmasq) + s * u; sigmasq = exp( lnsigmasq ); y = sqrt(sigmasq) * z; moment y = (2 4) abs(1 3) abs_lag1(1 2) abs_lag2(1 2); moment y = abs_lag3(1 2) abs_lag4(1 2) abs_lag5(1 2) abs_lag6(1 2) abs_lag7(1 2) abs_lag8(1 2) abs_lag9(1 2) abs_lag10(1 2); fit y / gmm npreobs=20 ndraw=10; bound s > 0, 1>b>0; run;
OUTVARS Statement
OUTVARS variables ;
The OUTVARS statement species additional variables dened in the model program to be output to the OUT= data sets. The OUTVARS statement is not needed unless the variables to be added to the output data set are not referred to by the model, or unless you want to include parameters or other special variables in the OUT= data set. The OUTVARS statement includes additional variables, whereas the KEEP statement excludes variables.
PARAMETERS Statement
PARAMETERS variable < value > < variable < value > > . . . ;
The PARAMETERS statement declares the parameters of a model and optionally sets their initial values. Valid abbreviations are PARMS and PARM. Each parameter has a single value associated with it, which is the same for all observations. Lagging is not relevant for parameters. If a value is not specied in the PARMS statement (or by the PARMS= option of a FIT statement), the value defaults to 0.0001 for FIT tasks and to a missing value for SOLVE tasks.
Programming Statements
To dene the model, you can use most of the programming statements that are allowed in the SAS DATA step. See the SAS Language Reference: Dictionary for more information.
RANGE Statement
RANGE variable < = rst > < TO last > ;
The RANGE statement species the range of observations to be read from the DATA= data set. For FIT tasks, the RANGE statement controls the period of t for the estimation. For SOLVE tasks, the RANGE statement controls the simulation period or forecast horizon. The RANGE variable must be a numeric variable in the DATA= data set that identies the observations, and the data set must be sorted by the RANGE variable. The rst observation in the range is identied by rst, and the last observation is identied by last. PROC MODEL uses the rst l observations prior to rst to initialize the lags, where l is the maximum number of lags needed to evaluate any of the equations to be t or solved, or the maximum number of lags needed to compute any of the instruments when an instrumental variables estimation method is used. There should be at least l observations in the data set before rst. If last is not specied, all the nonmissing observations starting with rst are used. If rst is omitted, the rst l observations are used to initialize the lags, and the rest of the data, until last, is used. If a RANGE statement is used but both rst and last are omitted, the RANGE statement variable is used to report the range of observations processed. The RANGE variable should be nonmissing for all observations. Observations that contain missing RANGE values are deleted. The following are examples of RANGE statements:
range range range range range year = 1971 to 1988; /* yearly data date = 1feb73d to 1nov82d; /* monthly data time = 60.5; /* time in years year to 1977; /* use all years through 1977 date; /* use values of date to report period-of-fit */ */ */ */ */
If no RANGE statements follow multiple FIT statements and a single RANGE statement is declared before all the FIT statements, estimation in each of the multiple FIT statements is based on the data specied in the single RANGE statement. A single RANGE statement following multiple FIT statements affects only the t immediately preceding it. If the FIT statement is both followed by and preceded by RANGE statements, the following RANGE statement takes precedence over the preceding RANGE statement. In the case where a range of data is to be used for a particular SOLVE task, the RANGE statement should be specied following the SOLVE statement in the case of either single or multiple SOLVE statements.
RESET Statement
RESET options ;
All of the options of the PROC MODEL statement can be reset by the RESET statement. In addition, the RESET statement supports one additional option:
PURGE
deletes the current model so that a new model can be dened. When the MODEL= option is used in the RESET statement, the current model is deleted before the new model is read.
RESTRICT Statement
RESTRICT restriction1 < , restriction2 . . . > ;
The RESTRICT statement is used to impose linear and nonlinear restrictions on the parameter estimates. RESTRICT statements refer to the parameters estimated by the associated FIT statement (that is, to either the preceding FIT statement or, in the absence of a preceding FIT statement, to the following FIT statement). You can specify any number of RESTRICT statements. Each restriction is written as an optional name, followed by an expression, followed by an equality operator (=) or an inequality operator (<, >, <=, >=), followed by a second expression:
< "name" > expression operator expression
The optional "name" is a string used to identify the restriction in the printed output and in the OUTEST= data set. The operator can be =, <, >, <= , or >=. The operator and second expression are optional. Restriction expressions can be composed of parameter names, arithmetic operators, functions, and constants. Comparison operators (such as = or <) and logical operators (such as &) cannot be used
in RESTRICT statement expressions. Parameters named in restriction expressions must be among the parameters estimated by the associated FIT statement. Expressions can refer to variables dened in the program. The restriction expressions can be linear or nonlinear functions of the parameters. The following is an example of the use of the RESTRICT statement:
proc model data=one; endogenous y1 y2; exogenous x1 x2; parms a b c; restrict b*(b+c) <= a; eq.one = -y1/c + a/x2 + b * x1**2 + c * x2**2; eq.two = -y2 * y1 + b * x2**2 - c/(2 * x1); fit one two / fiml; run;
SOLVE Statement
SOLVE variables < SATISFY= equations > < /options > ;
The SOLVE statement species that the model be simulated or forecast for input data values and, optionally, selects the variables to be solved. If the list of variables is omitted, all of the model variables declared ENDOGENOUS are solved. If no model variables are declared ENDOGENOUS, then all model variables are solved. The following specication can be used in the SOLVE statement:
SATISFY=equation SATISFY=( equations )
species a subset of the model equations that the solution values are to satisfy. If the SATISFY= option is not used, the solution is computed to satisfy all the model equations. Note that the number of equations must equal the number of variables solved.
names the input data set. The model is solved for each observation read from the DATA= data set. If the DATA= option is not specied on the SOLVE statement, the data set specied by the DATA= option in the PROC MODEL statement is used.
ESTDATA=SAS-data-set
names a data set whose rst observation provides values for some or all of the parameters and whose additional observations (if any) give the covariance matrix of the parameter estimates.
The covariance matrix read from the ESTDATA= data set is used to generate multivariate normal pseudo-random shocks to the model parameters when the RANDOM= option requests Monte Carlo simulation.
OUT=SAS-data-set
outputs the predicted (solution) values, residual values, actual values, or equation errors from the solution to a data set. The residual values are the actual pred icted values, which is the negative of RESID.variable as dened in the section Equation Translations on page 1155. Only the solution values are output by default.
OUTACTUAL
outputs the actual values of the solved variables read from the input data set to the OUT= data set. This option is applicable only if the OUT= option is specied.
OUTALL
writes the equation errors to the OUT= data set. These values are normally very close to zero when a simultaneous solution is computed; they can be used to double-check the accuracy of the solution process. It is applicable only if the OUT= option is specied.
OUTLAGS
writes the observations used to start the lags to the OUT= data set. This option is applicable only if the OUT= option is specied.
OUTPREDICT
writes the solution values to the OUT= data set. This option is relevant only if the OUT= option is specied. The OUTPREDICT option is the default unless one of the other output options is used.
OUTRESID
writes the residual values computed as the actual pred icted values and is not the same as the RESID.variable values. This option is applicable only if the OUT= option is specied.
PARMSDATA=SAS-data-set
species a data set that contains the parameter estimates. See the section Input Data Sets on page 1104 for more details.
RESIDDATA=SAS-data-set
species a data set that contains the residuals that are to be used in the empirical distribution. This data set can be created using the OUT= option on the FIT statement.
SDATA=SAS-data-set
species a data set that provides the covariance matrix of the equation errors. The covariance matrix read from the SDATA= data set is used to generate multivariate normal pseudo-random shocks to the equations when the RANDOM= option requests Monte Carlo simulation.
TYPE=name
species the estimation type. The name specied in the TYPE= option is compared to the _TYPE_ variable in the ESTDATA= and SDATA= data sets to select observations to use in constructing the covariance matrices. When TYPE= is omitted, the last estimation type in the data set is used.
species a dynamic solution. In the dynamic solution mode, solved values are used by the lagging functions. DYNAMIC is the default.
NAHEAD=n
species a simulation of n-period-ahead dynamic forecasting. The NAHEAD= option is used to simulate the process of using the model to produce successive forecasts to a xed forecast horizon, in which each forecast uses the historical data available at the time the forecast is made. Note that NAHEAD=1 produces a static (one-step-ahead) solution. NAHEAD=2 produces a solution that uses one-step-ahead solutions for the rst lag (LAG1 functions return static predicted values) and actual values for longer lags. NAHEAD=3 produces a solution that uses NAHEAD=2 solutions for the rst lags, NAHEAD=1 solutions for the second lags, and actual values for longer lags. In general, NAHEAD=n solutions use NAHEAD=n1 solutions for LAG1, NAHEAD=n2 solutions for LAG2, and so forth.
START=s
species static solutions until the sth observation and then changes to dynamic solutions. If the START=s option is specied, the rst observation in the range in which LAGn delivers solved predicted values is s+n, while LAGn returns actual values for earlier observations.
STATIC
species a static solution. In static solution mode, actual values of the solved variables from the input data set are used by the lagging functions.
species that the actual value of a solved variable is used as the solution value (instead of the predicted value from the model equations) whenever nonmissing data are available in the input data set. That is, in FORECAST mode, PROC MODEL solves only for those variables that are missing in the input data set.
SIMULATE
species that PROC MODEL always solves for all solution variables as a function of the input values of the other variables, even when actual data for some of the solution variables are available in the input data set. SIMULATE is the default.
computes a simultaneous solution by using Newtons method. When the NEWTON option is selected, the analytic derivatives of the equation errors with respect to the solution variables are computed, and memory-efcient sparse matrix techniques are used for factoring the Jacobian matrix. The NEWTON option can be used to solve both normalized-form and general-form equations and can compute goal-seeking solutions. NEWTON is the default.
SEIDEL
species a single-equation (nonsimultaneous) solution. The model is executed once to compute predicted values for the variables from the actual values of the other endogenous variables. The SINGLE option can be used only for normalized-form equations and cannot be used for goal-seeking solutions. For more information on these options, see the section Solution Modes on page 1117.
species copula to be used in the simulation. The normal (Gaussian) copula is the default. The copula applies to covariance of equation errors.
PSEUDO=DEFAULT | TWISTER
species which pseudo-number generator is to be used in generating draws for Monte Carlo simulation. The two pseudo-random number generators supported by the MODEL procedure are a default congruential generator which has period 231 1 and Mersenne-Twister pseudorandom number generator which has an extraordinarily long period 219937 1.
QUASI=NONE|SOBOL|FAURE
species a pseudo- or quasi-random number generator. Two quasi-random number generators are supported by the MODEL procedure: the Sobol sequence (QUASI=SOBOL) and the Faure sequence (QUASI=FAURE). The default is QUASI=NONE, which is the pseudorandom number generator.
RANDOM=n
repeats the solution n times for each BY group, with different random perturbations of the equation errors if the SDATA= option is used; with different random perturbations of the parameters if the ESTDATA= option is used and the ESTDATA= data set contains a parameter covariance matrix; and with different values returned from the random number generator
functions, if any are used in the model program. If RANDOM=0, the random number generator functions always return zero. See the section Monte Carlo Simulation on page 1120 for details. The default is RANDOM=0.
SEED=n
species an integer to use as the seed in generating pseudo-random numbers to shock the parameters and equations when the ESTDATA= or the SDATA= options are specied. If n is negative or zero, the time of day from the computers clock is used as the seed. The SEED= option is relevant only if the RANDOM= option is used. The default is SEED=0.
WISHART=df
species that a Wishart distribution with degrees of freedom df be used in place of the normal error covariance matrix. This option is used to model the variance of the error covariance matrix when Monte Carlo simulation is selected.
species the convergence criterion for the simultaneous solution. Convergence of the solution is judged by comparing the CONVERGE= value to the maximum over the equations of j ij jyi j C 1E
if they are computable, otherwise j ij where i represents the equation error and y i represents the solution variable that corresponds to the ith equation for normalized-form equations. The default is CONVERGE=1E8.
MAXITER=n
species the maximum number of iterations allowed for computing the simultaneous solution for any observation. The default is MAXITER=50.
MAXSUBITER=n
species the maximum number of damping subiterations that are performed in solving a nonlinear system when using the NEWTON solution method. Damping is disabled by setting MAXSUBITER=0. The default is MAXSUBITER=10.
Printing Options
INTGPRINT
prints between data points integration values for the DERT. variables and the auxiliary variables. If you specify the DETAILS option, the integrated derivative variables are printed as well.
ITPRINT
prints the solution approximation and equation errors at each iteration for each observation. This option can produce voluminous output.
PRINTALL
species the printing control options DETAILS, ITPRINT, SOLVEPRINT, STATS, and THEIL.
SOLVEPRINT
prints tables of Theil inequality coefcients and Theil relative change forecast error measures for the solution values. See the section Summary Statistics on page 1135 for more information.
Other Options
Other options that can be used on the SOLVE statement include the following that list and analyze the model: BLOCK, GRAPH, LIST, LISTCODE, LISTDEP, LISTDER, and XREF. The LTEBOUND= and MINTIMESTEP= options can be used to control the integration process. The following printing-control options are also available: DETAILS, FLOW, MAXERRORS=, NOPRINT, and TRACE. For complete descriptions of these options, see the PROC MODEL and FIT statement options described earlier in this chapter.
TEST Statement
TEST < "name" > test1 < , test2 . . . > < ,/ options > ;
The TEST statement performs tests of nonlinear hypotheses on the model parameters. The TEST statement applies to the parameters estimated by the associated FIT statement (that is, either the preceding FIT statement or, in the absence of a preceding FIT statement, the following FIT statement). You can specify any number of TEST statements. If you specify options on the TEST statement, a comma is required before the / character that separates the test expressions from the options, because the / character can also be used within test expressions to indicate division. The label lengths for tests and estimate statements are 256 characters. If the labels exceed this length, the label is truncated to 256 characters with a note printed to the log. Each test is written as an expression optionally followed by an equal sign (=) and a second expression:
Test expressions can be composed of parameter names, arithmetic operators, functions, and constants. Comparison operators (such as =) and logical operators (such as &) cannot be used in TEST statement expressions. Parameters named in test expressions must be among the parameters estimated by the associated FIT statement. If you specify only one expression in a test, that expression is tested against zero. For example, the following two TEST statements are equivalent:
test a + b; test a + b = 0;
When you specify multiple tests in the same TEST statement, a joint test is performed. For example, the following TEST statement tests the joint hypothesis that both A and B are equal to zero.
test a, b;
To perform separate tests rather than a joint test, use separate TEST statements. For example, the following TEST statements test the two separate hypotheses that A is equal to zero and that B is equal to zero.
test a; test b;
species the name of an output SAS data set that contains the test results. The format of the OUT= data set produced by the TEST statement is similar to that of the OUTEST= data set produced by the FIT statement.
VAR Statement
VAR variables < initial-values > . . . ;
The VAR statement declares model variables and optionally provides initial values for the lags of the variables. See the section Lag Logic on page 1161 for more information.
WEIGHT Statement
WEIGHT variable ;
The WEIGHT statement species a variable to supply weighting values to use for each observation in estimating parameters. If the weight of an observation is nonpositive, that observation is not used for the estimation. variable must be a numeric variable in the input data set. An alternative weighting method is to use an assignment statement to give values to the special variable _WEIGHT_. The _WEIGHT_ variable must not depend on the parameters being estimated. If both weighting specications are given, the weights are multiplied together.
Estimation Methods
Consider the general nonlinear model:
t
D q.yt ; xt ; / D Z.xt /
zt
where q 2Rg is a real vector valued function of yt 2Rg , xt 2Rl , 2Rp , where g is the number of equations, l is the number of exogenous variables (lagged endogenous variables are considered exogenous here), p is the number of parameters, and t ranges from 1 to n. zt 2Rk is a vector of instruments. t is an unobservable disturbance vector with the following properties: E. t / D 0 E.
t t/
0
All of the methods implemented in PROC MODEL aim to minimize an objective function. The following table summarizes the objective functions that dene the estimators and
the corresponding estimator of the covariance of the parameter estimates for each method.
Table 18.2 Summary of PROC MODEL Estimation Methods
Method OLS ITOLS SUR ITSUR N2SLS IT2SLS N3SLS IT3SLS GMM ITGMM FIML
Objective Function r0 r=n r0 .diag.S/ 1 I/r=n 1 r0 .SOLS I/r=n 0 .S 1 I/r=n r r0 .IW/r=n r0 .diag.S/ 1 W/r=n 1 r0 .SN2SLS W/r=n 0 .S 1 W/r=n r O 1 nmn ./0 VN2SLS nmn ./=n 0 V 1 nm ./=n nmn ./ O n constant C n ln.det.S// 2 Pn 1 lnj.Jt /j
Covariance of .X0 .diag.S/ 1 I/X/ 1 .X0 .diag.S/ 1 I/X/ 1 .X0 .S 1 I/X/ 1 .X0 .S 1 I/X/ 1 .X0 .diag.S/ 1 W/X/ 1 .X0 .diag.S/ 1 W/X/ 1 .X0 .S 1 W/X/ 1 .X0 .S 1 W/X/ 1 O .YX/0 V 1 .YX/ 1 0 V 1 .YX/ 1 .YX/ O O O Z0 .S 1 I/Z 1
The column labeled Instruments identies the estimation methods that require instruments. The variables used in this table and the remainder of this chapter are dened as follows: n D is the number of nonmissing observations. g D is the number of equations. k D is the number of instrumental variables. 2 3 r1 6 r2 7 6 7 r D 6 : 7 is the ng 1 vector of residuals for the g equations stacked together. 4:5 : rg 2 3 qi .y1 ; x1 ; / 6 qi .y2 ; x2 ; / 7 6 7 ri D 6 7 is the n 1 column vector of residuals for the i th equation. : : 4 5 : qi .yn ; xn ; / S X W Z Y O Z is a g g matrix that estimates , the covariances of the errors across equations (referred to as the S matrix). is an ng p matrix of partial derivatives of the residual with respect to the parameters. is an n is an n is a gk n matrix, Z.Z0 Z/
1 Z0 .
1 2 @ q.y
0 t ; xt ; /
@yt @i
Qi
0.
U Q Qi I Jt mn zt O V k constant
1 ; 2 ; : : :; n
is the n g matrix q.y1 ; x1 ; /; q.y2 ; x2 ; /; : : :; q.yn ; xn ; /. n identity matrix. which is a g g Jacobian matrix.
is rst moment of the crossproduct q.yt ; xt ; /zt , 1 P mn D n nD1 q.yt ; xt ; /zt t is a k column vector of instruments for observation t. z0 is also the tth row of Z. t is the gk gk matrix that represents the variance of the moment functions.
ng 2 .1
is the number of instrumental variables used. is the constant C ln.2 //. is the notation for a Kronecker product.
All vectors are column vectors unless otherwise noted. Other estimates of the covariance matrix for FIML are also available.
In the rst equation, y2 is a dependent, or endogenous, variable. As shown by the second equation, y2 is a function of y1 , which by the rst equation is a function of 1 , and therefore y2 depends on 1 . Likewise, y1 depends on 2 and is a dependent regressor in the second equation. This is an example of a simultaneous equation system; y1 and y2 are a function of all the variables in the system. Using the ordinary least squares (OLS) estimation method to estimate these equations produces biased estimates. One solution to this problem is to replace y1 and y2 on the right-hand side of the equations with predicted values, thus changing the regression problem to the following: y1 D a1 C b1 y2 C c1 x1 C O
1
y2 D a2 C b2 y1 C c2 x2 C O
This method requires estimating the predicted values y1 and y2 through a preliminary, or rst O O stage, instrumental regression. An instrumental regression is a regression of the dependent regressors on a set of instrumental variables, which can be any independent variables useful for predicting the dependent regressors. In this example, the equations are linear and the exogenous variables for the whole system are known. Thus, the best choice for instruments (of the variables in the model) are the variables x1 and x2 . This method is known as two-stage least squares or 2SLS, or more generally as the instrumental variables method. The 2SLS method for linear models is discussed in Pindyck (1981, p. 191192). For nonlinear models this situation is more complex, but the idea is the same. In nonlinear 2SLS, the derivatives of the model with respect to the parameters are replaced with predicted values. See the section Choice of Instruments on page 1084 for further discussion of the use of instrumental variables in nonlinear regression. To perform nonlinear 2SLS estimation with PROC MODEL, specify the instrumental variables with an INSTRUMENTS statement and specify the 2SLS or N2SLS option in the FIT statement. The following statements show how to estimate the rst equation in the preceding example with PROC MODEL:
proc model data=in; y1 = a1 + b1 * y2 + c1 * x1; fit y1 / 2sls; instruments x1 x2; run;
The 2SLS or instrumental variables estimator can be computed by using a rst-stage regression on the instrumental variables as described previously. However, PROC MODEL actually uses the equivalent but computationally more appropriate technique of projecting the regression problem into the linear space dened by the instruments. Thus, PROC MODEL does not produce any rst stage results when you use 2SLS. If you specify the FSRSQ option in the FIT statement, PROC MODEL prints First-Stage R2 statistic for each parameter estimate. O Formally, the that minimizes !0 n ! n X 1 X O Sn D .q.yt ; xt ; /zt / I zt z0 t n
t D1 t D1 1 n X tD1
! .q.yt ; xt ; /zt /
is the N2SLS estimator of the parameters. The estimate of at the nal iteration is used in the covariance of the parameters given in Table 18.2. See Amemiya (1985, p. 250) for details on the properties of nonlinear two-stage least squares.
errors. The large-sample efciency of an estimation can be improved if these cross-equation correlations are taken into account. SUR is also known as joint generalized least squares or Zellner O regression. Formally, the that minimizes 1X O O Sn D q.yt ; xt ; /0 n
t D1 n 1
q.yt ; xt ; /
is the SUR estimator of the parameters. The SUR method requires an estimate of the cross-equation covariance matrix, . PROC MODEL O rst performs an OLS estimation, computes an estimate, , from the OLS residuals, and then O performs the SUR estimation based on . The OLS results are not printed unless you specify the OLS option in addition to the SUR option. O You can specify the to use for SUR by storing the matrix in a SAS data set and naming that data O set in the SDATA= option. You can also feed the computed from the SUR residuals back into the SUR estimation process by specifying the ITSUR option. You can print the estimated covariance O matrix by using the COVS option in the FIT statement. The SUR method requires estimation of the matrix, and this increases the sampling variability of the estimator for small sample sizes. The efciency gain that SUR has over OLS is a large sample property, and you must have a reasonable amount of data to realize this gain. For a more detailed discussion of SUR, see Pindyck and Rubinfeld (1981, p. 331-333).
! .q.yt ; xt ; /zt /
is the 3SLS estimator of the parameters. For more details on 3SLS, see Gallant (1987, p. 435). Residuals from the 2SLS method are used to estimate the matrix required for 3SLS. The results of the preliminary 2SLS step are not printed unless the 2SLS option is also specied. To use the three-stage least squares method, specify an INSTRUMENTS statement and use the 3SLS or N3SLS option in either the PROC MODEL statement or a FIT statement.
D q.yt ; xt ; / D Z.xt /
t
zt
In general, the following orthogonality condition is desired: E. t zt / D 0 This condition states that the expected crossproducts of the unobservable disturbances, functions of the observable variables are set to 0. The rst moment of the crossproducts is 1X m.yt ; xt ; / n
tD1 n t,
and
mn D
m.yt ; xt ; / D q.yt ; xt ; /zt where m.yt ; xt ; /2Rgk . The case where gk > p is considered here, where p is the number of parameters. O Estimate the true parameter vector 0 by the value of that minimizes S.; V/ D nmn . /0 V where V D Cov nmn . 0 /; nmn . 0 /0 The parameter vector that minimizes this objective function is the GMM estimator. GMM estimation is requested in the FIT statement with the GMM option. The variance of the moment functions, V, can be expressed as
n X 1
nmn ./=n
!
t zt
V D E D D
n X sD1
!0
s zs
t D1 n X X n t D1 sD1 nS0 n
E . t zt /. s zs /0
O O Note that Sn is a gk gk matrix. Because Var .Sn / does not decrease with increasing n, you consider 0 estimators of Sn of the form: O Sn .l.n// D
n 1 X D nC1
w.
l.n/
O Sn;
where l.n/ is a scalar function that computes the bandwidth parameter, w. / is a scalar valued kernel, and the diagonal matrix D is used for a small sample degrees of freedom correction (Gallant O 1987). The initial # used for the estimation of Sn is obtained from a 2SLS estimation of the system. The degrees of freedom correction is handled by the VARDEF= option as it is for the S matrix estimation. The following kernels are supported by PROC MODEL. They are listed with their default bandwidth functions. Bartlett: KERNEL=BART ( w.x/ D l.n/ D 1 0 jxj jxj <D 1 otherwise
1 1=3 n 2
Parzen: KERNEL=PARZEN 8 1 6jxj2 C 6jxj3 < w.x/ D 2.1 jxj/3 : 0 l.n/ D n1=5
1 0 <D jxj <D 2 1 2 <D jxj <D 1 otherwise
Quadratic spectral: KERNEL=QS w.x/ D l.n/ D 25 12 2 x 2 1 1=5 n 2 si n.6 x=5/ 6 x=5 cos.6 x=5/
Details of the properties of these and other kernels are given in Andrews (1991). Kernels are selected with the KERNEL= option; KERNEL=PARZEN is the default. The general form of the KERNEL= option is
KERNEL=( PARZEN | QS | BART, c, e )
where the e
0 and c
l.n/ D cne The bias of the standard error estimates increases for large bandwidth parameters. A warning mes1 sage is produced for bandwidth parameters greater than n 3 . For a discussion of the computation of the optimal l.n/, refer to Andrews (1991). The Newey-West kernel (Newey and West 1987) corresponds to the Bartlett kernel with bandwidth parameter l.n/ D L C 1. That is, if the lag length for the Newey-West kernel is L, then the corresponding MODEL procedure syntax is KERNEL=(bart, L+1, 0). Andrews and Monahan (1992) show that using prewhitening in combination with GMM can improve condence interval coverage and reduce over rejection of t statistics at the cost of inating the variance and MSE of the estimator. Prewhitening can be performed by using the %AR macros.
For the special case that the errors are not serially correlatedthat is, E.et zt /.es zs / D 0 the estimate for S0 reduces to n 1X O q.yt ; xt ; /zt q.yt ; xt ; /zt 0 Sn D n
t D1 n
t s
The option KERNEL=(kernel,0,) is used to select this type of estimation when using GMM.
The covariance of GMM estimators, given a general weighting matrix VG 1 , is .YX/0 VG 1 .YX/
1
By default or when GENGMMV is specied, this is the covariance of GMM estimators. O If the weighting matrix is the same as V, then the covariance of GMM estimators becomes O .YX/0 V
1
.YX/
Let r be the number of unique instruments times the number of equations. The value r represents the number of orthogonality conditions imposed by the GMM method. Under the assumptions of the GMM method, r p linearly independent combinations of the orthogonality should be close to zero. The GMM estimates are computed by setting these combinations to zero. When r exceeds the number of parameters to be estimated, the OBJECTIVE*N, reported at the end of the estimation, is an asymptotically valid statistic to test the null hypothesis that the over-identifying restrictions of the model are valid. The OBJECTIVE*N is distributed as a chi-square with r p degrees of freedom (Hansen 1982, p. 1049).
Estimation Details
D q.yt ; xt ; /
where q 2Rg is a real vector valued function of y t 2Rg , x t 2Rl , 2 Rp ; g is the number of equations; l is the number of exogenous variables (lagged endogenous variables are considered exogenous here); p is the number of parameters; and t ranges from 1 to n. t is an unobservable disturbance vector with the following properties: E. t / D 0 E.
t t/
0
In many cases, it is not possible to write q.yt ; xt ; / in a closed form. Instead q is expressed as an integral of a function f; that is, Z q.yt ; xt ; / D f.yt ; xt ; ; ut /dP .u/ where f 2Rg is a real vector valued function of y t 2Rg , x t 2Rl , 2 Rp , and u t 2Rm , m is the number of stochastic variables with a known distribution P .u/. Since the distribution of u is completely known, it is possible to simulate articial draws from this distribution. Using such independent draws uht , h D 1; : : : ; H , and the strong law of large numbers, q can be approximated by
H 1 X f.yt ; xt ; ; uht /: H hD1
Generalized method of moments (GMM) is widely used to obtain efcient estimates for general model systems. When the moment conditions are not readily available in closed forms but can be approximated by simulation, simulated generalized method of moments (SGMM) can be used. The SGMM estimators have the nice property of being asymptotically consistent and normally distributed even if the number of draws H is xed (see McFadden 1989; Pakes and Pollard 1989).
D q.yt ; xt ; / D D Z.xt /
zt
where zt 2Rk is a vector of k instruments and t is an unobservable disturbance vector that can be serially correlated and nonstationary. In the case of no instrumental variables, zt is 1. q.yt ; xt ; / is the vector of moment conditions, and it is approximated by simulation. In general, theory suggests the following orthogonality condition E. t zt / D 0 which states that the expected crossproducts of the unobservable disturbances, t , and functions of the observable variables are set to 0. The sample means of the crossproducts are 1X m.yt ; xt ; / n
t D1 n
mn D
m.yt ; xt ; / D q.yt ; xt ; /zt where m.yt ; xt ; /2Rgk . The case where gk > p, where p is the number of parameters, is considO ered here. An estimate of the true parameter vector 0 is the value of that minimizes S.; V / D nmn . /0 V where V D Cov m. 0 /; m. 0 /0 : The steps for SGMM are as follows: O O 1. Start with a positive denite V matrix. This V matrix can be estimated from a consistent estimator O is a consistent estimator, then ut for t D 1; : : : ; n can be simulated H 0 number of times. of . If A consistent estimator of V is obtained as
n H H X 1 X X O D 1 O uht /zt 1 O 0 f.yt ; xt ; ; f.yt ; xt ; ; uht /zt 0 V 0 n H H t D1 hD1 hD1
0 0
nmn . /=n
H 0 must be large so that this is an consistent estimator of V. 2. Simulate H number of ut for t D 1; : : : ; n. As shown by Gourieroux and Monfort (1993), the number of simulations H does not need to be very large. For H D 10, the SGMM estimator O achieves 90% of the efciency of the corresponding GMM estimator. Find that minimizes the O 1 quadratic product of the moment conditions again with the weight matrix being V . O minnmn . /0 V
1
nmn . /=n
O O V. /V
D0 1 1 C
O EzVar.fjx/zV
D0 1 1
O 1 where 1 D DV D, D is the matrix of partial derivatives of the residuals with respect to the paO O rameters, V. / is the covariance of moments from estimated parameters , and Var.fjx/ is the covariance of moments for each observation from simulation. The rst term is the variance-covariance matrix of the exact GMM estimator, and the second term accounts for the variation contributed by simulating the moments.
Implementation in PROC MODEL
In PROC MODEL, if the user species the GMM and NDRAW options in the FIT statement, PROC O MODEL rst ts the model by using N2SLS and computes V by using the estimates from N2SLS 0 simulation. If NO2SLS is specied in the FIT statement, V is read from VDATA= data O and H O matrix, the initial starting value of is used as the estimator set. If the user does not provide a V O for computing the V matrix in step 1. If ITGMM option is specied instead of GMM, then PROC MODEL iterates from step 1 to step 3 until the V matrix converges. The consistency of the parameter estimates is not affected by the variance correction shown in the second term in step 3. The correction on the variance of parameter estimates is not computed by default. To add the adjustment, use ADJSMMV option on the FIT statement. This correction is of 1 the order of H and is small even for moderate H . The following example illustrates how to use SMM to estimate a simple regression model. Suppose the model is y D a C bx C u; u i id N.0; s 2 /:
First, consider the problem in a GMM context. The rst two moments of y are easily derived: E.y/ D a C bx E.y 2 / D .a C bx/2 C s 2 Rewrite the moment conditions in the form similar to the discussion above:
1t 2t
D yt D yt2
.a C bxt / .a C bxt /2 s2
Then you can estimate this model by using GMM with following statements:
proc model data=a; parms a b s; instrument x; eq.m1 = y-(a+b*x); eq.m2 = y*y - (a+b*x)**2 - s*s; bound s > 0; fit m1 m2 / gmm; run;
Now suppose you do not have the closed form for the moment conditions. Instead you can simulate the moment conditions by generating H number of simulated samples based on the parameters. Then the simulated moment conditions are
H 1 X fyt H hD1 H 1 X 2 fyt H hD1
1t
2t
This model can be estimated by using SGMM with the following statements:
proc model data=_tmpdata; parms a b s; instrument x; ysim = (a+b*x) + s * rannor( 98711 ); eq.m1 = y-ysim; eq.m2 = y*y - ysim*ysim; bound s > 0; fit m1 m2 / gmm ndraw=10; run;
You can use the following MOMENT statement instead of specifying the two moment equations above:
moment ysim=(1, 2);
In cases where you require a large number of moment equations, using the MOMENT statement to specify them is more efcient. Note that the NDRAW= option tells PROC MODEL that this is a simulation-based estimation. Thus, the random number function RANNOR returns random numbers in estimation process. During the simulation, 10 draws of m1 and m2 are generated for each observation, and the averages enter the objective functions just as the equations specied previously.
The simulation method can be used not only with GMM and ITGMM, but also with OLS, ITOLS, SUR, ITSUR, N2SLS, IT2SLS, N3SLS, and IT3SLS. These simulation-based methods are similar to the corresponding methods in PROC MODEL; the only difference is that the objective functions include the average of the H simulations.
Compared to the instrumental variables methods (2SLS and 3SLS), the FIML method has these advantages and disadvantages: FIML does not require instrumental variables. FIML requires that the model include the full equation system, with as many equations as there are endogenous variables. With 2SLS or 3SLS, you can estimate some of the equations without specifying the complete system. FIML assumes that the equations errors have a multivariate normal distribution. If the errors are not normally distributed, the FIML method might produce poor results. 2SLS and 3SLS do not assume a specic distribution for the errors. The FIML method is computationally expensive. The full information maximum likelihood estimators of and negative log-likelihood function:
ng 2 n X
ln .; / D
The option FIML requests full information maximum likelihood estimation. If the errors are distributed normally, FIML produces efcient estimators of the parameters. If instrumental variables are not provided, the starting values for the estimation are obtained from a SUR estimation. If instrumental variables are provided, then the starting values are obtained from a 3SLS estimation. The log-likelihood value and the l2 norm of the gradient of the negative log-likelihood function are shown in the estimation summary.
FIML Details
To compute the minimum of ln .; /, this function is concentrated using the relation . / D 1X q.yt ; xt ; /q0 .yt ; xt ; / n
t D1 n
This results in the concentrated negative log-likelihood function: n X @ ng 0 q.yt ; xt ; / C n lnj.j/ ln ./ D .1 C ln.2 // ln 2 2 @yt
t D1
ri .t / D C tr
@q.yt ; xt ; / 0 @yt
1 1
@2 q.yt ; xt ; / 0 @yt @i
1 tr . / 2 I . /
@. / @i
O The estimator of the variance-covariance of (COVB) for FIML can be selected with the COVBEST= option with the following arguments: CROSS selects the crossproducts estimator of the covariance matrix (Gallant 1987, p. 473): ! 1 n 1X 0 C D r.t/r .t/ n
t D1
where r.t / D r1 .t /; r2 .t/; : : :; rp .t/0 . This is the default. GLS selects the generalized least squares estimator of the covariance matrix. This is computed as (Dagenais 1978) O C D Z .. /
0 1
O I /Z
O O O O O where Z D .Z1 ; Z2 ; : : :; Zp / is ng p and each Zi column vector is obtained from stacking the columns of n 1 X @q.yt ; xt ; /0 1 @2 q.yt ; xt ; /0 U Qi 0 n @y @yn @i
t D1
U is an n FDA
g matrix
@Q . @i
selects the inverse of concentrated likelihood Hessian as an estimator of the covariance matrix. The Hessian is computed numerically, so for a large problem this is computationally expensive.
The HESSIAN= option controls which approximation to the Hessian is used in the minimization procedure. Alternate approximations are used to improve convergence and execution time. The choices are as follows: CROSS GLS FDA The crossproducts approximation is used. The generalized least squares approximation is used (default). The Hessian is computed numerically by nite differences.
HESSIAN=GLS has better convergence properties in general, but COVBEST=CROSS produces the most pessimistic standard error bounds. When the HESSIAN= option is used, the default estimator O of the variance-covariance of is the inverse of the Hessian selected.
df / 2
. df / j. /j 2 2
q0 .yt ; xt ; /. / 1C df
1 q.y
t ; xt ; /
df Cm 2
where m is the number of equations and df is the degrees of freedom. The maximum likelihood estimators of and likelihood function:
n X t D1
0 ln @ .
ln .; / D
Cm . df 2 /
1C
qt0 1 qt df
df Cm 2
1 A
df / 2
n 2
ln .jj/
. df / 2 n X @qt ln 0 @y t
t D1
The ERRORMODEL statement is used to request the t distribution maximum likelihood estimation. An OLS estimation is done to obtain initial parameter estimates and MSE.var estimates. Use NOOLS to turn off this initial estimation. If the errors are distributed normally, t distribution estimation produces results similar to FIML. The multivariate model has a single shared degrees-of-freedom parameter, which is estimated. The degrees-of-freedom parameter can also be set to a xed value. The log-likelihood value and the l2 norm of the gradient of the negative log-likelihood function are shown in the estimation summary.
t Distribution Details
Since a variance term is explicitly specied by using the ERRORMODEL statement, ./ is estimated as a correlation matrix and q.yt ; xt ; / is normalized by the variance. The gradient of the negative log-likelihood function with respect to the degrees of freedom is @ln @df nm 2 df
df Cm df n 0. 2 / n 0. 2 / C C 2 . df Cm / 2 . df / 2 0 1q q 2
0:5 log.1 C
df
0:5.df C m/ q0 1 q 0 1 2 .1 C q q / df
df
The gradient of the negative log-likelihood function with respect to the parameters is 2 3 @q .2 q0 1 @ / n 0:5.df C m/ 4 @ @ln i D C q0 1 1 q5 trace. @i .1 C q0 1 q=df / df @i 2 where @. / 2X @q.yt ; xt ; /0 D q.yt ; xt ; / @i n @i
tD1 n
1 @
@i
and . / 2 Rm q.yt ; xt ; / D p h. /
n
O The estimator of the variance-covariance of (COVB) for the t distribution is the inverse of the likelihood Hessian. The gradient is computed analytically, and the Hessian is computed numerically.
solve y / data=t(where=(date=1aug98d)) residdata=r sdata=s random=200 seed=6789 out=monte ; run; proc kde data=monte; univar y / plots=density; run; ods graphics off;
For simulation, if the CDF for the model is not built in to the procedure, you can use the CDF=EMPIRICAL() option. This uses the sorted residual data to create an empirical CDF. For computing the inverse CDF, the program needs to know how to handle the tails. For continuous data, the tail distribution is generally poorly determined. To counter this, the PERCENT= option species the percentage of the observations to use in constructing each tail. The default for the PERCENT= option is 10. A normal distribution or a t distribution is used to extrapolate the tails to innity. The standard errors for this extrapolation are obtained from the data so that the empirical CDF is continuous.
Missing Values
An observation is excluded from the estimation if any variable used for FIT tasks is missing, if the weight for the observation is not greater than 0 when weights are used, or if a DELETE statement is executed by the model program. Variables used for FIT tasks include the equation errors for each equation, the instruments, if any, and the derivatives of the equation errors with respect to the parameters estimated. Note that variables can become missing as a result of computational errors or calculations with missing values. The number of usable observations can change when different parameter values are used; some parameter values can be invalid and cause execution errors for some observations. PROC MODEL keeps track of the number of usable and missing observations at each pass through the data, and if the number of missing observations counted during a pass exceeds the number that was obtained using the previous parameter vector, the pass is terminated and the new parameter vector is considered infeasible. PROC MODEL never takes a step that produces more missing observations than the current estimate does. The values used to compute the Durbin-Watson, R2 , and other statistics of t are from the observations used in calculating the objective function and do not include any observation for which any needed variable was missing (residuals, derivatives, and instruments).
Nested Iterations
By default, for S-iterated methods, the S matrix is held constant until the parameters converge once. Then the S matrix is reestimated. One iteration of the parameter estimation algorithm is performed, and the S matrix is again reestimated. This latter process is repeated until convergence of both the parameters and the S matrix. Since the objective of the minimization depends on the S matrix, this has the effect of chasing a moving target. When the NESTIT option is specied, iterations are performed to convergence for the structural parameters with a xed S matrix. The S matrix is then reestimated, the parameter iterations are repeated to convergence, and so on until both the parameters and the S matrix converge. This has the effect of xing the objective function for the inner parameter iterations. It is more reliable, but usually more expensive, to nest the iterations.
R-Square Statistic
For unrestricted linear models with an intercept successfully estimated by OLS, R2 is always between 0 and 1. However, nonlinear models do not necessarily encompass the dependent mean as a special case and can produce negative R2 statistics. Negative R2 statistics can also be produced even for linear models when an estimation method other than OLS is used and no intercept term is in the model. R2 is dened for normalized equations as R2 D 1 S SE S SA y 2 N n
where SSA is the sum of the squares of the actual ys and y are the actual means. R2 cannot be N computed for models in general form because of the need for an actual Y.
Minimization Methods
PROC MODEL currently supports two methods for minimizing the objective function. These methods are described in the following sections.
GAUSS
The Gauss-Newton parameter-change vector for a system with g equations, n nonmissing observations, and p unknown parameters is D .X0 X/
1
X0 r
where is the change vector, X is the stacked ng p Jacobian matrix of partial derivatives of the residuals with respect to the parameters, and r is an ng 1vector of the stacked residuals. The components of X and r are weighted by the S 1 matrix. When instrumental methods are used, X
and r are the projections of the Jacobian matrix and residuals vector in the instruments space and not the Jacobian and residuals themselves. In the preceding formula, S and W are suppressed. If instrumental variables are used, then the change vector becomes: D .X0 .S
1
W/X/
X0 .S
W/r
This vector is computed at the end of each iteration. The objective function is then computed at the changed parameter values at the start of the next iteration. If the objective function is not improved by the change, the vector is reduced by one-half and the objective function is reevaluated. The change vector will be halved up to MAXSUBITER= times until the objective function is improved. If the objective function cannot be improved after MAXSUBITER= times, the procedure switches to the MARQUARDT method described in the next section to further improve the objective function. For FIML, the X0 X matrix is substituted with one of three choices for approximations to the Hessian. See the section Full Information Maximum Likelihood Estimation (FIML) on page 1019 in this chapter.
MARQUARDT
The Marquardt-Levenberg parameter change vector is D .X0 X C diag.X0 X//
1
X0 r
where is the change vector, and X and r are the same as for the Gauss-Newton method, described in the preceding section. Before the iterations start, is set to a small value (1E6). At each iteration, the objective function is evaluated at the parameters changed by . If the objective function is not improved, is increased to 10 and the step is tried again. can be increased up to MAXSUBITER= times to a maximum of 1E15 (whichever comes rst) until the objective function is improved. For the start of the next iteration, is reduced to max( /10,1E10).
Convergence Criteria
There are a number of measures that could be used as convergence or stopping criteria. PROC MODEL computes ve convergence measures labeled R, S, PPC, RPC, and OBJECT. When an estimation technique that iterates estimates of is used (that is, IT3SLS), two convergence criteria are used. The termination values can be specied with the CONVERGE=(p,s) option in the FIT statement. If the second value, s, is not specied, it defaults to p. The criterion labeled S (described later in the section) controls the convergence of the S matrix. When S is less than s, the S matrix has converged. The criterion labeled R is compared to the p-value to test convergence of the parameters. The R convergence measure cannot be computed accurately in the special case of singular residuals (when all the residuals are close to 0) or in the case of a 0 objective value. When either the trace of the S matrix computed from the current residuals (trace(S)) or the objective value is less than the value of the SINGULAR= option, convergence is assumed.
The various convergence measures are explained in the following: R is the primary convergence measure for the parameters. It measures the degree to which the residuals are orthogonal to the Jacobian columns, and it approaches 0 as the gradient of the objective function becomes small. R is dened as the square root of .r 0 .S
1
W/X.X0 .S .r 0 .S
W/X/
1
1 X0 .S 1 W/r/
W/r/
where X is the Jacobian matrix and r is the residuals vector. R is similar to the relative offset orthogonality convergence criterion proposed by Bates and Watts (1981). In the univariate case, the R measure has several equivalent interpretations: the cosine of the angle between the residuals vector and the column space of the Jacobian matrix. When this cosine is 0, the residuals are orthogonal to the partial derivatives of the predicted values with respect to the parameters, and the gradient of the objective function is 0. the square root of the R2 for the current linear pseudo-model in the residuals a norm of the gradient of the objective function, where the norming matrix is proportional to the current estimate of the covariance of the parameter estimates. Thus, using R, convergence is judged when the gradient becomes small in this norm. the prospective relative change in the objective function value expected from the next GAUSS step, assuming that the current linearization of the model is a good local approximation In the multivariate case, R is somewhat more complicated but is designed to go to 0 as the gradient of the objective becomes small and can still be given the previous interpretations for the aggregation of the equations weighted by S 1 . PPC is the prospective parameter change measure. PPC measures the maximum relative change in the parameters implied by the parameter-change vector computed for the next iteration. At the kth iteration, PPC is the maximum over the parameters jikC1 ik j
6j
jik C 1:0e
where ik is the current value of the ith parameter and ikC1 is the prospective value of this parameter after adding the change vector computed for the next iteration. The parameter with the maximum prospective relative change is printed with the value of PPC, unless the PPC is nearly 0. RPC is the retrospective parameter change measure. RPC measures the maximum relative change in the parameters from the previous iteration. At the kth iteration,
j
6j
C 1:0e
where ik is the current value of the ith parameter and ik 1 is the previous value of this parameter. The name of the parameter with the maximum retrospective relative change is printed with the value of RPC, unless the RPC is nearly 0. OBJECT measures the relative change in the objective function value between iterations: jO k O k 1 j jO k 1 C 1:0e 6 j where Ok tion. S
1
is the value of the objective function (Ok ) from the previous itera-
measures the relative change in the S matrix. S is computed as the maximum over i, j of
k jSij k jSij 1 k Sij 1
j
6j
C 1:0e
where S k 1 is the previous S matrix. The S measure is relevant only for estimation methods that iterate the S matrix. An example of the convergence criteria output is shown in Figure 18.23.
Figure 18.23 Convergence Criteria Output
Final Convergence Criteria R PPC(b) RPC(b) Object Trace(S) Objective Value 0.000737 0.003943 0.00968 4.784E-6 0.533325 0.522214
The Trace(S) is the trace (the sum of the diagonal elements) of the S matrix computed from the current residuals. This row is labeled MSE if there is only one equation.
estimates always converge in one iteration. The methods that iterate the S matrix must iterate further for the S matrix to converge. Nonlinear models might not necessarily converge. Convergence can be expected only with fully identied parameters, adequate data, and starting values sufciently close to solution estimates. Convergence and the rate of convergence might depend primarily on the choice of starting values for the estimates. This does not mean that a great deal of effort should be invested in choosing starting values. First, try the default values. If the estimation fails with these starting values, examine the model and data and rerun the estimation using reasonable starting values. It is usually not necessary that the starting values be very good, just that they not be very bad; choose values that seem plausible for the model and data.
The following statements specify the model and give descriptive labels to the model parameters. Then the FIT statement attempts to estimate a, b, and c by using the default starting value 0.0001.
proc model y = a + label a b c fit y; run; data=test; b * x ** c; = "Intercept" = "Coefficient of Transformed X" = "Power Transformation Parameter";
PROC MODEL prints model summary and estimation problem summary reports and then prints the output shown in Figure 18.24.
OLS
By using the default starting values, PROC MODEL is unable to take even the rst step in iterating to the solution. The change in the parameters that the Gauss-Newton method computes is very extreme and makes the objective values worse instead of better. Even when this step is shortened by a factor of a million, the objective function is still worse, and PROC MODEL is unable to estimate the model parameters. The problem is caused by the starting value of C. Using the default starting value C=0.0001, the rst iteration attempts to compute better values of A and B by what is, in effect, a linear regression of Y on the 10,000th root of X, which is almost the same as the constant 1. Thus the matrix that is inverted to compute the changes is nearly singular and affects the accuracy of the computed parameter changes. This is also illustrated by the next part of the output, which displays collinearity diagnostics for the crossproducts matrix of the partial derivatives with respect to the parameters, shown in Figure 18.25.
Figure 18.25 Collinearity Diagnostics
Collinearity Diagnostics Condition Number 1.0000 1.9529 1187758 -----Proportion of Variation---a b c 0.0000 0.0000 1.0000 0.0000 0.0000 1.0000 0.0000 0.0000 1.0000
Number 1 2 3
This output shows that the matrix is singular and that the partials of A, B, and C with respect to the residual are collinear at the point .0:0001; 0:0001; 0:0001/ in the parameter space. See the section Linear Dependencies on page 1041 for a full explanation of the collinearity diagnostics.
The MODEL procedure next prints the note shown in Figure 18.26, which suggests that you try different starting values.
Figure 18.26 Estimation Failure Note
NOTE: The parameter estimation is abandoned. Check your model and data. If the model is correct and the input data are appropriate, try rerunning the parameter estimation using different starting values for the parameter estimates. PROC MODEL continues as if the parameter estimates had converged.
PROC MODEL then produces the usual printout of results for the nonconverged parameter values. The estimation summary is shown in Figure 18.27. The heading includes the reminder (Not Converged).
Figure 18.27 Nonconverged Estimation Summary
The MODEL Procedure OLS Estimation Summary (Not Converged) Data Set Options DATA= TEST
Minimization Summary Parameters Estimated Method Iterations Subiterations Average Subiterations 3 Gauss 100 239 2.39
Final Convergence Criteria R PPC(b) RPC(b) Object Trace(S) Objective Value 0.962666 548.2193 540.3066 2.654E-6 4.667946 3.967754
Equation y
Nonlinear OLS Parameter Estimates (Not Converged) Approx Std Err 263349 263349 4.4373 Approx Pr > |t| 0.9996 0.9996 0.9996
Parameter a b c
Note that the R2 statistic is negative. An R2 < 0 results when the residual mean squared error for the model is larger than the variance of the dependent variable. Negative R2 statistics might be produced when either the parameter estimates fail to converge correctly, as in this case, or when the correctly estimated model ts the data very poorly.
proc model data=test; y = a + b * x ** c; label a = "Intercept" b = "Coefficient of Transformed X" c = "Power Transformation Parameter"; fit y start=(c=5); run;
Using these starting values, the estimates converge in 16 iterations. The results are shown in Figure 18.29. Note that since the START= option explicitly declares parameters, the parameter C is placed rst in the table.
Figure 18.29 Converged Results
The MODEL Procedure Nonlinear OLS Summary of Residual Errors DF Model 3 DF Error 17 Adj R-Sq 0.7834
Equation y
SSE 5.7359
MSE 0.3374
R-Square 0.8062
Nonlinear OLS Parameter Estimates Approx Std Err 0.2892 3.3775 3.4858 Approx Pr > |t| 0.2738 0.0238 0.3287
Parameter c a b
With better starting values, the estimates converge in only eight iterations. Counting the iteration required to compute the starting values for A and B, this is seven fewer than the 16 iterations required without the STARTITER option. The iteration history listing is shown in Figure 18.30.
Figure 18.30 ITPRINT Listing
The MODEL Procedure OLS Estimation N Iteration N Obs R Objective Subit a 0 20 0.9989 162.9 0 0.00010 1 20 0.0000 0.3464 0 10.96530 N Iteration N Obs R Objective Subit a 0 20 0.3873 0.3464 0 10.96530 1 20 0.3339 0.3282 2 10.75993 2 20 0.3244 0.3233 1 10.46894 3 20 0.3151 0.3197 1 10.11707 4 20 0.2764 0.3110 1 9.74691 5 20 0.2379 0.3040 0 9.06175 6 20 0.0612 0.2879 0 8.51825 7 20 0.0022 0.2868 0 8.39485 8 20 0.0001 0.2868 0 8.38467
GRID GRID
b 0.00010 0.77007
c 1.00000 1.00000
The results produced in this case are almost the same as the results shown in Figure 18.29, except that the PARMS statement causes the parameter estimates table to be ordered A, B, C instead of C, A, B. They are not exactly the same because the different starting values caused the iterations to converge at a slightly different place. This effect is controlled by changing the convergence criterion with the CONVERGE= option. By default, the STARTITER option performs one iteration to nd starting values for the parameters that are not given values. In this case, the model is linear in A and B, so only one iteration is needed. If A or B were nonlinear, you could specify more than one starting values iteration by specifying a number for the STARTITER= option.
For example, the following statements try ve different starting values for C: 1, 0.7, 0.5, 0.3, and 0. For each value of C, values for A and B are estimated. The combination of A, B, and C values that produce the smallest residual mean square is then used to start the iterative process.
proc model data=test; parms a b c; y = a + b * x ** c; label a = "Intercept" b = "Coefficient of Transformed X" c = "Power Transformation Parameter"; fit y start=(c=1 .7 .5 .3 0) / startiter itprint; run;
The iteration history listing is shown in Figure 18.31. Using the best starting values found by the grid search, the OLS estimation only requires two iterations. However, since the grid search required nine iterations, the total iterations in this case is 11.
Figure 18.31 ITPRINT Listing
The MODEL Procedure OLS Estimation N Iteration N Obs R Objective Subit a 0 20 0.9989 162.9 0 0.00010 1 20 0.0000 0.3464 0 10.96530 0 20 0.7587 0.7242 0 10.96530 1 20 0.0000 0.3073 0 10.41027 0 20 0.7079 0.5843 0 10.41027 1 20 0.0000 0.2915 0 9.69319 0 20 0.7747 0.7175 0 9.69319 1 20 0.0000 0.2869 0 8.04397 0 20 0.5518 2.1277 0 8.04397 1 20 0.0000 1.4799 0 8.04397 N Iteration N Obs R Objective Subit 0 20 0.0189 0.2869 0 1 20 0.0158 0.2869 0 2 20 0.0006 0.2868 0
GRID GRID GRID GRID GRID GRID GRID GRID GRID GRID
b 0.00010 0.77007 0.77007 1.36141 1.36141 2.13103 2.13103 3.85767 3.85767 4.66255
c 1.00000 1.00000 0.70000 0.70000 0.50000 0.50000 0.30000 0.30000 0.00000 0.00000
Because no initial values for A or B were provided in the PARAMETERS statement or were read in with a PARMSDATA= or ESTDATA= option, A and B were given the default value of 0.0001 for the rst iteration. At the second grid point, C=5, the values of A and B obtained from the previous iterations are used for the initial iteration. If initial values are provided for parameters, the parameters start at those initial values at each grid point.
where t is time in years. The model is estimated by using decennial census data of the U.S. population in millions. If this simple but highly nonlinear model is estimated by using the default starting values, the estimation fails to converge. To nd reasonable starting values, rst consider the meaning of a and c. Taking the limit as time increases, a is the limiting or maximum possible population. So, as a starting value for a, several times the most recent population known can be usedfor example, one billion (1000 million). Dividing the time derivative by the function to nd the growth rate and taking the limit as t moves into the past, you can determine that c is the initial growth rate. You can examine the data and compute an estimate of the growth rate for the rst few decades, or you can pick a number that sounds like a plausible population growth rate gure, such as 2%. To nd a starting value for b, let t equal the base year used, 1790, which causes c to drop out of the formula for that year, and then solve for the value of b that is consistent with the known population in 1790 and with the starting value of a. This yields b D ln.a=3:9 1/ or about 5.5, where a is 1000 and 3.9 is roughly the population for 1790 given in the data. The estimates converge using these starting values.
Convergence Problems
When estimating nonlinear models, you might encounter some of the following convergence problems.
Unable to Improve
The optimization algorithm might be unable to nd a step that improves the objective function. If this happens in the Gauss-Newton method, the step size is halved to nd a change vector for which the objective improves. In the Marquardt method, is increased to nd a change vector for which the objective improves. If, after MAXSUBITER= step-size halvings or increases in , the change vector still does not produce a better objective value, the iterations are stopped and an error message is printed. Failure of the algorithm to improve the objective value can be caused by a CONVERGE= value that is too small. Look at the convergence measures reported at the point of failure. If the estimates appear to be approximately converged, you can accept the NOT CONVERGED results reported, or you can try rerunning the FIT task with a larger CONVERGE= value. If the procedure fails to converge because it is unable to nd a change vector that improves the objective value, check your model and data to ensure that all parameters are identied and data values are reasonably scaled. Then, rerun the model with different starting values. Also, consider
using the Marquardt method if the Gauss-Newton method fails; the Gauss-Newton method can get into trouble if the Jacobian matrix is nearly singular or ill-conditioned. Keep in mind that a nonlinear model may be well-identied and well-conditioned for parameter values close to the solution values but unidentied or numerically ill-conditioned for other parameter values. The choice of starting values can make a big difference.
Nonconvergence
The estimates might diverge into areas where the program overows or the estimates might go into areas where function values are illegal or too badly scaled for accurate calculation. The estimation might also take steps that are too small or that make only marginal improvement in the objective function and thus fail to converge within the iteration limit. When the estimates fail to converge, collinearity diagnostics for the Jacobian crossproducts matrix are printed if there are 20 or fewer parameters estimated. See the section Linear Dependencies on page 1041 for an explanation of these diagnostics.
If convergence is obtained, the resulting estimates approximate only a minimum point of the objective function. The statistical validity of the results is based on the exact minimization of the objective function, and for nonlinear models the quality of the results depends on the accuracy of the approximation of the minimum. This is controlled by the convergence criterion used. There are many nonlinear functions for which the objective function is quite at in a large region around the minimum point so that many quite different parameter vectors might satisfy a weak convergence criterion. By using different starting values, different convergence criteria, or different minimization methods, you can produce very different estimates for such models. You can guard against this by running the estimation with different starting values and different convergence criteria and checking that the estimates produced are essentially the same. If they are not, use a smaller CONVERGE= value.
Local Minimum
You might have converged to a local minimum rather than a global one. This problem is difcult to detect because the procedure appears to have succeeded. You can guard against this by running the estimation with different starting values or with a different minimization technique. The START= option can be used to automatically perform a grid search to aid in the search for a global minimum.
Discontinuities
The computational methods assume that the model is a continuous and smooth function of the parameters. If this is not the case, the methods might not work. If the model equations or their derivatives contain discontinuities, the estimation usually succeeds, provided that the nal parameter estimates lie in a continuous interval and that the iterations do not produce parameter values at points of discontinuity or parameter values that try to cross asymptotes.
One common case of discontinuities causing estimation failure is that of an asymptotic discontinuity between the nal estimates and the initial values. For example, consider the following model, which is basically linear but is written with one parameter in reciprocal form:
y = a + b * x1 + x2 / c;
By placing the parameter C in the denominator, a singularity is introduced into the parameter space at C=0. This is not necessarily a problem, but if the correct estimate of C is negative while the starting value is positive (or vice versa), the asymptotic discontinuity at 0 will lie between the estimate and the starting value. This means that the iterations have to pass through the singularity to get to the correct estimates. The situation is shown in Figure 18.32.
Figure 18.32 Asymptotic Discontinuity
Because of the incorrect sign of the starting value, the C estimate goes off towards positive innity in a vain effort to get past the asymptote and onto the correct arm of the hyperbola. As the computer is required to work with ever closer approximations to innity, the numerical calculations break down and an objective function was not improved convergence failure message is printed. At this point, the iterations terminate with an extremely large positive value for C. When the sign of the starting value for C is changed, the estimates converge quickly to the correct values.
Linear Dependencies
In some cases, the Jacobian matrix might not be of full rank; parameters might not be fully identied for the current parameter values with the current data. When linear dependencies occur among the derivatives of the model, some parameters appear with a standard error of 0 and with the word BIASED printed in place of the t statistic. When this happens, collinearity diagnostics for the Jacobian crossproducts matrix are printed if the DETAILS option is specied and there are twenty or fewer parameters estimated. Collinearity diagnostics are also printed out automatically when a minimization method fails, or when the COLLIN option is specied. For each parameter, the proportion of the variance of the estimate accounted for by each principal component is printed. The principal components are constructed from the eigenvalues and eigenvectors of the correlation matrix (scaled covariance matrix). When collinearity exists, a principal component is associated with proportion of the variance of more than one parameter. The numbers reported are proportions so they remain between 0 and 1. If two or more parameters have large proportion values associated with the same principal component, then two problems can occur: the computation of the parameter estimates are slow or nonconvergent; and the parameter estimates have inated variances (Belsley, Kuh, and Welsch 1980, p. 105117). For example, the following cubic model is t to a quadratic data set:
proc model data=test3; exogenous x1; parms b1 a1 c1 ; y1 = a1 * x1 + b1 * x1 * x1 fit y1 / collin; run;
+ c1 * x1 * x1 *x1;
Number 1 2 3
Notice that the proportions associated with the smallest eigenvalue are almost 1. For this model, removing any of the parameters decreases the variances of the remaining parameters. In many models, the collinearity might not be clear cut. Collinearity is not necessarily something you remove. A model might need to be reformulated to remove the redundant parameterization, or the limitations on the estimability of the model can be accepted. The GINV=G4 option can be helpful to avoid problems with convergence for models containing collinearities.
Collinearity diagnostics are also useful when an estimation does not converge. The diagnostics provide insight into the numerical problems and can suggest which parameters need better starting values. These diagnostics are based on the approach of Belsley, Kuh, and Welsch (1980).
Iteration History
The options ITPRINT, ITDETAILS, XPX, I, and ITALL specify a detailed listing of each iteration of the minimization process.
ITPRINT Option
The ITPRINT information is selected whenever any iteration information is requested. The following information is displayed for each iteration: N Objective Trace(S) subit R is the number of usable observations. is the corrected objective function value. is the trace of the S matrix. is the number of subiterations required to nd a duces the objective function. is the R convergence measure. or a damping factor that re-
The estimates for the parameters at each iteration are also printed.
ITDETAILS Option
The additional values printed for the ITDETAILS option are: Theta Phi Stepsize Lambda Rank(XPX) is the angle in degrees between , the parameter change vector, and the negative gradient of the objective function. is the directional derivative of the objective function in the direction scaled by the objective function. is the value of the damping factor used to reduce if the Gauss-Newton method is used. is the value of if the Marquardt method is used. is the rank of the X0 X matrix (output if the projected Jacobian crossproducts matrix is singular).
The denitions of PPC and R are explained in the section Convergence Criteria on page 1028. When the values of PPC are large, the parameter associated with the criteria is displayed in parentheses after the value.
XPX Inverse for System a1 a1 d1 a2 b2 d2 Residual 0.000001 0.000028 0.000000 0.000000 0.000000 1.952150 d1 0.000028 0.001527 0.000000 0.000000 0.000000 -8.546875
At OLS Iteration 0 a2 b2 0.0000 0.0000 -0.0025 0.0825 0.9476 171.6234 d2 0.00 0.00 -0.08 0.95 15746.71 11930.89 Residual 2 -9 6 172 11931 10819902
The rst matrix, labeled Cross Products, for OLS estimation is 0 X X X0 r r0 X r0 r The column labeled Residual in the output is the vector X0 r, which is the gradient of the objective function. The diagonal scalar value r0 r is the objective function uncorrected for degrees of freedom. The second matrix, labeled XPX Inverse, is created through a sweep operation on the augmented X0 X matrix to get: .X0 X/ 1 .X0 r/0 .X0 X/ .X0 X/ 1 X0 r .X0 r/0 .X0 X/ 1 X0 r
r0 r
Note that the residual column is the change vector used to update the parameter estimates at each iteration. The corner scalar element is used to compute the R convergence criteria.
ITALL Option
The ITALL option, in addition to causing the output of all of the preceding options, outputs the S matrix, the inverse of the S matrix, the CROSS matrix, and the swept CROSS matrix. An example of a portion of the CROSS matrix for the preceding example is shown in Figure 18.35.
Figure 18.35 ITALL Option Crossproducts Matrix Output
The MODEL Procedure OLS Estimation Crossproducts Matrix 1 1 @PRED.y1/@a1 @PRED.y1/@d1 @PRED.y2/@a2 @PRED.y2/@b2 @PRED.y2/@d2 RESID.y1 RESID.y2 50.00 6409.08 -239.16 1275.00 50.00 0.00 14699.97 16052.76 At OLS Iteration 0 @PRED.y1/@d1 -239.16 -33818.35 1276.45 -7253.00 -239.19 -0.03 -76928.14 -85083.68 @PRED.y2/@a2 1275.0 187766.1 -7253.0 42925.0 1275.2 0.2 420582.9 470686.3
Crossproducts Matrix @PRED.y2/@b2 1 @PRED.y1/@a1 @PRED.y1/@d1 @PRED.y2/@a2 @PRED.y2/@b2 @PRED.y2/@d2 RESID.y1 RESID.y2 50.00 6409.88 -239.19 1275.15 50.01 0.00 14701.77 16055.07
At OLS Iteration 0 RESID.y1 14700 3879959 -76928 420583 14702 2 11827102 12234106 RESID.y2 16053 4065028 -85084 470686 16055 2 12234106 12749042
where OBS=25 selects the rst 25 observations in A. The second FIT statement produces the nal estimates using the full data set and starting values from the rst run.
You should try estimating the model in pieces to save time only if there are more than 14 parameters; the preceding example takes more time, not less, and the difference in memory required is trivial.
plus lower-order terms, where m is the number of unique nonzero derivatives of each residual with respect to each parameter, g is the number of equations, k is the number of instruments, and p is the number of parameters. This formula is for the memory required for 3SLS. If you are using OLS, a reasonable estimate of the memory required for large problems (greater than 100 parameters) is to divide the value obtained from the formula in half. Consider the following model program.
proc model data=test2 details; exogenous x1 x2; parms b1 100 a1 a2 b2 2.5 c2 55; y1 = a1 * y2 + b1 * x1 * x1; y2 = a2 * y1 + b2 * x2 * x2 + c2 / x2; fit y1 y2 / n3sls memoryuse; inst b1 b2 c2 x1 ; run;
The DETAILS option prints the storage requirements information shown in Figure 18.36.
Figure 18.36 Storage Requirements Information
The MODEL Procedure Storage Requirements for this Problem Order of XPX Matrix Order of S Matrix Order of Cross Matrix Total Nonzero Derivatives Distinct Variable Derivatives Size of Cross matrix 6 2 13 5 5 728
The matrix X0 X augmented by the residual vector is called the XPX matrix in the output, and it has the size m C 1. The order of the S matrix, 2 for this example, is the value of g. The CROSS matrix is made up of the k unique instruments, a constant column that represents the intercept terms, followed by the m unique Jacobian variables plus a constant column that represents the parameters with constant derivatives, followed by the g residuals.
Note that the CROSS matrix is symmetric, so only the diagonal and the upper triangular part of the matrix is stored. For examples of the CROSS and XPX matrices see the section Iteration History on page 1042.
symbols strings lists arrays statements opcodes parsing executable block option cross ref. ow analysis derivatives data vector cross matrix X0 X matrix S matrix GMM memory Jacobian work vectors overhead
memory used to store information about variables in the model memory used to store the variable names and labels space used to hold lists of variables memory used by ARRAY statements memory used for the list of programming statements in the model memory used to store the code compiled to evaluate the expression in the model program memory used in parsing the SAS statements the compiled model program size memory used by the BLOCK option memory used by the XREF option memory used to compute the interdependencies of the variables memory used to compute and store the analytical derivatives memory used for the program data vector memory used for one or more copies of the CROSS matrix memory used for one or more copies of the X0 X matrix memory used for the covariance matrix additional memory used for the GMM and ITGMM methods memory used for the Jacobian matrix for SOLVE and FIML memory used for miscellaneous work vectors other miscellaneous memory
/0 S
.Xj
/3
where S is the sample covariance matrix of X. For weighted regression, both S and .Xi / are computed by using the weights supplied by the WEIGHT statement or the _WEIGHT_ variable. Mardia showed that under the null hypothesis n b1;d is asymptotically distributed as 6 2 .d.d C 1/.d C 2/=6/. For small samples, Mardias skewness test statistic is calculated with a small sample correction formula, given by nk b1;d where the correction factor k is given by 6 k D .d C 1/.n C 1/.n C 3/=n...n C 1/.d C 1// 6/. Mardias skewness test statistic in PROC MODEL uses this small sample corrected formula.
/S
.Xi
/2
Mardia showed that under the null hypothesis, b2;d is asymptotically normally distributed with mean d.d C 2/ and variance 8d.d C 2/=n. The Henze-Zirkler test is based on a nonnegative functional D.:; :/ that measures the distance between two distribution functions and has the property that D.Nd .0; Id /; Q/ D 0 if and only if Q D Nd .0; Id / where Nd . ; d / is a d-dimensional normal distribution. The distance measure D.:; :/ can be written as Z O O D .P; Q/ D jP .t / Q.t /j2 ' .t/dt
Rd
O O where P .t / and Q.t / are the Fourier transforms of P and Q, and ' .t/ is a weight or a kernel function. The density of the normal distribution Nd .0; 2 Id / is used as ' .t/ ' .t / D .2 2 / where jtj D .t t /0:5 . The parameter depends on n as 1 2d C 1 1=.d C4/ 1=.d C4/ / n d .n/ D p . 4 2 The test statistic computed is called T .d / and is approximately distributed as a lognormal. The lognormal distribution is used to compute the null hypothesis probability.
1 n2 n n XX j D1 kD1
0 d 2
exp.
jtj2 /; t 2 Rd 2 2
T .d / D
exp.
2
2 jYj 2
n X j D1
Yk j2 / 2 jYj j2 / C .1 C 2 2 / 2.1 C 2 /
d=2
2.1 C /
d=2 1
exp.
.Xj
Xk /
jYj j2 D .Xj
N X/0 S
.Xj
N X/
Monte Carlo simulations suggest that T .d / has good power against distributions with heavy tails. The Shapiro-Wilk W test is computed only when the number of observations (n ) is less than 2000 while computation of the Kolmogorov-Smirnov test statistic requires at least 2000 observations. The following is an example of the output produced by the NORMAL option.
proc model data=test2; y1 = a1 * x2 * x2 - exp( d1*x1); y2 = a2 * x1 * x1 + b2 * exp( d2*x2); fit y1 y2 / normal ; run;
Equation y1 y2 System
Heteroscedasticity
One of the key assumptions of regression is that the variance of the errors is constant across observations. If the errors have constant variance, the errors are called homoscedastic. Typically, residuals are plotted to assess this assumption. Standard estimation methods are inefcient when the errors are heteroscedastic or have nonconstant variance.
Heteroscedasticity Tests
The MODEL procedure provides two tests for heteroscedasticity of the errors: Whites test and the modied Breusch-Pagan test. Both Whites test and the Breusch-Pagan are based on the residuals of the tted model. For systems of equations, these tests are computed separately for the residuals of each equation. The residuals of an estimation are used to investigate the heteroscedasticity of the true disturbances. The WHITE option tests the null hypothesis H0 W
2 i
for all i
Heteroscedasticity ! 1051
Whites test is general because it makes no assumptions about the form of the heteroscedasticity (White 1980). Because of its generality, Whites test might identify specication errors other than heteroscedasticity (Thursby 1982). Thus, Whites test might be signicant when the errors are homoscedastic but the model is misspecied in other ways. Whites test is equivalent to obtaining the error sum of squares for the regression of squared residuals on a constant and all the unique variables in JJ, where the matrix J is composed of the partial derivatives of the equation residual with respect to the estimated parameters. Whites test statistic W is computed as follows: W D nR2 where R2 is the correlation coefcient obtained from the above regression. The statistic is asymptotically distributed as chi-squared with P1 degrees of freedom, where P is the number of regressors in the regression, including the constant and n is the total number of observations. In the example given below, the regressors are constant, income, income*income, income*income*income, and income*income*income*income. income*income occurs twice and one is dropped. Hence, P=5 with degrees of freedom, P1=4. Note that Whites test in the MODEL procedure is different than Whites test in the REG procedure requested by the SPEC option. The SPEC option produces the test from Theorem 2 on page 823 of White (1980). The WHITE option, on the other hand, produces the statistic discussed in Greene (1993). The null hypothesis for the modied Breusch-Pagan test is homosedasticity. The alternate hypothesis is that the error variance varies with a set of regressors, which are listed in the BREUSCH= option. Dene the matrix Z to be composed of the values of the variables listed in the BREUSCH= option, such that zi;j is the value of the jth variable in the BREUSCH= option for the ith observation. The null hypothesis of the Breusch-Pagan test is
2 i
.0 C zi /
H0 W D 0
where
2 i
is the error variance for the ith observation and 0 and are regression coefcients.
Z .u
ui/ N
2 2 2 where u D .e1 ; e2 ; : : :; en /, i is a n
1X 2 vD .ei n
i D1
ee 2 / n
This is a modied version of the Breusch-Pagan test, which is less sensitive to the assumption of normality than the original test (Greene 1993, p. 395). The statements in the following example produce the output in Figure 18.39:
proc model data=schools; parms const inc inc2; exp = const + inc * income + inc2 * income * income; incsq = income * income; fit exp / white breusch=(1 income incsq); run;
Equation exp
Weighted Regression
The WEIGHT statement can be used to correct for the heteroscedasticity. Consider the following model, which has a heteroscedastic error term: p yt D 250.e 0:2t e 0:8t / C .9=t/ t The data for this model is generated with the following SAS statements.
data test; do t=1 to 25; y = 250 * (exp( -0.2 * t ) - exp( -0.8 * t )) + sqrt( 9 / t ) * rannor(1); output; end; run;
If this model is estimated with OLS, as shown in the following statements, the estimates shown in Figure 18.40 are obtained for the parameters.
proc model data=test;
Heteroscedasticity ! 1053
parms b1 0.1 b2 0.9; y = 250 * ( exp( -b1 * t ) - exp( -b2 * t ) ); fit y; run;
Parameter b1 b2
p If both sides of the model equation are multiplied by t , the model has a homoscedastic error term. This multiplication or weighting is done through the WEIGHT statement. The WEIGHT statement variable operates on the squared residuals as
0
t t
D weight
qt qt
so that the WEIGHT statement variable represents the square of the model multiplier. The following PROC MODEL statements corrects the heteroscedasticity with a WEIGHT statement:
proc model data=test; parms b1 0.1 b2 0.9; y = 250 * ( exp( -b1 * t ) - exp( -b2 * t ) ); fit y; weight t; run;
Note that the WEIGHT statement follows the FIT statement. The weighted estimates are shown in Figure 18.41.
Figure 18.41 Weighted OLS Estimates
Nonlinear OLS Parameter Estimates Approx Std Err 0.00101 0.00853 Approx Pr > |t| <.0001 <.0001
Parameter b1 b2
The weighted OLS estimates are identical to the output produced by the following PROC MODEL example:
proc model data=test;
parms b1 0.1 b2 0.9; y = 250 * ( exp( -b1 * t ) - exp( -b2 * t ) ); _weight_ = t; fit y; run;
If the WEIGHT statement is used in conjunction with the _WEIGHT_ variable, the two values are multiplied together to obtain the weight used. The WEIGHT statement and the _WEIGHT_ variable operate on all the residuals in a system of equations. If a subset of the equations needs to be weighted, the residuals for each equation can be modied through the RESID. variable for each equation. The following example demonstrates the use of the RESID. variable to make a homoscedastic error term:
proc model data=test; parms b1 0.1 b2 0.9; y = 250 * ( exp( -b1 * t ) - exp( -b2 * t ) ); resid.y = resid.y * sqrt(t); fit y; run;
These statements produce estimates of the parameters and standard errors that are identical to the weighted OLS estimates. The reassignment of the RESID.Y variable must be done after Y is assigned; otherwise it would have no effect. Also, note that the residual (RESID.Y) is multiplied by p t . Here the multiplier is acting on the residual before it is squared.
GMM Estimation
If the form of the heteroscedasticity is unknown, generalized method of moments estimation (GMM) can be used. The following PROC MODEL statements use GMM to estimate the example model used in the preceding section:
proc model data=test; parms b1 0.1 b2 0.9; y = 250 * ( exp( -b1 * t ) - exp( -b2 * t ) ); fit y / gmm; instruments b1 b2; run;
GMM is an instrumental method, so instrument variables must be provided. GMM estimation generates estimates for the parameters shown in Figure 18.42.
Heteroscedasticity ! 1055
Parameter b1 b2
.X0 X/
When the variance of the errors of a classical linear model Y D X C is not constant across observations (heteroscedastic), so that estimator O OLS D .X0 X/
1 2 i
2 j
X0 Y
is unbiased but it is inefcient. Models that take into account the changing variance can make more efcient use of the data. When the variances, t2 , are known, generalized least squares (GLS) can be used and the estimator O GLS D .X0 X/ where 2
2 1 1
X0
0
2 2
6 0 6 D 6 4 0 0
0 0
0 0 :: : 0
3 0 0 7 7 7 0 5
2 T 2 t ,
are unknown.
To solve this problem White (1980) proposed a heteroscedastic consistent-covariance matrix estimator (HCCME) O D .X0 X/
1
X0 O X.X0 X/
D Yt
O Xt OLS .
This estimator is considered somewhat unreliable in nite samples. Therefore, Davidson and MacKinnon (1993) propose three different modications to estimating O . The rst solution is to simply 2 multiply t by n ndf , where n is the number of observations and df is the number of explanatory variables, so that 2 n 3 2 0 0 0 n df 1 6 7 n 2 0 0 0 6 7 n df 2 7 O1 D 6 :: 6 7 : 0 0 0 4 5 0 0 0
n n df 2 n
0 0 0
2 n
3 7 7 7 7 7 7 5
O hn
3 7 7 7 7 7 7 7 5
MacKinnon and White (1985) investigated these three modied HCCMEs, including the original HCCME, based on nite-sample performance of pseudo-t statistics. The original HCCME performed the worst. The rst modication performed better. The second modication performed even better than the rst, and the third modication performed the best. They concluded that the original HCCME should never be used in nite sample estimation, and that the second and third modications should be used over the rst modication if the diagonals of O are available.
Heteroscedasticity ! 1057
Extending the discussion to systems of g equations, the HCCME for SUR estimation is Q0Q .X X/
1
Q0 Q Q0Q X O X.X X/
Q where X is a n g k matrix with the rst g rows representing the rst observation, the next g rows representing the second observation, and so on. O is now a n g n g block diagonal matrix with typical block g g 2 3 1;i 1;i 1;i 2;i : : : 1;i g;i 6 7 2;i 2;i : : : 2;i g;i 7 O i D 6 2;i : 1;i 6 7 : : : : : : : 4 5 : : : : ::: g;i 1;i g;i 2;i g;i g;i where
j;i
j;i
H C0
or r
j;i
n n df
j;i
H C1
or
j;i
q D j;i = 1
O hi
H C2
or
j;i
j;i =.1
O hi /
H C3
For two- and three-stage least squares, the HCCME for a g equation system is CovF . O /Cov where Cov D 1 0 X .I Z.Z0 Z/ n
1 0
Z /X
Z O ij Z.Z0 Z/
1 0
Z Xj
where Xj is a n p matrix with the j th equations regressors in the appropriate columns and zeros everywhere else.
2 6 O ij D 6 6 4
i;1
j;1
0
i;2 j;2
0 0 0
0 0
0 0 :: : 0
0 0 0
i;n j;n
3 7 7 7 5
For 2SLS O ij D 0 when i j . The t used in O is computed by using the parameter estimates obtained from the instrumental variables estimation. The leverage value for the i th equation used in the HCCME=2 and HCCME=3 methods is computed as conditional on the rst stage as hti D Zt .Z0 Z/ for 2SLS and hti D Zt .Z0 Z/ for 3SLS.
1 1
Xi .X0 .I Z.Z0
Z/
1 0
Z /X/
X0 Z.Z0 Z/ i
1 0 Zt
Xi .X0 .S
Z.Z0
Z/
1 0
Z /X/
X0 Z.Z0 Z/ i
1 0 Zt =Si i
The three variations of the test reported by the GODFREY=3 option are designed to have power
against different alternative hypothesis. Thus, if the residuals in fact have only rst-order autocorrelation, the lag 1 test has the most power for rejecting the null hypothesis of uncorrelated residuals. If the residuals have second- but not higher-order autocorrelation, the lag 2 test might be more likely to reject; the same is true for third-order autocorrelation and the lag 3 test. The null hypothesis of Godfreys tests is that the equation residuals are white noise. However, if the equation includes autoregressive error model of order p (AR(p),) then the lag i test, when considered in terms of the structural error, is for the null hypothesis that the structural errors are from an AR(p) process versus the alternative hypothesis that the errors are from an AR(p C i ) process. The alternative ARMA(p; i) process is locally equivalent to the alternative AR(p C i ) process with respect to the null model AR(p). Thus, the GODFREY= option results are also a test of AR(p) errors against the alternative hypothesis of ARMA(p; i) errors. See Godfrey (1978a and 1978b) for more detailed information.
where a =log() and b =. Equation (2) is called the log form of the equation in contrast to equation (1), which is called the level form of the equation. Using the SYSLIN procedure, you can estimate equation (2) by specifying
proc syslin data=in; model logy=logx; run;
where LOGY and LOGX are the logs of Y and X computed in a preceding DATA step. The resulting values for INTERCEPT and LOGX correspond to a and b in equation (2).
Using the MODEL procedure, you can try to estimate the parameters in the level form (and avoid the DATA step) by specifying
proc model data=in; parms alpha beta; y = alpha * x ** beta; fit y; run;
where ALPHA and BETA are the parameters in equation (1). Unfortunately, at least one of the preceding is wrong; an ambiguity results because equations (1) and (2) contain no explicit error term. The SYSLIN and MODEL procedures both deal with additive errors; the residual used (the estimate of the error term in the equation) is the difference between the predicted and actual values (of LOGY for PROC SYSLIN and of Y for PROC MODEL in this example). If you perform the regressions discussed previously, PROC SYSLIN estimates equation (3) while PROC MODEL estimates equation (4). lny D a C bln.x/ C y D x C These are different statistical models. Equation (3) is the log form of equation (5) y D x (5) =e . Equation (4), on the other hand, cannot be linearized because the error term ent from ) is additive in the level form. where (differ(3) (4)
You must decide whether your model is equation (4) or (5). If the model is equation (4), you should use PROC MODEL. If you linearize equation (1) without considering the error term and apply SYSLIN to MODEL LOGY=LOGX, the results will be wrong. On the other hand, if your model is equation (5) (in practice it usually is), and you want to use PROC MODEL to estimate the parameters in the level form, you must do something to account for the multiplicative error. PROC MODEL estimates parameters by minimizing an objective function. The objective function is computed using either the RESID.-prexed equation variable or the EQ.-prexed equation variable. You must make sure that these prexed equation variables are assigned an appropriate error term. If the model has additive errors that satisfy the assumptions, nothing needs to be done. In the case of equation (5), the error is nonadditive and the equation is in normalized form, so you must alter the value of RESID.Y. The following assigns a valid estimate of
y = alpha * x ** beta; resid.y = actual.y / pred.y;
to RESID.Y:
However, =e , and therefore , cannot have a mean of zero, and you cannot consistently estimate and by minimizing the sum of squares of an estimate of . Instead, you use D ln .
proc model data=in;
parms alpha beta; y = alpha * x ** beta; resid.y = log( actual.y / pred.y ); fit y; run;
Both examples produce estimates of and of the level form that match the estimates of a and b of the log form. That is, ALPHA=exp(INTERCEPT) and BETA=LOGX, where INTERCEPT and LOGX are the PROC SYSLIN parameter estimates from the MODEL LOGY=LOGX. The standard error reported for ALPHA is different from that for the INTERCEPT in the log form. The preceding example is not intended to suggest that loglinear models should be estimated in level form but, rather, to make the following points: Nonlinear transformations of equations involve the error term of the equation, and this should be taken into account when transforming models. The RESID.-prexed and the EQ.-prexed equation variables for models estimated by the MODEL procedure must represent additive errors with zero means. You can use assignments to RESID.-prexed and EQ.-prexed equation variables to transform error terms. Some models do not have additive errors or zero means, and many such models can be estimated using the MODEL procedure. The preceding approach applies not only to multiplicative models but to any model that can be manipulated to isolate the error term.
For transformed models in which the biasing factor is known, you can use programming statements to correct for the bias in the predicted values as estimates of the endogenous means. In the preceding log-linear case, the predicted values are biased by the factor exp( 2 /2). You can produce approximately unbiased predicted values in this case by writing the model as
proc model data=in; parms alpha beta; y=alpha * x ** beta; resid.y = log( actual.y / pred.y ); fit y; run;
See Miller (1984) for a discussion of bias factors for predicted values of transformed models. Note that models with transformed errors are not appropriate for Monte Carlo simulation that uses the SDATA= option. PROC MODEL computes the OUTS= matrix from the transformed RESID.prexed equation variables, while it uses the SDATA= matrix to generate multivariate normal errors, which are added to the predicted values. This method of computing errors is inconsistent when the equation variables have been transformed.
Ht
ht where
t
N.0; /.
For models that are homoscedastic, ht D 1 If you have a model that is heteroscedastic with known form, you can improve the efciency of the estimates by performing a weighted regression. The weight variable, using this notation, would be p 1= ht .
If the errors for a model are heteroscedastic and the functional form of the variance is known, the model for the variance can be estimated along with the regression function. To specify a functional form for the variance, assign the function to an H.var variable where var is the equation variable. For example, if you want to estimate the scale parameter for the variance of a simple regression model yDa xCb
Consider the same model with the following functional form for the variance: ht D
2
x2
There are three ways to model the variance in the MODEL procedure: feasible generalized least squares, generalized method of moments, and full information maximum likelihood.
Feasible GLS
A simple approach to estimating a variance function is to estimate the mean parameters by using some auxiliary method, such as OLS, and then use the residuals of that estimation to estimate the parameters of the variance function. This scheme is called feasible GLS. It is possible to use the residuals from an auxiliary method for the purpose of estimating because in many cases the residuals consistently estimate the error terms. For all estimation methods except GMM and FIML, using the H.var syntax species that feasible GLS is used in the estimation. For feasible GLS, the mean function is estimated by the usual method. The variance function is then estimated using pseudo-likelihood (PL) function of the generated residuals. The objective function for the PL estimation is ! n O X .yi f .xi ; //2 C log 2 h.zi ; / pn . ; / D 2 h.z ; / i
i D1
Once the variance function has been estimated, the mean function is reestimated by using the variance function as weights. If an S-iterated method is selected, this process is repeated until convergence (iterated feasible GLS).
Note that feasible GLS does not yield consistent estimates when one of the following is true: The variance is unbounded. There is too much serial dependence in the errors (the dependence does not fade with time). There is a combination of serial dependence and lag dependent variables. The rst two cases are unusual, but the third is much more common. Whether iterated feasible GLS avoids consistency problems with the last case is an unanswered research question. For more information see Davidson and MacKinnon (1993, pp. 298301) or Gallant (1987, pp. 124125) and Amemiya (1985, pp. 202203). One limitation is that parameters cannot be shared between the mean equation and the variance equation. This implies that certain GARCH models, cross-equation restrictions of parameters, or testing of combinations of parameters in the mean and variance component are not allowed.
E. t / D 0 To add the second moment conditions to the estimation, add the equation E."t "t ht / D 0 for linear example above, you can write
This is a popular way to estimate a continuous-time interest rate processes (see Chan et al. 1992). The H.var syntax automatically generates this system of equations. To further take advantage of the information obtained about the variance, the moment equations can be modied to p E."t = ht / D 0 E."t "t ht / D 0
proc model data=s; y = a * x + b; eq.two = resid.y**2 - sigma**2; resid.y = resid.y / sigma; fit y two/ gmm; instruments x; run;
Note that, if the error model is misspecied in this form of the GMM model, the parameter estimates might be inconsistent.
ln . / D
@q.yt ; xt ; / ln @yt
t D1 i D1
where g is the number of equations in the system. The HESSIAN=GLS option is not available for FIML estimation that involves variance functions. The matrix used when HESSIAN=CROSS is specied is a crossproducts matrix that has been enhanced by the dual quasi-Newton approximation.
Examples
You can specify a GARCH(1,1) model as follows:
proc model data=modloc.usd_jpy; /* Mean model --------*/ jpyret = intercept ; /* Variance model ----------------*/ h.jpyret = arch0 + arch1 * xlag( resid.jpyret ** 2, mse.jpyret + garch1 * xlag(h.jpyret, mse.jpyret) ; bounds arch0 arch1 garch1 >= 0; fit jpyret / method=marquardt fiml; run;
Note that the BOUNDS statement is used to ensure that the parameters are positive, a requirement for GARCH models. EGARCH models are used because there are no restrictions on the parameters. You can specify a EGARCH(1,1) model as follows:
proc model data=sasuser.usd_dem ; /* Mean model ----------*/ demret = intercept ; /* Variance model ----------------*/ if ( _OBS_ =1 ) then h.demret = exp( earch0 + egarch1 * log(mse.demret) ); else h.demret = exp( earch0 + earch1 * zlag( g) + egarch1 * log(zlag(h.demret))); g = - theta * nresid.demret + abs( nresid.demret ) - sqrt(2/3.1415); fit demret / method=marquardt fiml run; maxiter=100 converge=1.0e-6;
0:1y C 2:5z 2 z
D by C cy
y0 D 0 y0 D 1
0
D z D by C cy
0
The preceding statements produce the output shown in Figure 18.44. These statements produce additional output, which is not shown.
Figure 18.44 Simulation Results for Differential System
The MODEL Procedure Simultaneous Simulation Observation 1 Missing Iterations 2 0 CC -1.000000
The differential variables are distinguished by the derivative with respect to time (DERT.) prex. Once you dene the DERT. variable, you can use it on the right-hand side of another equation. The differential equations must be expressed in normal form; implicit differential equations are not allowed, and other terms on the left-hand side are not allowed. The TIME variable is the implied with respect to variable for all DERT. variables. The TIME variable is also the only variable that must be in the input data set. You can provide initial values for the differential equations in the data set, in the declaration statement (as in the previous example), or in statements in the program. Using the previous example, you can specify the initial values as
proc model data=t ; dependent y z ; parm b -2 c -4; if ( time=0 ) then do; y=0; z=1; end; else do; dert.y = z; dert.z = b * dert.y + c * y; end; end; solve y z / dynamic solveprint; run;
differential variables, you must include Y and Z in the input data set in order to do a static simulation of the model. If the simulation is dynamic, the initial values for the differential equations are obtained from the data set, if they are available. If the variable is not in the data set, you can specify the initial value in a declaration statement. If you do not specify an initial value, the value of 0.0 is used. A dynamic solution is obtained by solving one initial value problem for all the data. A graph of a simple dynamic simulation is shown in Figure 18.45. If the time variable for the current observation is less than the time variable for the previous observation, the integration is restarted from this point. This allows for multiple samples in one data le.
Figure 18.45 Dynamic Solution
In a static solution, n1 initial value problems are solved using the rst n1 data values as initial values. The equations are integrated using the i th data value as an initial value to the i+1 data value. Figure 18.46 displays a static simulation of noisy data from a simple differential equation. The static solution does not propagate errors in initial values as the dynamic solution does.
For estimation, the DYNAMIC and STATIC options in the FIT statement perform the same functions as they do in the SOLVE statement. Components of differential systems that have missing values or are not in the data set are simulated dynamically. For example, often in multiple compartment kinetic models, only one compartment is monitored. The differential equations that describe the unmonitored compartments are simulated dynamically. For estimation, it is important to have accurate initial values for ODEs that are not in the data set. If an accurate initial value is not known, the initial value can be made an unknown parameter and estimated. This allows for errors in the initial values but increases the number of parameters to estimate by the number of equations.
The differential equation that models this process is dconc D ku dt conc0 D 0 ke conc
The analytical solution to the model is conc D .ku =ke /.1 The data for the model are
data fish; input day conc; datalines; 0.0 0.0 1.0 0.15 2.0 0.2 3.0 0.26 4.0 0.32 6.0 0.33 ;
exp. ke t //
Parameter ku ke
To perform a dynamic estimation of the differential equation, add the DYNAMIC option to the FIT statement.
proc model data=fish; parm ku .3 ke .3; dert.conc = ku - ke * conc; fit conc / time = day dynamic; run;
The equation DERT.CONC is integrated from conc.0/ D 0. The results from this estimation are shown in Figure 18.49.
Figure 18.49 Dynamic Estimation Results for Fish Model
The MODEL Procedure Nonlinear OLS Parameter Estimates Approx Std Err 0.0170 0.0731 Approx Pr > |t| 0.0006 0.0030
Parameter ku ke
To perform a dynamic estimation of the differential equation and estimate the initial value, use the following statements:
proc model data=fish; parm ku .3 ke .3 conc0 0; dert.conc = ku - ke * conc; fit conc initial=(conc = conc0) / time = day dynamic; run;
The INITIAL= option in the FIT statement is used to associate the initial value of a differential equation with a parameter. The results from this estimation are shown in Figure 18.50.
Figure 18.50 Dynamic Estimation with Initial Value for Fish Model
The MODEL Procedure Nonlinear OLS Parameter Estimates Approx Std Err 0.0230 0.0943 0.0174 Approx Pr > |t| 0.0057 0.0165 0.8414
Parameter ku ke conc0
Finally, to estimate the sh model by using the analytical solution, use the following statements:
proc model data=fish; parm ku .3 ke .3; conc = (ku/ ke)*( 1 -exp(-ke * day)); fit conc; run;
Parameter ku ke
A comparison of the results among the four estimations reveals that the two dynamic estimations and the analytical estimation give nearly identical results (identical to the default precision). The two dynamic estimations are identical because the estimated initial value (0.00013071) is very close to the initial value used in the rst dynamic estimation (0). Note also that the static model did not require an initial guess for the parameter values. Static estimation, in general, is more forgiving of bad initial values. The form of the estimation that is preferred depends mostly on the model and data. If a very accurate initial value is known, then a dynamic estimation makes sense. If, additionally, the model can be written analytically, then the analytical estimation is computationally simpler. If only an approximate initial value is known and not modeled as an unknown parameter, the static estimation is less sensitive to errors in the initial value. The form of the error in the model is also an important factor in choosing the form of the estimation.
If the error term is additive and independent of previous error, then the dynamic mode is appropriate. If, on the other hand, the errors are cumulative, a static estimation is more appropriate. See the section Monte Carlo Simulation on page 1120 for an example.
Auxiliary Equations
Auxiliary equations can be used with differential equations. These are equations that need to be satised with the differential equations at each point between each data value. They are automatically added to the system, so you do not need to specify them in the SOLVE or FIT statement. Consider the following example. The Michaelis-Menten equations describe the kinetics of an enzyme-catalyzed reaction. The enzyme is E, and S is called the substrate. The enzyme rst reacts with the substrate to form the enzyme-substrate complex ES, which then breaks down in a second step to form enzyme and products P. The reaction rates are described by the following system of differential equations: d ES dt d S dt D k1 .E D k1 .E ES/S k2 ES k3 ES
ES/S C k2 ES ES
[E] D Etot
The rst equation describes the rate of formation of ES from E + S. The rate of formation of ES from E + P is very small and can be ignored. The enzyme is in either the complexed or the uncomplexed form. So if the total (Etot ) concentration of enzyme and the amount bound to the substrate is known, E can be obtained by conservation. In this example, the conservation equation is an auxiliary equation and is coupled with the differential equations for integration.
Time Variable
You must provide a time variable in the data set. The name of the time variable defaults to TIME. You can use other variables as the time variable by specifying the TIME= option in the FIT or SOLVE statement. The time intervals need not be evenly spaced. If the time variable for the current observation is less than the time variable for the previous observation, the integration is restarted.
data t2; y=0; time=0; output; y=2; time=1; output; y=3; time=2; output; run;
The problem is to nd values for X that satisfy the differential equation and the data in the data set. Problems of this kind are sometimes referred to as goal-seeking problems because they require you to search for values of X that satisfy the goal of Y. This problem is solved with the following statements:
proc model data=t2; independent x 0; dependent y; parm a 5; dert.y = a * x; solve x / out=goaldata; run; proc print data=goaldata; run;
The output from the PROC PRINT statement is shown in Figure 18.52.
Figure 18.52 Dynamic Solution
Obs 1 2 3 _TYPE_ PREDICT PREDICT PREDICT _MODE_ SIMULATE SIMULATE SIMULATE _ERRORS_ 0 0 0 x 0.00000 0.80000 -0.40000 y 0.00000 2.00000 3.00000 time 0 1 2
Note that an initial value of 0 is provided for the X variable because it is undetermined at TIME = 0. In the preceding goal-seeking example, X is treated as a linear function between each set of data points (see Figure 18.53).
D a
1 tf t0
.t .tf x0
1 t0 xf / C t 2 .xf 2
x0 //jtf 0
So x D 0:8 for this observation. Goal seeking for the TIME variable is not allowed.
Equality restrictions can be written as a vector function: h. / D 0 Inequality restrictions are either active or inactive. When an inequality restriction is active, it is treated as an equality restriction. All inactive inequality restrictions can be written as a vector function: F . / 0
Strict inequalities, such as .f . / > 0/, are transformed into inequalities as f ./ .1 / 0, where the tolerance is controlled by the EPSILON= option in the FIT statement and defaults to 10 8 . The ith inequality restriction becomes active if Fi < 0 and remains active until its Lagrange multiplier becomes negative. Lagrange multipliers are computed for all the nonredundant equality restrictions and all the active inequality restrictions. For the following, assume the vector h. / contains all the current active restrictions. The constraint matrix A is O @h. / O A. / D O @ The covariance matrix for the restricted parameter estimates is computed as Z.Z0 HZ/
1 0
where H is Hessian or approximation to the Hessian of the objective function (.X0 .diag.S/ 1 I/X/ for OLS), and Z is the last .np nc/ columns of Q. Q is from an LQ factorization of the constraint matrix, nc is the number of active constraints, and np is the number of parameters. See Gill, Murray, and Wright (1981) for more details on LQ factorization. The covariance column in Table 18.2 summarizes the Hessian approximation used for each estimation method. The covariance matrix for the Lagrange multipliers is computed as .AH
1
A0 /
The p-value reported for a restriction is computed from a beta distribution rather than a t distribution because the numerator and the denominator of the t ratio for an estimated Lagrange multiplier are not independent. The Lagrange multipliers for the active restrictions are printed with the parameter estimates. The Lagrange multiplier estimates are computed using the relationship A
0
Dg
where the dimensions of the constraint matrix A are the number of constraints by the number of parameters, is the vector of Lagrange multipliers, and g is the gradient of the objective function at the nal estimates. The nal gradient includes the effects of the estimated S matrix. For example, for OLS the nal gradient would be: g D X0 .diag.S/
1
I/r
where r is the residual vector. Note that when nonlinear restrictions are imposed, the convergence measure R might have values greater than one for some iterations.
Tests on Parameters
In general, the hypothesis tested can be written as H0 W h. / D 0 where h. / is a vector-valued function of the parameters given by the r expressions specied on the TEST statement. O O Q O Let V be the estimate of the covariance matrix of . Let be the unconstrained estimate of and Q D 0. Let be the constrained estimate of such that h./ A. / D @h. /=@ j O Let r be the dimension of h. / and n be the number of observations. Using this notation, the test statistics for the three kinds of tests are computed as follows. The Wald test statistic is dened as 8 9 1 0 O O O 0 O O W D h . /:A. /VA . /; h./ The Wald test is not invariant to reparameterization of the model (Gregory and Veall 1985; Gallant 1987, p. 219). For more information about the theoretical properties of the Wald test, see Phillips and Park (1988). The Lagrange multiplier test statistic is RD where
0
Q Q Q A. /VA . /
Q is the vector of Lagrange multipliers from the computation of the restricted estimate .
The Lagrange multiplier test statistic is equivalent to Raos efcient score test statistic: Q Q Q R D .@L. /=@ / V.@L. /=@/ where L is the log-likelihood function for the estimation method used. For SUR, 3SLS, GMM, and iterated versions of these methods, the likelihood function is computed as L D Objective Nobs=2
0
For OLS and 2SLS, the Lagrange multiplier test statistic is computed as: Q O Q O Q O Q R D .@S . /=@ / V.@S. /=@/=S./ O Q where S . / is the corresponding objective function value at the constrained estimate. The likelihood ratio test statistic is O Q T D 2 L. / L. / Q where represents the constrained estimate of and L is the concentrated log-likelihood value.
0
For OLS and 2SLS, the likelihood ratio test statistic is computed as: T D .n nparms/ O Q .S . / O O O O S .//=S./
This test statistic is an approximation from rF T D n log 1 C n nparms when the value of rF=.n nparms/ is small (Greene 2004, p. 421).
The likelihood ratio test is not appropriate for models with nonstationary serially correlated errors (Gallant 1987, p. 139). The likelihood ratio test should not be used for dynamic systems, for systems with lagged dependent variables, or with the FIML estimation method unless certain conditions are met (see Gallant 1987, p. 479). For each kind of test, under the null hypothesis the test statistic is asymptotically distributed as a 2 random variable with r degrees of freedom, where r is the number of expressions in the TEST statement. The p-values reported for the tests are computed from the 2 .r/ distribution and are only asymptotically valid. When both RESTRICT and TEST statements are used in a PROC MODEL step, test statistics are computed by taking into account the constraints imposed by the RESTRICT statement. Monte Carlo simulations suggest that the asymptotic distribution of the Wald test is a poorer approximation to its small sample distribution than the other two tests. However, the Wald test has the Q least computational cost, since it does not require computation of the constrained estimate . The following is an example of using the TEST statement to perform a likelihood ratio test for a compound hypothesis.
test a*exp(-k) = 1-k, d = 0 ,/ lr;
It is important to keep in mind that although individual t tests for each parameter are printed by default into the parameter estimates table, they are only asymptotically valid for nonlinear models. You should be cautious in drawing any inferences from these t tests for small samples.
O O O where V1 and V0 represent consistent estimates of the asymptotic covariance matrices of 1 and O0 respectively, and O q D 1 O 0
The m-statistic is then distributed 2 with k degrees of freedom, where k is the rank of the matrix O O .V1 V0 /. A generalized inverse is used, as recommended by Hausman and Taylor (1982). In the MODEL procedure, Hausmans m-statistic can be used to determine if it is necessary to use an instrumental variables method rather than a more efcient OLS estimation. Hausmans mstatistic can also be used to compare 2SLS with 3SLS for a class of estimators for which 3SLS is asymptotically efcient (similarly for OLS and SUR). Hausmans m-statistic can also be used, in principle, to test the null hypothesis of normality when comparing 3SLS to FIML. Because of the poor performance of this form of the test, it is not offered in the MODEL procedure. See Fair (1984, pp. 246247) for a discussion of why Hausmans test fails for common econometric models. To perform a Hausmans specication test, specify the HAUSMAN option in the FIT statement. The selected estimation methods are compared using Hausmans m-statistic. In the following example, Hausmans test is used to check the presence of measurement error. Under H0 of no measurement error, OLS is efcient, while under H1 , 2SLS is consistent. In the following code, OLS and 2SLS are used to estimate the model, and Hausmans test is requested.
proc model data=one out=fiml2; endogenous y1 y2; y1 = py2 * y2 + px1 * x1 + interc; y2 = py1* y1 + pz1 * z1 + d2; fit y1 y2 / ols 2sls hausman; instruments x1 z1; run;
The output specied by the HAUSMAN option produces the results shown in Figure 18.54.
Figure 18.54 Hausmans Specication Test Results
The MODEL Procedure Hausmans Specification Test Results Consistent under H1 DF Statistic Pr > ChiSq 2SLS 6 13.86 0.0313
Figure 18.54 indicates that 2SLS is preferred over OLS at 5% level of signicance. In this case, the null hypothesis of no measurement error is rejected. Hence, the instrumental variable estimator is required for this example due to the presence of measurement error.
Chow Tests
The Chow test is used to test for break points or structural changes in a model. The problem is posed as a partitioning of the data into two parts of size n1 and n2 . The null hypothesis to be tested is Ho W 1 D 2 D where 1 is estimated by using the rst part of the data and 2 is estimated by using the second part. The test is performed as follows (see Davidson and MacKinnon 1993, p. 380). 1. The p parameters of the model are estimated. 2. A second linear regression is performed on the residuals, u, from the nonlinear estimation in O step one. O u D Xb C residuals O O where X is Jacobian columns that are evaluated at the parameter estimates. If the estimation is an instrumental variables estimation with matrix of instruments W, then the following regression is performed: O u D PW Xb C residuals O where PW is the projection matrix. 3. The restricted SSE (RSSE) from this regression is obtained. An SSE for each subsample is then obtained by using the same linear regression. 4. The F statistic is then .RSSE SSE1 SSE2 /=p f D .SSE1 C SSE2 /=.n 2p/ This test has p and n 2p degrees of freedom.
Chows test is not applicable if min.n1 ; n2 / < p, since one of the two subsamples does not contain enough data to estimate . In this instance, the predictive Chow test can be used. The predictive Chow test is dened as .RSSE SSE1 / .n1 p/ f D SSE1 n2 where n1 > p. This test can be derived from the Chow test by noting that the SSE2 D 0 when n2 <D p and by adjusting the degrees of freedom appropriately. You can select the Chow test and the predictive Chow test by specifying the CHOW=arg and the PCHOW=arg options in the FIT statement, where arg is either the number of observations in the rst sample or a parenthesized list of rst sample sizes. If the size of the one of the two groups in which the sample is partitioned is less than the number of parameters, then a predictive Chow test is automatically used. These tests statistics are not produced for GMM and FIML estimations. The following is an example of the use of the Chow test.
data exp; x=0; do time=1 to 100; if time=50 then x=1; y = 35 * exp( 0.01 * time ) + rannor( 123 ) + x * 5; output; end; run; proc model data=exp; parm zo 35 b; dert.z = b * z; y=z; fit y init=(z=zo) / chow =(40 50 60) pchow=90; run;
The data set introduces an articial structural change into the model (the structural change effects the intercept parameter). The output from the requested Chow tests are shown in Figure 18.55.
Figure 18.55 Chows Test Results
The MODEL Procedure Structural Change Test Break Point 40 50 60 90
Num DF 2 2 2 11
Den DF 96 96 96 87
O where zp is the 100pth percentile of the standard normal distribution, is the maximum likelihood O estimate of , and O is the standard error estimate of . A likelihood-ratio-based condence interval is derived from the 2 distribution of the generalized likelihood ratio test. The approximate 1 condence interval for a parameter is O W 2l. / l. /q1;1
D 2l
where q1;1 is the .1 / quantile of the 2 with one degree of freedom, and l./ is the log likelihood as a function of one parameter. The endpoints of a condence interval are the zeros of the function l. / l . Computing a likelihood-ratio-based condence interval is an iterative process. This process must be performed twice for each parameter, so the computational cost is considerable. Using a modied form of the algorithm recommended by Venzon and Moolgavkar (1988), you can determine that the cost of each endpoint computation is approximately the cost of estimating the original system. To request condence intervals on estimated parameters, specify the PRL= option in the FIT statement. By default, the PRL option produces 95% likelihood ratio condence limits. The coverage of the condence interval is controlled by the ALPHA= option in the FIT statement. The following is an example of the use of the condence interval options.
data exp; do time = 1 to 20; y = 35 * exp( 0.01 * time ) + 5*rannor( 123 ); output; end; run; proc model data=exp; parm zo 35 b; dert.z = b * z; y=z; fit y init=(z=zo) / prl=both; test zo = 40.475437 ,/ lr; run;
The output from the requested condence intervals and the TEST statement are shown in Figure 18.56
Figure 18.56 Condence Interval Estimation
The MODEL Procedure Nonlinear OLS Parameter Estimates Approx Std Err 1.9471 0.00464 Approx Pr > |t| <.0001 0.1780
Parameter zo b
Test Results Test Test0 Type L.R. Statistic 3.81 Pr > ChiSq 0.0509 Label zo = 40.475437
Parameter zo b
Parameter Likelihood Ratio 95% Confidence Intervals Parameter Value Lower zo b 36.5893 0.00650 32.8381 -0.00264
Note that the likelihood ratio test reported the probability that zo D 40:47543 is 5% but zo D 40:47543 is the upper bound of a 95% condence interval. To understand this conundrum, note that the TEST statement is using the likelihood ratio statistic to test the null hypothesis H0 W zo D 40:47543 with the alternate that Ha W zo40:47543. The upper condence interval can be viewed as a test with the null hypothesis H0 W zo <D 40:47543.
Choice of Instruments
Several of the estimation methods supported by PROC MODEL are instrumental variables methods. There is no standard method for choosing instruments for nonlinear regression. Few econometric textbooks discuss the selection of instruments for nonlinear models. See Bowden and Turkington (1984, pp. 180182) for more information. The purpose of the instrumental projection is to purge the regressors of their correlation with the residual. For nonlinear systems, the regressors are the partials of the residuals with respect to the parameters. Possible instrumental variables include the following: any variable in the model that is independent of the errors lags of variables in the system derivatives with respect to the parameters, if the derivatives are independent of the errors low-degree polynomials in the exogenous variables variables from the data set or functions of variables from the data set Selected instruments must not have any of the following characteristics: depend on any variable endogenous with respect to the equations estimated
depend on any of the parameters estimated be lags of endogenous variables if there is serial correlation of the errors If the preceding rules are satised and there are enough observations to support the number of instruments used, the results should be consistent and the efciency loss held to a minimum. You need at least as many instruments as the maximum number of parameters in any equation, or some of the parameters cannot be estimated. Note that number of instruments means linearly independent instruments. If you add an instrument that is a linear combination of other instruments, it has no effect and does not increase the effective number of instruments. You can, however, use too many instruments. In order to get the benet of instrumental variables, you must have more observations than instruments. Thus, there is a trade-off; the instrumental variables technique completely eliminates the simultaneous equation bias only in large samples. In nite samples, the larger the excess of observations over instruments, the more the bias is reduced. Adding more instruments might improve the efciency, but after some point efciency declines as the excess of observations over instruments becomes smaller and the bias grows. The instruments used in an estimation are printed out at the beginning of the estimation. For example, the following statements produce the instruments list shown in Figure 18.57.
proc model data=test2; exogenous x1 x2; parms b1 a1 a2 b2 2.5 c2 55; y1 = a1 * y2 + b1 * exp(x1); y2 = a2 * y1 + b2 * x2 * x2 + c2 / x2; fit y1 y2 / n2sls; inst b1 b2 c2 x1 ; run;
This states that an intercept term, the exogenous variable X1, and the partial derivatives of the equations with respect to B1, B2, and C2, were used as instruments for the estimation.
Examples
Suppose that Y1 and Y2 are endogenous variables, that X1 and X2 are exogenous variables, and that A, B, C, D, E, F, and G are parameters. Consider the following model:
The INSTRUMENTS statement produces X1, X2, LAG(Y1), and an intercept as instruments. In order to estimate the Y1 equation by itself, it is necessary to include X2 explicitly in the instruments since F, in this case, is not included in the following estimation:
y1 = a + b * x1 + c * y2 + d * lag(y1); y2 = e + f * x2 + g * y1; fit y1; instruments x2 exclude=(c);
This produces the same instruments as before. You can list the parameter associated with the lagged variable as an instrument instead of using the EXCLUDE= option. Thus, the following is equivalent to the previous example:
y1 = a + b * x1 + c * y2 + d * lag(y1); y2 = e + f * x2 + g * y1; fit y1; instruments x1 x2 d;
For an example of declaring instruments when estimating a model involving identities, consider Kleins Model I:
proc model data=klien; endogenous c p w i x wsum exogenous wp g t year; parms c0-c3 i0-i3 w0-w3; a: c = c0 + c1 * p + c2 * b: i = i0 + i1 * p + i2 * c: w = w0 + w1 * x + w2 * x = c + i + g; y = c + i + g-t; p = x-w-t; k = lag(k) + i; wsum = w + wp; run;
k y;
The three equations to estimate are identied by the labels A, B, and C. The parameters associated with the predetermined terms are C2, I2, I3, W2, and W3 (and the intercepts, which are automatically added to the instruments). In addition, the system includes ve identities that contain the predetermined variables G, T, LAG(K), and WP. Thus, the INSTRUMENTS statement can be written as
lagk = lag(k); instruments c2 i2 i3 w2 w3 g t wp lagk;
where LAGK is a program variable used to hold LAG(K). However, this is more complicated than it needs to be. Except for LAG(K), all the predetermined terms in the identities are exogenous variables, and LAG(K) is already included as the coefcient of I3. There are also more parameters for predetermined terms than for endogenous terms, so you might prefer to use the EXCLUDE= option. Thus, you can specify the same instruments list with the simpler statement
instruments _exog_ exclude=(c1 c3 i1 w1);
To illustrate the use of polynomial terms as instrumental variables, consider the following model:
y1 = a + b * exp( c * x1 ) + d * log( x2 ) + e * exp( f * y2 );
The parameters are A, B, C, D, E, and F, and the right-hand-side variables are X1, X2, and Y2. Assume that X1 and X2 are exogenous (independent of the error), while Y2 is endogenous. The equation for Y2 is not specied, but assume that it includes the variables X1, X3, and Y1, with X3 exogenous, so the exogenous variables of the full system are X1, X2, and X3. Using as instruments quadratic terms in the exogenous variables, the model is specied to PROC MODEL as follows:
proc model; parms a b c d e f; y1 = a + b * exp( c * x1 ) + d * log( x2 ) + e * exp( f * y2 ); instruments inst1-inst9; inst1 = x1; inst2 = x2; inst3 = x3; inst4 = x1 * x1; inst5 = x1 * x2; inst6 = x1 * x3; inst7 = x2 * x2; inst8 = x2 * x3; inst9 = x3 * x3; fit y1 / 2sls; run;
It is not clear what degree polynomial should be used. There is no way to know how good the approximation is for any degree chosen, although the rst-stage R2 s might help the assessment.
First-Stage R-Squares
When the FSRSQ option is used on the FIT statement, the MODEL procedure prints a column of rst-stage R2 (FSRSQ) statistics along with the parameter estimates. The FSRSQ measures the fraction of the variation of the derivative column associated with the parameter that remains after projection through the instruments. Ideally, the FSRSQ should be very close to 1.00 for exogenous derivatives. If the FSRSQ is small for an endogenous derivative, it is unclear whether this reects a poor choice of instruments or a large inuence of the errors in the endogenous right-hand-side variables. When the FSRSQ for one or more parameters is small, the standard errors of the parameter estimates are likely to be large. Note that you can make all the FSRSQs larger (or 1.00) by including more instruments, because of the disadvantage discussed previously. The FSRSQ statistics reported are unadjusted R2 s and do not include a degrees-of-freedom correction.
Autoregressive Errors
A model with rst-order autoregressive errors, AR(1), has the form yt D f .xt ; / C
t t t
t 1
t 1
t 2
and so forth for higher-order processes. Note that the and have an expected value of 0. An example of a model with an AR(2) component is y D C x1 C
t t 2 t 2
t 1
Moving-Average Models
A model with rst-order moving-average errors, MA(1), has the form yt D f .xt / C
t t t 1
where t is identically and independently distributed with mean zero. An MA(2) error process has the form
t
t 1
t 2
and so forth for higher-order processes. For example, you can write a simple linear regression model with MA(2) moving-average errors as
proc model parms a y = a + ma2 fit; run; data=inma2; b ma1 ma2; b * x + ma1 * zlag1( resid.y ) + * zlag2( resid.y );
where MA1 and MA2 are the moving-average parameters. Note that RESID.Y is automatically dened by PROC MODEL as
pred.y = a + b * x + ma1 * zlag1( resid.y ) + ma2 * zlag2( resid.y ); resid.y = pred.y - actual.y;
t.
The ZLAG function must be used for MA models to truncate the recursion of the lags. This ensures that the lagged errors start at zero in the lag-priming phase and do not propagate missing values when lag-priming period variables are missing, and it ensures that the future errors are zero rather than missing during simulation or forecasting. For details about the lag functions, see the section Lag Logic on page 1161. This model written using the %MA macro is as follows:
proc model data=inma2; parms a b; y = a + b * x; %ma(y, 2); fit; run;
t 1
C ::: C
t p
t 1
:::
t q
where ARi and MAj represent the autoregressive and moving-average parameters for the various lags. You can use any names you want for these variables, and there are many equivalent ways that the specication could be written. Vector ARMA processes can also be estimated with PROC MODEL. For example, a two-variable AR(1) process for the errors of the two endogenous variables Y1 and Y2 can be specied as follows:
y1hat = y1 ... compute structural predicted value here ... ; /* ar part y1,y1 */ /* ar part y1,y2 */
y21hat = y2
... compute structural predicted value here ... ; /* ar part y2,y2 */ /* ar part y2,y1 */
statement to estimate the ARMA parameters only, using the structural parameter values from the rst run. Since the values of the structural parameters are likely to be close to their nal estimates, the ARMA parameter estimates might now converge. Finally, use another FIT statement to produce simultaneous estimates of all the parameters. Since the initial values of the parameters are now likely to be quite close to their nal joint estimates, the estimates should converge quickly if the model is appropriate for the data.
AR Initial Conditions
The initial lags of the error terms of AR(p ) models can be modeled in different ways. The autoregressive error startup methods supported by SAS/ETS procedures are the following: CLS ULS ML YW HL conditional least squares (ARIMA and MODEL procedures) unconditional least squares (AUTOREG, ARIMA, and MODEL procedures) maximum likelihood (AUTOREG, ARIMA, and MODEL procedures) Yule-Walker (AUTOREG procedure only) Hildreth-Lu, which deletes the rst p observations (MODEL procedure only)
See Chapter 8, The AUTOREG Procedure, for an explanation and discussion of the merits of various AR(p) startup methods. The CLS, ULS, ML, and HL initializations can be performed by PROC MODEL. For AR(1) errors, these initializations can be produced as shown in Table 18.3. These methods are equivalent in large samples.
Table 18.3 Initializations Performed by PROC MODEL: AR(1) ERRORS
Formula Y=YHAT+AR1*ZLAG1(Y-YHAT); Y=YHAT+AR1*ZLAG1(Y-YHAT); IF _OBS_=1 THEN RESID.Y=SQRT(1-AR1**2)*RESID.Y; Y=YHAT+AR1*ZLAG1(Y-YHAT); W=(1-AR1**2)**(-1/(2*_NUSED_)); IF _OBS_=1 THEN W=W*SQRT(1-AR1**2); RESID.Y=W*RESID.Y; Y=YHAT+AR1*LAG1(Y-YHAT);
maximum likelihood
Hildreth-Lu
MA Initial Conditions
The initial lags of the error terms of MA(q ) models can also be modeled in different ways. The following moving-average error start-up paradigms are supported by the ARIMA and MODEL procedures: ULS CLS ML unconditional least squares conditional least squares maximum likelihood
The conditional least squares method of estimating moving-average error terms is not optimal because it ignores the start-up problem. This reduces the efciency of the estimates, although they remain unbiased. The initial lagged residuals, extending before the start of the data, are assumed to be 0, their unconditional expected value. This introduces a difference between these residuals and the generalized least squares residuals for the moving-average covariance, which, unlike the autoregressive model, persists through the data set. Usually this difference converges quickly to 0, but for nearly noninvertible moving-average processes the convergence is quite slow. To minimize this problem, you should have plenty of data, and the moving-average parameter estimates should be well within the invertible range. This problem can be corrected at the expense of writing a more complex program. Unconditional least squares estimates for the MA(1) process can be produced by specifying the model as follows:
yhat = ... compute structural predicted value here ... ; if _obs_ = 1 then do; h = sqrt( 1 + ma1 ** 2 ); y = yhat; resid.y = ( y - yhat ) / h; end; else do; g = ma1 / zlag1( h ); h = sqrt( 1 + ma1 ** 2 - g ** 2 ); y = yhat + g * zlag1( resid.y ); resid.y = ( ( y - yhat) - g * zlag1( resid.y ) ) / h; end;
Moving-average errors can be difcult to estimate. You should consider using an AR(p ) approximation to the moving-average process. A moving-average process can usually be well-approximated by an autoregressive process if the data have not been smoothed or differenced.
Univariate Autoregression
To model the error term of an equation as an autoregressive process, use the following statement after the equation:
%ar( varname, nlags )
For example, suppose that Y is a linear function of X1, X2, and an AR(2) error. You would write this model as follows:
proc model parms a y = a + %ar( y, fit y / run; data=in; b c; b * x1 + c * x2; 2 ) list;
The calls to %AR must come after all of the equations that the process applies to. The preceding macro invocation, %AR(y,2), produces the statements shown in the LIST output in Figure 18.58.
Figure 18.58 LIST Option Output for an AR(2) Model
The MODEL Procedure Listing of Compiled Program Code Line:Col Statement as Parsed 1940:4 1940:4 1940:4 1941:14 1941:15 PRED.y = a + b * x1 + c * x2; RESID.y = PRED.y - ACTUAL.y; ERROR.y = PRED.y - y; _PRED__y = PRED.y; _OLD_PRED.y = PRED.y + y_l1 * ZLAG1( y - _PRED__y ) + y_l2 * ZLAG2( y - _PRED__y ); PRED.y = _OLD_PRED.y; RESID.y = PRED.y - ACTUAL.y; ERROR.y = PRED.y - y;
Stmt 1 1 1 2 3
3 3 3
The _PRED__ prexed variables are temporary program variables used so that the lags of the residuals are the correct residuals and not the ones redened by this equation. Note that this is equivalent to the statements explicitly written in the section General Form for ARMA Models on page 1090.
You can also restrict the autoregressive parameters to zero at selected lags. For example, if you wanted autoregressive parameters at lags 1, 12, and 13, you can use the following statements:
proc model parms a y = a + %ar( y, fit y / run; data=in; b c; b * x1 + c * x2; 13, , 1 12 13 ) list;
Stmt 1 1 1 2 3
3 3 3
There are variations on the conditional least squares method, depending on whether observations at the start of the series are used to warm up the AR process. By default, the %AR conditional least squares method uses all the observations and assumes zeros for the initial lags of autoregressive terms. By using the M= option, you can request that %AR use the unconditional least squares (ULS) or maximum-likelihood (ML) method instead. For example,
proc model data=in; y = a + b * x1 + c * x2; %ar( y, 2, m=uls ) fit y; run;
Discussions of these methods is provided in the section AR Initial Conditions on page 1091. By using the M=CLSn option, you can request that the rst n observations be used to compute estimates of the initial autoregressive lags. In this case, the analysis starts with observation n +1. For example:
proc model data=in; y = a + b * x1 + c * x2; %ar( y, 2, m=cls2 )
fit y; run;
You can use the %AR macro to apply an autoregressive model to the endogenous variable, instead of to the error term, by using the TYPE=V option. For example, if you want to add the ve past lags of Y to the equation in the previous example, you could use %AR to generate the parameters and lags by using the following statements:
proc model parms a y = a + %ar( y, fit y / run; data=in; b c; b * x1 + c * x2; 5, type=v ) list;
Stmt 1 1 1 2
2 2 2
This model predicts Y as a linear combination of X1, X2, an intercept, and the values of Y in the most recent ve periods.
The process_name value is any name that you supply for %AR to use in making names for the autoregressive parameters. You can use the %AR macro to model several different AR processes for different sets of equations by using different process names for each set. The process name ensures that the variable names used are unique. Use a short process_name value for the process if parameter estimates are to be written to an output data set. The %AR macro tries to construct parameter names
less than or equal to eight characters, but this is limited by the length of process_name, which is used as a prex for the AR parameter names. The variable_list value is the list of endogenous variables for the equations. For example, suppose that errors for equations Y1, Y2, and Y3 are generated by a second-order vector autoregressive process. You can use the following statements:
proc model data=in; y1 = ... equation for y1 ...; y2 = ... equation for y2 ...; y3 = ... equation for y3 ...; %ar( name, 2, y1 y2 y3 ) fit y1 y2 y3; run;
which generate the following for Y1 and similar code for Y2 and Y3:
y1 = pred.y1 + name1_1_1*zlag1(y1-name_y1) + name1_1_2*zlag1(y2-name_y2) + name1_1_3*zlag1(y3-name_y3) + name2_1_1*zlag2(y1-name_y1) + name2_1_2*zlag2(y2-name_y2) + name2_1_3*zlag2(y3-name_y3) ;
Only the conditional least squares (M=CLS or M=CLSn ) method can be used for vector processes. You can also use the same form with restrictions that the coefcient matrix be 0 at selected lags. For example, the following statements apply a third-order vector process to the equation errors with all the coefcients at lag 2 restricted to 0 and with the coefcients at lags 1 and 3 unrestricted:
proc model data=in; y1 = ... equation for y1 ...; y2 = ... equation for y2 ...; y3 = ... equation for y3 ...; %ar( name, 3, y1 y2 y3, 1 3 ) fit y1 y2 y3;
You can model the three series Y1Y3 as a vector autoregressive process in the variables instead of in the errors by using the TYPE=V option. If you want to model Y1Y3 as a function of past values of Y1Y3 and some exogenous variables or constants, you can use %AR to generate the statements for the lag terms. Write an equation for each variable for the nonautoregressive part of the model, and then call %AR with the TYPE=V option. For example,
proc model data=in; parms a1-a3 b1-b3; y1 = a1 + b1 * x; y2 = a2 + b2 * x; y3 = a3 + b3 * x;
The nonautoregressive part of the model can be a function of exogenous variables, or it can be intercept parameters. If there are no exogenous components to the vector autoregression model, including no intercepts, then assign zero to each of the variables. There must be an assignment to each of the variables before %AR is called.
proc model data=in; y1=0; y2=0; y3=0; %ar( name, 2, y1 y2 y3, type=v ) fit y1 y2 y3; run;
This example models the vector Y=(Y1 Y2 Y3)0 as a linear function only of its value in the previous two periods and a white noise error vector. The model has 18=(3 3 + 3 3) parameters.
where name species a prex for %AR to use in constructing names of variables needed to dene the AR process. If the endolist is not specied, the endogenous list defaults to name, which must be the name of the equation to which the AR error process is to be applied. The name value cannot exceed 32 characters. is the order of the AR process. species the list of equations to which the AR process is to be applied. If more than one name is given, an unrestricted vector process is created with the structural residuals of all the equations included as regressors in each of the equations. If not specied, endolist defaults to name. species the list of lags at which the AR terms are to be added. The coefcients of the terms at lags not listed are set to 0. All of the listed lags must be less than or equal to nlag, and there must be no duplicates. If not specied, the laglist defaults to all lags 1 through nlag. species the estimation method to implement. Valid values of M= are CLS (conditional least squares estimates), ULS (unconditional least squares estimates), and ML (maximum likelihood estimates). M=CLS is the default. Only M=CLS is allowed when more than one equation is specied. The ULS and ML methods are not supported for vector AR models by %AR.
nlag endolist
laglist
M=method
TYPE=V
species that the AR process is to be applied to the endogenous variables themselves instead of to the structural residuals of the equations.
This model states that the errors for Y1 depend on the errors of both Y1 and Y2 (but not Y3) at both lags 1 and 2, and that the errors for Y2 and Y3 depend on the previous errors for all three variables, but only at lag 1.
where name species a prex for %AR to use in constructing names of variables needed to dene the vector AR process.
species the order of the AR process. species the list of equations to which the AR process is to be applied. species that %AR is not to generate the AR process but is to wait for further information specied in later %AR calls for the same name value.
where name eqlist is the same as in the rst call. species the list of equations to which the specications in this %AR call are to be applied. Only names specied in the endolist value of the rst call for the name value can appear in the list of equations in eqlist. species the list of equations whose lagged structural residuals are to be included as regressors in the equations in eqlist. Only names in the endolist of the rst call for the name value can appear in varlist. If not specied, varlist defaults to endolist. species the list of lags at which the AR terms are to be added. The coefcients of the terms at lags not listed are set to 0. All of the listed lags must be less than or equal to the value of nlag, and there must be no duplicates. If not specied, laglist defaults to all lags 1 through nlag.
varlist
laglist
The following PROC MODEL statements are used to estimate the parameters of this model by using maximum likelihood error structure:
title Maximum Likelihood ARMA(1, (1 3)); proc model data=madat2; y=0; %ar( y, 1, , M=ml ) %ma( y, 3, , 1 3, M=ml ) /* %MA always after %AR */ fit y; run; title;
The estimates of the parameters produced by this run are shown in Figure 18.61.
Figure 18.61 Estimates from an ARMA(1, (1 3)) Process
Maximum Likelihood ARMA(1, (1 3)) The MODEL Procedure Nonlinear OLS Summary of Residual Errors DF Model 3 DF Error 197 197 Adj R-Sq -0.0169
Equation y RESID.y
R-Square -0.0067
Nonlinear OLS Parameter Estimates Approx Std Err 0.1187 0.0939 0.0601 Approx Pr > |t| 0.3973 0.0408 <.0001
Label AR(y) y lag1 parameter MA(y) y lag1 parameter MA(y) y lag3 parameter
where name nlag species a prex for %MA to use in constructing names of variables needed to dene the MA process and is the default endolist. is the order of the MA process.
endolist laglist
species the equations to which the MA process is to be applied. If more than one name is given, CLS estimation is used for the vector process. species the lags at which the MA terms are to be added. All of the listed lags must be less than or equal to nlag, and there must be no duplicates. If not specied, the laglist defaults to all lags 1 through nlag. species the estimation method to implement. Valid values of M= are CLS (conditional least squares estimates), ULS (unconditional least squares estimates), and ML (maximum likelihood estimates). M=CLS is the default. Only M=CLS is allowed when more than one equation is specied in the endolist.
M=method
where name nlag endolist DEFER species a prex for %MA to use in constructing names of variables needed to dene the vector MA process. species the order of the MA process. species the list of equations to which the MA process is to be applied. species that %MA is not to generate the MA process but is to wait for further information specied in later %MA calls for the same name value.
where name eqlist varlist laglist is the same as in the rst call. species the list of equations to which the specications in this %MA call are to be applied. species the list of equations whose lagged structural residuals are to be included as regressors in the equations in eqlist. species the list of lags at which the MA terms are to be added.
C b2 xt
C b3 xt
C : : : C bn xt
Models of this sort can introduce a great many parameters for the lags, and there may not be enough data to compute accurate independent estimates for them all. Often, the number of parameters is reduced by assuming that the lag coefcients follow some pattern. One common assumption is that the lag coefcients follow a polynomial in the lag length bi D
d X j D0
j .i /j
where d is the degree of the polynomial used. Models of this kind are called Almon lag models, polynomial distributed lag models, or PDLs for short. For example, Figure 18.62 shows the lag distribution that can be modeled with a low-order polynomial. Endpoint restrictions can be imposed on a PDL to require that the lag coefcients be 0 at the 0th lag, or at the nal lag, or at both.
Figure 18.62 Polynomial Distributed Lags
For linear single-equation models, SAS/ETS software includes the PDLREG procedure for estimating PDL models. See Chapter 20, The PDLREG Procedure, for a more detailed discussion of
polynomial distributed lags and an explanation of endpoint restrictions. Polynomial and other distributed lag models can be estimated and simulated or forecast with PROC MODEL. For polynomial distributed lags, the %PDL macro can generate the needed programming statements automatically.
where pdlname is a name (up to 32 characters) that you give to identify the PDL, nlags is the lag length, and degree is the degree of the polynomial for the distribution. The R=code is optional for endpoint restrictions. The value of code can be FIRST (for upper), LAST (for lower), or BOTH (for both upper and lower endpoints). See Chapter 20, The PDLREG Procedure, for a discussion of endpoint restrictions. The option OUTEST=dataset creates a data set that contains the estimates of the parameters and their covariance matrix. The later calls to apply the PDL have the general form
%PDL( pdlname, expression )
where pdlname is the name of the PDL and expression is the variable or expression to which the PDL is to be applied. The pdlname given must be the same as the name used to declare the PDL. The following statements produce the output in Figure 18.63:
proc model data=in list; parms int pz; %pdl(xpdl,5,2); y = int + pz * z + %pdl(xpdl,x); %ar(y,2,M=ULS); id i; fit y / out=model1 outresid converge=1e-6; run;
Label PDL(XPDL,5,2) coefficient for PDL(XPDL,5,2) coefficient for PDL(XPDL,5,2) coefficient for PDL(XPDL,5,2) coefficient for PDL(XPDL,5,2) coefficient for PDL(XPDL,5,2) coefficient for
This second example models two variables, Y1 and Y2, and uses two PDLs:
proc model data=in; parms int1 int2; %pdl( logxpdl, 5, 3 ) %pdl( zpdl, 6, 4 ) y1 = int1 + %pdl( logxpdl, log(x) ) + %pdl( zpdl, z ); y2 = int2 + %pdl( zpdl, z ); fit y1 y2; run;
A (5,3) PDL of the log of X is used in the equation for Y1. A (6,4) PDL of Z is used in the equations for both Y1 and Y2. Since the same ZPDL is used in both equations, the lag coefcients for Z are the same for the Y1 and Y2 equations, and the polynomial parameters for ZPDL are shared by the two equations. See Example 18.5 for a complete example and comparison with PDLREG.
y1 = a1 * y2 + b1 * x1 * x1 + d1; y2 = a2 * y1 + b2 * x2 * x2 + c2 / x2 + d1; fit y1 y2 / 3sls gmm kernel=(qs,1,0.2) outest=gmmest; fit y1 y2 / fiml type=gmm estdata=gmmest; run; proc print data=gmmest; run;
_ N A O M b E s _ 1 2
_ T Y P E _
a 1
a 2
b 1
b 2
c 2
d 1
3SLS 0 Converged 50 -.002229607 -1.25002 0.025827 1.99609 49.8119 -0.44533 GMM 0 Converged 50 -.001772196 -1.02345 0.014025 1.99726 49.8648 -0.87573
MISSING=PAIRWISE | DELETE
When missing values are encountered for any one of the equations in a system of equations, the default action is to drop that observation for all of the equations. The new MISSING=PAIRWISE option in the FIT statement provides a different method of handling missing values that avoids losing data for nonmissing equations for the observation. This is especially useful for SUR estimation on equations with unequal numbers of observations. The option MISSING=PAIRWISE species that missing values are tracked on an equation-byequation basis. The MISSING=DELETE option species that the entire observation is omitted from the analysis when any equation has a missing predicted or actual value for the equation. The default is MISSING=DELETE. When you specify the MISSING=PAIRWISE option, the S matrix is computed as S D D.R0 R/D where D is a diagonal matrix that depends on the VARDEF= option, the matrix R is .r1 ; : : : ; rg /, and ri is the vector of residuals for the ith equation with rij replaced with zero when rij is missing. For MISSING=PAIRWISE, the calculation of the diagonal element di;i of D is based on ni , the number of nonmissing observations for the ith equation, instead of on n. Similarly, for VARDEF=WGT or WDF, the calculation is based on the sum of the weights for the nonmissing observations for the ith equation instead of on the sum of the weights for all observations. See the description of the VARDEF= option for the denition of D.
The degrees-of-freedom correction for a shared parameter is computed by using the average number of observations used in its estimation. The MISSING=PAIRWISE option is not valid for the GMM and FIML estimation methods. For the instrumental variables estimation methods (2SLS, 3SLS), when an instrument is missing for an observation, that observation is dropped for all equations, regardless of the MISSING= option.
The model le NEWMOD then contains the new model and its estimated parameters plus the old models with their original parameter values.
You can create an input SDATA= data set by using the DATA step. PROC MODEL expects to nd a character variable _NAME_ in the SDATA= data set as well as variables for the equations in the estimation or solution. For each observation with a _NAME_ value that matches the name of an equation, PROC MODEL lls the corresponding row of the S matrix with the values of the names of equations found in the data set. If a row or column is omitted from the data set, a 1 is placed on the diagonal for the row or column. Missing values are ignored, and since the S matrix is symmetric, you can include only a triangular part of the S matrix in the SDATA= data set with the omitted part indicated by missing values. If the SDATA= data set contains multiple observations with the same _NAME_, the last values supplied for the _NAME_ are used. The structure of the expected data set is further described in the section OUTS= Data Set on page 1112. Use the TYPE= option in the PROC MODEL or FIT statement to specify the type of estimation method used to produce the S matrix you want to input. The following SAS statements are used to generate an S matrix from a GMM and a 3SLS estimation and to store that estimate in the data set GMMS:
proc model data=gmm2 ; exogenous x1 x2; parms a1 a2 b1 2.5 b2 c2 55 d1; inst b1 b2 c2 x1 x2; y1 = a1 * y2 + b1 * x1 * x1 + d1; y2 = a2 * y1 + b2 * x2 * x2 + c2 / x2 + d1; fit y1 y2 / 3sls gmm kernel=(qs,1,0.2) outest=gmmest outs=gmms; run; proc print data=gmms; run;
column of the V matrix is associated with an equation and an instrument. The position of each element in the V matrix can then be indicated by an equation name and an instrument name for the row of the element and an equation name and an instrument name for the column. Each observation in the VDATA= data set is an element in the V matrix. The row and column of the element are indicated by four variables (EQ_ROW, INST_ROW, EQ_COL, and INST_COL) that contain the equation name or instrument name. The variable name for an element is VALUE. Missing values are set to 0. Because the variance matrix is symmetric, only a triangular part of the matrix needs to be input. The following SAS statements are used to generate a V matrix estimation from GMM and to store that estimate in the data set GMMV:
proc model data=gmm2; exogenous x1 x2; parms a1 a2 b2 b1 2.5 c2 55 d1; inst b1 b2 c2 x1 x2; y1 = a1 * y2 + b1 * x1 * x1 + d1; y2 = a2 * y1 + b2 * x2 * x2 + c2 / x2 + d1; fit y1 y2 / gmm outv=gmmv; run; proc print data=gmmv(obs=15); run;
The data set GMM2 was generated by the example in the preceding ESTDATA= section. The V matrix stored in GMMV is selected for use in an additional GMM estimation by the following FIT statement:
fit y1 y2 / gmm vdata=gmmv; run;
A partial listing of the GMMV data set is shown in Figure 18.66. There are a total of 78 observations in this data set. The V matrix is 12 by 12 for this example.
the model variables. The dependent variables for the normalized form equations in the estimation contain residuals, actuals, or predicted values, depending on the _TYPE_ variable, whereas the model variables that are not associated with estimated equations always contain actual values from the input data set. any other variables named in the OUTVARS statement. These can be program variables computed by the model program, CONTROL variables, parameters, or special variables in the model program. The following SAS statements are used to generate and print an OUT= data set:
proc model data=gmm2; exogenous x1 x2; parms a1 a2 b2 b1 2.5 c2 55 d1; inst b1 b2 c2 x1 x2; y1 = a1 * y2 + b1 * x1 * x1 + d1; y2 = a2 * y1 + b2 * x2 * x2 + c2 / x2 + d1; fit y1 y2 / 3sls gmm out=resid outall ; run; proc print data=resid(obs=20); run;
The data set GMM2 was generated by the example in the preceding ESTDATA= section above. A partial listing of the RESID data set is shown in Figure 18.67.
Figure 18.67 The OUT= Data Set
Obs 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 _ESTYPE_ 3SLS 3SLS 3SLS 3SLS 3SLS 3SLS 3SLS 3SLS 3SLS 3SLS 3SLS 3SLS 3SLS 3SLS 3SLS 3SLS 3SLS 3SLS 3SLS 3SLS _TYPE_ ACTUAL PREDICT RESIDUAL ACTUAL PREDICT RESIDUAL ACTUAL PREDICT RESIDUAL ACTUAL PREDICT RESIDUAL ACTUAL PREDICT RESIDUAL ACTUAL PREDICT RESIDUAL ACTUAL PREDICT _WEIGHT_ 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 x1 1.00000 1.00000 1.00000 1.41421 1.41421 1.41421 1.73205 1.73205 1.73205 2.00000 2.00000 2.00000 2.23607 2.23607 2.23607 2.44949 2.44949 2.44949 2.64575 2.64575 x2 -1.7339 -1.7339 -1.7339 -5.3046 -5.3046 -5.3046 -5.2826 -5.2826 -5.2826 -0.6878 -0.6878 -0.6878 -7.0797 -7.0797 -7.0797 14.5284 14.5284 14.5284 -0.6968 -0.6968 y1 -3.05812 -0.36806 -2.69006 0.59405 -0.49148 1.08553 3.17651 -0.48281 3.65933 3.66208 -0.18592 3.84800 0.29210 -0.53732 0.82942 1.86898 -1.23490 3.10388 -1.03003 -0.10353 y2 -23.071 -19.351 -3.720 43.866 45.588 -1.722 51.563 41.857 9.707 -70.011 -76.502 6.491 99.177 92.201 6.976 423.634 421.969 1.665 -72.214 -69.680
variables with the names of the equations in the estimation Each observation contains a row of the covariance matrix. The data set is suitable for use with the SDATA= option in a subsequent FIT or SOLVE statement. (See the section Tests on Parameters on page 1078 in this chapter for an example of the SDATA= option.)
Description
Option
ODS Tables Created by the FIT Statement AugGMMCovariance ChowTest CollinDiagnostics ConfInterval ConvCrit Crossproducts matrix Structural change test Collinearity diagnostics Prole likelihood condence intervals Convergence criteria for estimation GMM ITALL CHOW= PRL= default
Table 18.4
(continued)
ODS Table Name ConvergenceStatus CorrB CorrResiduals CovB CovResiduals Crossproducts DatasetOptions DetResidCov DWTest Equations EstSummaryMiss EstSummaryStats GMMCovariance Godfrey HausmanTest HeteroTest InvXPXMat IterInfo LagLength MinSummary ModSummary ModVars NormalityTest ObsSummary ObsUsed ParameterEstimates ParmChange ResidSummary SizeInfo TermEstimates TestResults WgtVar XPXMat
Description Convergence status Correlations of parameters Correlations of residuals Covariance of parameters Covariance of residuals Crossproducts matrix Data sets used Determinant of the residuals Durbin Watson test Listing of equations to estimate Model summary statistics for PAIRWISE Objective, objective * N Crossproducts matrix Godfreys serial correlation test Hausmans test table Heteroscedasticity test tables XX inverse for system Iteration printing Model lag length Number of parameters, estimation kind Listing of all categorized variables Listing of model variables and parameters Normality test table Identies observations with errors Observations read, used, and missing. Parameter estimates Parameter change vector Summary of the SSE, MSE for the equations Storage requirement for estimation Nonlinear OLS and ITOLS Estimates Test statement table The name of the weight variable XX for system
Option default COVB/CORRB CORRS/COVS COVB/CORRB CORRS/COVS ITALL/ITPRINT default DETAILS DW= default MISSING= default GMM DETAILS GF= HAUSMAN BREUSCH/PAGEN I ITALL/ITPRINT default default default default NORMAL default default default default DETAILS OLS/ITOLS
XPX
ODS Tables Created by the SOLVE Statement DatasetOptions DescriptiveStatistics FitStatistics LagLength ModSummary ObsSummary ObsUsed SimulationSummary SolutionVarList Data sets used Descriptive statistics Fit statistics for simulation Model lag length Listing of all categorized variables Simulation trace output Observations read, used, and missing. Number of variables solved for Solution variable lists default STATS STATS default default SOLVEPRINT default default default
Table 18.4
(continued)
Description Theil relative change error statistics Theil forecast error statistics Iteration Error vector Iteration residual values Iteration predicted values Iteration solved for variable values
ODS Tables Created by the FIT and SOLVE Statements AdjacencyMatrix BlockAnalysis BlockStructure CodeDependency CodeList CrossReference DepStructure FirstDerivatives InterIntg MemUsage ParmReadIn ProgList RangeInfo SortAdjacencyMatrix TransitiveClosure Adjacency graph Block analysis Block structure Variable cross reference Listing of programs statements Cross-reference listing for program Dependency structure of the system First derivative table Integration iteration output Memory usage statistics Parameter estimates read in Listing of compiled program code RANGE statement specication Sorted adjacency graph Transitive closure graph GRAPH BLOCK BLOCK LISTDEP LISTCODE BLOCK LISTDER INTGPRINT MEMORYUSE ESTDATA=
GRAPH GRAPH
The AugGMMCovariance table is the V matrix augmented with the moment vector at iteration zero, produced when the ITALL option is used with the GMM option. If the V matrix to be used in GMM is read in by the VDATA option, then AugGMMCovariance would be the same matrix augmented with the moment vectors. The GMMCovariance ODS output is produced only when you read in a covariance matrix to be used in the GMM method. This table is produced by using DETAILS option with the GMM option.
ODS Graphics
This section describes the use of ODS for creating graphics with the MODEL procedure.
ODS Graph Name ACFPlot ActualByPredicted CooksD IACFPlot QQPlot PACFPlot ResidualHistogram StudentResidualPlot
Plot Description Autocorrelation of residuals Predicted versus actual plot Cooks D plot Inverse autocorrelation of residuals Q-Q plot of residuals Partial autocorrelation of residuals Histogram of the residuals Studentized residual plot
Solution Modes
The following solution modes are commonly used: The dynamic simultaneous forecast mode is used for forecasting with the model. Collect the historical data on the model variables, the future assumptions of the exogenous variables, and any prior information on the future endogenous values, and combine them in a SAS data set. Use the FORECAST option in the SOLVE statement. The dynamic simultaneous simulation mode is often called ex post simulation, historical simulation, or ex post forecasting. Use the DYNAMIC option. This mode is the default. The static simultaneous simulation mode can be used to examine the within-period performance of the model without the complications of previous period errors. Use the STATIC option. The NAHEAD=n dynamic simultaneous simulation mode can be used to see how well nperiod-ahead forecasting would have performed over the historical period. Use the NAHEAD=n option. The different solution modes are explained in detail in the following sections.
n-Period-Ahead Forecasting
Suppose you want to regularly forecast 12 months ahead and produce a new forecast each month as more data becomes available. n-period-ahead forecasting allows you to test how well you would have done over time if you had been using your model to forecast one year ahead.
To see how well a model predicts n time periods in the future, perform an n-period-ahead forecast on real data and compare the forecast values with the actual values. n-period-ahead forecasting refers to using dynamic values for the lagged endogenous variables only for lags 1 through n1. For example, one-period-ahead forecasting, specied by the NAHEAD=1 option in the SOLVE statement, is the same as if a static solution had been requested. Specifying NAHEAD=2 produces a solution that uses dynamic values for lag one and static, actual, values for longer lags. The following example is a two-year-ahead dynamic simulation. Figure 18.68.
data yearly; input year datalines; 84 4 9 0 7 85 5 6 1 1 86 3 8 2 5 87 2 10 3 0 88 4 7 6 20 89 5 4 8 40 90 3 2 10 50 91 2 5 11 40 ; run; x1 x2 x3 y1 y2 y3; 4 27 8 10 60 40 60 50 5 4 2 10 40 40 60 60
proc model data=yearly outmodel=yearlyModel; endogenous y1 y2 y3; exogenous x1 x2 x3; y1 = 2 + 3*x1 - 2*x2 + 4*x3; y2 = 4 + lag2( y3 ) + 2*y1 + x1; y3 = lag3( y1 ) + y2 - x2; solve y1 y2 y3 / nahead=2 out=c; run; proc print data=c; run;
Observations Processed Read Lagged Solved First Last Variables Solved For 20 12 8 5 8 y1 y2 y3
The preceding two-year-ahead simulation can be emulated without using the NAHEAD= option by the following PROC MODEL statements:
proc model data=yearly model=yearlyModel; range year = 87 to 88; solve y1 y2 y3 / dynamic solveprint; run; range year = 88 to 89; solve y1 y2 y3 / dynamic solveprint; run; range year = 89 to 90; solve y1 y2 y3 / dynamic solveprint; run; range year = 90 to 91; solve y1 y2 y3 / dynamic solveprint;
The totals shown under Observations Processed in Figure 18.68 are equal to the sum of the four individual runs.
The RANDOM= option is used to request Monte Carlo (or stochastic) simulations to generate condence intervals for errors that arise from the rst two sources. The Monte Carlo simulations can be performed with , , or both vectors represented as random variables. The SEED= option is used to control the random number generator for the simulations. SEED=0 forces the random number generator to use the system clock as its seed value. In Monte Carlo simulations, repeated simulations are performed on the model for random perturbations of the parameters and the additive error term. The random perturbations follow a multivariate normal distribution with expected value of 0 and covariance described by a covariance matrix of the parameter estimates in the case of , or a covariance matrix of the equation residuals for the case of . PROC MODEL can generate both covariance matrices or you can provide them. The ESTDATA= option species a data set that contains an estimate of the covariance matrix of the parameter estimates to use for computing perturbations of the parameters. The ESTDATA= data set is usually created by the FIT statement with the OUTEST= and OUTCOV options. When the ESTDATA= option is specied, the matrix read from the ESTDATA= data set is used to compute vectors of random shocks or perturbations for the parameters. These random perturbations are computed at the start of each repetition of the solution and added to the parameter values. The perturbed parameters are xed throughout the solution range. If the covariance matrix of the parameter estimates is not provided, the parameters are not perturbed. The SDATA= option species a data set that contains the covariance matrix of the residuals to use for computing perturbations of the equations. The SDATA= data set is usually created by the FIT statement with the OUTS= option. When SDATA= is specied, the matrix read from the SDATA= data set is used to compute vectors of random shocks or perturbations for the equations. These random perturbations are computed at each observation. The simultaneous solution satises the model equations plus the random shocks. That is, the solution is not a perturbation of a simultaneous solution of the structural equations; rather, it is a simultaneous solution of the stochastic equations by using the simulated errors. If the SDATA= option is not specied, the random shocks are not used. The different random solutions are identied by the _REP_ variable in the OUT= data set. An unperturbed solution with _REP_=0 is also computed when the RANDOM= option is used. RANDOM=n produces n +1 solution observations for each input observation in the solution range. If the RANDOM= option is not specied, the SDATA= and ESTDATA= options are ignored, and no Monte Carlo simulation is performed. PROC MODEL does not have an automatic way of modeling the exogenous variables as random variables for Monte Carlo simulation. If the exogenous variables have been forecast, the error bounds for these variables should be included in the error bounds generated for the endogenous variables. If the models for the exogenous variables are included in PROC MODEL, then the error bounds created from a Monte Carlo simulation contain the uncertainty due to the exogenous variables. Alternatively, if the distribution of the exogenous variables is known, the built-in random number generator functions can be used to perturb these variables appropriately for the Monte Carlo simulation. For example, if you know the forecast of an exogenous variable, X, has a standard error of 5.2 and the error is normally distributed, then the following statements can be used to generate random values for X:
During a Monte Carlo simulation, the random number generator functions produce one value at each observation. It is important to use a different seed value for all the random number generator functions in the model program; otherwise, the perturbations will be correlated. For the unperturbed solution, _REP_=0, the random number generator functions return 0. PROC UNIVARIATE can be used to create condence intervals for the simulation (see the Monte Carlo simulation example in the section Getting Started: MODEL Procedure on page 949).
The preceding DATA step generates the data set to request a ve-days-ahead forecast. The following statements estimate and forecast the three forward-rate models of the following form. rat et D rat et D
1
C
1
ratet
rat et N.0; 1/
title "Daily Multivariate Geometric Brownian Motion Model " "of D-Mark/USDollar Forward Rates"; proc model data=xfrate; parms df 15; /* Give initial value to df */
var_demusd1m = sigma1m ** 2 * lag(demusd1m **2); demusd3m = lag(demusd3m) + mu3m * lag(demusd3m); var_demusd3m = sigma3m ** 2 * lag(demusd3m ** 2); demusd6m = lag(demusd6m) + mu6m * lag(demusd6m); var_demusd6m = sigma6m ** 2 * lag(demusd6m ** 2); /* Specify the error distribution */ errormodel demusd1m demusd3m demusd6m ~ t( var_demusd1m var_demusd3m var_demusd6m, df ); /* output normalized S matrix */ fit demusd1m demusd3m demusd6m / outsn=s; run; /* forecast five days in advance */ solve demusd1m demusd3m demusd6m / data=five sdata=s random=1500 out=monte; id date; run; /* select out the last date ---*/ data monte; set monte; if date = 10dec95d then output; run; title "Distribution of demusd1m Five Days Ahead"; proc univariate data=monte noprint; var demusd1m; histogram demusd1m / normal(noprint color=red) kernel(noprint color=blue) cfill=ligr; run;
The Monte Carlo simulation specied in the preceding example draws from a multivariate t distribution with constant degrees of freedom and forecasted variance, and it computes future states of DEMUSD1M, DEMUSD3M, and DEMUSD6M. The OUTSN= option in the FIT statement is used to specify the data set for the normalized matrix. That is, the matrix is created by crossing the normally distributed residuals. The normally distributed residuals are created from the t distributed residuals by using the normal inverse CDF and the t CDF. This matrix is a correlation matrix. The distribution of DEMUSD1M on the fth day is shown in the Figure 18.70. The two curves overlaid on the graph are a kernel density estimation and a normal distribution t to the results.
The rst DATA step generates a data set that contains a covariance matrix for the ARRIVING and LEAVING variables. The covariance is 1 :7 :7 1 The following statements create the number of waiting clients data:
proc model data=calls; arriving = 0; errormodel arriving ~ poisson( 10 ); leaving = 4; errormodel leaving ~ poisson( 11 ); waiting = arriving - leaving; if waiting < 0 then waiting=0; outvars waiting; solve arriving leaving / random=500 sdata=s out=sim; run; title "Distribution of Clients Waiting"; proc univariate data=sim noprint; var waiting ; histogram waiting / cfill=ligr; run;
Mixtures of DistributionsCopulas
The theory of copulas is what enables the MODEL procedure to combine and simulate multivariate distributions with different marginals. This section provides a brief overview of copulas. Modeling a system of variables accurately is a difcult task. The underlying, ideal, distributional assumptions for each variable are usually different from each other. An individual variable might be best modeled as a t distribution or as a Poisson process. The correlation of the various variables are very important to estimate as well. A joint estimation of a set of variables would make it possible to estimate a correlation structure but would restrict the modeling to single, simple multivariate distribution (for example, the normal). Even with a simple multivariate distribution, the joint estimation would be computationally difcult and would have to deal with issues of missing data. By using the MODEL procedure ERRORMODEL statement, you can combine and simulate from models of different distributions. The covariance matrix for the combined model is constructed by using the copula induced by the multivariate normal distribution. A copula is a function that couples joint distributions to their marginal distributions. By default, the copula used in the MODEL procedure is based on the multivariate normal. This
particular multivariate normal has zero mean and covariance matrix R. The user provides R, which can be created by using the following steps: 1. Each model is estimated separately and their residuals are saved. 2. The residuals for each model are converted to a normal distribution by using their CDFs, Fi .:/, using the relationship 1 .F . i t //. 3. These normal residuals are crossed to create a covariance matrix R. If the model of interest can be estimated jointly, such as multivariate T, then the OUTSN= option can be used to generate the correct covariance matrix. A draw from this mixture of distributions is created by using the following steps that are performed automatically by the MODEL procedure. 1. Independent N.0; 1/ variables are generated. 2. These variables are transformed to a correlated set by using the covariance matrix R. 3. These correlated normals are transformed to a uniform by using ./. 4. F
1 ./
Alternate Copulas
The Gaussian, t, and the normal mixture copula are available in the MODEL procedure. These copulas support asymmetric parameters and can use alternate estimation methods for creating the base covariance matrix. The normal (Gaussian) copula is the default. A draw from a Gaussian copula is obtained from x D Az where z 2 Rd is a vector of independent random normal.0; 1/ draws, A 2 Rd d is the square root of the covariance matrix, R. For the normal mixture and t copula, a draw is created as p x D w C wAz where w is a scalar random variable and 2 Rd is a vector of asymmetry parameters. is specied in the SDATA= data set. If W inverse gamma.df =2; df =2/, then x is multivariate t or skewed t if is provided. When NORMALMIX is specied, w is distributed as a step function with each of the n positive variances, v1 . . . vn , having probability p1 . . . pn . The covariance matrix R D A0 A is specied with the SDATA= option. The vector of asymmetry parameters, , defaults to zero or is specied in the SDATA= data set with _TYPE_=ASYM. The ASYM option species that the nonzero asymmetry vector, , is to be used. The actual draw for an individual variable, yi , depends on the marginal distribution of the variable, Q F , and the chosen copula F as Q yi D Fi
1
.F .xi //
To produce a panel plot of this joint distribution, use the following SAS/GRAPH statements.
ods graphics on / height=800 width=800; proc template; define statgraph myplot.panel;
BeginGraph; entrytitle halign=left halign=center textattrs=GRAPHTITLETEXT "t Copula with a Range of Asymmetry"; layout datapanel classvars=(asym) / rows=3 columns=3 order=rowmajor height=1024 width=1420 rowaxisopts=(griddisplay=on label= ) columnaxisopts=(griddisplay=on label= ); layout prototype; scatterplot x=z y=y ; endlayout; endlayout; EndGraph; end; run; proc sgrender data=sim template=myplot.panel; run;
Also, 11 in base 3 is 1110 D 1 32 C 2 30 D 1023 When the powers of 3 are inverted, .11/ D 1 1 7 C2 D 9 3 9
The rst 10 numbers in this sequence .1/ : : : .10/ are provided below 1 2 1 4 7 2 5 8 1 0; ; ; ; ; ; ; ; ; 3 3 9 9 9 9 9 9 27 As the sequence proceeds, it lls in the gaps in a uniform fashion. Several authors have expanded this idea to many dimensions. Two versions supported by the MODEL procedure are the Sobol sequence (QUASI=SOBOL) and the Faure sequence (QUASI=FAURE). The Sobol sequence is based on binary numbers and is generally computationally faster than the Faure sequence. The Faure sequence uses the dimensionality of the problem to determine the number base to use to generate the sequence. The Faure sequence has better distributional properties than the Sobol sequence for dimensions greater than 8. As an example of the difference between a pseudo-random number and a quasi-random number, consider simulating a bivariate normal with 100 draws.
Figure 18.73 Kernel Density of a Bivariate Normal produced by 100 Faure-Random Draws
Figure 18.74 Kernel Density of a Bivariate Normal produced by 100 Pseudo-Random Draws
The rst page of output produced by the SOLVE step is shown in Figure 18.75. This is the summary description of the model. The error message states that the simulation was aborted at observation 144 because of missing input values.
The second page of output, shown in Figure 18.76, gives more information on the failed observation.
Figure 18.76 Solve Step Error Message
The MODEL Procedure Dynamic Single-Equation Forecast ERROR: Solution values are missing because of missing input values for observation 144 at NEWTON iteration 0. NOTE: Additional information on the values of the variables at this observation, which may be helpful in determining the cause of the failure of the solution process, is printed below.
Observation
144
Iteration Missing
0 1
CC
-1.000000
Iteration Errors - Missing. --- Listing of Program Data Vector 144 ACTUAL.LHUR: . . LHUR: 7.10000 0.01071 b: -0.47885 --ERROR.LHUR: PRED.LHUR: c:
_N_: IP: a:
. . 0.92930
From the program data vector, you can see the variable IP is missing for observation 144. LHUR could not be computed, so the simulation aborted. The solution summary table is shown in Figure 18.77.
Solution Summary Variables Solved Forecast Lag Length Solution Method CONVERGE= Maximum CC Maximum Iterations Total Iterations Average Iterations 1 1 NEWTON 1E-8 0 1 143 1
Observations Processed Read Lagged Solved First Last Failed 145 1 143 2 145 1 LHUR
This solution summary table includes the names of the input data set and the output data set followed by a description of the model. The table also indicates that the solution method defaulted to Newtons method. The remaining output is dened as follows. Maximum CC is the maximum convergence value accepted by the Newton procedure. This number is always less than the value for the CONVERGE= option. is the maximum number of Newton iterations performed at each observation and each replication of Monte Carlo simulations. is the sum of the number of iterations required for each observation and each Monte Carlo simulation. is the average number of Newton iterations required to solve the system at each step. is the number of observations used times the number of random replications selected plus one, for Monte Carlo simulations. The one additional simulation is the original unperturbed solution. For simulations that do not involve Monte Carlo, this number is the number of observations used.
Maximum Iterations
Summary Statistics
The STATS and THEIL options are used to select goodness-of-t statistics. Actual values must be provided in the input data set for these statistics to be printed. When the RANDOM= option is specied, the statistics do not include the unperturbed (_REP_=0) solution.
The following statements show the addition of the STATS and THEIL options to the model in the previous section:
proc model data=sashelp.citimon; parameters a 0.010708 b -0.478849 c 0.929304; lhur= 1/(a * ip) + b + c * lag(lhur) ; solve lhur / out=sim dynamic stats theil; range date to 01nov91d; run;
The STATS output in Figure 18.78 and the THEIL output in Figure 18.79 are generated.
Figure 18.78 STATS Output
The MODEL Procedure Dynamic Single-Equation Simulation Solution Range DATE = FEB1980 To NOV1991 Descriptive Statistics Actual Mean Std Dev 7.0887 1.4509 Predicted Mean Std Dev 7.2473 1.1465
Variable LHUR
N Obs 142
N 142
Statistics of fit Mean Error 0.1585 Mean % Error 3.5289 Mean Abs Error 0.6937 Mean Abs % Error 10.0001 RMS Error 0.7854 RMS % Error 11.2452
Variable LHUR
N 142
Statistics of fit Variable LHUR R-Square 0.7049 Label UNEMPLOYMENT RATE: ALL WORKERS, 16 YEARS
The number of observations (Nobs), the number of observations with both predicted and actual values nonmissing (N), and the mean and standard deviation of the actual and predicted values of the determined variables are printed rst. The next set of columns in the output are dened as
follows: Mean Error Mean % Error Mean Abs Error Mean Abs % Error RMS Error RMS % Error R-square SSE SSA CSSA y O y
1 N
PN
O j D1 .yj O j D1 .yj
100 N 1 N
PN
PN
100 N
PN
1 N
PN
1 N
100 1
O j D1 ..yj
SSE=CSSA O j D1 .yj
2 j D1 .yj /
PN
yj /2
PN
SSA
PN
j D1 yj
When the RANDOM= option is specied, the statistics do not include the unperturbed (_REP_=0) solution.
The THEIL option species that Theil forecast error statistics be computed for the actual and predicted values and for the relative changes from lagged values. Mathematically, the quantities are yc D .y O O yc D .y lag.y//= lag.y/ lag.y//= lag.y/
where yc is the relative change for the predicted value and yc is the relative change for the actual O value.
Variable LHUR
N 142
MSE 0.6168
Variable LHUR
Theil Relative Change Forecast Error Statistics Relative Change Corr N MSE (R) 142 0.0126 -0.08 MSE Decomposition Proportions Bias Reg Dist Var Covar (UM) (UR) (UD) (US) (UC) 0.09 0.85 0.06 0.43 0.47
Variable LHUR
Theil Relative Change Forecast Error Statistics Inequality Coef U1 U 4.1226 0.8348
Variable LHUR
The columns have the following meaning: Corr (R) is the correlation coefcient, , between the actual and predicted values. D where Bias (UM)
p
cov.y; y/ O
a p a
and
is an indication of systematic error and measures the extent to which the average values of the actual and predicted deviate from each other. .E.y/ 1 PN E.y//2 O yt /2 O
2 a / =MSE.
t D1 .yt p
Reg (UR)
is dened as .
2/ is dened as .1 a a =MSE and represents the variance of the residuals obtained by regressing yc on yc. O
is the variance proportion. US indicates the ability of the model to replicate the degree of variability in the endogenous variable. US D .
p 2 a/
MSE
Covar (UC)
represents the remaining error after deviations from average values and average variabilities have been accounted for. UC D 2.1 / p MSE
a
U1
is a statistic that measures the accuracy of a forecast dened as follows: p MSE U1 D q P N 1 2 t D1 .yt / N is the Theils inequality coefcient dened as follows: p MSE UDq P q P N N 1 1 2C O 2 t D1 .yt / t D1 .yt / N N is the mean square error. In the case of the relative change Theil statistics, the MSE is computed as follows:
N 1 X MSE D .yc t O N t D1
MSE
yct /2
More information about these statistics can be found in the references Maddala (1977, 344347) and Pindyck and Rubinfeld (1981, 364365).
data demand; do t=1 to 40; price = (rannor(10) +5) * 10; income = 8000 * t ** (1/8); demand = 7200 - 1054 * price ** (2/3) + 7 * income + 100 * rannor(1); output; end; run; data goal; demand = 85000; income = 12686; run;
The goal is to nd the price the company would have to charge to meet a sales target of 85,000 units. To do this, a data set is created with a DEMAND variable set to 85000 and with an INCOME variable set to 12686, the last income value. The desired price is then determined by using the following PROC MODEL statements:
proc model data=demand outmodel=demandModel; demand = a1 - a2 * price ** (2/3) + a3 * income; fit demand / outest=demest; solve price / estdata=demest data=goal solveprint; run;
The SOLVEPRINT option prints the solution values, number of iterations, and nal residuals at each observation. The SOLVEPRINT output from this solve is shown in Figure 18.80.
Figure 18.80 Goal Seeking, SOLVEPRINT Output
The MODEL Procedure Single-Equation Simulation Observation 1 Iterations 6 CC 0.000000 ERROR.demand 0.000000
The output indicates that it took six Newton iterations to determine the PRICE of 33.5902, which makes the DEMAND value within 16E11 of the goal of 85,000 units. Consider a more ambitious goal of 100,000 units. The output shown in Figure 18.81 indicates that the sales target of 100,000 units is not attainable according to this model.
data goal;
demand = 100000; income = 12686; run; proc model model=demandModel; solve price / estdata=demest data=goal solveprint; run;
Observation
Iteration Missing
1 1
CC
-1.000000
--- Listing of Program Data Vector --_N_: 12 ACTUAL.demand: 100000 ERROR.demand: . PRED.demand: . a1: 7126.437997 a2: 1040.841492 a3: 6.992694 demand: 100000 income: 12686 price: -0.000172 @PRED.demand/@pri: .
The program data vector with the error note indicates that even after 10 subiterations, the norm of the residuals could not be reduced. The sales target of 100,000 units are unattainable with the given model. You might need to reformulate your model or collect more data to more accurately reect the market response.
Single-Equation Solution
For normalized form equation systems, the solution either can simultaneously satisfy all the equations or can be computed for each equation separately, by using the actual values of the solution variables in the current period to compute each predicted value. By default, PROC MODEL com-
putes a simultaneous solution. The SINGLE option in the SOLVE statement selects single-equation solutions. Single-equation simulations are often made to produce residuals (which estimate the random terms of the stochastic equations) rather than the predicted values themselves. If the input data and range are the same as that used for parameter estimation, a static single-equation simulation reproduces the residuals of the estimation.
Newtons Method
The NEWTON option in the SOLVE statement requests Newtons method to simultaneously solve the equations for each observation. Newtons method is the default solution method. Newtons method is an iterative scheme that uses the derivatives of the equations with respect to the solution variables, J, to compute a change vector as yi D J
1
q.yi ; x; /
PROC MODEL builds and solves J by using efcient sparse matrix techniques. The solution variables y i at the ith iteration are then updated as yi C1 D yi C d yi
where d is a damping factor between 0 and 1 chosen iteratively so that kq.yiC1 ; x; /k < kq.yi ; x; /k The number of subiterations allowed for nding a suitable d is controlled by the MAXSUBITER= option. The number of iterations of Newtons method allowed for each observation is controlled by MAXITER= option. See Ortega and Rheinbolt (1970) for more details.
Jacobi Method
The JACOBI option in the SOLVE statement selects a matrix-free alternative to Newtons method. This method is the traditional nonlinear Jacobi method found in the literature. The Jacobi method as implemented in PROC MODEL substitutes predicted values for the endogenous variables and iterates until a xed point is reached. Then necessary derivatives are computed only for the diagonal elements of the jacobian, J. If the normalized form equation is y D f.y; x; / the Jacobi iteration has the form yi C1 D f.yi ; x; /
Seidel Method
The Seidel method is an order-dependent alternative to the Jacobi method. The Seidel method is selected by the SEIDEL option in the SOLVE statement. The Seidel method is like the Jacobi method except that in the Seidel method the model is further edited to substitute the predicted values into the solution variables immediately after they are computed. Seidel thus differs from the other methods in that the values of the solution variables are not xed within an iteration. With the other methods, the order of the equations in the model program makes no difference, but the Seidel method might work much differently when the equations are specied in a different sequence. Note that this xed point method is the traditional nonlinear Seidel method found in the literature. The iteration has the form y yi C1 D f.O i ; x; / j where yi C1 is the jth equation variable at the ith iteration and j
i i i i i i i O yi D .y1C1 ; y2C1 ; y3C1 ; : : :; yjC1 ; yj ; yj C1 ; : : :; yg /0 1
If the model is recursive, and if the equations are in recursive order, the Seidel method converges at once. If the model is block-recursive, the Seidel method might converge faster if the equations are grouped by block and the blocks are placed in block-recursive order. The BLOCK option can be used to determine the block-recursive form.
The second case is a system of equations that contains one or more EQ.var equations. In this case, a heuristic algorithm is used to make the assignment of a unique solution variable to each general form equation. Use the DETAILS option in the SOLVE statement to print a listing of the assigned variables. Once the assignment is made, the new y approximation is computed as y i C1 D y i f .x; yi / y i @.f .x; yi / y i /=@y
If k is the number of general form equations, then k derivatives are required. The convergence properties of the Jacobi and Seidel solution methods remain signicantly poorer than the default Newtons method.
Comparison of Methods
Newtons method is the default and should work better than the others for most small- to mediumsized models. The Seidel method is always faster than the Jacobi for recursive models with equations in recursive order. For very large models and some highly nonlinear smaller models, the Jacobi or Seidel methods can sometimes be faster. Newtons method uses more memory than the Jacobi or Seidel methods. Both the Newtons method and the Jacobi method are order-invariant in the sense that the order in which equations are specied in the model program has no effect on the operation of the iterative solution process. In order-invariant methods, the values of the solution variables are xed for the entire execution of the model program. Assignments to model variables are automatically changed to assignments to corresponding equation variables. Only after the model program has completed execution are the results used to compute the new solution values for the next iteration.
Troubleshooting Problems
In solving a simultaneous nonlinear dynamic model you might encounter some of the following problems.
Missing Values
For SOLVE tasks, there can be no missing parameter values. Missing right-hand-side variables result in missing left-hand-side variables for that observation.
Unstable Solutions
A solution might exist but be unstable. An unstable system can cause the Jacobi and Seidel methods to diverge.
Explosive Dynamic Systems
A model might have well-behaved solutions at each observation but be dynamically unstable. The solution might oscillate wildly or grow rapidly with time.
Propagation of Errors
During the solution process, solution variables can take on values that cause computational errors. For example, a solution variable that appears in a LOG function might be positive at the solution but might be given a negative value during one of the iterations. When computational errors occur, missing values are generated and propagated, and the solution process might collapse.
Convergence Problems
The following items can cause convergence problems: There are illegal function values ( for example There are local minima in the model equation. No solution exists. Multiple solutions exist. Initial values are too far from the solution. The CONVERGE= value is too small. When PROC MODEL fails to nd a solution to the system, the current iteration information and the program data vector are printed. The simulation halts if actual values are not available for the simulation to proceed. Consider the following program, which produces the output shown in Figure 18.82:
data test1; do t=1 to 50; x1 = sqrt(t) ; y = .; output; end; proc model data=test1; exogenous x1 ; control a1 -1 b1 -29 c1 -4 ; y = a1 * sqrt(y) + b1 * x1 * x1 + c1 * lag(x1); solve y / out=sim forecast dynamic ; run;
1 ).
Observation
Iteration Missing
1 1
CC
-1.000000
. -1 1.41421
NOTE: Check for missing input data or uninitialized lags. (Note that the LAG and DIF functions return missing values for the initial lag starting observations. This is a change from the 1982 and earlier versions of SAS/ETS which returned zero for uninitialized lags.) NOTE: Simulation aborted.
At the rst observation, a solution to the following equation is attempted: p yD y 62 There is no solution to this problem. The iterative solution process got as close as it could to making Y negative while still being able to evaluate the model. This problem can be avoided in this case by altering the equation. In other models, the problem of missing values can be avoided by either altering the data set to provide better starting values for the solution variables or by altering the equations. You should be aware that, in general, a nonlinear system can have any number of solutions and the solution found might not be the one that you want. When multiple solutions exist, the solution that is found is usually determined by the starting values for the iterations. If the value from the input data set for a solution variable is missing, the starting value for it is taken from the solution of the last period (if nonmissing) or else the solution estimate is started at 0.
Iteration Output
The iteration output, produced by the ITPRINT option, is useful in determining the cause of a convergence problem. The ITPRINT option forces the printing of the solution approximation and equation errors at each iteration for each observation. A portion of the ITPRINT output from the following statements is shown in Figure 18.83.
proc model data=test1; exogenous x1 ; control a1 -1 b1 -29 c1 -4 ; y = a1 * sqrt(abs(y)) + b1 * x1 * x1 + c1 * lag(x1); solve y / out=sim forecast dynamic itprint; run;
For each iteration, the equation with the largest error is listed in parentheses after the Newton convergence criteria measure. From this output you can determine which equation or equations in the system are not converging well.
Predicted Values y 0.0001000 Iteration Errors y -62.01010 Observation 1 Iteration 1 CC 50.902771 ERROR.y -61.88684
Predicted Values y -1.215784 Iteration Errors y -61.88684 Observation 1 Iteration 2 CC 0.364806 ERROR.y 41.752112
Numerical Integration
The differential equation system is numerically integrated to obtain a solution for the derivative variables at each data point. The integration is performed by evaluating the provided model at
Limitations ! 1147
multiple points between each data point. The integration method used is a variable order, variable step-size backward difference scheme; for more detailed information, see Aiken (1985) and Byrne and Hindmarsh (1975). The step size or time step is chosen to satisfy a local truncation error requirement. The term truncation error comes from the fact that the integration scheme uses a truncated series expansion of the integrated function to do the integration. Because the series is truncated, the integration scheme is within the truncation error of the true value. To further improve the accuracy of the integration, the total integration time is broken up into small intervals (time steps or step sizes), and the integration scheme is applied to those intervals. The integration at each time step uses the values computed at the previous time step so that the truncation error tends to accumulate. It is usually not possible to estimate the global error with much precision. The best that can be done is to monitor and to control the local truncation error, which is the truncation error committed at each time step relative to d D max .ky.t /k1 ; 1/
0t T
where y.t / is the integrated variable. Furthermore, the y.t/s are dynamically scaled to within two orders of magnitude one to keep the error monitoring well behaved. The local truncation error requirement defaults to 1.0E9. You can specify the LTEBOUND= option to modify that requirement. The LTEBOUND= option is a relative measure of accuracy, so a value smaller than 1.0E10 is usually not practical. A larger bound increases the speed of the simulation and estimation but decreases the accuracy of the results. If the LTEBOUND= option is set too small, the integrator is not able to take time steps small enough to satisfy the local truncation error requirement and still have enough machine precision to compute the results. Since the integrations are scaled to within 1.0E2 of one, the simulated values should be correct to at least seven decimal places. There is a default minimum time step of 1.0E14. This minimum time step is controlled by the MINTIMESTEP= option and the machine epsilon. If the minimum time step is smaller than the machine epsilon times the nal time value, the minimum time step is increased automatically. For the points between each observation in the data set, the values for nonintegrated variables in the data set are obtained from a linear interpolation from the two closest points. Lagged variables can be used with integrations, but their values are discrete and are not interpolated between points. Lagging, therefore, can then be used to input step functions into the integration. The derivatives necessary for estimation (the gradient with respect to the parameters) and goal seeking (the Jacobian) are computed by numerically integrating analytical derivatives. The accuracy of the derivatives is controlled by the same integration techniques mentioned previously.
Limitations
There are limitations to the types of differential equations that can be solved or estimated. One type is an explosive differential equation (nite escape velocity) for which the following differential equation is an example: y D a y; a > 0
0
If this differential equation is integrated too far in time, y exceeds the maximum value allowed on the computer, and the integration terminates. Likewise, differential systems that are singular cannot be solved or estimated in general. For example, consider the following differential system: x y
0 0
D D
y C 2x C 4y C exp.t/ x C y C exp.4 t/
0
This system has an analytical solution, but an accurate numerical solution is very difcult to obtain. 0 0 The reason is that y and x cannot be isolated on the left-hand side of the equation. If the equation is modied slightly to x y
0 0
y C 2x C 4y C exp.t/
0
D x C y C exp.4t /
the system is nonsingular, but the integration process could still fail or be extremely slow. If the MODEL procedure encounters either system, a warning message is issued. This system can be rewritten as the following recursive system, which can be estimated and simulated successfully with the MODEL procedure: x y
0 0
0:5exp.t/
Petzold (1982) mentions a class of differential algebraic equations that, when integrated numerically, could produce incorrect or misleading results. An example of such a system is y2 .t / D y1 .t / C g1 .t / 0 D y2 .t / C g2 .t / The analytical solution to this system depends on g and its derivatives at the current time only and not on its initial value or past history. You should avoid systems of this and other similar forms mentioned in Petzold (1982).
0
Typically, the SDATA= data set is created by the OUTS= option in a previous FIT statement. (The OUTS= data set from a FIT statement can be read back in by a SOLVE statement in the same PROC MODEL step.) You can create an input SDATA= data set by using the DATA step. PROC MODEL expects to nd a character variable _NAME_ in the SDATA= data set as well as variables for the equations in the estimation or solution. For each observation with a _NAME_ value that matches the name of an equation, PROC MODEL lls the corresponding row of the S matrix with the values of the names of equations found in the data set. If a row or column is omitted from the data set, an identity matrix row or column is assumed. Missing values are ignored. Since the S matrix is symmetric, you can include only a triangular part of the S matrix in the SDATA= data set with the omitted part indicated by missing values. If the SDATA= data set contains multiple observations with the same _NAME_, the last values supplied for the _NAME_ variable are used. The section OUTS= Data Set on page 1112 contains more details on the format of this data set. Use the TYPE= option to specify the type of estimation method used to produce the S matrix you want to input.
Model Variables
Model variables are declared by VAR, ENDOGENOUS, or EXOGENOUS statements, or by FIT and SOLVE statements. The model variables are the variables that the model is intended to explain or predict. PROC MODEL enables you to use expressions on the left-hand side of the equal sign to dene model equations. For example, a log-linear model for Y can be written as
log( y ) = a + b * x;
Previously, only a variable name was allowed on the left-hand side of the equal sign. The text on the left-hand side of the equation serves as the equation name used to identify the equation in printed output, in the OUT= data sets, and in FIT or SOLVE statements. To refer to equations specied by using left-hand side expressions (in the FIT statement, for example), place the left-hand side expression in quotes. For example, the following statements t a log-linear model to the dependent variable Y:
proc model data=in; log( y ) = a + b * x; fit "log(y)"; run;
The estimation and simulation is performed by transforming the models into general form equations. No actual or predicted value is available for general form equations, so no R2 or adjusted R2 is computed.
Equation Variables
An equation variable is one of several special variables used by PROC MODEL to control the evaluation of model equations. An equation variable name consists of one of the prexes EQ, RESID, ERROR, PRED, or ACTUAL, followed by a period and the name of a model equation. Equation variable names can appear in parts of the PROC MODEL printed output, and they can be used in the model program. For example, RESID-prexed variables can be used in LAG functions to dene equations with moving-average error terms. See the section Autoregressive MovingAverage Error Processes on page 1088 for details. The meaning of these prexes is detailed in the section Equation Translations on page 1155.
Parameters
Parameters are variables that have the same value for each observation. Parameters can be given values or can be estimated by tting the model to data. During the SOLVE stage, parameters are treated as constants. If no estimation is performed, the SOLVE stage uses the initial value provided
in the ESTDATA= data set, the MODEL= le, or in the PARAMETER statement, as the value of the parameter. The PARAMETERS statement declares the parameters of the model. Parameters are not lagged, and they cannot be changed by the model program.
Control Variables
Control variables supply constant values to the model program that can be used to control the model in various ways. The CONTROL statement declares control variables and species their values. A control variable is like a parameter except that it has a xed value and is not estimated from the data. Control variables are not reinitialized before each pass through the data and can thus be used to retain values between passes. You can use control variables to vary the program logic. Control variables are not affected by lagging functions. For example, if you have two versions of an equation for a variable Y, you could put both versions in the model and, by using a CONTROL statement to select one of them, produce two different solutions to explore the effect the choice of equation has on the model, as shown in the following statements:
select (case); when (1) y = when (2) y = end;
control case 1; solve / out=case1; run; control case 2; solve / out=case2; run;
Internal Variables
You can use several internal variables in the model program to communicate with the procedure. For example, if you want PROC MODEL to list the values of all the variables when more than 10
iterations are performed and the procedure is past the 20th observation, you can write
if _obs_ > 20 then if _iter_ > 10 then _list_ = 1;
Internal variables are not affected by lagging functions, and they cannot be changed by the model program except as noted. The following internal variables are available. The variables are all numeric except where noted. _ERRORS_ is a ag that is set to 0 at the start of program execution and is set to a nonzero value whenever an error occurs. The program can also set the _ERRORS_ variable. is the iteration number. For FIT tasks, the value of _ITER_ is negative for preliminary grid-search passes. The iterative phase of the estimation starts with iteration 0. After the estimates have converged, a nal pass is made to collect statistics with _ITER_ set to a missing value. Note that at least one pass, and perhaps several subiteration passes as well, is made for each iteration. For SOLVE tasks, _ITER_ counts the iterations used to compute the simultaneous solution of the system. is the number of dynamic lags that contribute to the solution at the current observation. _LAG_ is always 0 for FIT tasks and for STATIC solutions. _LAG_ is set to a missing value during the lag starting phase. is a list ag that is set to 0 at the start of program execution. The program can set _LIST_ to a nonzero value to request a listing of the values of all the variables in the program after the program has nished executing. is the solution method in use for SOLVE tasks. _METHOD_ is set to a blank value for FIT tasks. _METHOD_ is a character-valued variable. Values are NEWTON, JACOBI, SIEDEL, or ONEPASS. takes the value ESTIMATE for FIT tasks and the value SIMULATE or FORECAST for SOLVE tasks. _MODE_ is a character-valued variable. is the number of missing or otherwise unusable observations during the model estimation. For FIT tasks, _NMISS_ is initially set to 0; at the start of each iteration, _NMISS_ is set to the number of unusable observations for the previous iteration. For SOLVE tasks, _NMISS_ is set to a missing value. is the number of nonmissing observations used in the estimation. For FIT tasks, PROC MODEL initially sets _NUSED_ to the number of parameters; at the start of each iteration, _NUSED_ is reset to the number of observations used in the previous iteration. For SOLVE tasks, _NUSED_ is set to a missing value. counts the observations being processed. _OBS_ is negative or 0 for observations in the lag starting phase. is the replication number for Monte Carlo simulation when the RANDOM= option is specied in the SOLVE statement. _REP_ is 0 when the RANDOM= option is not used and for FIT tasks. When _REP_=0, the random-number generator functions always return 0.
_ITER_
_LAG_
_LIST_
_METHOD_
_MODE_ _NMISS_
_NUSED_
_OBS_ _REP_
_WEIGHT_
is the weight of the observation. For FIT tasks, _WEIGHT_ provides a weight for the observation in the estimation. _WEIGHT_ is initialized to 1.0 at the start of execution for FIT tasks. For SOLVE tasks, _WEIGHT_ is ignored.
Program Variables
Variables not in any of the other classes are called program variables. Program variables are used to hold intermediate results of calculations. Program variables are reinitialized to missing values before each observation is processed. Program variables can be lagged. The RETAIN statement can be used to give program variables initial values and enable them to keep their values between observations.
Character Variables
PROC MODEL supports both numeric and character variables. Character variables are not involved in the model specication but can be used to label observations, to write debugging messages, or for documentation purposes. All variables are numeric unless they are the following. character variables in a DATA= SAS data set program variables assigned a character value declared to be character by a LENGTH or ATTRIB statement
Equation Translations
Equations written in normalized form are always automatically converted to general form equations. For example, when a normalized form equation such as
y = a + b*x;
If the same system is expressed as the following general form equation, then this equation is used unchanged.
EQ.y = y a + b*x;
This makes it easy to solve for arbitrary variables and to modify the error terms for autoregressive or moving average models. Use the LIST option to see how this transformation is performed. For example, the following statements produce the listing shown in Figure 18.84.
proc model data=line list; y = a1 + b1*x1 + c1*x2; fit y; run;
Stmt 1 1 1
PRED.Y is the predicted value of Y, and ACTUAL.Y is the value of Y in the data set. The predicted value minus the actual value, RESID.Y, is then the error term, , for the original Y equation. Note that the residuals obtained from the OUTRESID option in the OUT=dataset for both the FIT and SOLVE statements are dened as actual pred icted , the negative of RESID.Y. See the section Syntax: MODEL Procedure on page 962 for details. ACTUAL.Y and Y have the same value for parameter estimation. For solve tasks, ACTUAL.Y is still the value of Y in the data set but Y becomes the solved value; the value that satises PRED.Y Y = 0. The following are the equation variable denitions. EQ. The value of an EQ.-prexed equation variable (normally used to dene a general form equation) represents the failure of the equation to hold. When the EQ.name variable is 0, the name equation is satised. The RESID.name variables represent the stochastic parts of the equations and are used to dene the objective function for the estimation process. A RESID.prexed equation variable is like an EQ.-prexed variable but makes it possible to use or transform the stochastic part of the equation. The RESID. equation is used in place of the ERROR. equation for model solutions if it has been reassigned or used in the equation. An ERROR.name variable is like an EQ.-prexed variable, except that it is used only for model solution and does not affect parameter estimation. For a normalized form equation (specied by assignment to a model variable), the PRED.name equation variable holds the predicted value, where name is the name of both the model variable and the corresponding equation. (PRED.prexed variables are not created for general form equations.)
RESID.
ERROR. PRED.
Derivatives ! 1157
ACTUAL.
For a normalized form equation (specied by assignment to a model variable), the ACTUAL.name equation variable holds the value of the name model variable read from the input data set. The DERT.name variable denes a differential equation. Once dened, it might be used on the right-hand side of another equation. The H.name variable species the functional form for the variance of the named equation. This is created for H.vars and is the moment equation for the variance for GMM. This variable is used only for GMM.
GMM_H.name = RESID.name**2 - H.name;
DERT. H. GMM_H.
MSE.
The MSE.y variable contains the value of the mean squared error for y at each iteration. An MSE. variable is created for each dependent/endogenous variable in the model. These variables can be used to specify the missing lagged values in the estimation and simulation of GARCH type models.
demret = intercept ; h.demret = arch0 + arch1 * xlag( resid.demret ** 2, mse.demret) + garch1 * xlag(h.demret, mse.demret) ;
NRESID.
This is created for H.vars and is the normalized residual of the variable <name >. The formula is
NRESID.name = RESID.name/ sqrt(H.name);
The three equation variable prexes, RESID., ERROR., and EQ. allow for control over the objective function for the FIT, the SOLVE, or both the FIT and the SOLVE stages. For FIT tasks, PROC MODEL looks rst for a RESID.name variable for each equation. If dened, the RESID.-prexed equation variable is used to dene the objective function for the parameter estimation process. Otherwise, PROC MODEL looks for an EQ.-prexed variable for the equation and uses it instead. For SOLVE tasks, PROC MODEL looks rst for an ERROR.name variable for each equation. If dened, the ERROR.-prexed equation variable is used for the solution process. Otherwise, PROC MODEL looks for an EQ.-prexed variable for the equation and uses it instead. To solve the simultaneous equation system, PROC MODEL computes values of the solution variables (the model variables being solved for) that make all of the ERROR.name and EQ.name variables close to 0.
Derivatives
Nonlinear modeling techniques require the calculation of derivatives of certain variables with respect to other variables. The MODEL procedure includes an analytic differentiator that determines the model derivatives and generates program code to compute these derivatives. When parameters are estimated, the MODEL procedure takes the derivatives of the equation with respect to the parameters. When the model is solved, Newtons method requires the derivatives of the equations with respect to the variables solved for.
PROC MODEL uses exact mathematical formulas for derivatives of non-user-dened functions. For other functions, numerical derivatives are computed and used. The differentiator differentiates the entire model program, including the conditional logic and ow of control statements. Delayed denitions, as when the LAG of a program variable is referred to before the variable is assigned a value, are also differentiated correctly. The differentiator includes optimization features that produce efcient code for the calculation of derivatives. However, when ow of control statements such as GOTO statements are used, the optimization process is impeded, and less efcient code for derivatives might be produced. Optimization is also reduced by conditional statements, iterative DO loops, and multiple assignments to the same variable. The table of derivatives is printed with the LISTDER option. The code generated for the computation of the derivatives is printed with the LISTCODE option.
Derivative Variables
When the differentiator needs to generate code to evaluate the expression for the derivative of a variable, the result is stored in a special derivative variable. Derivative variables are not created when the derivative expression reduces to a previously computed result, a variable, or a constant. The names of derivative variables, which might sometimes appear in the printed output, have the form @obj /@wrt, where obj is the variable whose derivative is being taken and wrt is the variable that the differentiation is with respect to. For example, the derivative variable for the derivative of Y with respect to X is named @Y/@X. The derivative variables can be accessed or used as part of the model program using the GETDER() function. GETDER(x, a ) GETDER(x, a, b ) the derivative of x with respect to a. the second derivative of x with respect to a and b.
The main purpose of the GETDER() function is for surfacing the derivatives so they can be stored in a data set for further processing. Only derivatives that are implied by the problem are available to the GETDER() function. When derivatives are requested that arent already created, a missing value will be returned. The derivative of the GETDER() function is always zero so the results of the GETDER() function shouldnt be used in any of the equations in the FIT or the SOLVE statement. The following example adds the gradient of the PRED.y value with respect to the parameters to the OUT= data set.
proc model data=line ; y = a1 + b1**2 *x1 + c1*x2; Dy_a1 = getder(PRED.y,a1); Dy_b1 = getder(PRED.y,b1); Dy_c1 = getder(PRED.y,c1); outvars Dy_a1 Dy_b1 Dy_c1; fit y / out=grad; run;
Mathematical Functions
The following is a brief summary of SAS functions that are useful for dening models. Additional functions and details are in SAS Language: Reference. Information about creating new functions can be found in SAS/BASE Software: Procedure Reference, Chapter 18, The FCMP Procedure. ABS(x ) ARCOS(x ) ARSIN(x ) ATAN(x ) COS(x ) COSH(x ) EXP(x ) LOG(x ) LOG10(x ) LOG2(x ) SIN(x ) SINH(x ) SQRT(x ) TAN(x ) TANH(x ) the absolute value of x the arccosine in radians of x; x should be between 1 and 1. the arcsine in radians of x; x should be between 1 and 1. the arctangent in radians of x the cosine of x; x is in radians. the hyperbolic cosine of x ex the natural logarithm of x the log base ten of x the log base two of x the sine of x; x is in radians. the hyperbolic sine of x the square root of x the tangent of x; x is in radians and is not an odd multiple of =2. the hyperbolic tangent of x
Random-Number Functions
The MODEL procedure provides several functions for generating random numbers for Monte Carlo simulation. These functions use the same generators as the corresponding SAS DATA step functions. The following random number functions are supported: RANBIN, RANCAU, RAND, RANEXP, RANGAM, RANNOR, RANPOI, RANTBL, RANTRI, and RANUNI. For more information, refer to SAS Language: Reference. Each reference to a random number function sets up a separate pseudo-random sequence. Note that this means that two calls to the same random function with the same seed produce identical results. This is different from the behavior of the random number functions used in the SAS DATA step. For example, the following statements produce identical values for X and Y, but Z is from an independent pseudo-random sequence:
x=rannor(123); y=rannor(123); z=rannor(567); q=rand(BETA, 1, 12 );
For FIT tasks, all random number functions always return 0. For SOLVE tasks, when Monte Carlo simulation is requested, a random number function computes a new random number on the rst iteration for an observation (if it is executed on that iteration) and returns that same value for all later iterations of that observation. When Monte Carlo simulation is not requested, random number functions always return 0.
ZLAGn ( < i, > x ) returns the ith lag of x, where n is the maximum lag, with missing lags replaced with zero XLAGn ( x, y ) ZDIFn (x ) MOVAVGn( x ) returns the nth lag of x if x is nonmissing, or y if x is missing is the difference with lag length truncated and missing values converted to zero; x is the variable or expression to compute the moving average of is the moving average if Xt denotes the observation at time point t, to ensure compatibility with the number n of observations used to calculate the moving average MOVAVGn, the following denition is used: MOVAV Gn.Xt / D Xt C Xt
1
C Xt
C : : : C Xt
nC1
The moving average calculation for SAS 9.1 and earlier releases is as follows: MOVAV Gn.Xt / D Xt C Xt
1
C Xt 2 C : : : C Xt nC1
Missing values of x are omitted in computing the average. If you do not specify n, the number of periods is assumed to be one. For example, LAG(X) is the same as LAG1(X). No more than four digits can be used with a lagging function; that is, LAG9999 is the greatest LAG function, ZDIF9999 is the greatest ZDIF function, and so on. The LAG functions get values from previous observations and make them available to the program. For example, LAG(X) returns the value of the variable X as it was computed in the execution of the program for the preceding observation. The expression LAG2(X+2*Y) returns the value of the expression X+2*Y, computed by using the values of the variables X and Y that were computed by the execution of the program for the observation two periods ago.
The DIF functions return the difference between the current value of a variable or expression and the value of its LAG. For example, DIF2(X) is a short way of writing XLAG2(X), and DIF15(SQRT(2*Z)) is a short way of writing SQRT(2*Z)LAG15(SQRT(2*Z)). The ZLAG and ZDIF functions are like the LAG and DIF functions, but they are not counted in the determination of the program lag length, and they replace missing values with 0s. The ZLAG function returns the lagged value if the lagged value is nonmissing, or 0 if the lagged value is missing. The ZDIF function returns the differenced value if the differenced value is nonmissing, or 0 if the value of the differenced value is missing. The ZLAG function is especially useful for models with ARMA error processes. See the next section for details.
Lag Logic
The LAG and DIF lagging functions in the MODEL procedure are different from the queuing functions with the same names in the DATA step. Lags are determined by the nal values that are set for the program variables by the execution of the model program for the observation. This can have upsetting consequences for programs that take lags of program variables that are given different values at various places in the program, as shown in the following statements:
temp t temp s = = = = x + w; lag( temp ); q - r; lag( temp );
The expression LAG(TEMP) always refers to LAG(QR), never to LAG(X+W), since QR is the nal value assigned to the variable TEMP by the model program. If LAG(X+W) is wanted for T, it should be computed as T=LAG(X+W) and not T=LAG(TEMP), as in the preceding example. Care should also be exercised in using the DIF functions with program variables that might be reassigned later in the program. For example, the program
temp = x ; s = dif( temp ); temp = 3 * y;
Note that in the preceding examples, TEMP is a program variable, not a model variable. If it were a model variable, the assignments to it would be changed to assignments to a corresponding equation variable. Note that whereas LAG1(LAG1(X)) is the same as LAG2(X), DIF1(DIF1(X)) is not the same as DIF2(X). The DIF2 function is the difference between the current period value at the point in the program where the function is executed and the nal value at the end of execution two periods ago; DIF2 is not the second difference. In contrast, DIF1(DIF1(X)) is equal to DIF1(X)LAG1(DIF1(X)), which equals X2*LAG1(X)+LAG2(X), which is the second difference of X.
More information about the differences between PROC MODEL and the DATA step LAG and DIF functions is found in Chapter 3, Working with Time Series Data.
Lag Lengths
The lag length of the model program is the number of lags needed for any relevant equation. The program lag length controls the number of observations used to initialize the lags. PROC MODEL keeps track of the use of lags in the model program and automatically determines the lag length of each equation and of the model as a whole. PROC MODEL sets the program lag length to the maximum number of lags needed to compute any equation to be estimated, solved, or needed to compute any instrument variable used. In determining the lag length, the ZLAG and ZDIF functions are treated as always having a lag length of 0. For example, if Y is computed as
y = lag2( x + zdif3( temp ) );
then Y has a lag length of 0. This is so that ARMA errors can be specied without causing the loss of additional observations to the lag starting phase and so that recursive lag specications, such as moving-average error terms, can be used. Recursive lags are not permitted unless the ZLAG or ZDIF functions are used to truncate the lag length. For example, the following statement produces an error message:
t = a + b * lag( t );
The program variable T depends recursively on its own lag, and the lag length of T is therefore undened. In the following equation RESID.Y depends on the predicted value for the Y equation but the predicted value for the Y equation depends on the LAG of RESID.Y, and thus, the predicted value for the Y equation depends recursively on its own lag.
y = yhat + ma * lag( resid.y );
The lag length is innite, and PROC MODEL prints an error message and stops. Since this kind of specication is allowed, the recursion must be truncated at some point. The ZLAG and ZDIF functions do this. The following equation is valid and results in a lag length for the Y equation equal to the lag length of YHAT:
Initially, the lags of RESID.Y are missing, and the ZLAG function replaces the missing residuals with 0s, their unconditional expected values. The ZLAG0 function can be used to zero out the lag length of an expression. ZLAG0(x ) returns the current period value of the expression x, if nonmissing, or else returns 0, and prevents the lag length of x from contributing to the lag length of the current statement.
Initializing Lags
At the start of each pass through the data set or BY group, the lag variables are set to missing values and an initialization is performed to ll the lags. During this phase, observations are read from the data set, and the model variables are given values from the data. If necessary, the model is executed to assign values to program variables that are used in lagging functions. The results for variables used in lag functions are saved. These observations are not included in the estimation or solution. If, during the execution of the program for the lag starting phase, a lag function refers to lags that are missing, the lag function returns missing. Execution errors that occur while starting the lags are not reported unless requested. The modeling system automatically determines whether the program needs to be executed during the lag starting phase. If L is the maximum lag length of any equation being t or solved, then the rst L observations are used to prime the lags. If a BY statement is used, the rst L observations in the BY group are used to prime the lags. If a RANGE statement is used, the rst L observations prior to the rst observation requested in the RANGE statement are used to prime the lags. Therefore, there should be at least L observations in the data set. Initial values for the lags of model variables can also be supplied in VAR, ENDOGENOUS, and EXOGENOUS statements. This feature provides initial lags of solution variables for dynamic solution when initial values for the solution variable are not available in the input data set. For example, the statement
var x 2 3 y 4 5 z 1;
feeds the initial lags exactly like these values in an input data set: Lag 2 1 X 3 2 Y 5 4 Z . 1
If initial values for lags are available in the input data set and initial lag values are also given in a declaration statement, the values in the VAR, ENDOGENOUS, or EXOGENOUS statements take priority. The RANGE statement is used to control the range of observations in the input data set that are processed by PROC MODEL. In the following statement, 01jan1924 species the starting period of the range, and 01dec1943 species the ending period:
The observations in the data set immediately prior to the start of the range are used to initialize the lags.
Language Differences
For the most part, PROC MODEL programming statements work the same as they do in the DATA step as documented in SAS Language: Reference. However, there are several differences that should be noted.
DO Statement Differences
The DO statement in PROC MODEL does not allow a character index variable. Thus, the following DO statement is not valid in PROC MODEL, although it is supported in the DATA step:
do i = A, B, C; /* invalid PROC MODEL code */
IF Statement Differences
The IF statement in PROC MODEL does not allow a character-valued condition. For example, the following IF statement is not supported by PROC MODEL:
if this then statement;
Comparisons of character values are supported in IF statements, so the following IF statement is acceptable:
if this < that then statement;
PROC MODEL allows for embedded conditionals in expressions. For example the following two statements are equivalent:
flag = if time = 1 or time = 2 then conc+30/5 + dose*time else if time > 5 then (0=1) else (patient * flag); if time = 1 or time = 2 then flag= conc+30/5 + dose*time; else if time > 5 then flag=(0=1); else flag=patient*flag;
Note that the ELSE operator involves only the rst object or token after it so that the following assignments are not equivalent:
total = if sum > 0 then sum else sum + reserve; total = if sum > 0 then sum else (sum + reserve);
The rst assignment makes TOTAL always equal to SUM plus RESERVE.
Subscripted array names must be enclosed in parentheses. For example, the following statement prints the ith element of the array A:
put (a i);
The PROC MODEL PUT statement supports the print item _PDV_ to print a formatted listing of all the variables in the program. For example, the following statement prints a much more readable listing of the variables than does the _ALL_ print item:
put _pdv_;
To print all the elements of the array A, use the following statement:
put a;
To print all the elements of A with each value labeled by the name of the element variable, use the following statement:
put a=;
The ARRAY statement is used to associate a name with a list of variables and constants. The array name can then be used with subscripts in the model program to refer to the items in the list. In PROC MODEL, the ARRAY statement does not support all the features of the DATA step ARRAY statement. Implicit indexing cannot be used; all array references must have explicit subscript expressions. Only exact array dimensions are allowed; lower-bound specications are not supported. A maximum of six dimensions is allowed. On the other hand, the ARRAY statement supported by PROC MODEL does allow both variables and constants to be used as array elements. You cannot make assignments to constant array elements. Both dimension specication and the list of elements are optional, but at least one must be supplied. When the list of elements is not given or fewer elements than the size of the array are listed, array variables are created by sufxing element numbers to the array name to complete the element list. The following are valid PROC MODEL array statements:
array array array array array x[120]; /* q[2,2]; /* b[4] va vb vc vd; /* x x1-x30; /* a[5] (1 2 3 4 5); /* array X of length 120 Two dimensional array Q B[2] = VB, B[4] = VD array X of length 30, X[7] = X7 array A initialized to 1,2,3,4,5 */ */ */ */ */
RETAIN Statement
RETAIN variables initial-values ;
The RETAIN statement causes a program variable to hold its value from a previous observation until the variable is reassigned. The RETAIN statement can be used to initialize program variables. The RETAIN statement does not work for model variables, parameters, or control variables because the values of these variables are under the control of PROC MODEL and not programming statements. Use the PARMS and CONTROL statements to initialize parameters and control variables. Use the VAR, ENDOGENOUS, or EXOGENOUS statement to initialize model variables.
The CMPMODEL= option defaults to BOTH in SAS 9.2 and is intended for transitional use. If CMPMODEL=BOTH, the MODEL procedure writes both formats; when loading model les PROC MODEL attempts to load the XML version rst and the CATALOG version second (if the XML version is not found). If CMPMODEL=XML, the MODEL procedure reads and writes only the XML format. If CMPMODEL=CATALOG, only the catalog format is used. The MODEL= option is used to read in a model. A list of model les can be specied in the MODEL= option, and a range of names with numeric sufxes can be given, as in
MODEL=(MODEL1MODEL10). When more than one model le is given, the list must be placed in parentheses, as in MODEL=(A B C), except in case of a single name. If more than one model le is specied, the les are combined in the order listed in the MODEL= option. The MODEL procedure continues to read and write catalog MODEL les, and model les created by previous releases of SAS/ETS continue to work, so you should experience no direct impact from this change. When the MODEL= option is specied in the PROC MODEL statement and model denition statements are also given later in the PROC MODEL step, the model les are read in rst, in the order listed, and the model program specied in the PROC MODEL step is appended after the model program read from the MODEL= les. The class assigned to a variable, when multiple model les are used, is the last declaration of that variable. For example, if Y1 was declared endogenous in the model le M1 and exogenous in the model le M2, the following statement causes Y1 to be declared exogenous.
proc model model=(m1 m2);
The INCLUDE statement can be used to append model code to the current model code. In contrast, when the MODEL= option is used in the RESET statement, the current model is deleted before the new model is read. By default, no model le is output if the PROC MODEL step performs any FIT or SOLVE tasks, or if the MODEL= option or the NOSTORE option is used. However, to ensure compatibility with previous versions of SAS/ETS software, when the PROC MODEL step does nothing but compile the model program, no input model le is read, and the NOSTORE option is not used, a model le is written. This model le is the default input le for a later PROC SYSLIN or PROC SIMLIN step. The default output model lename in this case is WORK.MODELS._MODEL_.MODEL. If FIT statements are used to estimate model parameters, the parameter estimates written to the output model le are the estimates from the last estimation performed for each parameter.
/*--- Diagnostics and Debugging ---*/ *---------Fitting a Segmented Model using MODEL----* | | | | y | quadratic plateau | | | y=a+b*x+c*x*x y=p | | | ..................... | | | . : | | | . : | | | . : | | | . : | | | . : | | +-----------------------------------------X | | x0 | | | | continuity restriction: p=a+b*x0+c*x0**2 | | smoothness restriction: 0=b+2*c*x0 so x0=-b/(2*c)| *--------------------------------------------------*; title QUADRATIC MODEL WITH PLATEAU; data a; input y x @@; datalines; .46 1 .47 2 .57 3 .61 4 .62 5 .68 6 .69 7 .78 8 .70 9 .74 10 .77 11 .78 12 .74 13 .80 13 .80 15 .78 16 ; proc model data=a list xref listcode; parms a 0.45 b 0.5 c -0.0025; x0 = -.5*b / c; /* join point */ if x < x0 then /* Quadratic part of model */ y = a + b*x + c*x*x; else /* Plateau part of model */ y = a + b*x0 + c*x0*x0; fit y; run;
Program Listing
The LIST option produces a listing of the model program. The statements are printed one per line with the original line number and column position of the statement. The program listing from the example program is shown in Figure 18.85.
Stmt 1 2 3 3 3 4 5 5 5
The LIST option also shows the model translations that PROC MODEL performs. LIST output is useful for understanding the code generated by the %AR and the %MA macros.
Cross-Reference
The XREF option produces a cross-reference listing of the variables in the model program. The XREF listing is usually used in conjunction with the LIST option. The XREF listing does not include derivative (@-prexed) variables. The XREF listing does not include generated assignments to equation variables, PRED., RESID., and ERROR.-prexed variables, unless the DETAILS option is used. The cross-reference from the example program is shown in Figure 18.86.
Figure 18.86 XREF Output for Segmented Model
QUADRATIC MODEL WITH PLATEAU The MODEL Procedure Cross Reference Listing For Program Kind Type References (statement)/(line):(col) Var Var Var Var Num Num Num Num Used: 3/51043:13 5/51045:13 Used: 1/51041:12 3/51043:16 5/51045:16 Used: 1/51041:15 3/51043:22 5/51045:23 Assigned: 1/51041:15 Used: 2/51042:11 5/51045:16 5/51045:23 5/51045:26 Used: 2/51042:11 3/51043:16 3/51043:22 3/51043:24 Assigned: 3/51043:19 5/51045:20
Symbol----------a b c x0
x PRED.y
Var Var
Num Num
Compiler Listing
The LISTCODE option lists the model code and derivatives tables produced by the compiler. This listing is useful only for debugging and should not normally be needed. LISTCODE prints the operator and operands of each operation generated by the compiler for each model program statement. Many of the operands are temporary variables generated by the compiler and given names such as #temp1. When derivatives are taken, the code listing includes the operations generated for the derivatives calculations. The derivatives tables are also listed. A LISTCODE option prints the transformed equations from the example shown in Figure 18.87 and Figure 18.88.
Figure 18.87 LISTCODE Output for Segmented ModelStatements as Parsed
Derivatives WRT-Variable a b c ObjectVariable RESID.y RESID.y RESID.y Derivative-Variable @RESID.y/@a @RESID.y/@b @RESID.y/@c
Stmt 1 1 1 2 3 3 3 3 3 3 3 3 3 4 5 5 5 5 5 5 5 5 5
Listing of Compiled Program Code Line:Col Statement as Parsed 3712:4 3712:4 3712:4 3713:4 3714:7 3714:7 3714:7 3714:7 3714:7 3714:7 3714:7 3714:7 3714:7 3715:4 3716:7 3716:7 3716:7 3716:7 3716:7 3716:7 3716:7 3716:7 3716:7 x0 = (-0.5 * b) / c; @x0/@b = -0.5 / c; @x0/@c = - x0 / c; if x < x0 then PRED.y = a + b * x + c * x * x; @PRED.y/@a = 1; @PRED.y/@b = x; @PRED.y/@c = x * x; RESID.y = PRED.y - ACTUAL.y; @RESID.y/@a = @PRED.y/@a; @RESID.y/@b = @PRED.y/@b; @RESID.y/@c = @PRED.y/@c; ERROR.y = PRED.y - y; else PRED.y = a + b * x0 + c * x0 * x0; @PRED.y/@a = 1; @PRED.y/@b = x0 + b * @x0/@b + (c * @x0/@b * x0 + c * x0 * @x0/@b); @PRED.y/@c = b * @x0/@c + ((x0 + c * @x0/@c) * x0 + c * x0 * @x0/@c); RESID.y = PRED.y - ACTUAL.y; @RESID.y/@a = @PRED.y/@a; @RESID.y/@b = @PRED.y/@b; @RESID.y/@c = @PRED.y/@c; ERROR.y = PRED.y - y;
1 Stmt ASSIGN
* / eeocf / /
line 1069 column 4. (1) arg=x0 argsave=x0 Source Text: at 1069:12 (30,0,2). at 1069:15 (31,0,2). at 1069:15 (18,0,1). at 1069:15 (31,0,2). at 1069:15 (24,0,1). at 1069:15 (31,0,2). line 1070 column 4. (2) arg=_temp1 argsave=_temp1 Source Text: at 1070:11 (36,0,2). line 1071 column 7. (1) arg=PRED.y argsave=y Source Text: at at at at at at at at at at 1071:16 1071:13 1071:22 1071:24 1071:19 1071:19 1071:19 1071:19 1071:24 1071:19 (30,0,2). (32,0,2). (30,0,2). (30,0,2). (32,0,2). (18,0,1). (1,0,1). (1,0,1). (30,0,2). (1,0,1).
x0 = -.5*b / c; * : _temp1 <- -0.5 b / : x0 <- _temp1 c eeocf : _DER_ <- _DER_ / : @x0/@b <- -0.5 c - : @1dt1_2 <- x0 / : @x0/@c <- @1dt1_2 c ref.st=ASSIGN stmt number 5 at 1073:7 if x < x0 then < : _temp1 <- x x0
2 Stmt IF
Oper Oper Oper Oper Oper Oper Oper Oper Oper Oper
* + * * + eeocf = = * =
/* Quadratic part of model */ y = a + b*x + c*x*x; * : _temp1 <- b x + : _temp2 <- a _temp1 * : _temp3 <- c x * : _temp4 <- _temp3 x + : PRED.y <- _temp2 _temp4 eeocf : _DER_ <- _DER_ = : @PRED.y/@a <- 1 = : @PRED.y/@b <- x * : @1dt1_1 <- x x = : @PRED.y/@c <- @1dt1_1
3 Stmt Assign
eeocf = = =
line 1071 column 7. (1) arg=RESID.y argsave=y at 1071:7 (33,0,2). at 1071:7 (18,0,1). at 1071:7 (1,0,1). at 1071:7 (1,0,1). at 1071:7 (1,0,1). line 1071 column 7. (1) arg=ERROR.y argsave=y
- : RESID.y <- PRED.y ACTUAL.y eeocf : _DER_ <- _DER_ = : @RESID.y/@a <- @PRED.y/@a = : @RESID.y/@b <- @PRED.y/@b = : @RESID.y/@c <- @PRED.y/@c
3 Stmt Assign
5 Stmt ASSIGN
Oper Oper Oper Oper Oper Oper Oper Oper Oper Oper Oper Oper Oper Oper Oper Oper Oper Oper Oper Oper Oper
* + * * + eeocf = * + * * * + + * * + * * + +
/* Plateau part of model */ y = a + b*x0 + c*x0*x0; * : _temp1 <- b x0 + : _temp2 <- a _temp1 * : _temp3 <- c x0 * : _temp4 <- _temp3 x0 + : PRED.y <- _temp2 _temp4 eeocf : _DER_ <- _DER_ = : @PRED.y/@a <- 1 * : @1dt1_1 <- b @x0/@b + : @1dt1_2 <- x0 @1dt1_1 * : @1dt1_3 <- c @x0/@b * : @1dt1_4 <- @1dt1_3 x0 * : @1dt1_5 <- _temp3 @x0/@b + : @1dt1_6 <- @1dt1_4 @1dt1_5 + : @PRED.y/@b <- @1dt1_2 @1dt1_6 * : @1dt1_8 <- b @x0/@c * : @1dt1_9 <- c @x0/@c + : @1dt1_10 <- x0 @1dt1_9 * : @1dt1_11 <- @1dt1_10 x0 * : @1dt1_12 <- _temp3 @x0/@c + : @1dt1_13 <- @1dt1_11 @1dt1_12 + : @PRED.y/@c <- @1dt1_8 @1dt1_13
5 Stmt Assign
eeocf = = =
line 1073 column 7. (1) arg=RESID.y argsave=y at 1073:7 (33,0,2). at 1073:7 (18,0,1). at 1073:7 (1,0,1). at 1073:7 (1,0,1). at 1073:7 (1,0,1). line 1073 column 7. (1) arg=ERROR.y argsave=y at 1073:7 (33,0,2).
- : RESID.y <- PRED.y ACTUAL.y eeocf : _DER_ <- _DER_ = : @RESID.y/@a <- @PRED.y/@a = : @RESID.y/@b <- @PRED.y/@b = : @RESID.y/@c <- @PRED.y/@c
5 Stmt Assign
Oper -
Dependency List
The LISTDEP option produces a dependency list for each variable in the model program. For each variable, a list of variables that depend on it and a list of variables it depends on is given. The dependency list produced by the example program is shown in Figure 18.89.
wsum k
y wp g t year c0 c1 c2 c3 i0 i1 i2
PRED.i
RESID.i
ERROR.i
ACTUAL.i
PRED.w
PRED.k
PRED.wsum
BLOCK Listing
The BLOCK option prints an analysis of the program variables based on the assignments in the model program. The output produced by the example is shown in Figure 18.90.
Figure 18.90 The BLOCK Output for Kleins Model
The MODEL Procedure Model Structure Analysis (Based on Assignments to Endogenous Model Variables) Exogenous Variables Endogenous Variables wp g t year c p w i x wsum k y
Dependency Structure of the System Block 1 k y Depends On All_Exogenous Depends On Block 1 All_Exogenous Depends On Block 1 All_Exogenous
One use for the block output is to put a model in recursive form. Simulations of the model can be done with the SEIDEL method, which is efcient if the model is recursive and if the equations are in recursive order. By examining the block output, you can determine how to reorder the model equations for the most efcient simulation.
Adjacency Graph
The GRAPH option displays the same information as the BLOCK option with the addition of an adjacency graph. An X in a column in an adjacency graph indicates that the variable associated with the row depends on the variable associated with the column. The output produced by the example is shown in Figure 18.91. The rst and last graphs are straightforward. The middle graph represents the dependencies of the nonexogenous variables after transitive closure has been performed (that is, A depends on B, and B depends on C, so A depends on C). The preceding transitive closure matrix indicates that K and Y do not directly or indirectly depend on each other.
y e a r * . . X . . . . . . . . X
c p w i x wsum k y wp g t year
* * * *
X . . . X . . X . . . .
X X . X . . . . . . . .
. X X . . X . . . . . .
. . . X X . X X . . . .
. X X . X . . . . . . .
X . . . . X . . . . . .
. . . . . . X . . . . .
. . . . . . . X . . . .
(Note: * = Exogenous Variable.) Transitive Closure Matrix of Sorted System w s u Block Variable c p w i x m k y 1 1 1 1 1 1 c p w i x wsum k y X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X . . . . . . X . . . . . . . . X
Block
y e a r * . . X . . . . . . . . X
1 1 1 1 1 1
c p w i x wsum k y wp g t year
* * * *
X . . . X . . X . . . .
L X . L . . . . . . . .
. X X . . X . . . . . .
. . . X X . X X . . . .
. X L . X . . . . . . .
X . . . . X . . . . . .
. . . L . . L . . . . .
. . . . . . . X . . . .
data=uspop; = Maximum Population = Location Parameter = Initial Growth Rate; / ( 1 + exp( b - c * (year-1790) ) ); start=(a 1000 b 5.5 c .02) / out=resid outresid;
Minimization Summary Parameters Estimated Method Iterations Subiterations Average Subiterations 3 Gauss 7 6 0.857143
Equation pop
SSE 345.6
MSE 19.2020
R-Square 0.9972
Nonlinear OLS Parameter Estimates Approx Std Err 30.0404 0.0695 0.00107 Approx Pr > |t| <.0001 <.0001 <.0001
Parameter a b c
The adjusted R2 value indicates the model ts the data well. There are only 21 observations and the model is nonlinear, so signicance tests on the parameters are only approximate. The signicance tests and associated approximate probabilities indicate that all the parameters are signicantly different from 0. The FIT statement included the options OUT=RESID and OUTRESID so that the residuals from the estimation are saved to the data set RESID. The residuals are plotted to check for heteroscedasticity by using PROC SGPLOT as follows.
title2 "Residuals Plot"; proc sgplot data=resid; refline 0; scatter x=year y=pop / markerattrs=(symbol=circlefilled); xaxis values=(1780 to 2000 by 20);
run;
The residuals do not appear to be independent, and the model could be modied to explain the remaining nonrandom errors.
* * * *
+ + + +
* * * *
+ + + +
* * * *
) / ); ) / );
fit share1 share2 start=( a1 -.14 a2 -.45 b11 .03 b12 .47 b22 .98 b31 .20 b32 1.11 b33 .71 ) / outsused=smatrix sur; run;
A portion of the printed output produced by this example is shown in Output 18.2.1 through Output 18.2.3.
Output 18.2.1 Translog Demand Model Summary
Consumer Demand--Translog Functional Form Asymmetric Model The MODEL Procedure Model Summary Model Variables Endogenous Parameters Equations Number of Statements Model Variables Parameters(Value) Equations 2 2 11 2 8
share1 share2 a1(-0.14) a2(-0.45) b11(0.03) b12(0.47) b13 b21 b22(0.98) b23 b31(0.2) b32(1.11) b33(0.71) share1 share2 The 2 Equations to Estimate
share1 = share2 =
F(a1, b11, b12, b13, b21, b22, b23, b31, b32, b33) F(a2, b11, b12, b13, b21, b22, b23, b31, b32, b33)
Final Convergence Criteria R PPC(b11) RPC(b11) Object Trace(S) Objective Value 0.00016 0.00116 0.012106 2.921E-6 0.000078 1.749312
Parameter a1 a2 b11 b12 b13 b21 b22 b23 b31 b32 b33
Estimate -0.14881 -0.45776 0.048382 0.43655 0.248588 0.586326 0.759776 1.303821 0.297808 0.961551 0.8291
t Value -66.08 -154.29 0.97 8.70 4.82 2.81 2.96 5.60 1.98 5.89 5.33
The model is then estimated under the restriction of symmetry (bij = bj i ), as shown in the following statements:
title2 Symmetric Model; proc model data=tlog1; var share1 share2 p1 p2 p3; parms a1 a2 b11 b12 b22 b31 b32 b33; bm1 = b11 + b12 + b31; bm2 = b12 + b22 + b32; bm3 = b31 + b32 + b33; lp1 = log(p1); lp2 = log(p2); lp3 = log(p3); share1 = ( a1 + b11 * lp1 + b12 * lp2 + b31 * lp3 ) / ( -1 + bm1 * lp1 + bm2 * lp2 + bm3 * lp3 ); share2 = ( a2 + b12 * lp1 + b22 * lp2 + b32 * lp3 ) / ( -1 + bm1 * lp1 + bm2 * lp2 + bm3 * lp3 ); fit share1 share2 start=( a1 -.14 a2 -.45 b11 .03 b12 .47 b22 .98 b31 .20 b32 1.11 b33 .71 ) / sdata=smatrix sur; run;
A portion of the printed output produced for the symmetry restricted model is shown in Output 18.2.4 and Output 18.2.5.
Nonlinear SUR Parameter Estimates Approx Std Err 0.00135 0.00167 0.00741 0.0115 0.0177 0.00614 0.0127 0.0168 Approx Pr > |t| <.0001 <.0001 0.0004 <.0001 <.0001 <.0001 <.0001 <.0001
Hypothesis testing requires that the S matrix from the unrestricted model be imposed on the restricted model, as explained in the section Tests on Parameters on page 1078. The S matrix saved in the data set SMATRIX is requested by the SDATA= option. A chi-square test is used to see if the hypothesis of symmetry is accepted or rejected. (OcOu ) has a chi-square distribution asymptotically, where Oc is the constrained OBJECTIVE*N and Ou is the unconstrained OBJECTIVE*N. The degrees of freedom is equal to the difference in the number of
free parameters in the two models. In this example, Ou is 76.9697 and Oc is 78.4097, resulting in a difference of 1.44 with 3 degrees of freedom. You can obtain the probability value by using the following statements:
data _null_; /* probchi( reduced-full, n-restrictions )*/ p = 1-probchi( 1.44, 3 ); put p=; run;
The output from this DATA step run is p D 0:6961858724. With this p-value you cannot reject the hypothesis of symmetry. This test is asymptotically valid.
1.8 .8
title1 Example of Vector AR(1) Error Process Using Grunfelds Model; /* Note: GE stands for General Electric WH stands for Westinghouse */ proc model outmodel=grunmod; var gei whi gef gec whf whc; parms ge_int ge_f ge_c wh_int wh_f wh_c; label ge_int = GE Intercept ge_f = GE Lagged Share Value Coef ge_c = GE Lagged Capital Stock Coef wh_int = WH Intercept wh_f = WH Lagged Share Value Coef wh_c = WH Lagged Capital Stock Coef;
gei = ge_int + ge_f * gef + ge_c * gec; whi = wh_int + wh_f * whf + wh_c * whc; run;
The preceding PROC MODEL step denes the structural model and stores it in the model le named GRUNMOD. The following PROC MODEL step reads in the model, adds the vector autoregressive terms using %AR, and requests SUR estimation by using the FIT statement.
title2 With Unrestricted Vector AR(1) Error Process; proc model data=grunfeld model=grunmod; %ar( ar, 1, gei whi ) fit gei whi / sur; run;
The nal PROC MODEL step estimates the restricted model, as shown in the following statements:
title2 With restricted AR(1) Error Process; proc model data=grunfeld model=grunmod; %ar( gei, 1 ) %ar( whi, 1) fit gei whi / sur; run;
gei whi gef gec whf whc ge_int ge_f ge_c wh_int wh_f wh_c ar_l1_1_1(0) ar_l1_1_2(0) ar_l1_2_1(0) ar_l1_2_2(0) gei whi The 2 Equations to Estimate
gei = whi =
F(ge_int, ge_f, ge_c, wh_int, wh_f, wh_c, ar_l1_1_1, ar_l1_1_2) F(ge_int, ge_f, ge_c, wh_int, wh_f, wh_c, ar_l1_2_1, ar_l1_2_2)
Final Convergence Criteria R PPC(wh_int) RPC(wh_int) Object Trace(S) Objective Value 0.000609 0.002798 0.005411 6.243E-7 720.2454 1.374476
Parameter ge_int ge_f ge_c wh_int wh_f wh_c ar_l1_1_1 ar_l1_1_2 ar_l1_2_1 ar_l1_2_2
Estimate -42.2858 0.049894 0.123946 -4.68931 0.068979 0.019308 0.990902 -1.56252 0.244161 -0.23864
t Value -1.39 3.27 2.70 -0.52 3.80 0.26 2.53 -1.44 1.37 -0.48
Label GE Intercept GE Lagged Share Value Coef GE Lagged Capital Stock Coef WH Intercept WH Lagged Share Value Coef WH Lagged Capital Stock Coef AR(ar) gei: LAG1 parameter for gei AR(ar) gei: LAG1 parameter for whi AR(ar) whi: LAG1 parameter for gei AR(ar) whi: LAG1 parameter for whi
gei whi gef gec whf whc ge_int ge_f ge_c wh_int wh_f wh_c gei_l1(0) whi_l1(0) gei whi
Nonlinear SUR Parameter Estimates Approx Std Err 29.7227 0.0149 0.0423 9.2765 0.0154 0.0805 0.2149 0.2424 Approx Pr > |t| 0.3259 0.0099 0.0124 0.7416 0.0029 0.6410 0.0393 0.0784
Label GE Intercept GE Lagged Share Value Coef GE Lagged Capital Stock Coef WH Intercept WH Lagged Share Value Coef WH Lagged Capital Stock Coef AR(gei) gei lag1 parameter AR(whi) whi lag1 parameter
gei = ge_int + ge_f * gef + ge_c * gec; whi = wh_int + wh_f * whf + wh_c * whc; run; title1 Example of MA(1) Error Process Using Grunfelds Model; title2 MA(1) Error Process Using Unconditional Least Squares; proc model data=grunfeld model=grunmod; %ma(gei,1, m=uls); %ma(whi,1, m=uls); fit whi gei start=( gei_m1 0.8 -0.8) / startiter=2; run;
Nonlinear OLS Parameter Estimates Approx Std Err 32.0908 0.0150 0.0352 9.5448 0.0172 0.0708 0.1614 0.2368 Approx Pr > |t| 0.4153 0.0217 0.0013 0.7048 0.0115 0.3559 <.0001 0.0060
Label GE Intercept GE Lagged Share Value Coef GE Lagged Capital Stock Coef WH Intercept WH Lagged Share Value Coef WH Lagged Capital Stock Coef MA(gei) gei lag1 parameter MA(whi) whi lag1 parameter
The estimation summary from the following PROC ARIMA statements is shown in Output 18.4.2.
title2 PROC ARIMA Using Unconditional Least Squares;
proc arima data=grunfeld; identify var=whi cross=(whf whc ) noprint; estimate q=1 input=(whf whc) method=uls maxiter=40; run;
Lag 0 1 0 0
Shift 0 0 0 0
Constant Estimate Variance Estimate Std Error Estimate AIC SBC Number of Residuals
The model stored in Example 18.3 is read in by using the MODEL= option and the moving-average terms are added using the %MA macro. The MA(1) model using maximum likelihood is estimated by using the following statements:
title2 MA(1) Error Process Using Maximum Likelihood ; proc model data=grunfeld model=grunmod; %ma(gei,1, m=ml); %ma(whi,1, m=ml); fit whi gei; run;
PROC ARIMA does not estimate systems, so only one equation is evaluated.
The estimation results are shown in Output 18.4.3 and Output 18.4.4. The small differences in the parameter values between PROC MODEL and PROC ARIMA can be eliminated by tightening the convergence criteria for both procedures.
Output 18.4.3 PROC MODEL Results by Using ML Estimation
Example of MA(1) Error Process Using Grunfelds Model MA(1) Error Process Using Maximum Likelihood The MODEL Procedure Nonlinear OLS Summary of Residual Errors DF Model 4 4 DF Error 16 16 16 16 Adj R-Sq 0.6821 0.6361
Nonlinear OLS Parameter Estimates Approx Std Err 34.2933 0.0161 0.0380 9.5638 0.0174 0.0729 0.1942 0.2540 Approx Pr > |t| 0.4765 0.0351 0.0023 0.7620 0.0106 0.3749 0.0009 0.0148
Label GE Intercept GE Lagged Share Value Coef GE Lagged Capital Stock Coef WH Intercept WH Lagged Share Value Coef WH Lagged Capital Stock Coef MA(gei) gei lag1 parameter MA(whi) whi lag1 parameter
Lag 0 1 0 0
Shift 0 0 0 0
Constant Estimate Variance Estimate Std Error Estimate AIC SBC Number of Residuals
f .z/xt
5z 2 C 1:5z
The LIST option prints the model statements added by the %PDL macro. The following statements generate simulated data as shown:
/*--------------------------------------------------------------*/ /* Generate Simulated Data for a Linear Model with a PDL on X */ /* y = 10 + x(6,2) + e */ /* pdl(x) = -5.*(lg)**2 + 1.5*(lg) + 0. */ /*--------------------------------------------------------------*/ data pdl; pdl2=-5.; pdl1=1.5; pdl0=0; array zz(i) z0-z6; do i=1 to 7;
z=i-1; zz=pdl2*z**2 + pdl1*z + pdl0; end; do n=-11 to 30; x =10*ranuni(1234567)-5; pdl=z0*x + z1*xl1 + z2*xl2 + z3*xl3 + z4*xl4 + z5*xl5 + z6*xl6; e =10*rannor(123); y =10+pdl+e; if n>=1 then output; xl6=xl5; xl5=xl4; xl4=xl3; xl3=xl2; xl2=xl1; xl1=x; end; run; title1 Polynomial Distributed Lag Example; title3 Estimation of PDL(6,4) Model-- No Endpoint Restrictions; proc model data=pdl; parms int; %pdl( xpdl, 6, 4 ) y = int + %pdl( xpdl, x ); fit y / list; run;
/* /* /* /*
declare the intercept parameter */ declare the lag distribution */ define the model equation */ estimate the parameters */
The LIST output for the model without endpoint restrictions is shown in Output 18.5.1. The rst seven statements in the generated program are the polynomial expressions for lag parameters XPDL_L0 through XPDL_L6. The estimated parameters are INT, XPDL_0, XPDL_1, XPDL_2, XPDL_3, and XPDL_4.
Stmt 1 2 3
4152:14
4152:14
4152:14
4152:14
4153:4
8 8 9 10 11 12 13 14 15 16
4153:4 4153:4 4152:15 4152:15 4152:15 4152:15 4152:15 4152:15 4152:15 4152:14
The FIT results for the model without endpoint restrictions are shown in Output 18.5.2.
Equation y
SSE 2070.8
MSE 115.0
R-Square 0.9998
Nonlinear OLS Parameter Estimates Approx Std Err 2.3238 0.7587 2.0936 1.6215 0.4253 0.0353 Approx Pr > |t| 0.0006 0.9127 0.7244 0.0186 0.6195 0.6528
Label
PDL(XPDL,6,4) parameter for PDL(XPDL,6,4) parameter for PDL(XPDL,6,4) parameter for PDL(XPDL,6,4) parameter for PDL(XPDL,6,4) parameter for
Portions of the output produced by the following PDL model with endpoints of the model restricted to zero are presented in Output 18.5.3.
title3 Estimation of PDL(6,4) Model-- Both Endpoint Restrictions; proc model data=pdl ; parms int; %pdl( xpdl, 6, 4, r=both ) y = int + %pdl( xpdl, x ); fit y /list; run;
/* /* /* /*
declare the intercept parameter */ declare the lag distribution */ define the model equation */ estimate the parameters */
Equation y
SSE 449868
MSE 22493.4
R-Square 0.9596
Nonlinear OLS Parameter Estimates Approx Std Err 32.4032 5.4361 1.7602 0.1471 Approx Pr > |t| 0.6038 0.0189 <.0001 <.0001
Label
PDL(XPDL,6,4) parameter for (L)**2 PDL(XPDL,6,4) parameter for (L)**3 PDL(XPDL,6,4) parameter for (L)**4
Note that XPDL_0 and XPDL_1 are not shown in the estimate summary. They were used to satisfy the endpoint restrictions analytically by the generated %PDL macro code. Their values can be determined by back substitution. To estimate the PDL model with one or more of the polynomial terms dropped, specify the largest degree of the polynomial desired with the %PDL macro and use the DROP= option in the FIT statement to remove the unwanted terms. The dropped parameters should be set to 0. The following PROC MODEL statements demonstrate estimation with a PDL of degree 2 without the 0th order term.
title3 Estimation of PDL(6,2) Model-- With XPDL_0 Dropped; proc model data=pdl list; parms int; %pdl( xpdl, 6, 2 ) y = int + %pdl( xpdl, x ); xpdl_0 =0; fit y drop=xpdl_0; run;
/* declare the intercept parameter */ /* declare the lag distribution */ /* define the model equation */ /* estimate the parameters */
Equation y
SSE 2114.1
MSE 100.7
R-Square 0.9998
Nonlinear OLS Parameter Estimates Approx Std Err 2.1685 0.3159 0.0656 Approx Pr > |t| 0.0003 <.0001 <.0001
Label
Three data sets are used in this example. The rst data set, HISTORY, is used to estimate the parameters of the model. The ASSUME data set is used to produce a forecast of PRICE and QUANTITY. Notice that the ASSUME data set does not need to contain the variables PRICE and QUANTITY. The HISTORY data set is shown as follows:
data history; input year income unitcost price quantity; datalines;
1976 1977
2221.87 2254.77
3.31220 3.61647
0.17903 0.06757
266.714 276.049
The third data set, GOAL, used in a forecast of PRICE and UNITCOST as a function of INCOME and QUANTITY is as follows:
data goal; input year income quantity; datalines; 1986 2571.87 371.4 1987 2721.08 416.5 1988 3327.05 597.3 1989 3885.85 764.1 1990 3650.98 694.3 ;
The following statements t the model to the HISTORY data set and solve the tted model for the ASSUME data set.
proc model model=model outmodel=model; /* estimate the model parameters */ fit supply demand / data=history outest=est n2sls; instruments income unitcost year; run; /* produce forecasts for income and unitcost assumptions */ solve price quantity / data=assume out=pq; run; title2 "Parameter Estimates for the System"; proc print data=est; run; title2 "Price Quantity Solution"; proc print data=pq; run;
The model summary of the supply and demand model is shown as follows:
Output 18.6.1 Model Summary
General Form Equations for Supply-Demand Model The MODEL Procedure Model Summary Model Variables Parameters Equations Number of Statements Model Variables Parameters Equations 4 6 2 3
The 2 Equations to Estimate supply = demand = Instruments F(s0(1), s1(price), s2(unitcost)) F(d0(1), d1(price), d2(income)) 1 income unitcost year
The estimation results are shown in Output 18.6.2 and the OUTEST= data set is show in Output 18.6.3. The output data set produced by the SOLVE statement is shown in Output 18.6.4.
Output 18.6.2 Output from the FIT Statement
General Form Equations for Supply-Demand Model The MODEL Procedure Nonlinear 2SLS Summary of Residual Errors DF Model 3 3 DF Error 7 7 Adj R-Sq
R-Square
Nonlinear 2SLS Parameter Estimates Approx Std Err 4.1841 0.5673 0.00187 4.1780 1.5977 1.1217 Approx Pr > |t| <.0001 0.2466 <.0001 <.0001 <.0001 <.0001
Parameter d0 d1 d2 s0 s1 s2
Output 18.6.3 Listing of OUTEST= Data Set Created in the FIT Statement
General Form Equations for Supply-Demand Model Price Quantity Solution _ S T A T U S _
_ N A O M b E s _ 1
_ T Y P E _
_ N U S E D _
d 0
d 1
d 2
s 0
s 1
s 2
Output 18.6.4 Listing of OUT= Data Set Created in the First SOLVE Statement
General Form Equations for Supply-Demand Model Price Quantity Solution Obs 1 2 3 4 5 _TYPE_ PREDICT PREDICT PREDICT PREDICT PREDICT _MODE_ SIMULATE SIMULATE SIMULATE SIMULATE SIMULATE _ERRORS_ 0 0 0 0 0 price 1.20473 1.18666 1.20154 1.68089 2.06214 quantity 371.552 382.642 391.788 400.478 411.896 income 2571.87 2609.12 2639.77 2667.77 2705.16 unitcost 2.31220 2.45633 2.51647 1.65617 1.01601 year 1986 1987 1988 1989 1990
The GOAL data set is used in a forecast of PRICE and UNITCOST as a function of INCOME and QUANTITY.
data goal; input year income quantity; datalines; 1986 2571.87 371.4 1987 2721.08 416.5 1988 3327.05 597.3 1989 3885.85 764.1 1990 3650.98 694.3 ;
The following statements produce the goal-seeking solutions for PRICE and UNITCOST by using the GOAL dataset.
title2 "Price Unitcost Solution"; /* produce goal-seeking solutions for income and quantity assumptions*/ proc model model=model; solve price unitcost / data=goal out=pc; run;
The output data set produced by the nal SOLVE statement is shown in Output 18.6.5.
Output 18.6.5 Listing of OUT= Data Set Created in the Second SOLVE Statement
General Form Equations for Supply-Demand Model Price Unitcost Solution Obs 1 2 3 4 5 _TYPE_ PREDICT PREDICT PREDICT PREDICT PREDICT _MODE_ SIMULATE SIMULATE SIMULATE SIMULATE SIMULATE _ERRORS_ 0 0 0 0 0 price 0.99284 1.86594 2.12230 2.46166 2.74831 quantity 371.4 416.5 597.3 764.1 694.3 income 2571.87 2721.08 3327.05 3885.85 3650.98 unitcost 2.72857 1.44798 2.71130 3.67395 2.42576 year 1986 1987 1988 1989 1990
A mass is hung from a spring with spring constant K. The motion is slowed by a damper with damper constant C. The damping force is proportional to the velocity, while the spring force is
proportional to the displacement. This is actually a continuous system; however, the behavior can be approximated by a discrete time model. We approximate the differential equation @ d i sp D veloci ty @ t i me with the difference equation d i sp D veloci ty t i me This is rewritten as d i sp LAG.d i sp/ D velocity dt
where dt is the time step used. In PROC MODEL, this is expressed with the program statement
disp = lag(disp) + vel * dt;
or
dert.disp = vel;
The rst statement is simply a computing formula for Eulers approximation for the integral Z d i sp D veloci ty dt If the time step is small enough with respect to the changes in the system, the approximation is good. Although PROC MODEL does not have the variable step-size and error-monitoring features of simulators designed for continuous systems, the procedure is a good tool to use for less challenging continuous models. The second form instructs the MODEL procedure to do the integration for you. This model is unusual because there are no exogenous variables, and endogenous data are not needed. Although you still need a SAS data set to count the simulation periods, no actual data are brought in. Since the variables DISP and VEL are lagged, initial values specied in the VAR statement determine the starting state of the system. The mass, time step, spring constant, and damper constant are declared and initialized by a CONTROL statement as shown in the following statements:
title1 Simulation of Spring-Mass-Damper System; /*- Data to drive the simulation time periods ---*/ data one; do n=1 to 100;
output; end; run; proc model var control force = disp = vel = accel = time = run; data=one outmodel=spring; force -200 disp 10 vel 0 mass 9.2 c 1.5 dt .1 -k * disp -c * vel; lag(disp) + vel * dt; lag(vel) + accel * dt; force / mass; lag(time) + dt;
The displacement scale is zeroed at the point where the force of gravity is offset, so the acceleration of the gravity constant is omitted from the force equation. The control variable C and K represent the damper and the spring constants respectively. The model is simulated three times, and the simulation results are written to output data sets. The rst run uses the original initial conditions specied in the VAR statement. In the second run, the initial displacement is doubled; the results show that the period of the motion is unaffected by the amplitude. In the third run, the DERT. syntax is used to do the integration. Notice that the path of the displacement is close to the old path, indicating that the original time step is short enough to yield an accurate solution. These simulations are performed by the following statements:
proc model data=one model=spring; title2 "Simulation of the model for the base case"; control run 1; solve / out=a; run; title2 "Simulation of the model with twice the initial displacement"; control run 2; var disp 20; solve / out=b; run; data two; do time = 0 to 10 by .2; output;end; run; title2 "Simulation of the model using the dert. syntax"; proc model data=two; var force -200 disp 10 vel 0 accel -20 time 0; control mass 9.2 c 1.5 dt .1 k 20; control run 3 ; force = -k * disp -c * vel; dert.disp = vel ; dert.vel = accel; accel = force / mass; solve / out=c; id time ; run;
The output SAS data sets that contain the solution results are merged and the displacement time paths for the three simulations are plotted. The three runs are identied on the plot as 1, 2, and 3. The following statements produce Output 18.7.1 through Output 18.7.5.
data p; set a b c; run; title2 Overlay Plot of All Three Simulations; proc sgplot data=p; series x=time y=disp / group=run lineattrs=(pattern=1); xaxis values=(0 to 10 by 1); yaxis values=(-20 to 20 by 10); run;
force(-200) disp(10) vel(0) accel(-20) time(0) mass(9.2) c(1.5) dt(0.1) k(20) run(1) force disp vel accel time
Observations Processed Read Lagged Solved First Last Variables Solved For 100 1 99 2 100
Solution Summary Variables Solved Simulation Lag Length Solution Method CONVERGE= Maximum CC Maximum Iterations Total Iterations Average Iterations 5 1 NEWTON 1E-8 2.64E-14 1 99 1
Observations Processed Read Solved Variables Solved For ODEs Auxiliary Equations 51 51 force disp vel accel dert.disp dert.vel force accel
z3 D 0
z5 D 0
where z1 is capital input, z2 is labor input, z3 is real output, z4 is time in years with 1929 as year zero, and z5 is the ratio of price of capital services to wage scale. The ci s are the unknown parameters. z1 and z2 are considered endogenous variables. A FIML estimation is performed by using the following statements:
data bodkin;
input z1 z2 z3 z4 z5; datalines; 1.33135 0.64629 0.4026 -20 0.24447 1.39235 0.66302 0.4084 -19 0.23454 ... more lines ...
title1 "Nonlinear FIML Estimation"; proc model data=bodkin; parms c1-c5; endogenous z1 z2; exogenous z3 z4 z5; eq.g1 = c1 * 10 **(c2 * z4) * (c5*z1**(-c4)+ (1-c5)*z2**(-c4))**(-c3/c4) - z3; eq.g2 = (c5/(1-c5))*(z1/z2)**(-1-c4) -z5; fit g1 g2 / fiml ; run;
When FIML estimation is selected, the log likelihood of the system is output as the objective value. The results of the estimation are shown in Output 18.8.1.
Output 18.8.1 FIML Estimation Results for U.S. Production Data
Nonlinear FIML Estimation The MODEL Procedure Nonlinear FIML Summary of Residual Errors DF Model 4 1 DF Error 37 40 Adj R-Sq
Equation g1 g2
R-Square
Nonlinear FIML Parameter Estimates Approx Std Err 0.0218 0.000673 0.1148 0.2699 0.0596 Approx Pr > |t| <.0001 <.0001 <.0001 0.0873 <.0001
Parameter c1 c2 c3 c4 c5
The theory of electric circuits is governed by Kirchhoffs laws: the sum of the currents owing to a node is zero, and the net voltage drop around a closed loop is zero. In addition to Kirchhoffs laws, there are relationships between the current I through each element and the voltage drop V across the elements. For the circuit in Figure 18.93, the relationships are C dV DI dt
for the nonlinear resistor. The following differential equation describes the current at node 2 as a function of time and voltage for this circuit: C d V2 dt V1 V2 D0 R1 C R2 .1 exp. V//
This equation can be written in the form d V2 V1 D dt .R1 C R2 .1 Consider the following data.
data circ; input v2 v1 time@@; datalines;
V2 exp. V///C
-0.00007 0.03091 0.11019 0.23048 0.39394 0.59476 0.81315 1.01412 1.21106 1.40461 1.57894 ;
0.0 1.0 2.0 3.0 4.0 5.0 5.0 5.0 5.0 5.0 5.0
0.0000000001 0.0000000003 0.0000000005 0.0000000007 0.0000000009 0.0000000011 0.0000000013 0.0000000015 0.0000000017 0.0000000019 0.0000000021
0.00912 0.06419 0.16398 0.30529 0.49121 0.70285 0.90929 1.11386 1.30237 1.48624 1.66471
0.5 1.5 2.5 3.5 4.5 5.0 5.0 5.0 5.0 5.0 5.0
0.0000000002 0.0000000004 0.0000000006 0.0000000008 0.0000000010 0.0000000012 0.0000000014 0.0000000016 0.0000000018 0.0000000020 0.0000000022
You can estimate the parameters in the preceding equation by using the following SAS statements:
title1 Circuit Model Estimation Example; proc model data=circ mintimestep=1.0e-23; parm R2 2000 R1 4000 C 5.0e-13; dert.v2 = (v1-v2)/((r1 + r2*(1-exp( -(v1-v2)))) * C); fit v2; run;
Parameter R2 R1 C
t Value <-----<-----<------
NOTE: The model was singular. Some estimates are marked Biased.
In this case, the model equation is such that there is linear dependency that causes biased results and inated variances. The Jacobian matrix is singular or nearly singular, but eliminating one of the parameters is not a solution in this case.
In Figure 18.94, E=enzyme, D=probe, and I=inhibitor. The differential equations that describe this reaction scheme are as follows: dD D k1r ED dt k1f E D
k2f E I
For this system, the initial values for the concentrations are derived from equilibrium considerations (as a function of parameters) or are provided as known values.
The experiment used to collect the data was carried out in two ways; preincubation (type=disassoc) and no preincubation (type=assoc). The data also contain repeated measurements. The data contain values for uorescence F, which is a function of concentration. Since there are no direct data for the concentrations, all the differential equations are simulated dynamically. The SAS statements used to t this model are as follows:
title1 Systems of Differential Equations Example; proc sort data=fit; by type time; run; %let %let %let %let k1f k1r k2f k2r = = = = 6.85e6 ; 3.43e-4 ; 1.8e7 ; 2.1e-2 ; ; ;
= 2.1e8 = 4.0e9
proc model data=fit; parameters qf qb k2f k2r l = = = = = 2.1e8 4.0e9 1.8e5 2.1e-3 0;
k1f = 6.85e6; k1r = 3.43e-4; /* Initial values for concentrations */ control dt 5.0e-7 et 5.0e-8 it 8.05e-6; /* Association initial values --------------*/ if type = assoc and time=0 then do; ed = 0; /* solve quadratic equation ----------*/ a = 1; b = -(&it+&et+(k2r/k2f)); c = &it*&et; ei = (-b-(((b**2)-(4*a*c))**.5))/(2*a); d = &dt-ed; i = &it-ei; e = &et-ed-ei; end;
/* Disassociation initial values ----------*/ if type = disassoc and time=0 then do; ei = 0; a = 1; b = -(&dt+&et+(&k1r/&k1f)); c = &dt*&et; ed = (-b-(((b**2)-(4*a*c))**.5))/(2*a); d = &dt-ed; i = &it-ei; e = &et-ed-ei; end; if time ne 0 then do; dert.d = k1r* ed - k1f *e *d;
dert.ei = k2f* e *i - k2r * ei; dert.i = k2r * ei - k2f* e *i; end; /* L - offset between curves */ if type = disassoc then F = (qf*(d-ed)) + (qb*ed) -L; else F = (qf*(d-ed)) + (qb*ed); fit F / method=marquardt; run;
This estimation requires the repeated simulation of a system of 41 differential equations (5 base differential equations and 36 differential equations to compute the partials with respect to the parameters). The results of the estimation are shown in Output 18.10.1.
Output 18.10.1 Kinetics Estimation
Systems of Differential Equations Example The MODEL Procedure Nonlinear OLS Summary of Residual Errors DF Model 5 DF Error 797 Adj R-Sq 0.9980
Equation f
SSE 2525.0
MSE 3.1681
R-Square 0.9980
The rst data set contains errors that are strictly additive and independent. The data for this estimation are generated by the following DATA step:
data drive1; a = 0.5; do iter=1 to 100; do time = 0 to 50; y = 1 - exp(-a*time) + 0.1 *rannor(123); output; end; end; run;
The second data set contains errors that are cumulative in form.
data drive2; a = 0.5; yp = 1.0 + 0.01 *rannor(123); do iter=1 to 100; do time = 0 to 50; y = 1 - exp(-a)*(1 - yp); yp = y + 0.01 *rannor(123); output; end;
end; run;
The following statements perform the 100 static estimations for each data set:
title1 Monte Carlo Simulation of ODE; proc model data=drive1 noprint; parm a 0.5; dert.y = a - a * y; fit y / outest=est; by iter; run;
Similar statements are used to produce 100 dynamic estimations with a xed and an unknown initial value. The rst value in the data set is used to simulate an error in the initial value. The following PROC UNIVARIATE statements process the estimations:
proc univariate data=est noprint; var a; output out=monte mean=mean p5=p5 p95=p95; run; proc print data=monte; run;
Additive Error mean p95 p5 0.77885 1.03524 0.54733 0.48785 0.63273 0.37644 0.48518 0.62452 0.36754
Cumulative Error mean p95 p5 0.57863 1.16112 0.31334 3.8546E24 8.88E10 -51.9249 641704.51 1940.42 -25.6054
For this example model, it is evident that the static estimation is the least sensitive to misspecication.
format date monyy.; call streaminit(156789); do t=0 to 20 by 0.1; date=intnx(month,01jun90d,(t*10)-1); x=rand(normal); e=rand(cauchy) + 10 ; y=exp(4*x)+e; output; end; run;
Cauchy.nc/ That is, the residuals of the model are distributed as a Cauchy distribution with noncentrality parameter nc. The log likelihood for the Cauchy distribution is li ke D log.1 C .x nc/2 /
The following SAS statements specify the model and the log-likelihood function.
title1 Cauchy Distribution; proc model data=c ; dependent y; parm a -2 nc 4; y=exp(-a*x); /* Likelihood function for the residuals */ obj = log(1+(-resid.y-nc)**2 * 3.1415926); errormodel y ~ general(obj) cdf=cauchy(nc); fit y / outsn=s1 method=marquardt; solve y / sdata=s1 data=c(obs=1) random=1000 seed=256789 out=out1; run; title Distribution of Y; proc sgplot data=out1; histogram y; run;
The FIT statement uses the OUTSN= option to output the matrix for residuals from the normal distribution. The matrix is 1 1 and has value 1:0 because it is a correlation matrix. The OUTS= matrix is the scalar 2989:0. Because the distribution is univariate (no covariances), the OUTS= option would produce the same simulation results. The simulation is performed by using the SOLVE statement.
yi D
yi D
k X j D1
where xhi and xhj are the ith and jth observations, respectively, on xh . The errors, u1i and u2i , are assumed to be distributed normally and independently with mean zero and constant variance. 2 2 2 2 The variance for the rst regime is 1 , and the variance for the second regime is 2 . If 1 2 and 1 2 , the regression system given previously is thought to be switching between the two regimes. The problem is to estimate 1 , 2 , 1 , and 2 without knowing a priori which of the n values of the dependent variable, y, was generated by which regime. If it is known a priori which observations 2 2 belong to which regime, a simple Chow test can be used to test 1 D 2 and 1 D 2 . Using Goldfeld and Quandts D-method for switching regression, you can solve this problem. Assume that observations exist on some exogenous variables z1i ; z2i ; : : : ; zpi , where z determines whether the ith observation is generated from one equation or the other. The equations are given as follows: yi D xi0 1 C u1i if
p X j D1 p X j D1 j zj i
yi
D xi0 2 C u2i
if
j zj i
> 0
where j are unknown coefcients to be estimated. Dene d.zi / as a continuous approximation to a step function. Replacing the unit step function with a continuous approximation by using the cumulative normal integral enables a more practical method that produces consistent estimates. Z P j zj i 1 1 2 d.zi / D p exp d 2 2 2 1 D is the n dimensional diagonal matrix consisting of d.zi /: 2 3 d.z1 / 0 0 0 6 0 d.z2 / 0 0 7 6 7 D D 6 7 :: 4 0 : 0 0 5 0 0 0 d.zn /
2 2 The parameters to estimate are now the k 1 s, the k 2 s, 1 , 2 , p s, and the introduced in the d.zi / equation. The can be considered as given a priori, or it can be estimated, in which case, the estimated magnitude provides an estimate of the success in discriminating between the two regimes (Goldfeld and Quandt 1976). Given the preceding equations, the model can be written as:
Y D .I
D/ X1 C DX2 C W
where W D .I D/U1 C DU2 , and W is a vector of unobservable and heteroscedastic error terms. 2 2 The covariance matrix of W is denoted by , where D .I D/2 1 C D2 2 . The maximum likelihood parameter estimates maximize the following log-likelihood function. log L D 1 2 n log 2 2 Y .I 1 log j 2 D/X1 j DX2 0
1
.I
D/X1
DX2
As an example, you now can use this switching regression likelihood to develop a model of housing starts as a function of changes in mortgage interest rates. The data for this example are from the U.S. Census Bureau and cover the period from January 1973 to March 1999. The hypothesis is that there are different coefcients on your model based on whether the interest rates are going up or down. So the model for zi is zi D p .ratei ratei
1/
where ratei is the mortgage interest rate at time i and p is a scale parameter to be estimated. The regression model is startsi startsi D intercept1 C ar1 D intercept2 C ar2 startsi startsi
1 1
C djf1 C djf2
decjanfeb decjanfeb
zi < 0 zi >D 0
where startsi is the number of housing starts at month i and decjanfeb is a dummy variable that indicates that the current month is one of December, January, or February. This model is written by using the following SAS statements:
title1 Switching Regression Example; proc model data=switch; parms sig1=10 sig2=10 int1 b11 b13 int2 b21 b23 p; bounds 0.0001 < sig1 sig2; decjanfeb = ( month(date) = 12 | month(date) <= 2 ); a = p*dif(rate); d = probnorm(a); /* Upper bound of integral */ /* Normal CDF as an approx of switch */
/* Regime 1 */ y1 = int1 + zlag(starts)*b11 + decjanfeb *b13 ; /* Regime 2 */ y2 = int2 + zlag(starts)*b21 + decjanfeb *b23 ; /* Composite regression equation */ starts = (1 - d)*y1 + d*y2; /* Resulting log-likelihood function */ logL = (1/2)*( (log(2*3.1415)) +
log( (sig1**2)*((1-d)**2)+(sig2**2)*(d**2) ) + (resid.starts*( 1/( (sig1**2)*((1-d)**2)+ (sig2**2)*(d**2) ) )*resid.starts) ) ; errormodel starts ~ general(logL); fit starts / method=marquardt converge=1.0e-5; /* test test test test run; Test for significant differences in the parms */ int1 = int2 ,/ lm; b11 = b21 ,/ lm; b13 = b23 ,/ lm; sig1 = sig2 ,/ lm;
Four TEST statements are added to test the hypothesis that the parameters are the same in both regimes. The parameter estimates and ANOVA table from this run are shown in Output 18.13.1.
Output 18.13.1 Parameter Estimates from the Switching Regression
Switching Regression Example The MODEL Procedure Nonlinear Liklhood Summary of Residual Errors DF Model 9 DF Error 304 Adj R-Sq 0.7748
Equation starts
SSE 85878.0
MSE 282.5
R-Square 0.7806
Nonlinear Liklhood Parameter Estimates Approx Std Err 0.9476 1.2710 5.9083 0.0444 3.1912 6.8159 0.0478 4.2985 8.5205 Approx Pr > |t| <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 0.0025
Estimate 15.47484 19.77808 32.82221 0.73952 -15.4556 42.73348 0.734117 -22.5184 25.94712
t Value 16.33 15.56 5.56 16.64 -4.84 6.27 15.37 -5.24 3.05
The test results shown in Output 18.13.2 suggest that the variance of the housing starts, SIG1 and SIG2, are signicantly different in the two regimes. The tests also show a signicant difference in the AR term on the housing starts.
data cauchy; format date monyy.; PI = 3.1415926; do date=1jun2001d to 1nov2002d; cauchy = -4 + tan((ranuni(1234) - 0.5) * PI); output; end; run;
Since the multivariate joint likelihood is unknown, the models must be estimated separately. The residuals for each model are saved by using the OUT= option. Also, each model is saved by using the OUTMODEL= option. The ID statement is used to provide a variable in the residual data set to merge by. The XLAG function is used to model the GARCH(1,1) process. The XLAG function returns the lag of the rst argument if it is nonmissing, otherwise it returns the second argument.
title1 t-distributed Errors Example; proc model data=t outmod=tModel; parms df 10 vt 4; t = a * date + c; errormodel t ~ t( vt, df ); fit t / out=tresid; id date; run; title1 GARCH-distributed Errors Example; proc model data=normal outmodel=normalModel; normal = b0 ; h.normal = arch0 + arch1 * xlag(resid.normal **2 , mse.normal) + GARCH1 * xlag(h.normal, mse.normal); fit normal /fiml out=nresid; id date; run; title1 Cauchy-distributed Errors Example; proc model data=cauchy outmod=cauchyModel; parms nc = 1; /* nc is noncentrality parm to Cauchy dist */ cauchy = nc; obj = log(1+resid.cauchy**2 * 3.1415926); errormodel cauchy ~ general(obj) cdf=cauchy(nc); fit cauchy / out=cresid; id date; run;
The simulation requires a covariance matrix created from normal residuals. The following DATA step statements use the inverse CDFs of the t and Cauchy distributions to convert the residuals to the normal distribution. The CORR procedure is used to create a correlation matrix that uses the
converted residuals.
/* Merge and normalize the 3 residual data sets */ data c; merge tresid nresid cresid; by date; t = probit(cdf("T", t/sqrt(0.2789), 16.58 )); cauchy = probit(cdf("CAUCHY", cauchy, -4.0623)); run; proc corr data=c out=s; var t normal cauchy; run;
Now the models can be simulated together by using the MODEL procedure SOLVE statement. The data set created by the CORR procedure is used as the correlation matrix.
title1 Simulating Equations with Different Error Distributions; /* Create one observation driver data set */ data sim; merge t normal cauchy; by date; data sim; set sim(firstobs = 519 ); proc model data=sim model=( tModel normalModel cauchyModel ); errormodel t ~ t( vt, df ); errormodel cauchy ~ cauchy(nc); solve t cauchy normal / random=2000 seed=1962 out=monte sdata=s(where=(_type_="CORR")); run;
An estimation of the joint density of the t and Cauchy distribution is created by using the KDE procedure. Bounds are placed on the Cauchy dimension because of its fat tail behavior. The joint PDF is shown in Output 18.14.1.
title "T and Cauchy Distribution"; proc kde data=monte; univar t / out=t_dens; univar cauchy / out=cauchy_dens; bivar t cauchy / out=density plots=all; run;
Output 18.14.2 Bivariate Density of t and Cauchy, Kernel Density for t and Cauchy
Output 18.14.3 Bivariate Density of t and Cauchy, Distribution and Kernel Density for t and Cauchy
Output 18.14.5 Bivariate Density of t and Cauchy, Kernel Density for t and Cauchy
Output 18.14.6 Bivariate Density of t and Cauchy, Distribution and Kernel Density for t and Cauchy
In the following SAS statements, ysi m is simulated, and the rst moment and the second moment of ysi m are compared with those of the observed endogenous variable y.
title "Simple regression model"; data regdata; do i=1 to 500;
x = rannor( 1013 ); Y = 2 + 1.5 * x + 1.5 * rannor( 9871 ); output; end; run; proc model data=regdata; parms a b s; instrument x; ysim = (a+b*x) + s * rannor( 8003 ); y = ysim; eq.ysq = y*y - ysim*ysim; fit y ysq / gmm ndraw; bound s > 0; run;
Y a b s ysq Y
Nonlinear GMM Parameter Estimates Approx Std Err 0.0657 0.0565 0.0498 Approx Pr > |t| <.0001 <.0001 <.0001
Parameter a b s
D a C bxt C ut D ut
1
i id N.0; s /
t 2
In the following SAS statements, ysi m is simulated by using this model, and the endogenous variable y is set to be equal to ysi m. The MOMENT statement creates two more moments for the estimation. One is the second moment and the other is the rst-order autocovariance. The NPREOBS=20 option instructs PROC MODEL to run the simulation 20 times before ysim is compared to the rst observation of y. Because the initial zlag.u/ is zero, the rst ysim is a C b x C s rannor.8003/. Without the NPREOBS option, this ysim is matched with the rst observation of y. With NPREOBS, this ysim, along with the next 19 ysim, is thrown away, and the moment match starts with the twenty-rst ysim with the rst observation of y. This way, the initial values do not exert a large inference on the simulated endogenous variables.
%let nobs=500; data ardata; lu =0; do i=-10 to &nobs; x = rannor( 1011 ); e = rannor( 1011 ); u = .6 * lu + 1.5 * e; Y = 2 + 1.5 * x + u; lu = u; if i > 0 then output; end; run; title1 Simulated Method of Moments for AR(1) Process; proc model data=ardata ; parms a b s 1 alpha .5; instrument x; u = alpha * zlag(u) + s * rannor( 8003 ); ysim = a + b * x + u; y = ysim; eq.ysq = y*y - ysim*ysim; eq.ylagy = y * lag(y) - ysim * lag( ysim ); fit y ysq ylagy / gmm npreobs=10 ndraw=10; bound s > 0, 1 > alpha > 0; run;
The 3 Equations to Estimate Y = ysq = ylagy = Instruments F(a(1), b(x), s, alpha) F(a, b, s, alpha) F(a, b, s, alpha) 1 x
Nonlinear GMM Parameter Estimates Approx Std Err 0.1038 0.0698 0.0984 0.0809 Approx Pr > |t| <.0001 <.0001 <.0001 <.0001
Parameter a b s alpha
t zt 2 t 1/
D a C b log.
C sut
.zt ; ut /
i id N.0; I2 /
This model is widely used in modeling the return process of stock prices and foreign exchange rates. This is called the stochastic volatility model because the volatility is stochastic as the random variable ut appears in the volatility equation. The following SAS statements use three moments: absolute value, the second-order moment, and absolute value of the rst-order autoregressive moment.
Note the ADJSMMV option in the FIT statement to request the SMM covariance adjustment for the parameter estimates. Although these moments have closed form solution as shown by Andersen and Sorensen (1996), the simulation approach signicantly simplies the moment conditions.
%let nobs=1000; data _tmpdata; a = -0.736; b=0.9; s=0.363; ll=sqrt( exp(a/(1-b)));; do i=-10 to &nobs; u = rannor( 101 ); z = rannor( 98761 ); lnssq = a+b*log(ll**2) +s*u; st = sqrt(exp(lnssq)); ll = st; y = st * z; if i > 0 then output; end; run; title1 Simulated Method of Moments for Stochastic Volatility Model; proc model data=_tmpdata ; parms a b .5 s 1; instrument _exog_ / intonly; u = rannor( 8801 ); z = rannor( 9701 ); lsigmasq = xlag(sigmasq,exp(a)); lnsigmasq = a + b * log(lsigmasq) + s * u; sigmasq = exp( lnsigmasq ); ysim = sqrt(sigmasq) * z; eq.m1 = abs(y) - abs(ysim); eq.m2 = y**2 - ysim**2; eq.m5 = abs(y*lag(y))-abs(ysim*lag(ysim)); fit m1 m2 m5 / gmm npreobs=10 ndraw=10; bound s > 0, 1 > b > 0; run;
Nonlinear GMM Parameter Estimates Approx Std Err 1.0357 0.1417 0.1491 Approx Pr > |t| 0.0316 <.0001 <.0001
Parameter a b s
title1 SMM for Duration Model with Unobserved Heterogeneity; %let nobs=1000; data durationdata; b=0.9; s=0.5; do i=1 to &nobs; u = rannor( 1011 ); v = ranuni( 1011 ); x = 2 * ranuni( 1011 ); y = -exp(-b * x + s * u) * log(v); output; end; run;
proc model data=durationdata; parms b .5 s 1; instrument x; u = rannor( 1011 ); v = ranuni( 1011 ); y = -exp(-b * x + s * u) * log(v); moment y = (2 3 4); fit y / gmm ndraw=10 ;* maxiter=500; bound s > 0, b > 0; run;
The 4 Equations to Estimate _moment_3 = _moment_2 = _moment_1 = y = Instruments F(b, F(b, F(b, F(b, 1 x s) s) s) s)
Nonlinear GMM Parameter Estimates Approx Std Err 0.0331 0.0608 Approx Pr > |t| <.0001 <.0001
Parameter b s
t zt 1
D ! C yt
2 t 1
The pseudo parameter vector is given by .!; ; /. One advantage of such a class of models is that the conditional density of yt is Gaussianthat is, yt2 1 exp f .yt jYt 1 ; / / 2 t2 t Therefore the score vector can easily be computed analytically. O The AUTOREG procedure provides the ML estimates, n . The output is stored in the garchout data set, while the estimates are stored in the garchest data set.
title1 Efficient Method of Moments for Stochastic Volatility Model; /* estimate GARCH(1,1) model */ proc autoreg data=svdata(keep=y)
outest=garchest noprint covout; model y = / noint garch=(q=1,p=1); output out=garchout cev=gsigmasq r=resid; run;
If the pseudo model is close enough to the structural model, in a suitable sense, Gallant and Long (1997) showed that a consistent estimator of the asymptotic covariance matrix of the sample pseudoscore vector can be obtained from the formula 1X O O O sf .Yt ; n /sf .Yt ; n /0 Vn D n
t D1 n
O 1 ; n /
The ML estimates of the GARCH(1,1) model are used in the following SAS statements to compute O the variance-covariance matrix Vn .
/* compute the V matrix */ data vvalues; set garchout(keep=y gsigmasq resid); /* compute scores of GARCH model */ score_1 = (-1 + y**2/gsigmasq)/ gsigmasq; score_2 = (-1 + y**2/gsigmasq)*lag(gsigmasq) / gsigmasq; score_3 = (-1 + y**2/gsigmasq)*lag(y**2) / gsigmasq; array score{*} score_1-score_3; array v_t{*} v_t_1-v_t_6; array v{*} v_1-v_6; /* compute external product of score vector */ do i=1 to 3; do j=i to 3; v_t{j*(j-1)/2 + i} = score{i}*score{j}; end; end; /* average them over t */ do s=1 to 6; v{s}+ v_t{s}/&nobs; end; run;
O The V matrix must be formatted to be used with the VDATA= option of the MODEL procedure. See the section VDATA= Input data set on page 1108 for more information about the VDATA= data set.
/* Create a VDATA dataset acceptable to PROC MODEL */ /* Transpose the last obs in the dataset */ proc transpose data=vvalues(firstobs=&nobs keep=v_1-v_6)
out=tempv; run; /* Add eq and inst labels */ data vhat; set tempv(drop=_name_); value = col1; drop col1; input _type_ $ eq_row $ eq_col $ inst_row $ inst_col $; *$; datalines; gmm m1 m1 1 1 /* intcpt is the only inst we use */ gmm m1 m2 1 1 gmm m2 m2 1 1 gmm m1 m3 1 1 gmm m2 m3 1 1 gmm m3 m3 1 1 ;
The last step of the EMM procedure is to estimate by using SMM, where the moment conditions are given by the scores of the auxiliary model. Given a xed value of the parameter vector and an arbitrarily large T, one can simulate a series O fy1 ./; y2 ./; : : : ; yT ./g from the structural model. The EMM estimator is the value n that O O O minimizes the quantity
1 O O O mT .; n /0 Vn mT .; n /
where
T 1 X O O O sf .Yk ./; n / mT .; n / D T kD1
O is the sample moment condition evaluated at the xed estimated pseudo parameter n . Note that the target function depends on the parameter only through the simulated series yk . O The following statements generate a data set that contains T D 20; 000 replicates of the estimated O pseudo parameter n and that is then input to the MODEL procedure. The EMM estimates are O found by using the SMM option of the FIT statement. The Vn matrix computed above serves as weighting matrix by using the VDATA= option, and the scores of the GARCH(1,1) auxiliary model evaluated at the ML estimates are the moment conditions in the GMM step. Since the number of structural parameters to estimate (3) is equal to the number of moment equations (3) times the number of instruments (1), the model is exactly identied and the objective function has value zero at the minimum. For simplicity, the starting values are set to the true values of the parameters.
/* USE SMM TO FIND EMM ESTIMATES */ /* Generate dataset of length T */ data emm; set garchest(obs=1 keep = _ah_0 _ah_1 _gh_1 _mse_); do i=1 to 20000; output;
end; drop i; run; title2 EMM estimates; /* Find the EMM estimates */ proc model data=emm maxiter=1000; parms a -0.736 b 0.9 s 0.363; instrument _exog_ / intonly; /* Describe the structural model */ u = rannor( 8801 ); z = rannor( 9701 ); lsigmasq = xlag(sigmasq,exp(a)); lnsigmasq = a + b * log(lsigmasq) + s * u; sigmasq = exp( lnsigmasq ); ysim = sqrt(sigmasq) * z; /* Volatility of the GARCH model */ gsigmasq = _ah_0 + _gh_1*xlag(gsigmasq, _mse_) + _ah_1*xlag(ysim**2, _mse_); /* Use scores eq.m1 = (-1 + eq.m2 = (-1 + eq.m3 = (-1 + of the GARCH model as moment conditions */ ysim**2/gsigmasq)/ gsigmasq; ysim**2/gsigmasq)*xlag(gsigmasq, _mse_) / gsigmasq; ysim**2/gsigmasq)*xlag(ysim**2, _mse_) / gsigmasq;
/* Fit scores using SMM and estimated Vhat */ fit m1 m2 m3 / gmm npreobs=10 ndraw=1 /* smm options */ vdata=vhat /* use estimated Vhat */ kernel=(bart,0,) /* turn smoothing off */; bounds s > 0, 1>b>0; run;
Nonlinear GMM Parameter Estimates Approx Std Err 0.0160 0.00217 0.00674 Approx Pr > |t| <.0001 <.0001 <.0001
Parameter a b s
You can also obtain the plots in the diagnostics panel as separate graphs by specifying the PLOTS(UNPACK) option. These plots are displayed in Output 18.20.3 through Output 18.20.10.
title1 Unpacked Graphical Output from PROC MODEL; proc model data=sashelp.citimon plots(unpack); lhur = 1/(a * ip + b) + c; fit lhur; id date; run;
References
Aiken, R.C., ed. (1985), Stiff Computation, New York: Oxford University Press. Amemiya, T. (1974), The Nonlinear Two-Stage Least-Squares Estimator, Journal of Econometrics, 2, 105110. Amemiya, T. (1977), The Maximum Likelihood Estimator and the Nonlinear Three-Stage Least Squares Estimator in the General Nonlinear Simultaneous Equation Model, Econometrica, 45 (4), 955968. Amemiya, T. (1985), Advanced Econometrics, Cambridge, MA: Harvard University Press. Andersen, T.G., Chung, H-J., and Sorensen, B.E. (1999), Efcient Method of Moments Estimation of a Stochastic Volatility Model: A Monte Carlo Study, Journal of Econometrics, 91, 6187.
References ! 1255
Andersen, T.G. and Sorensen, B.E. (1996), GMM Estimation of a Stochastic Volatility Model: A Monte Carlo Study, Journal of Business and Economic Statistics, 14, 328352. Andrews, D.W.K. (1991), Heteroscedasticity and Autocorrelation Consistent Covariance Matrix Estimation, Econometrica, 59 (3), 817858. Andrews, D.W.K. and Monahan, J.C. (1992), Improved Heteroscedasticity and Autocorrelation Consistent Covariance Matrix Estimator, Econometrica, 60 (4), 953966. Bansal, R., Gallant, A.R., Hussey, R., and Tauchen, G.E. (1993), Computational Aspects of Nonparametric Simulation Estimation, Belsey, D.A., ed., Computational Techniques for Econometrics and Economic Analysis, Boston, MA: Kluwer Academic Publishers, 322. Bansal, R., Gallant, A.R., Hussey, R., and Tauchen, G.E. (1995), Nonparametric Estimation of Structural Models for High-Frequency Currency Market Data, Journal of Econometrics, 66, 251 287. Bard, Yonathan (1974), Nonlinear Parameter Estimation, New York: Academic Press. Bates, D.M. and Watts, D.G. (1981), A Relative Offset Orthogonality Convergence Criterion for Nonlinear Least Squares, Technometrics, 23 (2), 179183. Belsley, D.A., Kuh, E., and Welsch, R.E. (1980), Regression Diagnostics, New York: John Wiley & Sons. Binkley, J.K. and Nelson, G. (1984), Impact of Alternative Degrees of Freedom Corrections in Two and Three Stage Least Squares, Journal of Econometrics, 24 (3) 223233. Bowden, R.J. and Turkington, D.A. (1984), Instrumental Variables, Cambridge: Cambridge University Press. Bratley, P., Fox, B.L., and H. Niederreiter (1992), Implementation and Tests of Low-Discrepancy Sequences, ACM Transactions on Modeling and Computer Simulation, 2 (3), 195213. Breusch, T.S. and Pagan, A.R., (1979), A Simple Test for Heteroscedasticity and Random Coefcient Variation, Econometrica, 47 (5), 12871294. Breusch, T.S. and Pagan, A.R. (1980), The Lagrange Multiplier Test and Its Applications to Model Specication in Econometrics, Review of Econometric Studies, 47, 239253. Byrne, G.D. and Hindmarsh, A.C. (1975), A Polyalgorithm for the Numerical Solution of ODEs, ACM TOMS, 1 (1), 7196. Calzolari, G. and Panattoni, L. (1988), Alternative Estimators of FIML Covariance Matrix: A Monte Carlo Study, Econometrica, 56 (3), 701714. Chan, K.C., Karolyi, G.A., Longstaff, F.A., and Sanders, A.B. (1992), An Empirical Comparison of Alternate Models of the Short-Term Interest Rate, The Journal of Finance, 47 (3), 12091227. Christensen, L.R., Jorgenson, D.W., and L.J. Lau (1975), Transcendental Logarithmic Utility Functions, American Economic Review, 65, 367383.
Dagenais, M.G. (1978), The Computation of FIML Estimates as Iterative Generalized Least Squares Estimates in Linear and Nonlinear Simultaneous Equation Models, Econometrica, 46, 6, 13511362. Davidian, M and Giltinan, D.M. (1995), Nonlinear Models for Repeated Measurement Data, London: Chapman & Hall. Davidson, R. and MacKinnon, J.G. (1993), Estimation and Inference in Econometrics, New York: Oxford University Press. Dufe, D. and Singleton, K.J. (1993), Simulated Moments Estimation of Markov Models of Asset Prices, Econometrica 61, 929952. Fair, R.C. (1984), Specication, Estimation, and Analysis of Macroeconometric Models, Cambridge: Harvard University Press. Ferson, Wayne E. and Foerster, Stephen R. (1993), Finite Sample Properties of the Generalized Method of Moments in Tests of Conditional Asset Pricing Models, Working Paper No. 77, University of Washington. Fox, B.L. (1986), Algorithm 647: Implementation and Relative Efciency of Quasirandom Sequence Generators, ACM Transactions on Mathematical Software, 12 (4), 362276. Gallant, A.R. (1977), Three-Stage Least Squares Estimation for a System of Simultaneous, Nonlinear, Implicit Equations, Journal of Econometrics, 5, 7188. Gallant, A.R. (1987), Nonlinear Statistical Models, New York: John Wiley and Sons. Gallant, A.R. and Holly, A. (1980), Statistical Inference in an Implicit, Nonlinear, Simultaneous Equation Model in the Context of Maximum Likelihood Estimation, Econometrica, 48 (3), 697 720. Gallant, A.R. and Jorgenson, D.W. (1979), Statistical Inference for a System of Simultaneous, Nonlinear, Implicit Equations in the Context of Instrumental Variables Estimation, Journal of Econometrics, 11, 275302. Gallant, A.R. and Long, J. (1997). Estimating Stochastic Differential Equations Efciently by Minimum Chi-squared, Biometrika, 84, 125141. Gallant, A.R. and Tauchen, G.E. (2001), Efcient Method of Moments, Working Paper, [http://www.econ.duke.edu/ get/wpapers/ee.pdf] accessed 12 September 2001. Gill, P.E., Murray, W., and Wright, M.H. (1981), Practical Optimization, New York: Academic Press. Godfrey, L.G. (1978a), Testing Against General Autoregressive and Moving Average Error Models when the Regressors Include Lagged Dependent Variables, Econometrica, 46, 12931301. Godfrey, L.G. (1978b), Testing for Higher Order Serial Correlation in Regression Equations when the Regressors Include Lagged Dependent Variables, Econometrica, 46, 13031310.
References ! 1257
Goldfeld, S.M. and Quandt, R.E. (1972), Nonlinear Methods in Econometrics, Amsterdam: NorthHolland Publishing Company. Goldfeld, S.M. and Quandt, R.E. (1973), A Markov Model for Switching Regressions, Journal of Econometrics, 316. Goldfeld, S.M. and Quandt, R.E. (1973), The Estimation of Structural Shifts by Switching Regressions, Annals of Economic and Social Measurement, 2/4. Goldfeld, S.M. and Quandt, R.E. (1976), Studies in Nonlinear Estimation, Cambridge, MA: Ballinger Publishing Company. Goodnight, J.H. (1979), A Tutorial on the SWEEP Operator, The American Statistician, 33, 149 158. Gourieroux, C. and Monfort, A. (1993), Simulation Based Inference: A Survey with Special Reference to Panel Data Models, Journal of Econometrics, 59, 533. Greene, William H. (1993), Econometric Analysis, New York: Macmillian Publishing Company. Greene, William H. (2004), Econometric Analysis, New York: Macmillian Publishing Company. Gregory, A.W. and Veall, M.R. (1985), On Formulating Wald Tests for Nonlinear Restrictions, Econometrica, 53, 14651468. Grunfeld, Y. and Griliches, Is Aggregation Necessarily Bad ? Review of Economics and Statistics, February 1960, 113134. Hansen, L.P. (1982), Large Sample Properties of Generalized Method of Moments Estimators, Econometrica, 50 (4), 10291054. Hansen, L.P. (1985), A Method for Calculating Bounds on the Asymptotic Covariance Matrices of Generalized Method of Moments Estimators, Journal of Econometrics, 30, 203238. Hatanaka, M. (1978), On the Efcient Estimation Methods for the Macro-Economic Models Nonlinear in Variables, Journal of Econometrics, 8, 323356. Hausman, J. A. (1978), Specication Tests in Econometrics, Econometrica, 46(6), 12511271. Hausman, J.A. and Taylor, W.E. (1982), A Generalized Specication Test, Economics Letters, 8, 239245. Henze, N. and Zirkler, B. (1990), A Class of Invariant Consistent Tests for Multivariate Normality, Communications in StatisticsTheory and Methods, 19 (10), 35953617. Johnston, J. (1984), Econometric Methods, Third Edition, New York: McGraw-Hill. Jorgenson, D.W. and Laffont, J. (1974), Efcient Estimation of Nonlinear Simultaneous Equations with Additive Disturbances, Annals of Social and Economic Measurement, 3, 615640. Joy, C., Boyle, P.P., and Tan, K.S. (1996), Quasi-Monte Carlo Methods in Numerical Finance, Management Science, 42 (6), 926938.
LaMotte, L.R. (1994), A Note on the Role of Independence in t Statistics Constructed From Linear Statistics in Regression Models, The American Statistician, 48 (3), 238239. Lee, B. and Ingram, B. (1991), Simulation Estimation of Time Series Models, Journal of Econometrics, 47, 197205. MacKinnon, J.G. and White H. (1985), Some Heteroskedasticity Consistent Covariance Matrix Estimators with Improved Finite Sample Properties, Journal of Econometrics, 29, 305325. Maddala, G.S. (1977), Econometrics, New York: McGraw-Hill. Mardia, K. V. (1974), Applications of Some Measures of Multivariate Skewness and Kurtosis in Testing Normality and Robustness Studies, The Indian Journal of Statistics 36 (B) pt. 2, 115128. Mardia, K. V. (1970), Measures of Multivariate Skewness and Kurtosis with Applications, Biometrika 57 (3), 519530. Matis, J.H., Miller, T.H., and Allen, D.M. (1991), Metal Ecotoxicology Concepts and Applications, ed. M.C Newman and A. W. McIntosh, Chelsea, MI; Lewis Publishers. Matsumoto, M. and Nishimura, T. (1998), Mersenne Twister: A 623-Dimensionally Equidistributed Uniform Pseudo-Random Number Generator, ACM Transactions on Modeling and Computer Simulation, 8, 330. McFadden, D. (1989), A Method of Simulated Moments for Estimation of Discrete Response Models without Numerical Integration, Econometrica, 57, 9951026. McNeil, A.J., Frey, R., and Embrechts, P. (2005), Quantitative Risk Management: Concepts, Techniques and Tools Princeton Series in Finance, Princeton University Press, 2005. Messer, K. and White, H. (1994), A Note on Computing the Heteroskedasticity Consistent Covariance Matrix Using Instrumental Variable Techniques, Oxford Bulletin of Economics and Statistics, 46, 181184. Mikhail, W.M. (1975), A Comparative Monte Carlo Study of the Properties of Economic Estimators, Journal of the American Statistical Association, 70, 94104. Miller, D.M. (1984), Reducing Transformation Bias in Curve Fitting, The American Statistician, 38 (2), 124126. Morelock, M.M., Pargellis, C.A., Graham, E.T., Lamarre, D., and Jung, G. (1995), Time-Resolved Ligand Exchange Reactions: Kinetic Models for Competitive Inhibitors with Recombinant Human Renin, Journal of Medical Chemistry, 38, 17511761. Nelsen, Roger B. (1999), Introduction to Copulas, New York: Springer-Verlag. Newey, W.K. and West, D. W. (1987), A Simple, Positive Semi-Denite, Heteroscedasticity and Autocorrelation Consistent Covariance Matrix, Econometrica, 55, 703708. Noble, B. and Daniel, J.W. (1977), Applied Linear Algebra, Englewood Cliffs, NJ: Prentice-Hall. Ortega, J. M. and Rheinbolt, W.C. (1970), Iterative Solution of Nonlinear Equations in Several
References ! 1259
Variables, Burlington, MA: Academic Press. Pakes, A. and Pollard, D. (1989), Simulation and the Asymptotics of Optimization Estimators, Econometrica 57, 10271057. Parzen, E. (1957), On Consistent Estimates of the Spectrum of a Stationary Time Series, Annals of Mathematical Statistics, 28, 329348. Pearlman, J. G. (1980), An Algorithm for Exact Likelihood of a High-Order AutoregressiveMoving Average Process, Biometrika, 67 (1), 232233. Petzold, L.R. (1982), Differential/Algebraic Equations Are Not ODEs, SIAM Journal on Scientic and Statistical Computing, 3, 367384. Phillips, C.B. and Park, J.Y. (1988), On Formulating Wald Tests of Nonlinear Restrictions, Econometrica, 56, 10651083. Pindyck, R.S. and Rubinfeld, D.L. (1981), Econometric Models and Economic Forecasts, Second Edition, New York: McGraw-Hill. Savin, N.E. and White, K.J. (1978), Testing for Autocorrelation with Missing Observations, Econometrics, 46, 5967. Sobol, I.M., A Primer for the Monte Carlo Method, Boca Raton, FL: CRC Press, 1994. Srivastava, V. and Giles, D.E.A., (1987), Seemingly Unrelated Regression Equation Models, New York: Marcel Dekker. Theil, H. (1971), Principles of Econometrics, New York: John Wiley & Sons. Thursby, J., (1982), Misspecication, Heteroscedasticity, and the Chow and Goldeld-Quandt Test, Review of Economics and Statistics, 64, 314321. Venzon, D.J. and Moolgavkar, S.H. (1988), A Method for Computing Prole-Likelihood Based Condence Intervals, Applied Statistics, 37, 8794. White, Halbert, (1980), A Heteroskedasticity-Consistent Covariance Matrix Estimator and a Direct Test for Heteroskedasticity, Econometrica, 48 (4), 817838. Wu, D. M. (July 1973), Alternative Tests of Independence Between Stochastic Regressors and Disturbances, Econometrica , 41 (4), 733750.
1260
Chapter 19
Dynamic Panel Estimator . . . . . . . . . . . . . . . . . . . . . . . . Specication Testing For Dynamic Panel . . . . . . . . . . . . . . . . Instrument Choice . . . . . . . . . . . . . . . . . . . . . . . . . . . . Linear Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . Heteroscedasticity-Corrected Covariance Matrices . . . . . . . . . . . R-Square . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Specication Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . Troubleshooting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ODS Graphics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The OUTPUT OUT= Data Set . . . . . . . . . . . . . . . . . . . . . . The OUTEST= Data Set . . . . . . . . . . . . . . . . . . . . . . . . . The OUTTRANS= Data Set . . . . . . . . . . . . . . . . . . . . . . . Printed Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ODS Table Names . . . . . . . . . . . . . . . . . . . . . . . . . . . . Example: PANEL Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . Example 19.1: Analyzing Demand for Liquid Assets . . . . . . . . . . Example 19.2: The Airline Cost Data: Fixtwo Model . . . . . . . . . ODS Graphics Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . Example 19.3: The Airline Cost Data: Further Analysis . . . . . . . . Example 19.4: The Airline Cost Data: Random-Effects Models . . . . Example 19.5: Using the FLATDATA Statement . . . . . . . . . . . . Example 19.6: The Cigarette Sales Data: Dynamic Panel Estimation GMM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References: PANEL Procedure . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . with . . . . . .
1304 1310 1311 1312 1313 1316 1317 1318 1320 1320 1321 1322 1323 1323 1324 1324 1329 1333 1335 1337 1340 1343 1345
autoregressive models moving-average models If the specication is dependent only on the cross section to which the observation belongs, such a model is referred to as a one-way model. A specication that depends on both the cross section and the time period to which the observation belongs is called a two-way model. Apart from the possible one-way or two-way nature of the effect, the other dimension of difference between the possible specications is the nature of the cross-sectional or time-series effect. The models are referred to as xed-effects models if the effects are nonrandom and as random-effects models otherwise. If the effects are xed, the models are essentially regression models with dummy variables that correspond to the specied effects. For xed-effects models, ordinary least squares (OLS) estimation is the best linear unbiased estimator. Random-effects models use a two-stage approach. In the rst stage, variance components are calculated by using methods described by Fuller and Battese, Wansbeek and Kapteyn, Wallace and Hussain, or Nerlove. In the second stage, variance components are used to standardize the data, and ordinary least squares(OLS) regression is performed. There are two types of models in the PANEL procedure that accommodate an autoregressive structure. The Parks method is used to estimate a rst-order autoregressive model with contemporaneous correlation. The dynamic panel estimator is used to estimate an autoregressive model with lagged dependent variable. The Da Silva method is used to estimate mixed variance-component moving-average error process. The regression parameters are estimated by using a two-step generalized least squares(GLS)-type estimator. The new PANEL procedure enhances the features that were implemented in the TSCSREG procedure. The following list shows the most important additions. New estimation methods include between estimators, pooled estimators, and dynamic panel estimators that use the generalized method of moments (GMM). The variance components for random-effects models can be calculated for both balanced and unbalanced panels by using the methods described by Fuller and Battese, Wansbeek and Kapteyn, Wallace and Hussain, or Nerlove. The CLASS statement creates classication variables that are used in the analysis. The TEST statement includes new options for Wald, Lagrange multiplier, and the likelihood ratio tests. The new RESTRICT statement species linear restrictions on the parameters. The FLATDATA statement enables the data to be in a compressed form. Several methods that produce heteroscedasticity-consistent covariance matrices (HCCME) are added because the presence of heterscedasticity can result in inefcient and biased estimates of the variance covariance matrix in the OLS framework.
The LAG statement can generate a large number of missing values, depending on lag order. Typically, it is difcult to create lagged variables in the panel setting. If lagged variables are created in a DATA step, several programming steps that include loops are often needed. By including the LAG statement, the PANEL procedure makes the creation of lagged values easy. The missing values can be replaced with zeros, overall mean, time mean, or cross section mean by using the LAG, ZLAG, XLAG, SLAG, and CLAG statements. ODS Graphics plots can now be produced by the PANEL procedure. The new plots include residual, predicted, and actual value plots, Q-Q plots, histograms, and prole plots. The OUTPUT statement enables you to output data and estimates that can be used in other analyses.
The next step is to invoke the PANEL procedure and specify the cross section and time series variables in an ID statement. The following statements shows the correct syntax:
proc panel data=a; id state date; model y = x1 x2; run;
Alternatively, PROC PANEL has the capability to read at data. Say that you are using the data set A, which has observations on states. Specically, the data are composed of observations on Y , X1, and X2. Unlike the previous case, the data is not recorded with a PROC PANEL structure. Instead, you have all of a states information on a single row. You have variables to denote the name of the state (say state). The time observations for the Y variable are recorded horizontally.
So the variable Y _1 is the rst periods time observation, Y _10 is the tenth periods observation for some state. The same holds for the other variables. You have variables X1_1 to X1_10, X2_1 to X 2_10, and X 3_1 to X 3_10 for others. With such data, PROC PANEL could be called by using the following syntax:
proc panel data=a; flatdata indid = state base = (Y X1 X2) tsname = t; id state t; model Y = X1 X2; run;
See FLATDATA Statement on page 1272 and Example 19.2 for more information about the use of the FLATDATA statement.
The major advantage of using PROC PANEL is that you can incorporate a model for the structure of the random errors. It is important to consider what kind of error structure model is appropriate for your data and to specify the corresponding option in the MODEL statement. The error structure options supported by the PANEL procedure are FIXONE, FIXONETIME, FIXTWO, RANONE, RANTWO, PARKS, DASILVA, GMM and ITGMM(iterated GMM). See the section Details: PANEL Procedure on page 1282 for more information about these methods and the error structures they assume. The following statements t a Fuller-Battese one-way random-effects model.
proc panel data=a; id state date; model y = x1 x2 / rantwo vcomp=fb; run;
You can specify more than one error structure option in the MODEL statement; the analysis is repeated using each specied method. You can use any number of MODEL statements to estimate different regression models or estimate the same model by using different options. See Example 19.1 for more information.
In order to aid in model specication within this class of models, the procedure provides two specication test statistics. The rst is an F statistic that tests the null hypothesis that the xed-effects parameters are all zero. The second is a Hausman m statistic that provides information about the appropriateness of the random-effects specication. The m statistic is based on the idea that, under the null hypothesis of no correlation between the effects variables and the regressors, OLS and GLS are consistent, but OLS is inefcient. Hence, a test can be based on the result that the covariance of an efcient estimator with its difference from an inefcient estimator is zero. Rejection of the null hypothesis might suggest that the xed-effects model is more appropriate. The procedure also provides the Buse R-square measure. This number is interpreted as a measure of the proportion of the transformed sum of squares of the dependent variable that is attributable to the inuence of the independent variables. In the case of OLS estimation, the Buse R-square measure is equivalent to the usual R-square measure.
Unbalanced Data
In the case of xed-effects models, random-effects models, between estimators, and dynamic panel estimators, the PANEL procedure can process data with different numbers of time series observations across different cross sections. The Parks and Da Silva methods cannot be used with unbalanced data. The missing time series observations are recognized by the absence of time series ID variable values in some of the cross sections in the input data set. Moreover, if an observation with a particular time series ID value and cross-sectional ID value is present in the input data set, but one or more of the model variables are missing, that time series point is treated as missing for that cross section.
Introductory Example
The following statements use the cost function data from Greene (1990) to estimate the variance components model. The variable PRODUCTION is the log of output in millions of kilowatt-hours, and COST is the log of cost in millions of dollars. Refer to Greene (1990) for details.
data greene; input firm year production cost @@; datalines; 1 1955 5.36598 1.14867 1 1960 1 1965 6.37673 1.52257 1 1970 2 1955 6.54535 1.35041 2 1960 2 1965 7.40245 2.09519 2 1970 ... more lines ...
You decide to t the following model to the data: Cit D Intercept C Pit C vi C et C
it
i D 1; : : :; NI t D 1; : : :; T
where Cit and Pit represent the cost and production, and vi , et and series, and error variance components.
it
If you assume that the time and cross-sectional effects are random, you are left with four possible estimators for the variance components. You choose Fuller-Battese. The following statements t this model.
proc sort data=greene; by firm year; run; proc panel data=greene; model cost = production / rantwo vcomp = fb; id firm year; run;
The PANEL procedure output is shown in Figure 19.1. A model description is printed rst, which reports the estimation method used and the number of cross sections and time periods. The variance components estimates are printed next. Finally, the table of regression parameter estimates shows the estimates, standard errors, and t tests.
Figure 19.1 The Variance Components Estimates
The PANEL Procedure Fuller and Battese Variance Components (RanTwo) Dependent Variable: cost Model Description Estimation Method Number of Cross Sections Time Series Length Fit Statistics SSE MSE R-Square 0.3481 0.0158 0.8136 DFE Root MSE 22 0.1258 RanTwo 6 4
Variance Component Estimates Variance Component for Cross Sections Variance Component for Time Series Variance Component for Error Hausman Test for Random Effects DF 1 m Value 26.46 Pr > m <.0001 0.046907 0.00906 0.008749
DF 1 1
Functional Summary
The statements and options used with the PANEL procedure are summarized in the following table.
Description
Statement
Option
Data Set Options includes correlations in the OUTEST= data set includes covariances in the OUTEST= data set species the input data set species variables to keep but not transform species the output data set for CLASS STATEMENT species the output data set species the name of an output SAS data set writes parameter estimates to an output data set
Description
Statement
Option
writes the transformed series to an output data set requests that the procedure produce graphics via the Output Delivery System Declaring the Role of Variables species BY group processing species the cross section and time ID variables declares instrumental variables Lag Generation species output data set for lags species output data set for lags species output data set for lags species output data set for lags species output data set for lags Printing Control Options prints correlations of the estimates prints covariances of the estimates suppresses printed output performs tests of linear hypotheses Model Estimation Options requests the Breusch-Pagan Test for one-way random effects requests the Breusch-Pagan Test for two-way random effects species the between groups model species the between time periods model species Da Silva method species the one-way xed-effects model species the one-way xed-effects model with respect to time species the two-way xed-effects model species the Moore-Penrose generalized inverse species the dynamic panel estimator model requests the HCCME estimator for the variance covariance matrix species order of the moving-average error process for Da Silva method suppresses the intercept term
PANEL PANEL
OUTTRANS= PLOTS=
BY ID INSTRUMENTS
MODEL MODEL MODEL MODEL MODEL MODEL MODEL MODEL MODEL MODEL MODEL MODEL MODEL
BP BP2 BTWNG BTWNT DASILVA FIXONE FIXONETIME FIXTWO GINV = G4 GMM HCCME= M= NOINT
Description
Statement
Option
species the Parks method prints matrix for Parks method species the pooled model species the one-way random-effects model species the two-way random-effects model prints autocorrelation coefcients for Parks method control check for singularity species the method for the variance components estimator species linear equality restrictions on the parameters
names the input data set. The input data set must be sorted by cross section and by time period within cross section. If you omit the DATA= option, the most recently created SAS data set is used.
OUTEST=SAS-data-set
names an output data set to contain the parameter estimates. When the OUTEST= option is not specied, the OUTEST= data set is not created. See the section The OUTEST= Data Set on page 1321 for details about the structure of the OUTEST= data set.
OUTTRANS=SAS-data-set
names an output data set to contain the transformed series for further analysis and computation of models with time observations greater than two. See the section The OUTTRANS= Data Set on page 1322 for details about the structure of the OUTTRANS= data set.
OUTCOV COVOUT
writes the covariance matrix of the parameter estimates to the OUTEST= data set. See the section The OUTEST= Data Set on page 1321 for details.
OUTCORR CORROUT
writes the correlation matrix of the parameter estimates to the OUTEST= data set. See the section The OUTEST= Data Set on page 1321 for details.
BY Statement ! 1271
requests that the PANEL procedure produce statistical graphics via the Output Delivery System, provided that the ODS GRAPHICS statement has been specied. For general information about ODS Graphics, see Chapter 21, Statistical Graphics Using ODS (SAS/STAT Users Guide). The global plot options apply to all relevant plots generated by the PANEL procedure. The global plot options supported by the PANEL procedure follow.
For more details, see the section ODS Graphics on page 1320. In addition, any of the following MODEL statement options can be specied in the PROC PANEL statement: CORRB, COVB, FIXONE, FIXONETIME, FIXTWO, BTWNG, BTWNT, POOLED, RANONE, RANTWO, FULLER, PARKS, DASILVA, NOINT, NOPRINT, M=, PHI, RHO, VCOMP=, and SINGULAR=. When specied in the PROC PANEL statement, these options are equivalent to specifying the options for every MODEL statement. See the section MODEL Statement on page 1276 for a complete description of each of these options.
BY Statement
BY variables ;
A BY statement can be used with PROC PANEL to obtain separate analyses on observations in groups dened by the BY variables. When a BY statement appears, the input data set must be sorted by the BY variables as well as by cross section and time period within the BY groups. The following statements show an example:
proc sort data=a; by byvar1 byvar2 csid tsid; run; proc panel data=a; by byvar1 byvar2; id csid tsid; ... run;
CLASS Statement
CLASS variables < / out= SAS-data-set > ;
The CLASS statement names the classication variables to be used in the analysis. Classication variables can be either character or numeric. In PROC PANEL, the CLASS statement enables you to output class variables to a data set that contains a copy of the original data.
FLATDATA Statement
FLATDATA options / out= SAS-data-set ;
species the variables that are to be transformed into a proper PROC PANEL format. All variables to be transformed must be named according to the convention: basename_timeperiod. You supply just the basename, and the procedure extracts the appropriate variables to transform. If some years data are missing for a variable, then PROC PANEL detects this and lls in with missing values.
INDID=variable
names the variable in the input data set that uniquely identies each individual. The INDID variable can be a character or numeric variable.
KEEP=(variable, variable, . . . , variable)
species the variables that are to be copied without any transformation. These variables
ID Statement ! 1273
remain constant with respect to time when the data are converted to PROC PANEL format. This is an optional item.
TSNAME=name
species a name for the generated time identier. The name must satisfy the requirements for the name of a SAS variable. The name can be quoted, but it must not be the name of a variable in the input data set. The following options can be specied on the FLATDATA statement after the slash (/):
OUT =SAS-data-set
saves the converted at data set to a PROC PANEL formatted data set.
ID Statement
ID cross-section-id time-series-id ;
The ID statement is used to specify variables in the input data set that identify the cross section and time period for each observation. When an ID statement is used, the PANEL procedure veries that the input data set is sorted by the cross section ID variable and by the time series ID variable within each cross section. The PANEL procedure also veries that the time series ID values are the same for all cross sections. To make sure the input data set is correctly sorted, use PROC SORT to sort the input data set with a BY statement with the variables listed exactly as they are listed in the ID statement, as shown in the following statements:
proc sort data=a; by csid tsid; run; proc panel data=a; id csid tsid; ... etc. ... run;
INSTRUMENT Statement
INSTRUMENTS options ;
The INSTRUMENTS statement denotes which variables are used in the moment condition equations of the dynamic panel estimator. The following options can be used with the INSTRUMENTS statement.
CONSTANT
DEPVAR
species a list of variables correlated with the error term. These variables are not used in forming moment conditions from level equations.
EXOGENOUS=(variable, variable, . . . , variable)
species a list of variables that are not correlated with the error term.
PREDETERMINED=(variable, variable, . . . , variable)
species a list of variables whose future realizations can be correlated with the error term but whose present and past realizations are not. If a variable listed in the EXOGENOUS list is not included in the CORRELATED list, then it is considered to be uncorrelated to the error term. For example, in the following statements, the exogenous instruments are Z1, Z2 and X1. Z1 is an instrument that is correlated to the individual xed effects.
proc panel data=a; inst exogenous=(Z1 Z2 X1) correlated = (Z1) constant depvar; model Y = X1 X2 X3 / gmm; run;
For a detailed discussion of the model set up and the use of the INSTRUMENTS statement, see Dynamic Panel Estimator on page 1304. Note that for each MODEL statement, one INSTRUMENT statement is required. In other words, if there are two models to be estimated by using GMM within one PANEL procedure, then there should be two INSTRUMENT statements. For example,
proc panel data=test; inst depvar pred=(x1 x2) exog=(x3 x4 x5) correlated=(x3 x4 x5); model y = y_1 x1 x2 / gmm maxband=6 nolevels ginv=g4 artest=5; inst pred=(x2 x4) exog=(x3 x5) correlated=(x3 x4); model y = y_1 x2 / gmm maxband=6 nolevels ginv=g4 artest=5; id cs ts; run;
Generally, creating lags of variables in a panel setting is a tedious process in which you must generate many DATA STEP statements. The PANEL procedure now enables you to generate lags of any series without jumping across any individuals boundary. The lag statement is a data set
generation tool. Using the data created by a LAG statement requires a subsequent PROC PANEL call. There can be more than one LAG statement in each call to PROC PANEL. The LAG statement tends to generate many missing values in the data. This can be problematic, because the number of usable observations diminishes with the lag length. Therefore, PROC PANEL offers alternatives to the LAG statement. The OUT= option must be specied on the LAG statement. The output data set includes all variables in the input set, plus the lags that are denoted with the convention originalvar_lag. The following statements can be used instead of LAG with otherwise identical syntax:
CLAG
replaces missing values with the cross section mean for that variable in that cross section. Note that missing values are replaced only if they are in the generated (lagged) series. Missing variables in the original variables are not changed.
SLAG
replaces missing values with the time mean for that variable in that time period. Note that missing values are replaced only if they are in the generated (lagged) series. Missing variables in the original variables are not changed.
XLAG
replaces missing values with the overall mean for that variable. Note that missing values are replaced only if they are in the generated (lagged) series. Missing variables in the original variables are not changed.
ZLAG
replaces missing values with zero for that variable. Note that missing values are replaced only if they are in the generated (lagged) series. Missing variables in the original variables are not changed. Assume that data set A has been sorted by cross section and time period within cross section (or FLATDATA has been specied), and the variables are Y, X1, X2 and X3. You desire the lags 1 and 3 of the X1 variable; lag 3, 6, and 9 of X2; and lag 2 of variable X3. The following PROC PANEL statements generate a series with those characteristics.
proc panel data=A; id i t; lag X1(1 3) X2(3 6 9) X3(2) / out=A_lag; run;
If you want a zeroing instead of missing values, then you specify the following:
proc panel data=A; id i t; zlag X1(1 3) X2(3 6 9) X3(2) / out=A_zlag; run;
Similarly, you can specify XLAG to replace with overall means, SLAG to replace with time means, and CLAG to replace with cross section means.
MODEL Statement
MODEL response = regressors < / options > ;
The MODEL statement species the regression model and the error structure assumed for the regression residuals. The response variable on the left side of the equal sign is regressed on the independent variables listed after the equal sign. Any number of MODEL statements can be used. For each model statement, only one response variable can be specied on the left side of the equal sign. The error structure is specied by the FULLER, PARKS, DASILVA, FIXONE, FIXONETIME, FIXTWO, RANONE, RANTWO, GMM, and ITGMM options. More than one of these options can be used, in which case the analysis is repeated for each error structure model specied. Models can be given labels. Model labels are used in the printed output to identify the results for different models. If no label is specied, the response variable name is used as the label for the model. The model label is specied as follows:
label : MODEL . . . ;
The following options can be specied in the MODEL statement after a slash (/).
ARTEST=integer
species the maximum order of the test for the presence of AR effects in the residual. The acceptable range of values for this option is 1 to t 3. See the section Instrument Choice on page 1311 for details.
ATOL=number
species the convergence criterion for iterated GMM when convergence of the method is determined by convergence in the weighting matrix. The convergence criterion must be positive. See the section Dynamic Panel Estimator on page 1304 for details.
BANDOPT=TRAILING | CENTERED | LEADING
species which observations are included in the instrument list when the MAXBAND option is employed. See the section Dynamic Panel Estimator on page 1304 for details.
BP
species the convergence criterion for iterated GMM when convergence of the method is determined by convergence in the parameter matrix. The convergence criterion must be positive. The default is BTOL=1e8. See the section Dynamic Panel Estimator on page 1304 for details.
BTWNG
BTWNT
species that the model be estimated by using the Da Silva method, which assumes a mixed variance-component moving-average model for the error structure. See the section Da Silva Method (Variance-Component Moving-Average Model) on page 1302 for details.
FIXONE
species that a one-way xed-effects model be estimated with the one-way model corresponding to cross-sectional effects only.
FIXONETIME
species that a one-way xed-effects model be estimated with the one-way model corresponding to time effects only.
FIXTWO
species that the model be estimated by using the Fuller-Battese method, which assumes a variance components model for the error structure. See the section Fuller and Batteses Method on page 1292 later in this chapter for details.
GINV= G2 | G4
species what type of generalized inverse to use. The default is a G2 inverse. The G4 inverse is generally more desirable except that it is a more numerically intensive methodology.
GMM
species that the model be estimated by using the dynamic panel estimator method, which allows for autoregressive processes. This is the single-step model. Note that with this option one INSTRUMENT statement is required for each MODEL statement. See the section Dynamic Panel Estimator on page 1304 for details.
HCCME=number
species the type of HCCME variance-covariance matrix requested. The expected range of values is no or 0 to 4. The default is zero, no correction. See the section HeteroscedasticityCorrected Covariance Matrices on page 1313 for details.
ITGMM
species that the model be estimated by using the dynamic panel estimator method, but that PROC PANEL keep updating the weighting matrix until either the parameter vector converges
or the weighting matrix converges. See the section Dynamic Panel Estimator on page 1304 for details.
ITPRINT
prints out the iteration history of the parameter and transformed sum of error squared.
M=number
species the order of the moving-average process in the Da Silva method. The value of the M=option must be less than T 1. The default is M=1.
MAXBAND=integer
species the maximum number of time periods (per instrumental variable) that are allowed into the moment condition. The acceptable range of values for this option is 1 to t 1. The default value is MAXBAND=1 or MAXBAND=2, depending on the BANDOPT option. See the section Dynamic Panel Estimator on page 1304 for details.
MAXITER= integer
species the maximum number of iterations allowed for the iterated GMM option. The default value is MAXITER=200. See the section Dynamic Panel Estimator on page 1304 for details.
NODIFFS
species that the dynamic panel model be estimated without moment conditions from the difference equations. See the section Dynamic Panel Estimator on page 1304 for details.
NOESTIM
limits the estimation of a FIXONE, FIXONETIME, RANONE model to the generation of the transformed series. This option is intended for use with an OUTTRANS= data set.
NOINT
species that the dynamic panel model be estimated without moment conditions from the level equations. See the section Dynamic Panel Estimator on page 1304 for details.
NOPRINT
species that the model be estimated by using the Parks method, which assumes a rst-order autoregressive model for the error structure. See the section Parks Method (Autoregressive Model) on page 1300 for details.
PHI
prints the matrix of estimated covariances of the observations for the Parks method. The PHI option is relevant only when the PARKS option is used. See the section Parks Method (Autoregressive Model) on page 1300 for details.
POOLED
RANONE
species that robust weighting matrix be used in the calculation of the variance-covariance matrix of the single step estimator. See the section Dynamic Panel Estimator on page 1304 for details.
SINGULAR=number
species a singularity criterion for the inversion of the matrix. The default depends on the precision of the computer system.
TIME
species that the model be estimated by using the dynamic panel estimator method, but that PROC PANEL includes time dummy variables to model any time effects present in the data. See the section Dynamic Panel Estimator on page 1304 for details.
TWOSTEP
species that the model be estimated by using the dynamic panel estimator method, but that two steps be used in the estimation. An initial rst step is used to form an estimator for the weighting matrix used in the second step. See the section Dynamic Panel Estimator on page 1304 for details.
VCOMP=FB | NL | WH | WK
species the type of variance component estimate to use. The default is VCOMP=FB for balanced data and VCOMP=WK for unbalanced data. See the section The One-Way RandomEffects Model on page 1292 for details.
OUTPUT Statement
OUTPUT OUT=SAS-data-set < = options . . . > ;
The OUTPUT statement creates an output SAS data set as specied by the following options:
OUT=SAS-data-set
names the output SAS data set to contain the predicted and transformed values. If the OUT= option is not specied, the new data set is named according to the DATAn convention.
PREDICTED=name P=name
RESIDUAL=name R=name
writes the residuals from the predicted values based on both the structural and time series parts of the model to the output data set.
RESTRICT Statement
RESTRICT "string" equation < ,equation2. . . > ;
The RESTRICT statement species linear equality restrictions on the parameters in the previous model statement. There can be as many unique restrictions as the number of parameters in the preceding model statement. Multiple RESTRICT statements are understood as joint restrictions on a models parameters. Restrictions on the intercept are obtained by the use of the keyword INTERCEPT. Currently, only linear equality restrictions are permitted in PROC PANEL. Tests and restriction expressions can only be composed of algebraic operations that involve the addition symbol (+), subtraction symbol (), and multiplication symbol (*). The RESTRICT statement accepts labels that are produced in the printed output. RESTRICT statement can be labeled in two ways. A RESTRICT statement can be preceded by a label followed by a colon. This is illustrated in rest1 in the example below. Alternatively, the keyword RESTRICT can be followed by a quoted string. The following statements illustrate the use of the RESTRICT statement:
proc panel; model y = x1 x2 x3; restrict x1 = 0, x2 * .5 + 2 * x3= 0; rest1: restrict x2 = 0, x3 = 0; restrict "rest2" intercept=1; run;
Note that a restrict statement cannot include a division sign in its formulation.
TEST Statement
TEST "string" equation < ,equation2. . . >< / options > ;
The TEST statement performs Wald, Lagrange multiplier and likelihood ratio tests of linear hypotheses about the regression parameters in the preceding MODEL statement. Each equation species a linear hypothesis to be tested. All hypotheses in one TEST statement are tested jointly. Variable names in the equations must correspond to regressors in the preceding MODEL statement, and each name represents the coefcient of the corresponding regressor. The keyword INTERCEPT refers to the coefcient of the intercept.
The following options can be specied on the TEST statement after the slash (/):
ALL
species the likelihood ratio test. The following statements illustrate the use of the TEST statement:
proc panel; id csid tsid; model y = x1 x2 x3; test x1 = 0, x2 * .5 + 2 * x3 = 0; test_int: test intercept = 0, x3 = 0; run;
The rst test investigates the joint hypothesis that 1 D 0 and :52 C 23 D 0 Currently, only linear equality restrictions and tests are permitted in PROC PANEL. Tests and restriction expressions can be composed only of algebraic operations that involve the addition symbol (+), subtraction symbol (), and multiplication symbol (*). The TEST statement accepts labels that are produced in the printed output. The TEST statement can be labeled in two ways. A TEST statement can be preceded by a label followed by a colon. Alternatively, the keyword TEST can be followed by a quoted string. If both are presented, PROC PANEL uses the quoted string. In the event no label is present, PROC PANEL automatically labels the tests. If both a TEST and a RESTRICT statement are specied, the test is run with restrictions applied. Note that for the DaSilva method, only the WALD test is available.
Missing Values
Any observation in the input data set with a missing value for one or more of the regressors is ignored by PROC PANEL and is not used in the model t. If there are observations in the input data set with missing dependent variable values but with nonmissing regressors, PROC PANEL can compute predicted values and store them in an output data set by using the OUTPUT statement. Note that the presence of such observations with missing dependent variable values does not affect the model t because these observations are excluded from the calculation. If either some regressors or the dependent variable values are missing, the model is estimated as unbalanced where the number of time series observations across different cross sections does not have to be equal. The Parks and Da Silva methods cannot be used with unbalanced data.
Computational Resources
The more parameters there are to be estimated, the more memory and time are required to estimate the model. Also affecting these resources are the estimation method chosen and the method to calculate variance components. If the model has p parameters including the intercept, there are at least p C p .p C 1/=2 numbers being held in the memory. If the Arellano and Bond GMM approach is used, the amount of memory grows proportionately to the number of instruments in the INSTRUMENT statement. If the ITGMM (iterated GMM) option is selected, the computation time also depends on the convergence criteria selected and the maximum number of iterations allowed.
Restricted Estimates
A consequence of estimating a linear model with a restriction is that the error degrees of freedom increase by the number of restrictions. PROC PANEL produces the Lagrange multiplier associated with each restriction. Say that you are interested in linear regression in which there are r restrictions. A linear restriction
Notation ! 1283
implies the following set of equations that relate the regression coefcients: R1;1 1 C R1;2 2 C C R1;p p D q1 R2;1 1 C R2;2 2 C C R2;p p D q2 ................. Rr;1 1 C Rr;2 2 C C Rr;p p D qr To economize on notation, you can represent the restriction structure in the following matrix notation R D q. The restricted estimator is given by: D .X X/
0
h 0 0 R R.X X/
.R
q/
.R
q/
The standard errors of the Lagrange Multipliers are calculated from the following relationship: Var. h 0 / D R.X X/
1
h 0 0 RVar./R R.X X/
A signicant Lagrange multiplier implies that you can reject the null hypothesis that the restrictions are not binding. Note that in the special case of the xed-effects models, the NOINT option and RESTRICT INTERCEPT=0 option give different estimates. This is not an error; it reects two perspectives on the same issue. In the FIXONE case, the intercept is the last cross sections xed effect (or the last time affecting the case of FIXONETIME). Specifying the NOINT option removes the intercept, but allows the last effect in. The NOINT command simply reclassies the effects. The dummy variables become true cross section effects. If you specify the NOINT option with the FIXTWO option, the restriction is imposed that the last time effect is zero. A RESTRICT INTERCEPT=0 statement suppresses the estimation of the last effect in the FIXONE and FIXONETIME case. A RESTRICT INTERCEPT=0 has similar effects on the FIXTWO estimator. In general, restricting the intercept to zero is not recommended because OLS loses its unbiased nature.
Notation
The following notation represents the usual panel structure, with the specication of ui t dependent on the particular model: yit D
K X kD1
xitk k C uit
i D 1; : : :NI t D 1; : : :Ti
P The total number of observations M D ND1 Ti . For the balanced data case, Ti D T for all i . The i M M covariance matrix of uit is denoted by V. Let X and y be the independent and dependent variables arranged by cross section and by time within each cross section. Let Xs be the X matrix without the intercept. All other notation is specic to each section.
it
The matrix Q0 represents the within transformation. In the one-way model, the within transformaQ tion is the conversion of the raw data to deviations from a cross sections mean. The vector xit is a row of the general matrix Xs , where the subscripted s implies the constant (column of ones) is missing. Q Q Let Xs D Q0 Xs and y D Q0 y. The estimator of the slope coefcients is given by Q Q0 Q s D .Xs Xs /
1
Q0 Q Xs y
Once the slope estimates are in hand, the estimation of an intercept or the cross-sectional xed effects is handled as follows. First, you obtain the cross-sectional effects:
i
D yi N
Q N s xi
for
i D 1:::N
If the NOINT option is specied, then the dummy variables coefcients are set equal to the xed effects. If an intercept is desired, then the ith dummy variable is obtained from the following expression: Di D
i N
for
i D 1:::N
N.
Q Xs s /2
.K
1//
Q Q Q where the residuals u are given by u D .IM jM j0 M =M/.y Xs s / if there is an intercept and by Qs / if there is not. The drawback is that the formula changes (but the results do not) Q u D .y Xs with the inclusion of a constant. Q The variance covariance matrix of s is given by: h i Q Q0 Q Var s D O 2 .Xs Xs / 1 Q The covariance of the dummy variables and the dummy variables with the s is dependent on whether the intercept is included in the model.
no intercept: Var i D Var Di D h i O2 0 Q N N C xi Var s xi Ti h i 0 Q N N Cov i ; j D Cov Di Dj D xi Var s xj h i h i h i 0 Q Q Q N Cov i ; s D Cov Di s D xi Var s
Cov Di ; Dj D
O2 C .N i x Ti
N
h i 0 Q x N xN / Var s .N j D
h i O2 0 Q N N C xN Var s xN TN h i 0 Q N xN / Var s N xN /
h i O2 0 Q x N C xN Var s .N i Ti h i 0 Q N x Var s
N
Alternatively, the model option FIXONETIME estimates a one-way model where the heterogeneity comes from time effects. This option is analogous to re-sorting the data by time and then by cross section and running a FIXONE model. The advantage of using the FIXONETIME option is that sorting is avoided and the model remains labeled correctly.
C t C
it
If you do not specify the NOINT option, which suppresses the intercept, the estimates for the xed effects are reported under the restriction that N D 0 and T D 0. If you specify the NOINT option to suppress the intercept, only the restriction T D 0 is imposed.
Balanced Panels
Assume that the data are balanced (for example, all cross sections have T observations). Then you can write the following: yit D yit Q Q xit D xit yi N N xi N yt Cy N N N N N xt Cx
where the symbols: yit and xit are the dependent variable (a scalar) and the explanatory variables (a vector whose columns are the explanatory variables not including a constant), respectively N yi and xi are cross section means N N y t and x t are time means N N N y and x are the overall means Q The two-way xed-effects model is simply a regression of yit on xit . Therefore, the two-way is Q given by: 1 0 Q Q0 Q Q Q s D X X Xy The calculations of cross section dummy variables, time dummy variables, and intercepts follow in a fashion similar to that used in the one-way model. First, you obtain the net cross-sectional and time effects. Denote the cross-sectional effects by and the time effects by . These effects are calculated from the following relations: Oi D yi N t D y t O N N y N N y N Q N s xi Q N s x t N x N N x N
Denote the cross-sectional dummy variables and time dummy variables with the superscript C and T. Under the NOINT option the following equations give the dummy variables:
C Di D Oi C T O T Dt D t O
T O
When an intercept is specied, the equations for dummy variables and intercept are:
C Di D Oi T Dt D t O
ON T O
.yit
Q Xs s /2
Q With or without a constant, the variance covariance matrix of s is given by: h i Q Q0 Q Var s D O 2 .Xs Xs / 1
1 1 C T N
N N xi C x t
1 NT h i 0 Q N N N N x Var s xi C x t
N N x
Var DtT
2O 2 C .N t x N
h i 0 Q x N x T / Var s .N t
N x T/
Cov
DiC ; DjC
D C
1 N
1 NT
h i 0 Q N N N N x Var s xm ij C xm i t N N x
N N xm i i C xm i t
T Cov DtT ; Du
O2 C .N t x N
h i 0 Q x N x T / Var s .N u
N xT/ N xT/
h i 0 O2 Q x N N N N Cov DiC ; DtT D C xi C x t x Var s .N t N h i 0 Q N N N N Cov DiC ; D xi C x t x Var s h i 0 Q N Cov DiT ; D .N t x T / Var s x
N N x
Unbalanced Panels
Let X and y be the independent and dependent variables arranged by time and by cross section within each time period. (Note that the input data set used by the PANEL procedure must be sorted by cross section and then by time within each cross section.) Let Mt be the number of cross sections P observed in year t and let t Mt D M. Let Dt be the Mt N matrix obtained from the N N identity matrix from which rows that correspond to cross sections not observed at time t have been omitted. Consider Z D .Z1 ; Z2 / where Z1 D .D1 ; D2 ; : : : ::DT / and Z2 D diag.D1 jN ; D2 jN ; : : : : : : DT jN /. The matrix Z gives the dummy variable structure for the two-way model.
0 0 0 0
Let N D Z1 Z1 T D Z2 Z2 A D Z2 Z1 N Z D Z2 Z1 Q D T P D .IM
0 0 0
1 0 N A 0 AN1 A 0 Z1 N1 Z1 /
N ZQ
1N0
X s Py
where X
NC1
.K
1//
0
Q where the residuals are given by u D .IM jM jM =M/.y Qs if there is no intercept. Q model and by u D y X s
The actual implementation is quite different from the theory. The PANEL procedure transforms all series using the P matrix. N v D Pv The variable being transformed is v, which could be y or any column of X. After the data are properly transformed, OLS is run on the resulting series. Q Given s , the next step is estimating the cross-sectional and time effects. Given that vector of cross-sectional effects and is the column vector of time effects, Q DQ
1N0
is the column
Zy
1N0
Q Z Xs s .1 C 2 Q 3 /Xs s
Q D .1 C 2 1 D N1 Z1 2 D N1 A Q 3 D N1 A Q
0 0 0
3 /y
0
1 1
Z2 AN1 Z1
0
Given the cross-sectional and time effects, the next step is to derive the associated dummy variables. Using the NOINT option, the following equations give the dummy variables:
C Di D Oi C T O T Dt D t O
T O
When an intercept is desired, the equations for dummy variables and intercept are:
C Di D Oi T Dt D t O
ON T O
C .1 C 2 where 1 D N1 A Q 2 D N1 A Q O Var D O 2
0 0
1 1
AN1 A Q T Q
1
AN1
AN N0 N Q 1 Z ZQ 1 C Q
1N0
h i 0 Q N Z Xs Var s Xs ZQ
h i 0 O Cov ; O D
h 0 O 2 N1 A Q
C .1 C 2
i 0 0 A Q 1 AN1 A Q h i 0 Q N 3 /Var s Xs ZQ 1
1
h i h i Q Q Cov O ; D .1 C 2 3 /Var s h i h i Q N0 O Q Cov ; D Q 1 Z Xs Var s Now you work out the variance covariance estimates for the dummy variables.
Between Estimators
The between groups estimator is the regression of the cross section means of y on the cross section Q means of Xs . In other words, you t the following regression: N yi D xi BG C i N The between time periods estimator is the regression of the time means of y on the time means of Q Xs . In other words, you t the following regression: N y t D x t BT C N
t
In either case, the error is assumed to be normally distributed with mean zero and a constant variance.
Pooled Estimator
PROC PANEL allows you to pool time series cross-sectional data and run regressions on the data. Pooling is admissible if there are no xed effects or random effects present in the data. This feature is included to aid in analysis and comparison across model types and to give you access to HCCME standard errors and other panel diagnostics. In general, this model type should not be used with time series cross-sectional data.
it
N N Let Z0 D diag.jTi ), P0 D diag.JTi /, and Q0 D diag.ETi /, with JTi N Q Q ETi D ITi JTi . Dene Xs D Q0 Xs and y D Q0 y.
In the one-way model, estimation proceeds in a two-step fashion. First, you obtain estimates of the variance of the 2 and 2 . There are multiple ways to derive these estimates; PROC PANEL provides four options. All four options are valid for balanced or unbalanced panels. Once these estimates are in hand, they are used to form a weighting factor , and estimation proceeds via OLS on partial deviations from group means. PROC PANEL gives the following options for variance component estimators.
Z0 y
0
Q0 Q 0 Q0 Q Xs y/ .Xs y/
1
Xs y R./
R. j/ D R.j / C R. / O 2 D .y y
0
R.j /
R. //=.M
K/
The estimator of the cross-sectional variance component is given by O 2 D .R. j/ .N 1/ O 2 /=.M tr.Z0 Xs .Xs Xs /
0 0
Xs Z0 //
Note that the error variance is the variance of the residual of the within estimator. According to Baltagi and Chang (1994), the Fuller and Battese method is appropriate to apply to both balanced and unbalanced data. The Fuller and Battese method is the default for estimation of one-way random-effects models with balanced panels. However, the Fuller and Battese method does not always obtain nonnegative estimates for the cross section (or group) variance. In the case of a negative estimate, a warning is printed and the estimate is set to zero.
Q Q Xs .X0 s Xs /
1X 0y Qs Q
if there is an
1//
2 1
0 0
Xs P0 Xs
tr.Xs Q0 Xs /
N Xs JM Xs /
where O 2 and O 2 are obtained by equating the quadratic forms to their expected values. The estimator of the error variance is the residual variance of the within estimate. The Wansbeek and Kapteyn method can also generate negative variance components estimates.
X P0 X X X
0 0
X Z0 Z0 X
K C tr XX
0
XX
1
0
X P0 X
0
0 X Z0 Z0 X C tr X X
0
X P0 X
0
XX
X P0 X
0
where tr() is the trace operator on a square matrix. Solving this system produces the variance components. This method is applicable to balanced and unbalanced panels. However, there is no guarantee of positive variance components. Any negative values are xed equal to zero.
Nerloves Method
The Nerlove method for estimating variance components can be obtained by setting VCOMP = NL. The Nerlove method (see Baltagi 1995, page 17) is assured to give estimates of the variance components that are always positive. Furthermore, it is simple in contrast to the previous estimators. is the ith xed effect, Nerloves method uses the variance of the xed effects as the estimate PN . i N /2 2 is of You have O 2 D iD1 N 1 , where N is the mean xed effect. The estimate of simply the residual sum of squares of the one-way xed-effects regression divided by the number of observations.
i
If
O 2.
With the variance components in hand, from any method, the next task is to estimate the regression model of interest. For each individual, you form a weight (i ) as follows: i D 1
2 wi D Ti 2
=wi C
2
where Ti is the ith cross sections time observations. Taking the i , you form the partial deviations, yit D yit Q xit D xit Q i yi N i xi N
where yi and xi are cross section means of the dependent variable and independent variables (inN N cluding the constant if any), respectively. The random-effects is then the result of simple OLS on the transformed data.
C et C
it
As in the one-way random-effects model, the PANEL procedure provides four options for variance component estimators. Unlike the one-way random-effects model, unbalanced panels present some special concerns. Let X and y be the independent and dependent variables arranged by time and by cross section within each time period. (Note that the input data set used by the PANEL procedure must be sorted by cross section and then by time within each cross section.) Let Mt be the number of cross sections P observed in time t and t Mt D M. Let Dt be the Mt N matrix obtained from the N N identity matrix from which rows that correspond to cross sections not observed at time t have been omitted. Consider Z D .Z1 ; Z2 / where Z1 D .D1 ; D2 ; : : : ::DT / and Z2 D diag.D1 jN ; D2 jN ; : : : : : : DT jN /. The matrix Z gives the dummy variable structure for the two-way model. For notational ease, let N D Z1 Z1 ; T D Z2 Z2 ; A D Z2 Z1 N Z D Z2 N 1 D IM N 2 D IM Q D T P D .IM Z1 N1 A
0 0 0 0 0 0 0 0 0
Z1 N1 Z1 Z2 T 1 Z2 AN1 A
0 0 0
Z1 N1 Z1 /
N ZQ
1N0
NC1
.K
1//
where P is the Wansbeek and Kapteyn within estimator for unbalanced (or balanced) panel in a two-way setting. The estimator of the error variance is the same as that in the Wansbeek and Kapteyn method. Consider the expected values E.qN / D C E.qT / D C M 2 M M 2 e M
2 2
T T N N
K C 1 0 0 0 N N N tr Xs 2 Z1 Z1 2 Xs Xs 2 Xs K C 1 0 0 0 N N N tr Xs 1 Z2 Z2 1 Xs Xs 1 Xs
Just as in the one-way case, there is always the possibility that the (estimated) variance components will be negative. In such a case, the negative components are xed to equal zero. After substituting the group sum of the within residuals for .qN /, the time sums of the within residuals for .qT /, and 2 O 2 , the two equations are solved for O 2 and O e .
NC1
.K
1//
1X s 0 Py
Q Q where the u are given by u D .IM jM j0 M =M/.y X s .X0 s PX s / 0 PX / 1 X0 Py if there is not. Q intercept and by u D .y X s .X s s s
/ if there is an
The estimation of the variance components is performed by using a quadratic unbiased estimaQ tion (QUE) method that involves computing on quadratic forms of the residuals u, equating their expected values to the realized quadratic forms, and solving for the variance components. Let Q Q qN D u Z2 T 1 Z2 u Q Q qT D u Z1 N1 Z1 u The expected values are E.qN / D .T C kN .1 C k0 //
2
0 0 0 0
C .T
C .M
2 e
.1 C k0 // /
2
2 2
C .N
2 e
1
0
X s jM =M
0
1 1
X s Z2 T 1 Z2 X s /
0 0
X s Z1 N1 Z1 X s /
D jM Z1 Z1 jM D jM Z2 Z2 jM
0 0
2 The quadratic unbiased estimators for 2 and e are obtained by equating the expected values to the quadratic forms and solving for the two unknowns.
When the NOINT option is specied, the variance component equations change slightly. In particular, the following is true (Wansbeek and Kapteyn 1989): E.qN / D .T C kN / E.qT / D .N C kT /
2 2
CT CM
2 2
CM CN
2 e 2 e
C 12
C 13 C 22 C 32
2 e 2 2
0 0 Q Q E.qN / D E uOLS Z2 T 1 Z2 uOLS D 21 0 0 Q Q E.qT / D E uOLS Z1 N1 Z1 uOLS D 31 where the js constants are dened by 0 0 11 D M N T C 1 tr X PX X X
2 2
C 23 C 33
2 e 2 e
0 0 0 12 D tr X Z1 Z1 X X X 13 D tr X Z2 Z2 X X X
0 0 0
0 0 X PX X X 0 0 X PX X X
1
21 D T
0 0 0 tr X Z2 T 1 Z2 X X X
22 D T
0 0 0 0 2tr X Z2 T 1 Z2 Z1 Z1 X X X
0
C tr X Z2 T Z2 X X X
X Z1 Z1 X X X
23 D T
2tr X Z2 Z2 X X X
0 0 0
0 0 0
0 0 0 C tr X Z2 T 1 Z2 X X X
X Z2 Z2 X X X
1
31 D N
0 0 0 tr X Z1 N1 Z1 X X X
32 D M 0 0 0 C tr X Z1 N1 Z1 X X X
2tr X Z1 Z1 X X X
0 0 0
0 0 0
X Z1 Z1 X X X
33 D N
0
2tr X
0 0 Z1 N1 Z1 Z2 Z2 X 0
XX
0 0
C tr X
0 Z1 N1 Z1 X
XX
X Z2 Z2 X X X
The PANEL procedure solves this system for the estimates O , O , and O e . Some of the estimated variance components can be negative. Negative components are set to zero and estimation proceeds.
Nerloves Method
The Nerlove method for estimating variance components can be obtained with by setting VCOMP = NL. The estimator of the error variance is Q Q O 2 D u Pu=M
0
The variance components for cross section and time effects are: O2 D and
2 Oe N X. iD1 i
N /2 where 1
T X .t iD1
With the estimates of the variance components in hand, you can proceed to the nal estimation. If the panel is balanced, partial mean deviations are used: yit D yit Q xit D xit Q 1 D 1 2 D 1 1 yi N 1 xi N 2 y t C 3 y N N 2 x t C 3 x N N
C C T
2 e
3 D 1 C 2 C p
CN
2 e
1I
With these partial deviations, PROC PANEL uses OLS on the transformed series (including an intercept if so desired). The case of an unbalanced panel is somewhat trickier. You could naively substitute the variance components in the equation below: D
2
IM C
Z1 Z1 C
0 2 e Z2 Z2
After inverting the expression for , it is possible to do GLS on the data (even if the panel is unbalanced). However, the inversion of is no small matter because the dimension is at least M.M C1/ . 2 Wansbeek and Kapteyn show that the inverse of 0 2 1 Q D V VZ2 P 1 Z V
2
can be written as
IT
Computationally, this is a much less intensive approach. By using the inverse of the variance-covariance matrix of the error, it becomes possible to complete GLS on the unbalanced panel.
correlated)
it (autoregression)
D 0 D 0 D
ij
1 jt / it jt / it js /
D 0.st /
ij
E.ui0 / D 0 E.ui0 uj 0 / D D
ij =.1 i j/
The model assumed is rst-order autoregressive with contemporaneous correlation between cross sections. In this model, the covariance matrix for the vector of random errors u can be expressed as 2 6 0 6 E.uu / D V D 6 4 where 2 1
j 2 j j 11 P11 21 P21 12 P12 22 P22
: : : N1 PN1
: : : N 2 PN 2 : : :
::: ::: : : :
1N P1N 2N P2N
7 7 7 : : 5 : N N PN N
6 6 i 6 2 Pij D 6 i 6 6 : 4 : :
1
i
: : :
T 2 i
1 : : :
T 3 i
T 1 j T 27 7 j T 37 7 j 7
T 1 i
: 7 : 5 : 1
The matrix V is estimated by a two-stage procedure, and is then estimated by generalized least squares. The rst step in estimating V involves the use of ordinary least squares to estimate and obtain the tted residuals, as follows: O uDy O XOLS
A consistent estimator of the rst-order autoregressive parameter is then obtained in the usual manner, as follows: Oi D
T X tD2
! uit ui;t O O
1
T X t D2
! u2 1 O i;t i D 1; 2; : : :; N
Finally, the autoregressive characteristic of the data is removed (asymptotically) by the usual transformation of taking weighted differences. That is, for i D 1; 2; : : :; N, q yi1 1
2 Oi
p X kD1
Xi1k k
q 1
q 2 Oi C ui1 1
2 Oi
yit
Oi yi;t
p X kD1
.Xitk
Oi Xi;t
1;k /k
C ui t
Oi ui;t
1t
D 2; : : :; T
Xitk k C uit i D 1; 2; : : :; NI t D 1; 2; : : :; T
Notice that the transformed model has not lost any observations (Seely and Zyskind 1971). The second step in estimating the covariance matrix V is applying ordinary least squares to the preceding transformed model, obtaining O u Dy X OLS
ij
is calculated as follows:
O ij .1 Oi Oj /
uit ujt O O
Estimated generalized least squares (EGLS) then proceeds in the usual manner, O O P D .X0 V
1
X/
O X0 V
O O where V is the derived consistent estimator of V. For computational purposes, P is obtained directly from the transformed model,
0 O O P D .X .
IT /X /
O X .
IT /y
O where D O ij i;j D1;:::;N . The preceding procedure is equivalent to Zellners two-stage methodology applied to the transformed model (Zellner 1962). Parks demonstrates that this estimator is consistent and asymptotically, normally distributed with O Var.P / D .X0 V
1
X/
Standard Corrections
For the PARKS option, the rst-order autocorrelation coefcient must be estimated for each cross section. Let be the N 1 vector of true parameters and R D .r1 ; : : :; rN /0 be the corresponding vector of estimates. Then, to ensure that only range-preserving estimates are used in PROC PANEL, the following modication for R is made: 8 ri if jri j < 1 < ri D max.:95; rmax/ if ri 1 : min. :95; rmin/ if ri 1 where 8 <0 if ri < 0 or ri 1 8i rmax D :maxrj W 0rj < 1 otherwise
j
1 < rj 0
if ri > 0 or ri otherwise
1 8i
i D 1; : : :; NI t D 1; : : :; T
where u D .a1T / C .1N b/ C e y D .y11 ; : : :; y1T ; y21 ; : : :; yN T /0 X D .x11 ; : : :; x1T ; x21 ; : : :; xN T /0 a D .a1 : : :aN /0 b D .b1 : : :bT /0 e D .e11 ; : : :; e1T ; e21 ; : : :; eN T /0 Here 1 N is an N 1 vector with all elements equal to 1, and denotes the Kronecker product.
The following conditions are assumed: 1. xit is a sequence of nonstochastic, known p 1 vectors in <p whose elements are uniformly bounded in <p . The matrix X has a full column rank p. 2. is a p 1 constant vector of unknown parameters.
2 a,
3. a is a vector of uncorrelated random variables such that E.ai / D 0 and var.ai / D 2 a > 0; i D 1; : : :; N. 4. b is a vector of uncorrelated random variables such that E.bt / D 0 and var.bt / D 2 > 0 and t D 1; : : :; T. b
2 b
where
5. ei D .ei1 ; : : :; eiT /0 is a sample of a realization of a nite moving-average time series of order m < T 1 for each i ; hence, eit D 0
t
C 1
t 1
C : : : C m
t m
t D 1; : : :; TI i D 1; : : :; N
j D1
where 0 ; 1 ; : : :; m are unknown constants such that 0 0 and m 0, and f j gj D 1 is a white noise processthat is, a sequence of uncorrelated random variables with 2 E. t / D 0; E. t / D 2 , and 2 > 0. 6. The sets of random variables fai gN , fbt gTD1 , and fei t gTD1 for i D 1; : : :; N are mutually t t i D1 uncorrelated. 7. The
t k
random terms have normal distributions ai N.0; N.0; 2 /; for i D 1; : : :; NI t D 1; : : :TI and k D 1; : : :; m.
2 a /; bt
N.0;
2 /; b
and
2 b .JN IT /
C .IN T /
T matrix with elements t s as follows: ( .jt sj/ if jt sjm Cov.eit eis / D 0 if jt sj > m P k where .k/ D 2 mD0 j j Ck for k D jt sj. For the denition of IN , IT , JN , and JT , see j the section Fuller and Batteses Method on page 1292. The covariance matrix, denoted by V, can be written in the form VD
.0/ 2 a .IN JT /
where T is a T
2 b .JN IT /
m X kD0
.k/.IN T /
.k/
where T D IT , and, for k =1,: : :, m, T is a band matrix whose kth off-diagonal elements are 1s and all other elements are 0s. Thus, the covariance matrix of the vector of observations y has the form Var.y/ D where
1 2 k mC3 X kD1 k Vk
.k/
D D D
2 a 2 b
.k
3/k D 3; : : :; m C 3
V1 D IN JT V2 D JN IT Vk D IN T
.k 3/
k D 3; : : :; m C 3
The estimator of is a two-step GLS-type estimatorthat is, GLS with the unknown covariance matrix replaced by a suitable estimator of V. It is obtained by substituting Seely estimates for the scalar multiples k ; k D 1; 2; : : :; m C 3. Seely (1969) presents a general theory of unbiased estimation when the choice of estimators is restricted to nite dimensional vector spaces, with a special emphasis on quadratic estimation of P functions of the form nD1 i i . i The parameters i (i =1,: : :, n) are associated with a linear model E(y )=X with covariance matrix Pn i D1 i Vi where Vi (i =1, : : :, n) are real symmetric matrices. The method is also discussed by Seely (1970a,1970b) and Seely and Zyskind (1971). Seely and Soong (1971) consider the MINQUE principle, using an approach along the lines of Seely (1969).
C K k xitk C kD1
C t C
it
The x variables can include ones that are correlated or uncorrelated to the individual effects, predetermined, or strictly exogenous. The and are cross-sectional and time series xed effects, respectively. Arellano and Bond (1991) show that it is possible to dene conditions that should result in a consistent estimator. Consider the simple case of an autoregression in a panel setting (with only individual effects): yit D yi.t
1/
it
it
it
it 1 .
Obviously, y is not exogenous. However, Arellano and Bond (1991) show that it is still useful as an instrument, if properly lagged. For t D 2 (assuming the rst observation corresponds to time period 1) you have, yi2 D yi1 C
i2
Using yi1 as an instrument is not a good idea since Cov . i1 ; i2 / 0. Therefore, since it is not possible to form a moment restriction, you discard this observation. For t D 3 you have, yi3 D yi2 C
i3
Clearly, you have every reason to suspect that Cov . tion. For t D 4, both Cov .
i1 ; i4 /
i1 ; i3 /
D 0 and Cov .
i2 ; i4 /
D 0 must hold.
Proceeding in that fashion, you have the following matrix of instruments, 1 0 yi1 0 0 0 0 0 0 0 0 B 0 yi1 yi2 0 C 0 0 0 0 0 C B B 0 0 0 yi1 yi2 yi3 0 C 0 0 Zi D B C B : C : : : : : : : : : : @ : A : : : : : : 0 0 0 0 0 0 0 yi1 yi.T 2/ Using the instrument matrix, you form the weighting matrix AN as AN D
N 1 X 0 Z Hi Zi N i i
0 0 1 0 0 0 0 0
0 0 0 0 0 : : : 0 0 0 0
0 0 0 1 0
0 0 0 : : : 2 1
C C C C C C C 1 A 2
0 0 0 : : :
Note that the maximum size of the Hi matrix is T2. The origins of the initial weighting matrix are the expected error covariances. Notice that on the diagonals, 2 2 . it it / D E it 2 it i.t 1/ C i.t 1/ D 2 2 E and off diagonals, E
it i.t 1/
DE
it i.t 1/
it i.t 2/
i.t 1/ i.t 1/
i.t 1/ i.t 2/
If you let the vector of lagged differences (in the series yit ) be denoted as yi and the dependent variable as yi , then the optimal GMM estimator is " D X
i
0
! yi Zi AN X
i
!# Zi yi
0
! X
i
! X
i
yi Zi AN
Zi yi
Using the estimate, O , you can obtain estimates of the errors, O , or the differences, O . From the errors, the variance is calculated as, O O M 1 P where M D ND1 Ti is the total number of observations. i
2
0
Furthermore, you can calculate the variance of the parameter as, "
2
0
i yi Zi AN
!# X
i
Zi yi
Alternatively, you can view the initial estimate of the improve the estimate of the weight matrix, AN .
Instead of imposing the structure of the weighting, you form the Hi matrix through the following: Hi D O i O i
0
You then complete the calculation as previously shown. The PROC PANEL option TWOSTEP species this estimation.
The case of multiple right-hand-side variables illustrates more clearly the power of Arellano and Bond (1991) and Arellano and Bover (1995). Considering the general case you have:
maxlag
yit D
X
lD1
l yi.t l/
C Xi C
C t C
it
It is clear that lags of the dependent variable are both not exogenous and correlated to the xed effects. However, the independent variables can fall into one of several categories. An independent variable can be correlated and exogenous, uncorrelated and exogenous, correlated and predetermined, and uncorrelated and predetermined. The category in which an independent variable is found inuences when or whether it becomes a suitable instrument. Note, however, that neither PROC PANEL nor Arellano and Bond require that a regressor be an instrument or that an instrument be a regressor. First, consider the question of exogenous or endogenous. An exogenous variable is not correlated with the error term in the model at all. Therefore, all observations (on the exogenous variable) become valid instruments at all time periods. If the model has only one instrument and it happens to be exogenous, then the optimal instrument matrix looks like, 1 0 xi1 xiT 0 0 0 0 B C 0 xi1 xiT 0 0 0 C B B C 0 0 xi1 xiTS 0 0 Zi D B C B C : : : : : : : : : : @ A : : : : : 0 0 0 0 xi1 xiTS The situation for the predetermined variables becomes a little more difcult. A predetermined variable is one whose future realizations can be correlated to current shocks in the dependent variable. With such an understanding, it is admissible to allow all current and lagged realizations as instruments. In other words you have, 0 1 xi1 0 0 0 0 B 0 xi1 xi2 C 0 0 0 B C B 0 C 0 xi1 xi3 0 0 Zi D B C B : C : : : : : : : : : @ : A : : : : 0 0 0 0 xi1 xi.TS
2/
When the data contain a mix of endogenous, exogenous, and predetermined variables, the instrument matrix is formed by combining the three. The third observation would have one observation on the dependent variable as an instrument, three observations on the predetermined variables as instruments, and all observations on the exogenous variables. There is yet another set of moment restrictions that can be employed. An uncorrelated variable means that the variables level is not affected by the individual specic effect. You write the general model presented above as:
maxlag
yit D
X
lD1
l yi.t l/
C K k xitk C t C kD1
it
where
it
it .
Since the variables are uncorrelated with and uncorrelated with the error, you can perform a system estimation with the difference and level equations. That is, the uncorrelated variables imply moment restrictions on the level equation. If you denote the new instrument matrix with the full complement of instruments available by a and both x p and x e are uncorrelated, then you have: 0 1 Zi 0 0 0 0 B 0 xp xe 0 0 C i1 i1 B C Zi D B : : : : : C : : : : A @ : : : : : : 0 0
e xiTS xiTS p
The formation of the initial weighting matrix becomes somewhat problematic. If you denote the new weighting matrix with a , then you can write the following: AN D where 0 B B B Hi D B B @ Hi 0 0 0 0 1 0 0 0 0 1 0 : : : :: : : : : : : : 0 0 0 0 0 0 : : : 1 1 C C C C C A
N 1 X 0 Z H Z N i i i i
To nish, you write out the two equations (or two stages) that are estimated. yit D Si C t t
1
it
yit D Si C
C t C
it
where Si is the matrix of all explanatory variables, lagged endogenous, exogenous, and predetermined. Let yit be given by yit yit D D yit
S D
Si Si
! X
i
! AN X
i
Si Zi
Zi yi
If the TWOSTEP or ITGMM option is not requested, estimation terminates here. If it terminates, you can obtain the following information. Variance of the error term comes from the second stage equationthat is,
2
O O M p
where p is the number of regressors. The variance covariance matrix can be obtained from " ! !# 1 X 0 X 0 2 Si Zi AN Zi Si
i i
Alternatively, a robust estimate of the variance covariance matrix can be obtained by specifying the ROBUST option. Without further reestimation of the model, the Hi matrix is recalculated as follows: 0 0 Hi;2 D 0 0 And the weighting matrix becomes AN;2 D
N 1 X 0 Z H Z N i i i;2 i
Using the information above, you construct the robust variance covariance matrix from the following: Let G denote a temporary matrix. " GD X
i
! Si Zi
0
!# AN X
i
! X Si Zi
0
Zi S
i i
AN
Alternatively, the new weighting matrix can be used to form an updated estimate of the regression parameters. This results when the TWOSTEP option is requested. In short, " D X
i
! Si Zi
0
!# AN;2 X
i
! X
i
! AN;2 X
i
Zi Si
Si Zi
Zi Si
! Si Zi
0
!# AN;2 X
i
Zi Si
As a nal note, it possible to iterate more than twice by specifying the ITGMM option. Such a multiple iteration should result in a more stable estimate of the variance covariance estimate. PROC
PANEL allows two convergence criteria. Convergence can occur in the parameter estimates or in the weighting matrices. Iterate until AN;kC1 .i; j / AN;k .i; j / ATOL max i;j dim.AN;k / AN;k .i; j / or kC1 .i / k .i / BTOL max .i / i dim.k / k where ATOL is the tolerance for convergence in the weighting matrix and BTOL is the tolerance for convergence in the parameter estimate matrix. The default convergence criteria is BTOL = 1E8 for PROC PANEL.
where i is the stacked error term (of the differenced equation and level equation). When the robust weighting matrix is used, the test statistic is computed as ! ! X 0 X 0 i Zi AN;2 Zi i
i i
This denition of the Sargan test is used for all iterated estimations. The Sargan test is distributed as a 2 with degrees of freedom equal to the number of moment conditions minus the number of parameters. In addition to the Sargan test, PROC PANEL tests for autocorrelation in the residuals. These tests are distributed as standard normal. PROC PANEL tests the hypothesis that the autocorrelation of the lth lag is signicant. Dene !l as the lag of Symbolically, 0 0 B : : B : B B 0 !l D B B 1 B B : : @ :
TS 2 l
the differenced error, with zero padding for the missing values generated. 1 C C C C C C C C A
Note that the choice of Hi is dependent on the stage of estimation. If the estimation is rst stage, then you would use the matrix with twos along the main diagonal, and minus ones along the primary subdiagonals. In a robust estimation or multi-step estimation, this matrix would be formed from the outer product of the residuals (from the previous step). Dene the constant k2 as ! k2 .l/ D 2 X
i
! X
i
! X
i
!l;i Si G
Si Zi AN;k
Zi Hi !l;i
! X
i
Si !l;i
Using the four quantities, the test for autoregressive structure in the differenced residual is k0 .l/ m .l/ D p k1 .l/ C k2 .l/ C k3 .l/ The m statistic is distributed as a normal random variable with mean zero and standard deviation of one.
Instrument Choice
Arellano and Bonds technique is a very useful method for dealing with any autoregressive characteristics in the data. However, there is one caveat to consider. Too many instruments bias the estimator to the within estimate. Furthermore, many instruments make this technique not scalable. The weighting matrix becomes very large, so every operation that involves it becomes more computationally intensive. The PANEL procedure enables you to specify a bandwidth for instrument selection. For example, specifying MAXBAND=10 means that at most there will be ten time observations for each variable entering as an instrument. The default is to follow the Arellano-Bond methodology.
In specifying a maximum bandwidth, you can also specify the selection of the time observations. There are three possibilities: leading, trailing (default), and centered. The exact consequence of choosing any of those possibilities depends on the variable type (correlated, exogenous, or predetermined) and the time period of the current observation. If the MAXBAND option is specied, then the following is true under any selection criterion (let t be the time subscript for the current observation). The rst observation for the endogenous variable (as instrument) is max.t MAXBAND; 2/ and the last instrument is t 2. The rst observation for a predetermined variable is max.t MAXBAND; 2/ and t 1. The rst and last observation for an exogenous variable is given in the list below. Trailing: If t < MAXBAND, then the rst instrument is for the rst time period and the last observation is MAXBAND. Otherwise, if t MAXBAND, then the rst observation is t MAXBAND, while the last instrument to enter is t . Centered: If t MAXBAND , then the rst observation is the rst time period and the last 2 MAXBAND observation is MAXBAND. If t > T , then the rst instrument included is the 2 T MAXBAND C 1 and the last observation is T . If MAXBAND < t T MAXBAND , then 2 2 the rst included instrument is t MAXBAND C 1 and the last observation is t C MAXBAND . 2 2 If the MAXBAND value is an odd number, the procedure decrements by one. Leading : If t > T MAXBAND, then the rst instrument corresponds to time period T MAXBAND, while the last observation is T S. Otherwise, if t T MAXBAND, then the rst observation is t and the last observation is t C. The PANEL procedure enables you to include dummy variables to deal with the presence of time effects not captured by including the lagged dependent variable. The dummy variables directly affect the level equations. However, this implies that the difference of the dummy variables for dummy variable of time period t and t 1 enters the difference equation. The rst usable observation occurs at t D 3. If the level equation is not used in the estimation, then there is no way to identify the dummy variables. Selecting the TIME option gives the same result as that which would be obtained by creating dummy variables in the data set and using those in the regression. The PANEL procedure gives you several options when it comes to missing values and unbalanced panel. By default, any time period for which there are missing values is skipped. The corresponding rows and columns of H matrices are zeroed, and the calculation is continued. Alternatively, you can elect to replace missing values and missing observations with zeros (ZERO), the overall mean of the series (OAM), the cross-sectional mean (CSM), or the time series mean (TSM).
.R
r/
However, it is also possible to write the F statistic as FD where O u is the residual vector from the restricted regression O u is the residual vector from the unrestricted regression J is the number of restrictions .M K/ are the degrees of freedom O O O O .u u u u/=J 0 O O u u=.M K/
0 0
The Wald, likelihood ratio (LR) and Lagrange multiplier (LM) tests are all related to the F test. You use this relationship of the F test to the likelihood ratio and Lagrange multiplier tests. The Wald test is calculated from its denition. The Wald test statistic is: W D .R
0 O r/ RVR0
.R
r/
The advantage of calculating Wald in this manner is that it enables you to substitute a heteroscedasticity-corrected covariance matrix for the matrix V. PROC PANEL makes such a substitution if you request the HCCME option in the MODEL statement. The likelihood ratio is: LR D M ln 1 C
1 M K
JF
The Lagrange multiplier test statistic is: JF LM D M M K C JF Note that only the Wald is changed when the HCCME option is selected. The LR and LM tests are unchanged. The distribution of these test statistics is the 2 with degrees of freedom equal to the number of restrictions imposed (J). The three tests are asymptotically equivalent, but they have differing small sample properties. Greene (2000, p. 392) and Davidson and MacKinnon (1993, pg. 456-458) discuss the small sample properties of these statistics.
covariance matrix. Consider the simple linear model (this discussion parallels the discussion in Davidson and MacKinnon, 1993, pg. 548-562): y D X C The assumptions that make the linear regression best linear unbiased estimator (BLUE) are E. / D 0 0 and E. / D , where has the simple structure 2 I. Heteroscedasticity results in a general covariance structure, so that it is not possible to simplify . The result is the following:
0 Q D .X X/
X y D .X X/
X .X C / D C .X X/
As long as the following is true, then you are assured that the OLS estimate is consistent and unbiased: 1 0 pli mn!1 X D0 n If the regressors are nonrandom, then it is possible to write the variance of the estimated as the following: 0 0 0 Q Var D .X X/ 1 X X.X X/ 1 The effect of structure in the variance covariance matrix can be ameliorated by using generalized 1 can be calculated. Using 1 , you premultiply both sides least squares (GLS), provided that of the regression equation,
1
yD
X C
X/
1 1
0
X/ X/
1 1 1
y X C
1 1
D .X
X.
1
D C .X
X/
1 1 1
X/ X/ X/
1 1 1
X X
1 1
1 1
X.X
0
1 1
X/
1 1
0 0
X.X
X/
The difference in variance between the OLS estimator and the GLS estimator can be written as .X X/
0
X.X X/
.X
X/
By the Gauss Markov Theory, the difference matrix must be positive denite under most circumstances (zero if OLS and GLS are the same, when the usual classical regression assumptions are
met). Thus, OLS is not efcient under a general error structure. It is crucial to realize is that OLS does not produce biased results. It would sufce if you had a method for estimating a consistent covariance matrix and you used the OLS . Estimation of the matrix is by no means simple. The matrix is square and has M 2 elements, so unless some sort of structure is assumed, it becomes an impossible problem to solve. However, the heteroscedasticity can have quite a general structure. White (1980) shows that it is not necessary to have a consistent estimate of . On the contrary, it sufces to get an estimate of the middle expression. That is, you need an estimate of: DX
0
This matrix, , is easier to estimate because its dimension is K. PROC PANEL provides the following classical HCCME estimators: The matrix is approximated by: HCCME=N0: This is the simple OLS estimator so .X X/. If you do not request the HCCME= option, then PROC PANEL defaults to this estimator. HCCME=0:
M 1 X 2 0 Oi xi xi M iD0
0
HCCME=2:
M 1 X Oi2 0 xx Oi i i M h iD0 1
O O The hi term is i th diagonal element of the so called hat matrix. The expression for hi is 0 1 X0 . The hat matrix attempts to adjust the estimates for the presence of inuence Xi .X X/ i or leverage points. HCCME=3: Oi2 1X 0 xx O i /2 i i n h iD0 .1 HCCME=4: This is the Arellano (1987) version of the White (1980) HCCME for panel. PROC PANEL includes an option for the calculation of the Arellano (1987) version of the White HCCME in the panel setting. Arellanos insight is that in a panel there are N covariance matrices, each corresponding to a cross section. Forming the White (1980) HCCME for each panel, you need to take only the average of those N estimators that yield Arellano. The details
M
of the estimation follow. First, you arrange the data such that the rst cross section occupies the rst Ti observations. You treat the panels as separate regressions with the form: Q yi D i i C Xis C
i
Q The parameter estimates and i are the result of LSDV or within estimator. i is a vector 0 ones of length Ti . The estimate of the i th cross sections X X matrix (where the s subscript 0 indicates that no constant column has been suppressed to avoid confusion) is Xi Xi . The estimate for the whole sample is: Xs Xs D
0
N X iD1
Xi Xi
The Arellano standard error is in fact a White-Newey-West estimator with constant and equal weight on each component. It should be noted that, in the between estimators, selecting HCCME = 4 returns the HCCME = 0 result since there is no other variable to group by. In their discussion, Davidson and MacKinnon (1993, pg. 554) argue that HCCME=1 should always be preferred to HCCME=0. While generally HCCME=3 is preferred to 2 and 2 is preferred to 1, the calculation of HCCME=1 is as simple as the calculation of HCCME=0. Therefore, it is clear that HCCME=1 is preferred when the calculation of the hat matrix is too tedious. All HCCME estimators have well dened asymptotic properties. The small sample properties are not well known, and care must exercised when sample sizes are small. The HCCME estimator of Var./ is used to drive the covariance matrices for the xed effects and the Lagrange multiplier standard errors. Robust estimates of the variance covariance matrix for imply robust variance covariance matrices for all other parameters.
R-Square
The conventional R-square measure is inappropriate for all models that the PANEL procedure estimates by using GLS since a number outside the [0,1] range might be produced. Hence, a generalization of the R-square measure is reported. The following goodness-of-t measure (Buse 1973) is reported: R D1
2
0 O O O u V 1u 0 0 O y D V 1 Dy
O X.X V
1 X/ 1 X0 V 1 y, O
O V 0 O jM V
1 1j M
/.
This is a measure of the proportion of the transformed sum of squares of the dependent variable that is attributable to the inuence of the independent variables.
1u O 1y
However, the xed-effects models are somewhat different. In the case of a xed-effects model, the choice of including or excluding an intercept becomes merely a choice of classication. Suppressing the intercept in the FIXONE or FIXONETIME case merely changes the name of the intercept to a xed effect. It makes no sense to redene the R-square measure since nothing material changes in the model. Similarly, for the FIXTWO model there is no reason to change R-square. In the case of the FIXONE, FIXONETIME, and FIXTWO models, the R-square is dened as the Theil (1961) R-square (detailed above). This makes intuitive sense since you are regressing a transformed (demeaned) series on transformed regressors, excluding a constant. In other words, you are looking at one minus the sum of squared errors divided by the sum of squares of the (transformed) dependent variable. In the case of OLS estimation, both of the R-square formulas given here reduce to the usual R-square formula.
Specication Tests
The PANEL procedure outputs the results of one specication test for xed effects and two specication tests for random effects. For xed effects, let f be the n dimensional vector of xed-effects parameters. The specication test reported is the conventional F statistic for the hypothesis f D 0. The F statistic with n; M K degrees of freedom is computed as O O O f Sf 1 f =n O where Sf is the estimated covariance matrix of the xed-effects parameters. Hausmans (1978) specication test or m statistic can be used to test hypotheses in terms of bias or inconsistency of an estimator. This test was also proposed by Wu (1973) and further extended in Hausman and Taylor (1982). Hausmans m statistic is as follows. O O Consider two estimators, a and b , which under the null hypothesis are both consistent, but only O O a is asymptotically efcient. Under the alternative hypothesis, only b is consistent. The m statistic is O m D .b O 0 O a / .Sb O Sa /
1
O .b
O a /
O O O O where Sb and Sa are consistent estimates of the asymptotic covariance matrices of b and a . Then 2 with k degrees of freedom, where k is the dimension of and . Oa Ob m is distributed In the random-effects specication, the null hypothesis of no correlation between effects and regressors implies that the OLS estimates of the slope parameters are consistent and inefcient but the GLS estimates of the slope parameters are consistent and efcient. This facilitates a Hausman
Breusch and Pagan (1980) lay out a Lagrange multiplier test for random effects based on the simple OLS (pooled) estimator. If uit is the itth residual from the OLS regression, then the Breusch-Pagan O (BP) test for one-way random effects is 32 2 P hP i2 N T uit O tD1 NT 6 iD1 7 BP D 15 4 PN PT 2 2.T 1/ O iD1 tD1 uit The BP test generalizes to the case of a two-way random-effects model (Greene 2000, page 589). Specically, i2 uit O t=1 NT 6 4 PN PT 2.T 1/ O2 i=1 t = 1 uit 2P
n i=1
hP T
32 7 15 32 7 15
BP2 D
2P
T t=1
hP
N O i = 1 uit
i2
is distributed as a 2 statistic with two degrees of freedom. Since the BP2 test generalizes (nests the BP test) the test for random effects, the absence of random effects (nonrejection of the null of no random effects) in the BP2 is a fairly clear indication that there will probably not be any one-way effects either. In both cases (BP and BP2), the residuals are obtained from a pooled regression. There is very little extra cost in selecting both the BP and BP2 test. Notice that in the case of just groupwise heteroscedasticity, the BP2 test approaches BP. In the case of time based heteroscedasticity, the BP2 test reduces to a BP test of time effects. In the case of unbalanced panels, neither the BP nor BP2 statistics are valid. Finally, you should be aware that the BP option generates different results depending on whether the estimation is FIXONE or FIXONET. Specically, under the FIXONE estimation technique, the BP tests for cross-sectional random effects. Under the FIXONET estimation, the BP tests for time random effects. While the Hausman statistic is automatically generated, you request Breusch-Pagan via the BP or BP2 option (see Baltagi 1995 for details).
Troubleshooting
Some guidelines need to be followed when you use PROC PANEL for analysis. For each cross section, PROC PANEL requires at least two time series observations with nonmissing values for all model variables. There should be at least two cross sections for each time point in the data. If these two conditions are not met, then an error message is printed in the log stating that there is only one cross section or time series observation and further computations will be terminated. You have to
Troubleshooting ! 1319
give adequate data for an estimation method to produce results, and you should check the log for any data related errors. If the number of cross sections is greater than the number of time series observations per cross section, PROC PANEL while using PARKS method produces an error message stating that the phi matrix is singular. This is analogous to seemingly unrelated regression with fewer observations than equations in the model. To avoid the problem, reduce the number of cross sections. Your data set could have multiple observations for each time ID within a particular cross section. However, PROC PANEL is applicable only in cases where you have only a single observation for each time ID within each cross section. In such a case, after you have sorted the data, an error warning specifying that the data has not been sorted in ascending sequence with respect to time series ID appears in the log. The cause of the error is due to multiple observations for each time ID for a given cross section. PROC PANEL allows only one observation for each time ID within each cross section. The following data set shown in Figure 19.2 illustrates the preceding instance with the correct representation.
Figure 19.2 Single Observation for Each Time Series
Obs 1 2 3 4 5 6 7 8 firm 1 1 1 1 2 2 2 2 year 1955 1960 1965 1970 1955 1960 1965 1970 production 5.36598 6.03787 6.37673 6.93245 6.54535 6.69827 7.40245 7.82644 cost 1.14867 1.45185 1.52257 1.76627 1.35041 1.71109 2.09519 2.39480
In this case, you can observe that there are no multiple observations with respect to a given time series ID within a cross section. This is the correct representation of a data set where PROC PANEL is applicable. If for state ID 1 you have two observations for the year=1955, then PROC PANEL produces the following error message: The data set is not sorted in ascending sequence with respect to time series ID. The current time period has year=1955 and the previous time period has year=1955 in cross section rm=1. A data set similar to the previous example with multiple observations for the YEAR=1955 is shown in Figure 19.3; this data set results in an error message due to multiple observations while using PROC PANEL.
In order to use PROC PANEL, you need to aggregate the data so that you have unique time ID values within each cross section. One possible way to do this is to run a PROC MEANS on the input data set and compute the mean of all the variables by FIRM and YEAR, and then use the output data set.
ODS Graphics
This section describes the use of ODS for creating graphics with the PANEL procedure. Both the graphical results and the syntax for specifying them are subject to change in a future release. To request these graphs, you must specify the ODS GRAPHICS statement. The table below lists the graph names, the plot descriptions, and the options used.
Table 19.2 ODS Graphics Produced by PROC PANEL
ODS Graph Name ResidualPlot FitPlot QQPlot ResidSurfacePlot PredSurfacePlot ActSurfacePlot ResidStackPlot ResidHistogram
Plot Description Plot of the residuals Predicted versus actual plot Plot of the quantiles of the residuals Surface plot of the residuals Surface plot of the predicted values Surface plot of the actual values Stack plot of the residuals Plot of the histogram of residuals
Plots=Option Residual, Resid Fitplot QQ Residsurface Predsurface Actsurface Residstack, Resstack Residualhistogram, Residhistogram
is a character variable that contains the label for the MODEL statement if a label is specied. is a character variable that identies the estimation method. is the number of the model estimated. contains the value of the dependent variable. contains the weighing variable. is the value of the cross section ID. is the value of the time period in the dynamic model. are the values of regressor variables specied in the MODEL statement. if PRED= name1 and/or RESIDUAL= name2 options are specied, then name1 and name2 are the columns of predicted values of dependent variable and residuals of the regression, respectively.
_NAME_
_VARCS_
is the variance component estimate due to cross sections. The _VARCS_ variable is included in the OUTEST= data set when either the FULLER or DASILVA option is specied. is the variance component estimate due to time series. The _VARTS_ variable is included in the OUTEST= data set when either the FULLER or DASILVA option is specied. is the variance component estimate due to error. The _VARERR_ variable is included in the OUTEST= data set when the FULLER option is specied. is the rst-order autoregressive parameter estimate. The _A_1 variable is included in the OUTEST= data set when the PARKS option is specied. The values of _A_1 are cross-sectional parameters, meaning that they are estimated for each cross section separately. The _A_1 variable has a value only for _TYPE_=CSPARMS observations. The cross section to which the estimate belongs is indicated by the _CSID_ variable. is the intercept parameter estimate. (INTERCEP is missing for models when the NOINT option is specied.) are the regressor variables specied in the MODEL statement. The regressor variables in the OUTEST= data set contain the corresponding parameter estimates for the model identied by _MODEL_ for _TYPE_=PARMS observations, and the corresponding covariance or correlation matrix elements for _TYPE_=COVB and _TYPE_=CORRB observations. The response variable contains the value1 for the _TYPE_=PARMS observation for its model.
_VARTS_
_VARERR_ _A_1
INTERCEP regressors
First, z2 is copied over. Then y, x1, x2, and x3, are replaced with their mean deviates (from cross sectional). Furthermore, two new variables are created. _MODELL_ _TRTYPE_ is the models label (if it exists). is the models transformation type. In the FIXONE case, this is _FIXONE_ or _FIXONET_. If the model RANONE model is selected, the _TRTYPE_ variable
is either _Ran1FB_, _Ran1WK_, _Ran1WH_ or _Ran1NL_ depending on the variance component estimators chosen.
Printed Output
For each MODEL statement, the printed output from PROC PANEL includes the following: a model description, which gives the estimation method used, the model statement label if specied, the number of cross sections and the number of observations in each cross section, and the order of moving average error process for the DASILVA option. For xed-effects model analysis, an F test for the absence of xed effects is produced, and for random-effects model analysis, a Hausman test is used for the appropriateness of the random-effects specication. the estimates of the underlying error structure parameters the regression parameter estimates and analysis. For each regressor, this includes the name of the regressor, the degrees of freedom, the parameter estimate, the standard error of the estimate, a t statistic for testing whether the estimate is signicantly different from 0, and the signicance probability of the t statistic. Optionally, PROC PANEL prints the following: the covariance and correlation of the resulting regression parameter estimates for each model and assumed error structure O the matrix that is the estimated contemporaneous covariance matrix for the PARKS option
Description
Option
ODS Tables Created by the MODEL Statement ModelDescription Model description FitStatistics Fit statistics FixedEffectsTest F test for no xed effects ParameterEstimates Parameter estimates
Table 19.3
(continued)
Description Covariance of parameter estimates Correlations of parameter estimates Variance component estimates
RandomEffectsTest AR1Estimates TestResults BreuschPaganTest BreuschPaganTest2 Sargan ARTest IterHist ConvergenceStatus EstimatedPhiMatrix EstimatedAutocovariances
Hausman test for random effects First-order autoregressive parameter estimates Tests of linear restrictions Breusch-Pagan one-way test Breusch-Pagan two-way test Sargans test for overidentication Autoregression test for the residuals Iteration history Convergence status of iterated GMM estimator Estimated phi matrix Estimates of autocovariances
t = Per Capita Time Deposits s = Per Capita S & L Association Shares y = Permanent Per Capita Personal Income rd = Service Charge on Demand Deposits rt = Interest on Time Deposits rs = Interest on S & L Association Shares; datalines; CA 1949 6.2785 6.1924 4.4998 7.2056 -1.0700 0.1080 CA 1950 6.4019 6.2106 4.6821 7.2889 -1.0106 0.1501 CA 1951 6.5058 6.2729 4.8598 7.3827 -1.0024 0.4008 CA 1952 6.4785 6.2729 5.0039 7.4000 -0.9970 0.4492 CA 1953 6.4118 6.2538 5.1761 7.4200 -0.8916 0.4662 ... more lines ...
As shown in the following statements, the SORT procedure is used to sort the data into the required time series cross-sectional format; then PROC PANEL analyzes the data.
proc sort data=a; by state year; run; proc panel data=a; model d = y rd rt rs / fuller parks dasilva m=7; model t = y rd rt rs / parks; model s = y rd rt rs / parks; id state year; run;
The income elasticities for liquid assets are greater than 1 except for the demand deposit income elasticity (0.692757) estimated by the Da Silva method. In Output 19.1.1, Output 19.1.2, and Output 19.1.3, the coefcient estimates (0.29094, 0.43591, and 0.27736) of demand deposits (RD) imply that demand deposits increase signicantly as the service charge is reduced. The price elasticities (0.227152 and 0.408066) for time deposits (RT) and S&L association shares (RS) have the expected sign. Thus an increase in the interest rate on time deposits or S&L shares will increase the demand for the corresponding liquid asset. Demand deposits and S&L shares appear to be substitutes (see Output 19.1.2, Output 19.1.3, and Output 19.1.5). Time deposits are also substitutes for S&L shares in the time deposit demand equation (see Output 19.1.4), while these liquid assets are independent of each other in Output 19.1.5 (insignicant coefcient estimate of RT, 0:02705). Demand deposits and time deposits appear to be weak complements in Output 19.1.3 and Output 19.1.4, while the cross elasticities between demand deposits and time deposits are not signicant in Output 19.1.2 and Output 19.1.5.
Variance Component Estimates Variance Component for Cross Sections Variance Component for Time Series Variance Component for Error Hausman Test for Random Effects DF 4 m Value 5.51 Pr > m 0.2385 0.03427 0.00026 0.00111
Variable Intercept y rd rt rs
DF 1 1 1 1 1
Label Intercept Permanent Per Capita Personal Income Service Charge on Demand Deposits Interest on Time Deposits Interest on S & L Association Shares
Variable Intercept y rd rt rs
DF 1 1 1 1 1
Label Intercept Permanent Per Capita Personal Income Service Charge on Demand Deposits Interest on Time Deposits Interest on S & L Association Shares
Variable Intercept y rd rt rs
DF 1 1 1 1 1
Label Intercept Permanent Per Capita Personal Income Service Charge on Demand Deposits Interest on Time Deposits Interest on S & L Association Shares
Output 19.1.5 Demand for Savings and Loan Shares, Parks Method
The PANEL Procedure Parks Method Estimation Dependent Variable: s Per Capita S & L Association Shares Model Description Estimation Method Number of Cross Sections Time Series Length Fit Statistics SSE MSE R-Square 39.2550 0.5452 0.9017 DFE Root MSE 72 0.7384 Parks 7 11
Variable Intercept y rd rt rs
DF 1 1 1 1 1
Label Intercept Permanent Per Capita Personal Income Service Charge on Demand Deposits Interest on Time Deposits Interest on S & L Association Shares
C .i
N / C .
T/ it
where the are the pure cross-sectional effects and are the time effects. The actual model speculated is highly nonlinear in the original variables. It would look like the following: T Cit D exp .i C
t
C 3 LFit C
1 2 it / Qit PFit
data airline; input Obs I T C Q PF LF; label obs = "Observation number"; label I = "Firm Number (CSID)"; label T = "Time period (TSID)"; label Q = "Output in revenue passenger miles (index)"; label C = "Total cost, in thousands"; label PF = "Fuel price"; label LF = "Load Factor (utilization index)"; datalines; ... more lines ...
data airline; set airline; lC = log(C); lQ = log(Q); lPF = log(PF); label lC = "Log transformation of costs"; label lQ = "Log transformation of quantity"; label lPF= "Log transformation of price of fuel"; run;
First, you see the models description in Output 19.2.1. The model is a two-way xed-effects model. There are six cross sections and fteen time observations.
Output 19.2.1 The Airline Cost DataModel Description
The PANEL Procedure Fixed Two Way Estimates Dependent Variable: lC Log transformation of costs Model Description Estimation Method Number of Cross Sections Time Series Length FixTwo 6 15
The R-square and degrees of freedom can be seen in Table 19.2.2. On the whole, you see a large R-square, so there is a reasonable t. The degrees of freedom of the estimate are 90 minus 14 time dummy variables minus 5 cross section dummy variables and 4 regressors.
The F test for xed effects is shown in Table 19.2.3. Testing the hypothesis that there are no xed effects, you easily reject the null of poolability. There are group effects, or time effects, or both. The test is highly signicant. OLS would not give reasonable results.
Output 19.2.3 The Airline Cost DataTest for Fixed Effects
F Test for No Fixed Effects Num DF 19 Den DF 67 F Value 23.10 Pr > F <.0001
Looking at the parameters, you see a more complicated pattern. Most of the cross-sectional effects are highly signicant (with the exception of CS2). This means that the cross sections are signicantly different from the sixth cross section. Many of the time effects show signicance, but this is not uniform. It looks like the signicance might be driven by a large 16t h period effect, since the rst six time effects are negative and of similar magnitude. The time dummy variables taper off in size and lose signicance from time period 12 onward. There are many causes to which you could attribute this decay of time effects. The time period of the data spans the OPEC oil embargoes and the dissolution of the Civil Aeronautics Board (CAB). These two forces are two possible reasons to observe the decay and parameter instability. As for the regression parameters, you see that quantity affects cost positively, and the price of fuel has a positive effect, but load factors negatively affect the costs of the airlines in this sample. The somewhat disturbing result is that the fuel cost is not signicant. If the time effects are proxies for the effect of the oil embargoes, then an insignicant fuel cost parameter would make some sense. If the dummy variables proxy for the dissolution of the CAB, then the effect of load factors is also not being precisely estimated.
Variable CS1 CS2 CS3 CS4 CS5 TS1 TS2 TS3 TS4 TS5 TS6 TS7 TS8 TS9 TS10 TS11 TS12 TS13 TS14 Intercept lQ lPF LF
DF 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Estimate 0.174237 0.111412 -0.14354 0.18019 -0.04671 -0.69286 -0.63816 -0.59554 -0.54192 -0.47288 -0.42705 -0.39586 -0.33972 -0.2718 -0.22734 -0.1118 -0.03366 -0.01775 -0.01865 12.93834 0.817264 0.168732 -0.88267
t Value 2.02 1.43 -2.77 5.61 -2.08 -2.05 -1.92 -1.81 -1.70 -2.04 -2.27 -2.28 -2.26 -2.02 -2.98 -3.50 -0.78 -0.49 -0.61 5.83 25.66 1.03 -3.37
Pr > |t| 0.0470 0.1576 0.0073 <.0001 0.0415 0.0442 0.0589 0.0751 0.0939 0.0454 0.0267 0.0255 0.0269 0.0478 0.0040 0.0008 0.4354 0.6261 0.5430 <.0001 <.0001 0.3057 0.0012
Label Cross Sectional Effect 1 Cross Sectional Effect 2 Cross Sectional Effect 3 Cross Sectional Effect 4 Cross Sectional Effect 5 Time Series Effect 1 Time Series Effect 2 Time Series Effect 3 Time Series Effect 4 Time Series Effect 5 Time Series Effect 6 Time Series Effect 7 Time Series Effect 8 Time Series Effect 9 Time Series Effect 10 Time Series Effect 11 Time Series Effect 12 Time Series Effect 13 Time Series Effect 14 Intercept Log transformation of quantity Log transformation of price of fuel Load Factor (utilization index)
The preceding statements result in plots shown in Output 19.2.5 and Output 19.2.6.
Output 19.2.5 Diagnostic Panel 1
The UNPACK and ONLY options produce individual detail images of paneled plots. The graph shown in Output 19.2.7 shows a detail plot of residuals by cross section. The packed version always puts all cross sections on one plot while the unpacked one shows the cross sections in groups of ten to avoid loss of detail.
proc panel data=airline; id i t; model lC = lQ lPF LF / fixtwo plots(unpack only) = residsurface; run;
The preceding statements result in Output 19.3.1. The t seems to have deteriorated somewhat. The SSE rises from 0.1768 to 0.2926.
You still reject poolability based on the F test in Output 19.3.2 at all accepted levels of signicance.
Output 19.3.2 The Airline Cost DataTest for Fixed Effects
F Test for No Fixed Effects Num DF 5 Den DF 81 F Value 57.74 Pr > F <.0001
The parameters change somewhat dramatically as shown in Output 19.3.3. The effect of fuel costs comes in very strong and signicant. The load factors coefcient increases, although not as dramatically. This suggests that the xed time effects might be proxies for both the oil shocks and deregulation.
Output 19.3.3 The Airline Cost DataParameter Estimates
Parameter Estimates Standard Error 0.0842 0.0757 0.0500 0.0330 0.0239 0.2636 0.0299 0.0152 0.2017
DF 1 1 1 1 1 1 1 1 1
Estimate -0.08708 -0.12832 -0.29599 0.097487 -0.06301 9.79304 0.919293 0.417492 -1.07044
t Value -1.03 -1.69 -5.92 2.95 -2.64 37.15 30.76 27.47 -5.31
Pr > |t| 0.3041 0.0940 <.0001 0.0041 0.0100 <.0001 <.0001 <.0001 <.0001
Label Cross Sectional Effect 1 Cross Sectional Effect 2 Cross Sectional Effect 3 Cross Sectional Effect 4 Cross Sectional Effect 5 Intercept Log transformation of quantity Log transformation of price of fuel Load Factor (utilization index)
data table; set estimates; VarCS = round(_VARCS_,.00001); VarTS = round(_VARTS_,.00001); VarErr = round(_VARERR_,.00001); Int = round(Intercept,.0001); lQ2 = round(lQ,.0001); lPF2 = round(lPF,.0001); lF2 = round(lF,.0001); if _n_ >= 9 then do; VarCS = . ; VarTS = . ; keep _MODEL_ _METHOD_ VarCS VarTS VarErr Int lQ2 lPF2 lF2; run;
The parameter estimates and variance components for both models are reported in Output 19.4.1 and Output 19.4.2.
title "Parameter Estimates"; proc print data=table label noobs; label _MODEL_ = "Model" _METHOD_ = "Method" Int = "Intercept" lQ2 = "lQ"
lPF2 = "lPF" lF2 = "lF"; var _METHOD_ _MODEL_ Int lQ2 lPF2 lF2; run;
title "Variance Component Estimates" ; proc print data=table label noobs; label _MODEL_ = "Model" _METHOD_ = "Method" VarCS = "Variance Component for Cross Sections" VarTS = "Variance Component for Time Series" VarErr = "Variance Component for Error"; var _METHOD_ _MODEL_ VarCS VarTS VarErr; run; title ;
Method _Ran1FB_ _Ran1WK_ _Ran1WH_ _Ran1NL_ _Ran2FB_ _Ran2WK_ _Ran2WH_ _Ran2NL_ _POOLED_ _BTWGRP_ _BTWTME_
Model RANONE RANONEWK RANONEWH RANONENL RANTWO RANTWOWK RANTWOWH RANTWONL POOLED BTWNG BTWNT
Variance Component for Error 0.00361 0.00361 0.00328 0.00325 0.00264 0.00264 0.00250 0.00196 0.01553 0.01584 0.00051
In the random-effects model, individual constant terms are viewed as randomly distributed across cross-sectional units and not as parametric shifts of the regression function, as in the xed-effects model. This is appropriate when the sampled cross-sectional units are drawn from a large population. Clearly, in this example, the six airlines are a sample of all the airlines in the industry and not an exhaustive, or nearly exhaustive, list. There are four ways of computing the variance components in the one-way random-effects model. The method by Fuller and Battese (1974) (FB), uses a tting of constants methods to estimate them. The Wansbeek and Kapteyn (1989) (WK) method uses the true disturbances, while the Wallace and Hussain (WH) method uses ordinary least squares residuals. Looking at the estimates of the variance components for cross section and error in Output 19.4.2, you see that equal variance components for error are computed for both FB and WK, while WH and NL are nearly equal. All four techniques produce different variance components for cross sections. These estimates are then used to estimate the values of the parameters in Output 19.4.1. All the parameters appear to have similar and equally plausible estimates. Both the index for output in revenue passenger miles (lQ) and fuel price (lPF) have small, positive effects on total costs, which you would expect. The load factor (LF) has a somewhat larger and negative effect on total costs, suggesting that as utilization increases, costs decrease. As in the one-way random-effects model, the variance components for error produced by the FB and WK methods are equal. However, in this case, the WH and NL methods produce variance estimates that are dissimilar. The estimates of the variance component for cross sections are all different, but in a close range. The same cannot be said for the variance component for time series. As varied as each of the variance estimates may be, they produce parameter estimates that are similar and plausible. As with the one-way effects model, the index for output (lQ) and fuel price (lPF) are small and positive. The load factor (LF) estimates are all negative and, with the exception of the estimate produced by the WH method, somewhat smaller than the estimates produced in the oneway model. During the time the data were collected, the Civil Aeronautics Board dissolved, so it is possible that the dummy variables are proxies for this dissolution. This would lead to the decay of time effects and an imprecise estimation of the effects of the load factors, even though the estimates are statistically signicant. The pooled estimates give you something to compare the random-effects estimates against. You see that signs and magnitudes of output and fuel price are similar but that the magnitude of the load factor coefcient is somewhat larger under pooling. Since the model appears to have both cross-sectional and time series effects, the pooled model should not be used. Finally, you examine the between groups estimators. For the between groups estimate, you are looking at each airlines data averaged across time. You see in Output 19.4.1 that the between groups parameter estimates are radically different from all other parameter estimates. This could indicate that the time component is not being appropriately handled with this technique. For the between times estimate, you are looking at the average across all airlines in each time period. In this case, the parameter estimates are of the same sign and closer in magnitude to the previously computed estimates. Both the output and load factor effects appear to have more bearing on total costs.
Since the PANEL procedure cannot work directly with the data in compressed form, the FLATDATA statement can be used to transform the data. The OUT= option can be used to output transformed data to a data set.
proc panel data=flattest; flatdata indid=i tsname="t" transform=(X Y) keep=( cs num seed ) / out=flat_out; id i t; model y = x / fixone noint; run;
First, six observations for the uncompressed data set and results for the one-way xed-effects model tted are shown in Output 19.5.2 and Output 19.5.3.
Output 19.5.2 Uncompressed Data Set
Obs 1 2 3 4 5 6 I 1 1 1 1 1 1 t 1 2 3 4 5 6 X 0.40268 0.91951 0.69482 -2.28899 -1.32762 1.92348 Y 2.30418 2.11850 2.66009 -4.94104 -0.83053 5.01359 CS CS1 CS1 CS1 CS1 CS1 CS1 NUM -1.56058 -1.56058 -1.56058 -1.56058 -1.56058 -1.56058
Variable CS1 CS2 CS3 CS4 CS5 CS6 CS7 CS8 CS9 CS10 CS11 CS12 CS13 CS14 CS15 CS16 CS17 CS18 CS19 CS20 X
DF 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Estimate 0.945589 2.475449 3.250337 3.712149 5.023584 6.791074 6.11374 8.733843 8.916685 8.913916 10.82881 11.40867 12.8865 13.37819 14.72619 15.58813 17.77983 17.9909 18.87283 19.40034 2.010753
t Value 2.06 5.40 7.10 8.04 10.78 14.43 13.15 19.07 19.44 19.32 23.64 24.79 28.10 29.21 32.16 34.04 38.83 38.96 41.18 42.37 16.52
Pr > |t| 0.0416 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001
Label Cross Sectional Effect 1 Cross Sectional Effect 2 Cross Sectional Effect 3 Cross Sectional Effect 4 Cross Sectional Effect 5 Cross Sectional Effect 6 Cross Sectional Effect 7 Cross Sectional Effect 8 Cross Sectional Effect 9 Cross Sectional Effect 10 Cross Sectional Effect 11 Cross Sectional Effect 12 Cross Sectional Effect 13 Cross Sectional Effect 14 Cross Sectional Effect 15 Cross Sectional Effect 16 Cross Sectional Effect 17 Cross Sectional Effect 18 Cross Sectional Effect 19 Cross Sectional Effect 20
Example 19.6: The Cigarette Sales Data: Dynamic Panel Estimation with GMM ! 1343
Example 19.6: The Cigarette Sales Data: Dynamic Panel Estimation with GMM
In this example, a dynamic panel demand model for cigarette sales is estimated. It illustrates the application of the method described in the section Dynamic Panel Estimator on page 1304. The data are a panel from 46 American states over the period 196392. See Baltagi and Levin (1992) and Baltagi (1995) for data description. All variables were transformed by taking the natural logarithm. The data set CIGAR is shown in the following statements.
data cigar; input state year price pop pop_16 cpi ndi sales pimin; label state = State abbreviation year = YEAR price = Price per pack of cigarettes pop = Population pop_16 = Population above the age of 16 cpi = Consumer price index with (1983=100) ndi = Per capita disposable income sales = Cigarette sales in packs per capita pimin = Minimum price in adjoining states per pack of cigarettes; datalines; 1 63 28.6 3383 2236.5 30.6 1558.3045298 93.9 26.1 1 64 29.8 3431 2276.7 31.0 1684.0732025 95.4 27.5 1 65 29.8 3486 2327.5 31.5 1809.8418752 98.5 28.9 1 66 31.5 3524 2369.7 32.4 1915.1603572 96.4 29.5 1 67 31.6 3533 2393.7 33.4 2023.5463678 95.5 29.6 1 68 35.6 3522 2405.2 34.8 2202.4855362 88.4 32 1 69 36.6 3531 2411.9 36.7 2377.3346665 90.1 32.8 1 70 39.6 3444 2394.6 38.8 2591.0391591 89.8 34.3 1 71 42.7 3481 2443.5 40.5 2785.3159706 95.4 35.8 ... more lines ...
The following statements sort the data by STATE and YEAR variables.
proc sort data=cigar; by state year; run;
Next, logarithms of the variables required for regression estimation are calculated, as shown in the following statements:
data cigar; set cigar; lsales = log(sales); lprice = log(price); lndi = log(ndi); lpimin = log(pimin);
= Log price per pack of cigarettes; Log per capita disposable income; = Log cigarette sales in packs per capita; = Log minimum price in adjoining states per pack of cigarettes;
The following statements create the CIGAR_LAG data set with lagged variable for each cross section.
proc panel data=cigar; id state year; clag lsales(1) / out=cigar_lag; run; data cigar_lag; set cigar_lag; label lsales_1 = Lagged log cigarette sales in packs per capita; run;
Finally, the model is estimated by a two step GMM method. Five lags (MAXBAND=5) of the dependent variable are used as instruments. NOLEVELS options is specied to avoid use of level equations, as shown in the following statements:
proc panel data=cigar_lag; inst depvar; model lsales = lsales_1 lprice lndi lpimin / gmm nolevels twostep maxband=5; id state year; run;
DF 1 1 1 1 1
If the theory suggests that there are other valid instruments, PREDETERMINED, EXOGENOUS and CORRELATED options can also be used.
Davidson, R. and MacKinnon, J. G. (1993), Estimation and Inference in Econometrics, New York: Oxford University Press. Da Silva, J. G. C. (1975), The Analysis of Cross-Sectional Time Series Data, Ph.D. dissertation, Department of Statistics, North Carolina State University. Davis, Peter (2002), Estimating Multi-Way Error Components Models with Unbalanced Data Structures, Journal of Econometrics, 106:1, 67-95. Feige, E. L. (1964), The Demand for Liquid Assets: A Temporal Cross-Section Analysis, Englewood Cliffs: Prentice-Hall. Feige, E. L. and Swamy, P. A. V. (1974), A Random Coefcient Model of the Demand for Liquid Assets, Journal of Money, Credit, and Banking, 6, 241-252. Fuller, W. A. and Battese, G. E. (1974), Estimation of Linear Models with Crossed-Error Structure, Journal of Econometrics, 2, 67-78. Greene, W. H. (1990), Econometric Analysis, First Edition, New York: Macmillan Publishing Company. Greene, W. H. (2000), Econometric Analysis, Fourth Edition, New York: Macmillan Publishing Company. Hausman, J. A. (1978), Specication Tests in Econometrics, Econometrica, 46, 1251-1271. Hausman, J. A. and Taylor, W. E. (1982), A Generalized Specication Test, Economics Letters, 8, 239-245. Hsiao, C. (1986), Analysis of Panel Data, Cambridge: Cambridge University Press. Judge, G. G., Grifths, W. E., Hill, R. C., Lutkepohl, H., and Lee, T. C. (1985), The Theory and Practice of Econometrics, Second Edition, New York: John Wiley & Sons. Kmenta, J. (1971), Elements of Econometrics, AnnArbor: The University of Michigan Press. Lamotte, L. R. (1994), A Note on the Role of Independence in t Statistics Constructed from Linear Statistics in Regression Models, The American Statistician, 48:3, 238-240. Maddala, G. S. (1977), Econometrics, New York: McGraw-Hill Co. Parks, R. W. (1967), Efcient Estimation of a System of Regression Equations When Disturbances Are Both Serially and Contemporaneously Correlated, Journal of the American Statistical Association, 62, 500-509. SAS Institute Inc. (1979), SAS Technical Report S-106, PANEL: A SAS Procedure for the Analysis of Time-Series Cross-Section Data, Cary, NC: SAS Institute Inc. Searle S. R. (1971), Topics in Variance Component Estimation, Biometrics, 26, 1-76. Seely, J. (1969), Estimation in Finite-Dimensional Vector Spaces with Application to the Mixed Linear Model, Ph.D. dissertation, Department of Statistics, Iowa State University.
Seely, J. (1970a), Linear Spaces and Unbiased Estimation, Annals of Mathematical Statistics, 41, 1725-1734. Seely, J. (1970b), Linear Spaces and Unbiased EstimationApplication to the Mixed Linear Model, Annals of Mathematical Statistics, 41, 1735-1748. Seely, J. and Soong, S. (1971), A Note on MINQUEs and Quadratic Estimability, Corvallis, Oregon: Oregon State University. Seely, J. and Zyskind, G. (1971), Linear Spaces and Minimum Variance Unbiased Estimation, Annals of Mathematical Statistics, 42, 691-703. Theil, H. (1961), Economic Forecasts and Policy, Second Edition, Amsterdam: North-Holland, 435-437. Wansbeek, T., and Kapteyn, Arie (1989), Estimation of the Error-Components Model with Incomplete Panels, Journal of Econometrics, 41, 341-361. White, H. (1980), A Heteroscedasticity-Consistent Covariance Matrix Estimator and a Direct Test for Heteroscedasticity, Econometrica, 48, 817-838. Wu, D. M. (1973), Alternative Tests of Independence between Stochastic Regressors and Disturbances, Econometrica, 41(4), 733-750. Zellner, A. (1962), An Efcient Method of Estimating Seemingly Unrelated Regressions and Tests for Aggregation Bias, Journal of the American Statistical Association, 57, 348-368.
1348
Chapter 20
covariates.) For example, the two-regressor model with a distributed lag effect for one regressor is written yt D C
p X i D0
i xt
C zt C u t
Here, xt is the regressor with a distributed lag effect, zt is a simple covariate, and ut is an error term. The distribution of the lagged effects is modeled by Almon lag polynomials. The coefcients bi of the lagged values of the regressor are assumed to lie on a polynomial curve. That is, bi D 0 C
d X j D1
j i j
where d. p/ is the degree of the polynomial. For the numerically efcient estimation, the PDLREG procedure uses orthogonal polynomials. The preceding equation can be transformed into orthogonal polynomials: bi D 0 C
d X j D1
j fj .i /
where fj .i / is a polynomial of degree j in the lag length i, and j is a coefcient estimated from the data. The PDLREG procedure supports endpoint restrictions for the polynomial. That is, you can constrain the estimated polynomial lag distribution curve so that b 1 D 0 or bpC1 D 0, or both. You can also impose linear restrictions on the parameter estimates for the covariates. You can specify a minimum degree and a maximum degree for the lag distribution polynomial, and the procedure ts polynomials for all degrees in the specied range. (However, if distributed lags are specied for more that one regressor, you can specify a range of degrees for only one of them.) The PDLREG procedure can also test for autocorrelated residuals and perform autocorrelated error correction by using the autoregressive error model. You can specify any order autoregressive error model and can specify several different estimation methods for the autoregressive model, including exact maximum likelihood. The PDLREG procedure computes generalized Durbin-Watson statistics to test for autocorrelated residuals. For models with lagged dependent variables, the procedure can produce Durbin h and Durbin t statistics. You can request signicance level p-values for the Durbin-Watson, Durbin h, and Durbin t statistics. See Chapter 8, The AUTOREG Procedure, for details about these statistics. The PDLREG procedure assumes that the input observations form a time series. Thus, the PDLREG procedure should be used only for ordered and equally spaced time series data.
The notation X(4,2) species that the model includes X and 4 lags of X, with the coefcients of X and its lags constrained to follow a second-degree (quadratic) polynomial. Thus, the regression model specied by this MODEL statement is yt D a C b0 xt C b1 xt
1
C b2 xt
C b3 xt
C b4 xt
C czt C ut
bi D 0 C 1 f1 .i / C 2 f2 .i / where f1 .i / is a polynomial of degree 1 in i and f2 .i/ is a polynomial of degree 2 in i. Lag distribution specications are enclosed in parentheses and follow the name of the regressor variable. The general form of the lag distribution specication is
regressor-name ( length, degree, minimum-degree, end-constraint )
where length degree minimum-degree end-constraint is the length of the lag distributionthat is, the number of lags of the regressor to use. is the degree of the distribution polynomial. is an optional minimum degree for the distribution polynomial. is an optional endpoint restriction specication, which can have the value FIRST, LAST, or BOTH.
If the minimum-degree option is specied, the PDLREG procedure estimates models for all degrees between minimum-degree and degree.
Introductory Example
The following statements generate simulated data for variables Y and X. Y depends on the rst three lags of X, with coefcients .25, .5, and .25. Thus, the effect of changes of X on Y takes effect 25% after one period, 75% after two periods, and 100% after three periods.
data test; xl1 = 0; xl2 = 0; xl3 = 0; do t = -3 to 100; x = ranuni(1234); y = 10 + .25 * xl1 + .5 * xl2 + .25 * xl3 + .1 * rannor(1234); if t > 0 then output; xl3 = xl2; xl2 = xl1; xl1 = x; end; run;
The following statements use the PDLREG procedure to regress Y on a distributed lag of X. The length of the lag distribution is 4, and the degree of the distribution polynomial is specied as 3.
proc pdlreg data=test; model y = x( 4, 3 ); run;
The PDLREG procedure rst prints a table of statistics for the residuals of the model, as shown in Figure 20.1. See Chapter 8, The AUTOREG Procedure, for an explanation of these statistics.
Figure 20.1 Residual Statistics
The PDLREG Procedure Dependent Variable y
The PDLREG Procedure Ordinary Least Squares Estimates SSE MSE SBC MAE MAPE Durbin-Watson 0.86604442 0.00952 -156.72612 0.07761107 0.73971576 1.9920 DFE Root MSE AIC AICC Regress R-Square Total R-Square 91 0.09755 -169.54786 -168.88119 0.7711 0.7711
The PDLREG procedure next prints a table of parameter estimates, standard errors, and t tests, as shown in Figure 20.2.
Figure 20.2 Parameter Estimates
Standard Error 0.0431 0.0378 0.0336 0.0322 0.0392 Approx Pr > |t| <.0001 <.0001 0.7377 <.0001 0.4007
DF 1 1 1 1 1
The table in Figure 20.2 shows the model intercept and the estimated parameters of the lag distribution polynomial. The parameter labeled X**0 is the constant term, 0 , of the distribution polynomial. X**1 is the linear coefcient, 1 ; X**2 is the quadratic coefcient, 2 ; and X**3 is the cubic coefcient, 3 . The parameter estimates for the distribution polynomial are not of interest in themselves. Since the PDLREG procedure does not print the orthogonal polynomial basis that it constructs to represent the distribution polynomial, these coefcient values cannot be interpreted. However, because these estimates are for an orthogonal basis, you can use these results to test the degree of the polynomial. For example, this table shows that the X**3 estimate is not signicant; the p-value for its t ratio is 0.4007, while the X**2 estimate is highly signicant (p < :0001). This indicates that a second-degree polynomial might be more appropriate for this data set. The PDLREG procedure next prints the lag distribution coefcients and a graphical display of these coefcients, as shown in Figure 20.3.
Figure 20.3 Coefcients and Graph of Estimated Lag Distribution
Estimate of Lag Distribution Standard Error 0.0360 0.0307 0.0239 0.0315 0.0365 Approx Pr > |t| 0.2677 <.0001 <.0001 <.0001 0.8929
-0.04
0.4167
The lag distribution coefcients are the coefcients of the lagged values of X in the regression model. These coefcients lie on the polynomial curve dened by the parameters shown in Figure 20.2. Note that the estimated values for X(1), X(2), and X(3) are highly signicant, while X(0) and X(4) are not signicantly different from 0. These estimates are reasonably close to the true values used to generate the simulated data. The graphical display of the lag distribution coefcients plots the estimated lag distribution polynomial reported in Figure 20.2. The roughly quadratic shape of this plot is another indication that a third-degree distribution curve is not needed for this data set.
Functional Summary
The statements and options used with the PDLREG procedure are summarized in the following table.
Table 20.1 PDLREG Functional Summary
Description Data Set Options specify the input data set write predicted values to an output data set BY-Group Processing specify BY-group processing Printing Control Options request all print options print correlations of the estimates print covariances of the estimates print DW statistics up to order j print the marginal probability of DW statistics print inverse of the crossproducts matrix print details at each iteration step print Durbin t statistic print Durbin h statistic suppress printed output print partial autocorrelations print standardized parameter estimates print crossproducts matrix Model Estimation Options specify order of autoregressive process suppress intercept parameter specify convergence criterion specify maximum number of iterations specify estimation method
Statement PDLREG OUTPUT BY MODEL MODEL MODEL MODEL MODEL MODEL MODEL MODEL MODEL MODEL MODEL MODEL MODEL MODEL MODEL MODEL MODEL MODEL
ALL CORRB COVB DW=j DWPROB I ITPRINT LAGDEP LAGDEP= NOPRINT PARTIAL STB XPX NLAG= NOINT CONVERGE= MAXITER= METHOD=
Table 20.1
continued
Description Output Control Options specify condence limit size specify condence limit size for structural predicted values output transformed intercept variable output lower condence limit for predicted values output lower condence limit for structural predicted values output predicted values output predicted values of the structural part output residuals from the predicted values output residuals from the structural predicted values output transformed variables output upper condence limit for the predicted values output upper condence limit for the structural predicted values
Statement OUTPUT OUTPUT OUTPUT OUTPUT OUTPUT OUTPUT OUTPUT OUTPUT OUTPUT OUTPUT OUTPUT OUTPUT
Option ALPHACLI= ALPHACLM= CONSTANT= LCL= LCLM= P= PM= R= RM= TRANSFORM= UCL= UCLM=
species the name of the SAS data set containing the input data. If you do not specify the DATA= option, the most recently created SAS data set is used. In addition, you can place any of the following MODEL statement options in the PROC PDLREG statement, which is equivalent to specifying the option for every MODEL statement: ALL, CONVERGE=, CORRB, COVB, DW=, DWPROB, ITPRINT, MAXITER=, METHOD=, NOINT, NOPRINT, and PARTIAL.
BY Statement
BY variables ;
A BY statement can be used with PROC PDLREG to obtain separate analyses on observations in groups dened by the BY variables.
MODEL Statement
MODEL dependent = effects / options ;
The MODEL statement species the regression model. The keyword MODEL is followed by the dependent variable name, an equal sign, and a list of independent effects. Only one MODEL statement is allowed. Every variable in the model must be a numeric variable in the input data set. Specify an independent effect with a variable name optionally followed by a polynomial lag distribution specication.
The term in parentheses following the variable name species a polynomial distributed lag (PDL) for the variable. The PDL specication is as follows: length degree minimum-degree constraint species the number of lags of the variable to include in the lag distribution. species the maximum degree of the distribution polynomial. If not specied, the degree defaults to the lag length. species the minimum degree of the polynomial. By default minimum-degree is the same as degree. species endpoint restrictions on the polynomial. The value of constraint can be FIRST, LAST, or BOTH. If a value is not specied, there are no endpoint restrictions.
If you do not specify the degree or minimum-degree parameter, but you do specify endpoint restrictions, you must use commas to show which parameter, degree or minimum-degree, is left out.
prints all the matrices computed during the analysis of the model.
CORRB
DW= j
prints the generalized Durbin-Watson statistics up to the order of j. The default is DW=1. When you specify the LAGDEP or LAGDEP=name option, the Durbin-Watson statistic is not printed unless you specify the DW= option.
DWPROB
sets the convergence criterion. If the maximum absolute value of the change in the autoregressive parameter estimates between iterations is less than this amount, then convergence is assumed. The default is CONVERGE=.001.
I
prints .X0 X/ 1 , the inverse of the crossproducts matrix for the model; or, if restrictions are specied, it prints .X0 X/ 1 adjusted for the restrictions.
ITPRINT
prints the t statistic for testing residual autocorrelation when regressors contain lagged dependent variables.
LAGDEP= name LAGDV= name
prints the Durbin h statistic for testing the presence of rst-order autocorrelation when regressors contain the lagged dependent variable whose name is specied as LAGDEP=name. When the h statistic cannot be computed, the asymptotically equivalent t statistic is given.
MAXITER= number
species the type of estimates for the autoregressive component. METHOD= option are as follows: METHOD=ML METHOD=ULS METHOD=YW METHOD=ITYW species the maximum likelihood method. species unconditional least squares. species the Yule-Walker method. species iterative Yule-Walker estimates.
The default is METHOD=ML if you specied the LAGDEP or LAGDEP= option; otherwise, METHOD=YW is the default.
species the order of the autoregressive process or the subset of autoregressive lags to be t. If you do not specify the NLAG= option, PROC PDLREG does not t an autoregressive model.
NOINT
prints standardized parameter estimates. Sometimes known as a standard partial regression coefcient, a standardized parameter estimate is a parameter estimate multiplied by the standard deviation of the associated regressor and divided by the standard deviation of the regressed variable.
XPX
prints the crossproducts matrix, X0 X, used for the model. X refers to the transformed matrix of regressors for the regression.
OUTPUT Statement
OUTPUT OUT= SAS-data-set keyword= option . . . ;
The OUTPUT statement creates an output SAS data set with variables as specied by the following keyword options. See the section Predicted Values in Chapter 8, The AUTOREG Procedure, for a description of the associated computations for these options.
ALPHACLI= number
sets the condence limit size for the estimates of future values of the current realization of the response time series to number, where number is less than one and greater than zero. The resulting condence interval has 1-number condence. The default value for number is 0.05, corresponding to a 95% condence interval.
ALPHACLM= number
sets the condence limit size for the estimates of the structural or regression part of the model to number, where number is less than one and greater than zero. The resulting condence interval has 1-number condence. The default value for number is 0.05, corresponding to a 95% condence interval.
OUT= SAS-data-set
The following specications are of the form KEYWORD=names, where KEYWORD= species the statistic to include in the output data set and names gives names to the variables that contain the statistics.
CONSTANT= variable
requests that the lower condence limit for the predicted value (specied in the PREDICTED= option) be added to the output data set under the name given.
LCLM= name
requests that the lower condence limit for the structural predicted value (specied in the PREDICTEDM= option) be added to the output data set under the name given.
PREDICTED= name P= name
stores the predicted values in the output data set under the name given.
PREDICTEDM= name PM= name
stores the structural predicted values in the output data set under the name given. These values are formed from only the structural part of the model.
RESIDUAL= name R= name
stores the residuals from the predicted values based on both the structural and time series parts of the model in the output data set under the name given.
RESIDUALM= name RM= name
requests that the specied variables from the input data set be transformed by the autoregressive model and put in the output data set. If you need to reproduce the data suitable for reestimation, you must also transform an intercept variable. To do this, transform a variable that only takes the value 1 or use the CONSTANT= option.
UCL= name
stores the upper condence limit for the predicted value (specied in the PREDICTED= option) in the output data set under the name given.
UCLM= name
stores the upper condence limit for the structural predicted value (specied in the PREDICTEDM= option) in the output data set under the name given. For example, the SAS statements
proc pdlreg data=a; model y=x1 x2; output out=b p=yhat r=resid; run;
create an output data set named B. In addition to the input data set variables, the data set B contains the variable YHAT, whose values are predicted values of the dependent variable Y, and RESID, whose values are the residual values of Y.
RESTRICT Statement
RESTRICT equation , . . . , equation ;
The RESTRICT statement places restrictions on the parameter estimates for covariates in the preceding MODEL statement. A parameter produced by a distributed lag cannot be restricted with the RESTRICT statement. Each restriction is written as a linear equation. If you specify more than one restriction in a RESTRICT statement, the restrictions are separated by commas. You can refer to parameters by the name of the corresponding regressor variable. Each name used in the equation must be a regressor in the preceding MODEL statement. Use the keyword INTERCEPT to refer to the intercept parameter in the model. RESTRICT statements can be given labels. You can use labels to distinguish results for different restrictions in the printed output. Labels are specied as follows:
label : RESTRICT . . .
The following is an example of the use of the RESTRICT statement, in which the coefcients of the regressors X1 and X2 are required to sum to 1:
proc pdlreg data=a; model y = x1 x2; restrict x1 + x2 = 1; run;
Parameter names can be multiplied by constants. When no equal sign appears, the linear combination is set equal to 0. Note that the parameters associated with the variables are restricted, not the variables themselves. Here are some examples of valid RESTRICT statements:
restrict restrict restrict restrict restrict x1 + x2 = 1; x1 + x2 - 1; 2 * x1 = x2 + x3 , intercept + x4 = 0; x1 = x2 = x3 = 1; 2 * x1 - x2;
Restricted parameter estimates are computed by introducing a Lagrangian parameter for each restriction (Pringle and Rayner 1971). The estimates of these Lagrangian parameters are printed in
the parameter estimates table. If a restriction cannot be applied, its parameter value and degrees of freedom are listed as 0. The Lagrangian parameter, , measures the sensitivity of the SSE to the restriction. If the restriction is changed by a small amount , the SSE is changed by 2 . The t ratio tests the signicance of the restrictions. If as the unrestricted ones. is zero, the restricted estimates are the same
You can specify any number of restrictions in a RESTRICT statement, and you can use any number of RESTRICT statements. The estimates are computed subject to all restrictions specied. However, restrictions should be consistent and not redundant.
Missing Values
The PDLREG procedure skips any observations at the beginning of the data set that have missing values. The procedure uses all observations with nonmissing values for all the independent and dependent variables such that the lag distribution has sufcient nonmissing lagged independent variables.
i xt
When the lag length (p) is long, severe multicollinearity can occur. Use the Almon or polynomial distributed lag model to avoid this problem, since the relatively low-degree d (p) polynomials can capture the true lag distribution. The lag coefcient can be written in the Almon polynomial lag i D 0 C
d X j D1
j i j
Emerson (1968) proposed an efcient method of constructing orthogonal polynomials from the preceding polynomial equation as i D 0 C
d X j D1
j fj .i /
where fj .i / is a polynomial of degree j in the lag length i. The polynomials fj .i/ are chosen so that they are orthogonal: ( n X 1 ifj D k wi fj .i /fk .i / D 0 ifj k i D1 where wi is the weighting factor, and n D p C 1. PROC PDLREG uses the equal weights (wi D 1) for all i. To construct the orthogonal polynomials, the following recursive relation is used: fj .i / D .Aj i C Bj /fj
1 .i/
Cj fj
2 .i/j
D 1; : : :; d
!2 wi ifj2 1 .i/ !2 9 = ;
1=2
Aj
wi i 2 fj2 1 .i /
wi ifj
1 .i/fj 2 .i/
Bj Cj
Aj
n X i D1
wi ifj2 1 .i/
1 .i /fj 2 .i/
D Aj
n X i D1
wi ifj
where f
1 .i /
D 0 and f0 .i / D 1=
qP n
i D1 wi .
PROC PDLREG estimates the orthogonal polynomial coefcients, 0 ; : : :; d , to compute the coefcient estimate of each independent variable (X) with distributed lags. For example, if an independent variable is specied as X(9,3), a third-degree polynomial is used to specify the distributed lag coefcients. The third-degree polynomial is t as a constant term, a linear term, a quadratic term, and a cubic term. The four terms are constructed to be orthogonal. In the output produced by the PDLREG procedure for this case, parameter estimates with names X**0, X**1, X**2, and X**3 correspond to 0 ; 1 ; 2 , and 3 , respectively. A test using the t statistic and the approximate O O O O p-value (Approx Pr > jtj) associated with X**3 can determine whether a second-degree polynomial rather than a third-degree polynomial is appropriate. The estimates of the 10 lag coefcients associated with the specication X(9,3) are labeled X(0), X(1), X(2), X(3), X(4), X(5), X(6), X(7), X(8), and X(9).
Printed Output
The PDLREG procedure prints the following items: 1. the name of the dependent variable 2. the ordinary least squares (OLS) estimates 3. the estimates of autocorrelations and of the autocovariance, and if line size permits, a graph of the autocorrelation at each lag. The autocorrelation for lag 0 is 1. These items are printed if you specify the NLAG= option. 4. the partial autocorrelations if the PARTIAL and NLAG= options are specied. The rst partial autocorrelation is the autocorrelation for lag 1. 5. the preliminary mean square error, which results from solving the Yule-Walker equations if you specify the NLAG= option 6. the estimates of the autoregressive parameters, their standard errors, and the ratios of estimates to standard errors (t ) if you specify the NLAG= option 7. the statistics of t for the nal model if you specify the NLAG= option. These include the error sum of squares (SSE), the degrees of freedom for error (DFE), the mean square error (MSE), the root mean square error (Root MSE), the mean absolute error (MAE), the mean absolute percentage error (MAPE), the Schwarz information criterion (SBC), the Akaikes information criterion (AIC), Akaikes information criterion corrected(AICC), the regression R2 (Regress R-Square), the total R2 (Total R-Square), and the Durbin-Watson statistic (DurbinWatson). See Chapter 8, The AUTOREG Procedure, for details of the regression R2 and the total R2 . 8. the parameter estimates for the structural model (B), a standard error estimate, the ratio of estimate to standard error (t), and an approximation to the signicance probability for the parameter being 0 (Approx Pr > jtj) 9. a plot of the lag distribution (estimate of lag distribution) 10. the covariance matrix of the parameter estimates if the COVB option is specied
Description
Option NLAG=
ODS Tables Created by the MODEL Statement ARParameterEstimates Estimates of autoregressive parameters CholeskyFactor Cholesky root of gamma Coefcients Coefcients for rst NLAG observations ConvergenceStatus Convergence status table CorrB Correlation of parameter estimates CorrGraph Estimates of autocorrelations CovB Covariance of parameter estimates DependenceEquations Linear dependence equation Dependent Dependent variable DWTest Durbin-Watson statistics DWTestProb Durbin-Watson statistics and p-values ExpAutocorr Expected autocorrelations FitSummary Summary of regression GammaInverse Gamma inverse IterHistory Iteration history LagDist Lag distribution ParameterEstimates Parameter estimates ParameterEstimatesGivenAR Parameter estimates assuming AR parameters are given PartialAutoCorr Partial autocorrelation PreMSE Preliminary MSE XPXIMatrix .X0 X/ 1 matrix XPXMatrix X0 X matrix YWIterSSE Yule-Walker iteration sum of squared error ODS Tables Created by the RESTRICT Statement Restrict Restriction table
NLAG= default CORRB NLAG= COVB default DW= DW= DWPROB NLAG= default ITPRINT ALL default NLAG= PARTIAL NLAG= XPX XPX METHOD=ITYW
default
The printed output produced by the PDLREG procedure is shown in Output 20.1.1. The small Durbin-Watson test indicates autoregressive errors.
Output 20.1.1 Printed Output Produced by PROC PDLREG
National Industrial Conference Board Data Quarterly Series - 1952Q1 to 1967Q4 The PDLREG Procedure Dependent Variable ce
DF 1 1 1 1 1 1 1
Estimate of Lag Distribution Standard Error 0.0360 0.0109 0.0255 0.0254 0.0112 0.0370 Approx Pr > |t| 0.0165 <.0001 <.0001 <.0001 <.0001 <.0001
0.2444
The following statements use the REG procedure to t the same polynomial distributed lag model. A DATA step computes lagged values of the regressor X, and RESTRICT statements are used to impose the polynomial lag distribution. Refer to Judge et al. (1985, pp. 357359) for the restricted least squares estimation of the Almon distributed lag model.
q3 ca ca_1 ca_2 ca_3 ca_4 ca_5; 5*ca_1 - 10*ca_2 + 10*ca_3 - 5*ca_4 + ca_5; 3*ca_1 + 2*ca_2 + 2*ca_3 - 3*ca_4 + ca_5; 7*ca_1 + 4*ca_2 - 4*ca_3 - 7*ca_4 + 5*ca_5;
DF 6 48 54
F Value 473.58
Pr > F <.0001
0.9834 0.9813
Variable Intercept q1 q2 q3 ca ca_1 ca_2 ca_3 ca_4 ca_5 RESTRICT RESTRICT RESTRICT
DF 1 1 1 1 1 1 1 1 1 1 -1 -1 -1
t Value 2.87 -0.17 -0.35 -0.51 2.49 9.56 5.00 6.24 17.69 6.60 0.05 0.42 0.56
Pr > |t| 0.0061 0.8635 0.7277 0.6137 0.0165 <.0001 <.0001 <.0001 <.0001 <.0001 0.9614* 0.6772* 0.5814*
5 X i D0
ci yt
2 X i D0
di rt
3 X i D0
fi pt
C ut
format date yyqc6.; label m = Real Money Stock (M1) lagm = Lagged Real Money Stock y = Real GNP r = Commercial Paper Rate p = Inflation Rate; datalines; ... more lines ...
The regression model is written for the PDLREG procedure with a MODEL statement. The LAGDEP= option is specied to test for the serial correlation in disturbances since regressors contain the lagged dependent variable LAGM.
title Money Demand Estimation using Distributed Lag Model; title2 Quarterly Data - 1968Q2 to 1983Q4; proc pdlreg data=a; model m = lagm y(5,3) r(2, , ,first) p(3,2) / lagdep=lagm; run;
Variable Intercept lagm y**0 y**1 y**2 y**3 r**0 r**1 r**2 p**0 p**1 p**2
DF 1 1 1 1 1 1 1 1 1 1 1 1
Estimate -0.1407 0.9875 0.0132 -0.0704 0.1261 -0.4089 -0.000186 0.002200 0.000788 -0.6602 0.4036 -1.0064
t Value -0.54 23.21 2.91 -1.33 1.60 -3.23 -0.55 2.84 3.16 -5.83 1.74 -4.40
Restriction r(-1)
DF -1
L Value 0.0164
t Value 2.26
-0.196
0.2686
| |************************| |****************| | | | *************| | |****** | | |**************** | | | *********| Estimate of Lag Distribution Standard Error 0.000388 0.000234 0.000754 Approx Pr > |t| 0.0012 0.0023 0.0230
-0.001
0.0018
-1.104
0.2634
References ! 1373
References
Balke, N. S. and Gordon, R. J. (1986), Historical Data, in R. J. Gordon, ed., The American Business Cycle, 781850, Chicago: The University of Chicago Press. Emerson, P. L. (1968), Numerical Construction of Orthogonal Polynomials from a General Recurrence Formula, Biometrics, 24, 695701. Gallant, A. R. and Goebel, J. J. (1976), Nonlinear Regression with Autoregressive Errors, Journal of the American Statistical Association, 71, 961967. Harvey, A. C. (1981), The Econometric Analysis of Time Series, New York: John Wiley & Sons. Johnston, J. (1972), Econometric Methods, Second Edition, New York: McGraw-Hill. Judge, G. G., Grifths, W. E., Hill, R. C., Lutkepohl, H., and Lee, T. C. (1985), The Theory and Practice of Econometrics, Second Edition, New York: John Wiley & Sons. Park, R. E. and Mitchell, B. M. (1980), Estimating the Autocorrelated Error Model with Trended Data, Journal of Econometrics, 13, 185201. Pringle, R. M. and Rayner, A. A. (1971), Generalized Inverse Matrices with Applications to Statistics, New York: Hafner Publishing.
1374
Chapter 21
Example 21.3: Bivariate Probit Analysis . . . . . . . . . . . . . . . . . Example 21.4: Sample Selection Model . . . . . . . . . . . . . . . . . Example 21.5: Sample Selection Model with Truncation and Censoring Example 21.6: Types of Tobit Models . . . . . . . . . . . . . . . . . . Example 21.7: Stochastic Frontier Models . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . .
. . . . . .
within a certain range and values outside this range are not available, the QLIM procedure offers a class of models adjusting for truncation. In some cases, the dependent variable is continuous only in a certain range and all values outside this range are reported as being on its boundary. For example, if it is not possible to observe negative values, the value of the dependent variable is reported as equal to zero. Because the data are censored, OLS results are inconsistent, and it cannot be guaranteed that the predicted values from the model fall in the appropriate region. Most of the models in the QLIM procedure can be extended to accommodate bivariate and multivariate scenarios. The assumption that one variable is observed only if another variable takes on certain values lead to the introduction of sample selection models. If the dependent variables are mutually exclusive and observed only for certain ranges of the selection variable, the sample selection can be extended to include cases of switching regression. Stochastic frontier production and cost models allow for random shocks of the production or cost. They include a systematic positive component in the error term adjusting for technological or cost inefciency. The QLIM procedure uses maximum likelihood methods. Initial starting values for the nonlinear optimizations are typically calculated by OLS.
The response variable, y, is numeric and has discrete values. PROC QLIM enables the user to specify the type of endogenous variables in the ENDOGENOUS statement. The binary probit model can be also specied as follows:
model y = x1 / discrete;
When multiple endogenous variables are specied in the QLIM procedure, these equations are estimated as a system. Multiple endogenous variables can be specied with one MODEL statement in the QLIM procedure when these models have the same exogenous variables:
model y1 y2 = x1 x2 / discrete;
proc qlim data=a; model y1 = x1 x2; model y2 = x1 x2; endogenous y1 y2 ~ discrete; run;
The standard tobit model is estimated by specifying the endogenous variable to be truncated or censored. The limits of the dependent variable can be specied with the CENSORED or TRUNCATED option in the ENDOGENOUS or MODEL statement when the data are limited by specic values or variables. For example, the two-limit censored model requires two variables that contain the lower (bottom ) and upper (top ) bound:
proc qlim data=a; model y = x1 x2 x3; endogenous y ~ censored(lb=bottom ub=top); run;
The bounds can be numbers if they are xed for all observations in the data set. For example, the standard tobit model can be specied as follows:
proc qlim data=a; model y = x1 x2 x3; endogenous y ~ censored(lb=0); run;
Results of this analysis are shown in the following four gures. In the rst table, shown in Figure 21.1, PROC QLIM provides frequency information about each choice. In this example, 428 women participate in the labor force (inlf =1).
The second table is the estimation summary table shown in Figure 21.2. Included are the number of dependent variables, names of dependent variables, the number of observations, the log-likelihood function value, the maximum absolute gradient, the number of iterations, AIC, and Schwarz criterion.
Figure 21.2 Fit Summary Table of Binary Probit
Model Fit Summary Number of Endogenous Variables Endogenous Variable Number of Observations Log Likelihood Maximum Absolute Gradient Number of Iterations Optimization Method AIC Schwarz Criterion 1 inlf 753 -401.30219 0.0000669 15 Quasi-Newton 818.60439 855.59691
Goodness-of-t measures are displayed in Figure 21.3. All measures except McKelvey-Zavoinas denition are based on the log-likelihood function value. The likelihood ratio test statistic has chisquare distribution conditional on the null hypothesis that all slope coefcients are zero. In this example, the likelihood ratio statistic is used to test the hypothesis that kidslt6 Dkidge6 Dage Deduc Dexper Dexpersq Dnwifeinc D 0.
N = # of observations, K = # of regressors
Finally, the parameter estimates and standard errors are shown in Figure 21.4.
Figure 21.4 Parameter Estimates of Binary Probit
Parameter Estimates Standard Error 0.508590 0.004840 0.025255 0.018720 0.000600 0.008477 0.118519 0.043477 Approx Pr > |t| 0.5954 0.0130 <.0001 <.0001 0.0017 <.0001 <.0001 0.4076
DF 1 1 1 1 1 1 1 1
When the error term has a logistic distribution, the binary logit model is estimated. To specify a logistic distribution, add D=LOGIT option as follows:
/*-- Binary Logit --*/ proc qlim data=mroz; model inlf = nwifeinc educ exper expersq age kidslt6 kidsge6 / discrete(d=logit); run;
DF 1 1 1 1 1 1 1 1
The heteroscedastic logit model can be estimated using the HETERO statement. If the variance of the logit model is a function of the family income level excluding wifes income (nwifeinc ), the variance can be specied as Var. i / D
2
exp. *nwifeinci /
where 2 is normalized to 1 because the dependent variable is discrete. The following SAS statements estimate the heteroscedastic logit model:
/*-- Binary Logit with Heteroscedasticity --*/ proc qlim data=mroz; model inlf = nwifeinc educ exper expersq age kidslt6 kidsge6 / discrete(d=logit); hetero inlf ~ nwifeinc / noconst; run;
The parameter estimate, , of the heteroscedasticity variable is listed as _H.nwifeinc; see Figure 21.6.
Parameter Intercept nwifeinc educ exper expersq age kidslt6 kidsge6 _H.nwifeinc
DF 1 1 1 1 1 1 1 1 1
Estimate 0.510445 -0.026778 0.255547 0.234105 -0.003613 -0.100878 -1.645206 0.066941 0.013280
t Value 0.52 -2.21 4.14 5.02 -2.92 -4.69 -5.29 0.78 0.98
At least one MODEL statement is required. If more than one MODEL statement is used, the QLIM procedure estimates a system of models. Main effects and higher-order terms can be specied in the MODEL statement, as in the GLM procedure and PROBIT procedure in SAS/STAT. If a CLASS statement is used, it must precede the MODEL statement.
Functional Summary
The statements and options used with the QLIM procedure are summarized in the following table.
Table 21.1 QLIM Functional Summary
Description Data Set Options specify the input data set write parameter estimates to an output data set write predictions to an output data set Declaring the Role of Variables specify BY-group processing specify classication variables specify a weight variable Printing Control Options request all printing options print correlation matrix of the estimates print covariance matrix of the estimates print a summary iteration listing suppress the normal printed output Options to Control the Optimization Process specify the optimization method specify the optimization options
BY CLASS WEIGHT
QLIM NLOPTIONS
Model Estimation Options specify options specic to Box-Cox transformation suppress the intercept parameter specify a seed for pseudo-random number generation specify number of draws for Monte Carlo integration specify method to calculate parameter covariance Endogenous Variable Options specify discrete variable specify censored variable specify truncated variable specify variable selection condition
Table 21.1
continued
Description specify stochastic frontier variable Heteroscedasticity Model Options specify the function for heteroscedasticity models square the function for heteroscedasticity models specify no constant for heteroscedasticity models Output Control Options output predicted values output structured part output residuals output error standard deviation output marginal effects output probability for the current response output probability for all responses output expected value output conditional expected value output inverse Mills ratio include covariances in the OUTEST= data set include correlations in the OUTEST= data set Test Request Options Requests Wald, Lagrange multiplier, and likelihood ratio tests Requests the WALD test Requests the Lagrange multiplier test Requests the likelihood ratio test
Statement ENDOGENOUS
Option FRONTIER()
OUTPUT OUTPUT OUTPUT OUTPUT OUTPUT OUTPUT OUTPUT OUTPUT OUTPUT OUTPUT QLIM QLIM
PREDICTED XBETA RESIDUAL ERRSTD MARGINAL PROB PROBALL EXPECTED CONDITIONAL MILLS COVOUT CORROUT
ALL WALD LM LR
species the input SAS data set. If the DATA= option is not specied, PROC QLIM uses the most recently created SAS data set.
writes the covariance matrix for the parameter estimates to the OUTEST= data set. This option is valid only if the OUTEST= option is specied.
CORROUT
writes the correlation matrix for the parameter estimates to the OUTEST= data set. This option is valid only if the OUTEST= option is specied.
Printing Options
NOPRINT
suppresses the normal printed output but does not suppress error listings. If NOPRINT option is set, then any other print option is turned off.
PRINTALL
turns on all the printing-control options. The options set by PRINTALL are COVB and CORRB.
CORRB
prints the initial parameter estimates, convergence criteria, and all constraints of the optimization. At each iteration, objective function value, step size, maximum gradient, and slope of search direction are printed as well.
species the method to calculate the covariance matrix of parameter estimates. The supported covariance types are as follows: OP HESSIAN QML species the covariance from the outer product matrix. species the covariance from the inverse Hessian matrix. species the covariance from the outer product and Hessian matrices (the quasi-maximum likelihood estimates).
NDRAW=value
species the optimization method. If this option is specied, it overwrites the TECH= option in NLOPTIONS statement. Valid values are as follows: CONGRA DBLDOG NMSIMP NEWRAP NRRIDG QUANEW TRUREG performs a conjugate-gradient optimization performs a version of double dogleg optimization performs a Nelder-Mead simplex optimization performs a Newton-Raphson optimization combining a line-search algorithm with ridging performs a Newton-Raphson optimization with ridging performs a quasi-Newton optimization performs a trust region optimization
BOUNDS Statement
BOUNDS bound1 < , bound2 . . . > ;
The BOUNDS statement imposes simple boundary constraints on the parameter estimates. BOUNDS statement constraints refer to the parameters estimated by the QLIM procedure. Any number of BOUNDS statements can be specied. Each bound is composed of parameters and constants and inequality operators. Parameters associated with regressor variables are referred to by the names of the corresponding regressor variables:
item operator item < operator item < operator item . . . > >
Each item is a constant, the name of a parameter, or a list of parameter names. See the section Naming of Parameters on page 1414 for more details on how parameters are named in the QLIM procedure. Each operator is <, >, <=, or >=.
BY Statement ! 1387
Both the BOUNDS statement and the RESTRICT statement can be used to impose boundary constraints; however, the BOUNDS statement provides a simpler syntax for specifying these kinds of constraints. See the RESTRICT Statement on page 1393 for more information. The following BOUNDS statement constrains the estimates of the parameters associated with the variable ttime and the variables x1 through x10 to be between zero and one. This example illustrates the use of parameter lists to specify boundary constraints.
bounds 0 < ttime x1-x10 < 1;
The following BOUNDS statement constrains the estimates of the correlation (_RHO) and sigma (_SIGMA) in the bivariate model:
bounds _rho >= 0, _sigma.y1 > 1, _sigma.y2 < 5;
BY Statement
BY variables ;
A BY statement can be used with PROC QLIM to obtain separate analyses on observations in groups dened by the BY variables.
CLASS Statement
CLASS variables ;
The CLASS statement names the classication variables to be used in the analysis. Classication variables can be either character or numeric. Class levels are determined from the formatted values of the CLASS variables. Thus, you can use formats to group values into levels. See the discussion of the FORMAT procedure in SAS Language Reference: Dictionary for details. If the CLASS statement is used, it must appear before any of the MODEL statements.
ENDOGENOUS Statement
ENDOGENOUS variables options ;
The ENDOGENOUS statement species the type of dependent variables that appear on the lefthand side of the equation. Endogenous variables listed refer to the dependent variables that appear on the left-hand side of the equation. Currently, no right-hand side endogeneity is handled in PROC QLIM. All variables appearing on the right-hand side of the equation are treated as exogenous.
species that the endogenous variables in this statement are discrete. Valid discrete-options are as follows:
ORDER=DATA | FORMATTED | FREQ | INTERNAL
species the sorting order for the levels of the discrete variables specied in the ENDOGENOUS statement. This ordering determines which parameters in the model correspond to each level in the data. The following table shows how PROC QLIM interprets values of the ORDER= option. Value of ORDER= DATA FORMATTED FREQ INTERNAL Levels Sorted By order of appearance in the input data set formatted value descending frequency count; levels with the most observations come rst in the order unformatted value
By default, ORDER=FORMATTED. For the values FORMATTED and INTERNAL, the sort order is machine dependent. For more information about sorting order, see the chapter on the SORT procedure in the Base SAS Procedures Guide.
DISTRIBUTION=distribution-type DIST=distribution-type D=distribution-type
species the cumulative distribution function used to model the response probabilities. Valid values for distribution-type are as follows: NORMAL LOGISTIC the normal distribution for the probit model the logistic distribution for the logit model
By default, DISTRIBUTION=NORMAL. If a multivariate model is specied, logistic distribution is not allowed. Only normal distribution is supported.
species that the endogenous variables in this statement be censored. Valid censored-options are as follows:
LB=value or variable LOWERBOUND=value or variable
species the lower bound of the censored variables. If value is missing or the value in variable is missing, no lower bound is set. By default, no lower bound is set.
species the upper bound of the censored variables. If value is missing or the value in variable is missing, no upper bound is set. By default, no upper bound is set.
species that the endogenous variables in this statement be truncated. Valid truncated-options are as follows:
LB=value or variable LOWERBOUND=value or variable
species the lower bound of the truncated variables. If value is missing or the value in variable is missing, no lower bound is set. By default, no lower bound is set.
UB=value or variable UPPERBOUND=value or variable
species the upper bound of the truncated variables. If value is missing or the value in variable is missing, no upper bound is set. By default, no upper bound is set.
species that the endogenous variable in this statement follow a production or cost frontier. Valid frontier-options are as follows:
TYPE=
HALF TRUNCATED
PRODUCTION
species that the model estimated be a cost function. If neither PRODUCTION nor COST option is specied, production function is estimated by default.
Selection Options
SELECT (select-option )
species selection criteria for sample selection model. Select-option species the condition for the endogenous variable to be selected. It is written as a variable name, followed by an equality operator (=) or an inequality operator (<, >, <=, >=), followed by a number:
variable operator number
The variable is the endogenous variable that the selection is based on. The operator can be =, <, >, <= , or >=. Multiple select-options can be combined with the logic operators: AND, OR. The following example illustrates the use of the SELECT option:
endogenous y1 ~ select(z=0); endogenous y2 ~ select(z=1 or z=2);
The SELECT option can be used together with the DISCRETE, CENSORED, or TRUNCATED option. For example:
endogenous y1 ~ select(z=0) discrete; endogenous y2 ~ select(z=1) censored (lb=0); endogenous y3 ~ select(z=1 or z=2) truncated (ub=10);
For more details about selection models with censoring or truncation, see the section Selection Models on page 1408.
HETERO Statement
HETERO ;
dependent variables
The HETERO statement species variables that are related to the heteroscedasticity of the residuals and the way these variables are used to model the error variance. The heteroscedastic regression model supported by PROC QLIM is y i D x0 C i
i i
N.0;
2 i /
See the section Heteroscedasticity on page 1405 for more details on the specication of functional forms.
LINK=value
The functional form can be specied using the LINK= option. The following option values are allowed: EXP species the exponential link function
2 i
.1 C exp.zi //
LINEAR
.1 C zi /
When the LINK= option is not specied, the exponential link function is specied by default.
NOCONST
D D
2 2
.zi / exp.zi /
0
SQUARE
estimates the model by using the square of linear heteroscedasticity function. For example, you can specify the following heteroscedasticity function:
2 i
.1 C .zi /2 /
The option SQUARE does not apply to exponential heteroscedasticity function because the 0 0 square of an exponential function of zi is the same as the exponential of 2zi . Hence the only difference is that all estimates are divided by two.
INIT Statement
INIT initvalue1 < , initvalue2 . . . > ;
The INIT statement is used to set initial values for parameters in the optimization. Any number of INIT statements can be specied. Each initvalue is written as a parameter or parameter list, followed by an optional equality operator (=), followed by a number:
parameter <=> number
MODEL Statement
MODEL dependent = regressors < / options > ;
The MODEL statement species the dependent variable and independent regressor variables for the regression model. The following options can be used in the MODEL statement after a slash (/).
LIMIT1=value
species the restriction of the threshold value of the rst category when the ordinal probit or logit model is estimated. LIMIT1=ZERO is the default option. When LIMIT1=VARYING is specied, the threshold value is estimated.
NOINT
species options that are used for Box-Cox regression or regressor transformation. For example, the Box-Cox regression is specied as
model y = x1 x2 / boxcox(y=lambda,x1 x2)
D 0 C 1 x1i 2 C 2 x2i 2 C
The option-list takes the form variable-list < = varname > separated by ,. The variable-list species that the list of variables have the same Box-Cox transformation; varname species the name of this Box-Cox coefcient. If varname is not specied, the coefcient is called _Lambdai, where i increments sequentially.
NLOPTIONS Statement
NLOPTIONS < options > ;
PROC QLIM uses the be nonlinear optimization (NLO) subsystem to perform nonlinear optimization tasks. For a list of all the options of the NLOPTIONS statement, see Chapter 6, Nonlinear Optimization Methods.
OUTPUT Statement
OUTPUT < OUT=SAS-data-set > < output-options > ;
The OUTPUT statement creates a new SAS data set containing all variables in the input data set and, optionally, the estimates of x0 , predicted value, residual, marginal effects, probability, standard deviation of the error, expected value, conditional expected value, and inverse Mills ratio. When the response values are missing for the observation, all output estimates except residual are still computed as long as none of the explanatory variables is missing. This enables you to compute these statistics for prediction. You can specify only one OUTPUT statement. Details on the specications in the OUTPUT statement are as follows:
CONDITIONAL
output estimates of
EXPECTED
j,
output estimates of inverse Mills ratios of censored or truncated continuous, binary discrete, and selection endogenous variables.
OUT=SAS-data-set
output estimates of probability of discrete endogenous variables taking the current observed responses.
PROBALL
output estimates of probability of discrete endogenous variables for all possible responses.
RESIDUAL
output estimates of x0 .
RESTRICT Statement
RESTRICT restriction1 < , restriction2 . . . > ;
The RESTRICT statement is used to impose linear restrictions on the parameter estimates. Any number of RESTRICT statements can be specied, but the number of restrictions imposed is limited by the number of regressors. Each restriction is written as an expression, followed by an equality operator (=) or an inequality operator (<, >, <=, >=), followed by a second expression:
expression operator expression
The operator can be =, <, >, <= , or >=. The operator and second expression are optional. Restriction expressions can be composed of parameter names, multiplication ( ), addition (C) and substitution ( ) operators, and constants. Parameters named in restriction expressions must be among the parameters estimated by the model. Parameters associated with a regressor variable are referred to by the name of the corresponding regressor variable. The restriction expressions must be a linear function of the parameters. The following is an example of the use of the RESTRICT statement:
proc qlim data=one; model y = x1-x10 / discrete; restrict x1*2 <= x2 + x3; run;
The RESTRICT statement can also be used to impose cross-equation restrictions in multivariate models. The following RESTRICT statement imposes an equality restriction on coefcients of x1 in equation y1 and x1 in equation y2:
proc qlim data=one; model y1 = x1-x10; model y2 = x1-x4; endogenous y1 y2 ~ discrete; restrict y1.x1=y2.x1; run;
TEST Statement
<label:> TEST <string:> equation [,equation. . . ] / options ;
The TEST statement performs Wald, Lagrange multiplier, and likelihood ratio tests of linear hypotheses about the regression parameters in the preceding MODEL statement. Each equation species a linear hypothesis to be tested. All hypotheses in one TEST statement are tested jointly. Variable names in the equations must correspond to regressors in the preceding MODEL statement, and each name represents the coefcient of the corresponding regressor. The keyword INTERCEPT refers to the coefcient of the intercept. The following options can be specied in the TEST statement after the slash (/):
ALL
requests the likelihood ratio test. The following illustrates the use of the TEST statement:
proc qlim; model y = x1 x2 x3; test x1 = 0, x2 * .5 + 2 * x3 = 0; test _int: test intercept = 0, x3 = 0; run;
The rst test investigates the joint hypothesis that 1 D 0 and 0:52 C 23 D 0 Only linear equality restrictions and tests are permitted in PROC QLIM. Tests expressions can be composed only of algebraic operations involving the addition symbol (+), subtraction symbol (-), and multiplication symbol (*). The TEST statement accepts labels that are reproduced in the printed output. TEST statement can be labeled in two ways. A TEST statement can be preceded by a label followed by a colon. Alternatively, the keyword TEST can be followed by a quoted string. If both are present, PROC QLIM uses the label preceding the colon. In the event no label is present, PROC QLIM automatically labels the tests.
WEIGHT Statement
WEIGHT variable ;
The WEIGHT statement species a variable to supply weighting values to use for each observation in estimating parameters. The log likelihood for each observation is multiplied by the corresponding weight variable value. If the weight of an observation is nonpositive, that observation is not used in the estimation.
where value of the latent dependent variable, yi , is observed only as follows: yi D 1 if yi > 0 D 0 otherwise The disturbance, tion (CDF) Z .x/ D
i,
of the probit model has standard normal distribution with the distribution func1 p exp. t 2 =2/dt 2
x 1
The disturbance of the logit model has standard logistic distribution with the CDF .x/ D exp.x/ 1 D 1 C exp.x/ 1 C exp. x/
The binary discrete choice model has the following probability that the event fyi D 1g occurs: P .yi D 1/ D F .x0 / D i The log-likelihood function is `D
N X yi logF .x0 / C .1 i i D1
yi / log1
F .x0 / i
where the CDF F .x/ is dened as .x/ for the probit model while F .x/ D .x/ for logit. The rst order derivative of the logit model are X @` D .yi @
i D1 N
.x0 //xi i
Note that the logit maximum likelihood estimates are p times greater than probit maximum like3 lihood estimates, since the probit parameter estimates, , are standardized, and the error term with 2 logistic distribution has a variance of 3 .
Ordinal Probit/Logit
When the dependent variable is observed in sequence with M categories, binary discrete choice modeling is not appropriate for data analysis. McKelvey and Zavoina (1975) proposed the ordinal (or ordered) probit model. Consider the following regression equation: yi D x0 C i
i
where error disturbances, i , have the distribution function F . The unobserved continuous random variable, yi , is identied as M categories. Suppose there are M C 1 real numbers, 0 ; ; M, where 0 D 1, 1 D 0, M D 1, and 0 1 M . Dene Ri;j D
j
x0 i
The probability that the unobserved dependent variable is contained in the j th category can be written as P
j 1
< yi
D F .Ri;j /
F .Ri;j
1/
F .Ri;j
1/
The rst derivatives are written as N M XX f .Ri;j 1 / f .Ri;j / @` dij D xi @ F .Ri;j / F .Ri;j 1 /
i D1 j D1
1/
where f .x/ D dF .x/ and j;k D 1 if j D k. When the ordinal probit is estimated, it is assumed dx that F .Ri;j / D .Ri;j /. The ordinal logit model is estimated if F .Ri;j / D .Ri;j /. The rst
threshold parameter, 1 , is estimated when the LIMIT1=VARYING option is specied. By default (LIMIT1=ZERO), so that M 2 threshold parameters ( 2 ; : : : ; M 1 ) are estimated. The ordered probit models are analyzed by Aitchison and Silvey (1957), and Cox (1970) discussed ordered response data by using the logit model. They dened the probability that yi belongs to j th category as P
j 1
< yi
D F.
C x0 / i
F.
j 1
C x0 / i
where 0 D 1 and M D 1. Therefore, the ordered response model analyzed by Aitchison and Silvey can be estimated if the LIMIT1=VARYING option is specied. Note that D .
Goodness-of-Fit Measures
The goodness-of-t measures discussed in this section apply only to discrete dependent variable models. McFadden (1974) suggested a likelihood ratio index that is analogous to the R2 in the linear regression model:
2 RM D 1
ln L ln L0
where L is the value of the maximum likelihood function and L0 is a likelihood function when regression coefcients except an intercept term are zero. It can be shown that L0 can be written as L0 D
M X j D1
Nj ln.
Nj / N
where Nj is the number of responses in category j . Estrella (1998) proposes the following requirements for a goodness-of-t measure to be desirable in discrete choice modeling: The measure must take values in 0; 1, where 0 represents no t and 1 corresponds to perfect t. The measure should be directly related to the valid test statistic for signicance of all slope coefcients. The derivative of the measure with respect to the test statistic should comply with corresponding derivatives in a linear regression. Estrellas (1998) measure is written
2 RE1
D1
ln L ln L0
2 N
ln L0
.ln L
K/= ln L0
2 N
ln L0
where ln L0 is computed with null slope parameter values, N is the number observations used, and K represents the number of estimated parameters. Other goodness-of-t measures are summarized as follows:
2 RC U1
D1 1
L0 L
2 N
.Cragg
2
Uhler1/
2 RC U 2
.L0 =L/ N
2
.Cragg
Uhler2/
1
2 RA D
N L0
Nelson/
yi 0
ifyi > 0 ifyi 0 The log-likelihood function of the standard censored regression model is X yi x0 0 i /= ln1 .xi = / C ln .
i 2fyi >0g 2 /.
i idN.0; X
i 2fyi D0g
`D
where . / is the cumulative density function of the standard normal distribution and . / is the probability density function of the standard normal distribution. The tobit model can be generalized to handle observation-by-observation censoring. The censored model on both of the lower and upper limits can be dened as 8 Ri < Ri if yi y if Li < yi < Ri yi D : i Li if yi Li
ln ln . . Li
yi
x0 i x0 i
/=
X
i 2fyi DRi g
Ri ln .
x0 i
/ C
X
i 2fyi DLi g
Log-likelihood functions of the lower- or upper-limit censored model are easily derived from the two-limit censored model. The log-likelihood function of the lower-limit censored model is X X y i x0 Li x0 i i `D ln . /= C ln . /
i 2fyi >Li g i 2fyi DLi g
x0 i
N.0; /. The likelihood function is described as P .y1 < 0/P .y1 > 0; y2 /.
The Type 3 Tobit model is different from the Type 2 Tobit in that y1i of the Type 3 Tobit is observed when y1i > 0. y1i y2i y1i y2i D x0 1 C u1i 1i D x0 2 C u2i 2i D y1i if y1i > 0 D 0 if y1i 0 D y2i if y1i > 0 D 0 if y1i 0 where .u1i ; u2i /0 i idN.0; /.
The likelihood function is characterized as P .y1 < 0/P .y1 ; y2 /. Type 4 Tobit The Type 4 Tobit model consists of three equations: y1i y2i y3i y1i y2i y3i D x0 1 C u1i 1i D x0 2 C u2i 2i D x0 3 C u3i 3i D y1i if y1i > 0 D 0 if y1i 0 D y2i if y1i > 0 D 0 if y1i 0 D y3i if y1i 0 D 0 if y1i > 0 where .u1i ; u2i ; u3i /0 i idN.0; /. The likelihood function of the Type 4 Tobit model is characterized as P .y1 < 0; y3 /P .y1 ; y2 /.
Type 5 Tobit The Type 5 Tobit model is dened as follows: y1i y2i y3i y1i y2i y3i D x0 1 C u1i 1i D x0 2 C u2i 2i D x0 3 C u3i 3i D 1 if y1i > 0 D 0 if y1i 0 D y2i if y1i > 0 D 0 if y1i 0 D y3i if y1i 0 D 0 if y1i > 0 where .u1i ; u2i ; u3i /0 are from iid trivariate normal distribution. The likelihood function of the Type 5 Tobit model is characterized as P .y1 < 0; y3 /P .y1 > 0; y2 /. Code examples for these models can be found in Example 21.6: Types of Tobit Models on page 1426.
Two-limit truncation model is dened as yi D yi if Li < yi < Ri The log-likelihood function of the two-limit truncated regression model is `D
N X i D1
ln .
yi
x0 i
/=
Ri ln .
x0 i
Li
x0 i
The log-likelihood functions of the lower- and upper-limit truncation model are
N X i D1
ln ln . .
` D
yi
x0 i x0 i
/=
ln 1
Li x0 i
x0 i
(lower)
` D
N X i D1
yi
/=
Ri ln .
(upper)
where i D vi ui . The vi term represents the stochastic error component and ui is the nonnegative, technology inefciency error component. The vi error component is assumed to be distributed iid normal and independently from ui . If ui > 0, the error term, i , is negatively skewed and represents technology inefciency. If ui < 0, the error term i is positively skewed and represents cost inefciency. PROC QLIM models the ui error component as a half normal, exponential, or truncated normal distribution.
f .u; v/ D
2 2
u v
exp
u2 2 2 u
v2 2 2 v
exp
u2 2 2 u
. C u/2 2 2 v
Integrating u out to obtain the marginal density function of results in the following form: Z f. / D D D where D
u= v 0 1
and
u and
The log-likelihood function for the production model with N producers is written as X 1 X 2 i ln L D const ant N ln C ln i 2 2
i i
f .u; v/ D p
1 2
u v
exp
u
u
v2 2 2 v
f .u; /du 1 D
u v
v u
exp
u
2 v 2 u
and the marginal density function for the cost function is equal to 2 1 v f. / D exp C v2 2 u u v u u The log-likelihood function for the normal-exponential production model with N producers is 2 X X i i v v C ln ln L D const ant N ln u C N C 2 2 u u v u
i i
u/
exp
.u 2
2 u
/2
v2 2 2 v
. = u/ C
. C /2 2 2
1
f. / D
The log-likelihood function for the normal-truncated normal production model with N producers is ln L D const ant 1X 2
i
N ln
i
N ln
X
i
ln C
For more detail on normal-half normal, normal-exponential, and normal-truncated models, see Kumbhakar and Knox Lovell (2000) and Coelli, Prasada Rao, and Battese (1998).
D f .z0 / i
The following table shows various functional forms of heteroscedasticity and the corresponding options to request each model. No. 1 2 3 4 5 6 Model f .z0 i f .z0 i f .z0 i f .z0 i f .z0 i f .z0 i /D /D /D /D /D /D
2 .1 C exp.z0 // i 2 exp.z0 / i PL 2 .1 C lD1 l zli / P 2 .1 C . L 2 lD1 l zli / / P 2. L lD1 l zli / P 2 .. L 2 lD1 l zli / /
Options link=EXP (default) link=EXP noconst link=LINEAR link=LINEAR square link=LINEAR noconst link=LINEAR square noconst
For discrete choice models, 2 is normalized ( 2 D 1) since this parameter is not identied. Note that in models 3 and 5, it may be possible that variances of some observations are negative. Although the QLIM procedure assigns a large penalty to move the optimization away from such region, it is possible that the optimization cannot improve the objective function value and gets locked in the region. Signs of such outcome include extremely small likelihood values or missing standard errors in the estimates. In models 2 and 6, variances are guaranteed to be greater or equal to zero, but
it may be possible that variances of some observations are very close to zero. In these scenarios, standard errors may be missing. Models 1 and 4 do not have such problems. Variances in these models are always positive and never close to zero. The heteroscedastic regression model is estimated using the following log-likelihood function: `D N ln.2 / 2 x0 . i
N X1 i D1
ln.
2 i /
1 X ei 2 . / 2 i
i D1
where ei D yi
Box-Cox Modeling
The Box-Cox transformation on x is dened as ( x 1 if 0 x. / D ln.x/ if D 0 The Box-Cox regression model with heteroscedasticity is written as
. yi
0/
D 0 C D
i
K X kD1
k xki k C
where i N.0; i2 / and transformed variables must be positive. In practice, too many transformation parameters cause numerical problems in model tting. It is common to have the same Box-Cox transformation performed on all the variables that is, 0 D 1 D D K . It is required for the magnitude of transformed variables to be in the tolerable range if the corresponding transformation parameters are j j > 1. The log-likelihood function of the Box-Cox regression model is written as `D N ln.2 / 2
.
0/
N X i D1
ln. i /
N 1 X
2 i i D1
2 ei C .
1/
N X i D1
ln.yi /
where ei D yi
i.
When the dependent variable is discrete, censored, or truncated, the Box-Cox transformation can be applied only to explanatory variables.
where the disturbances, 1i and 2i , have joint normal distribution with zero mean, standard deviations 1 and 2 , and correlation of . y1 and y2 are latent variables. The dependent variables y1 and y2 are observed if the latent variables y1 and y2 fall in certain ranges: y1 D y1i if y1i 2 D1 .y1i / y2 D y2i if y2i 2 D2 .y2i / D is a transformation from .y1i ; y2i / to .y1i ; y2i /. For example, if y1 and y2 are censored variables with lower bound 0, then y1 D y1i if y1i > 0; y2 D y2i if y2i > 0; y1 D 0 if y1i 0 y2 D 0 if y2i 0
There are three cases for the log likelihood of .y1i ; y2i /. The rst case is that y1i D y1i and y2i D y2i . That is, this observation is mapped to one point in the space of latent variables. The log likelihood is computed from a bivariate normal density, y1 x1 0 1 y2 x2 0 2 `i D ln 2 . ; ; / ln 1 ln 2
1 2
where 2 .u; v; / is the density function for standardized bivariate normal distribution with correlation ,
2 .u; v;
/D
.1=2/.u2 Cv 2 2 uv/=.1
2/
2 .1
2 /1=2
The second case is that one observed dependent variable is mapped to a point of its latent variable and the other dependent variable is mapped to a segment in the space of its latent variable. For example, in the bivariate censored model specied, if observed y1 > 0 and y2 D 0, then y1 D y1 and y2 2 . 1; 0. In general, the log likelihood for one observation can be written as follows (the subscript i is dropped for simplicity): If one set is a single point and the other set is a range, without loss of generality, let D1 .y1 / D fy1 g and D2 .y2 / D L2 ; R2 , `i D ln " C ln . y1 R2 x1 0 1
1
ln
x2 0 2
2
y1 x1 0 1
1
L2
x2 0 2
2
y1 x1 0 1
1
!#
where and are the density function and the cumulative probability function for standardized univariate normal distribution. The third case is that both dependent variables are mapped to segments in the space of latent variables. For example, in the bivariate censored model specied, if observed y1 D 0 and y2 D 0, then y1 2 . 1; 0 and y2 2 . 1; 0. In general, if D1 .y1 / D L1 ; R1 and D2 .y2 / D L2 ; R2 , the log likelihood is Z `i D ln
R1 L1 x1 0 1 1 x1 0 1 1
R2 L2
x2 0 2 2 x2 0 2 2
2 .u; v;
/ du dv
Selection Models
In sample selection models, one or several dependent variables are observed when another variable takes certain values. For example, the standard Heckman selection model can be dened as zi D w 0 i zi D 1 0 C ui if zi > 0 if zi 0
i
y i D x0 C i
if zi D 1
where ui and i are jointly normal with zero mean, standard deviations of 1 and , and correlation of . z is the variable that the selection is based on, and y is observed when z has a value of 1. Least squares regression using the observed data of y produces inconsistent estimates of . Maximum likelihood method is used to estimate selection models. It is also possible to estimate these models by using Heckmans method, which is more computationally efcient. But it can be shown that the resulting estimates, although consistent, are not asymptotically efcient under normality assumption. Moreover, this method often violates the constraint on correlation coefcient j j 1. The log-likelihood function of the Heckman selection model is written as ` D X
i 2fzi D0g
ln1 (
.w0 / i yi xi 0 w0 i C p 1
yi xi 0 2
!)
X
i 2fzi D1g
ln .
ln
C ln
Only one variable is allowed for the selection to be based on, but the selection may lead to several variables. For example, in the following switching regression model, zi D w0 i zi D C ui
1 if zi > 0 0 if zi 0 D x0 1 C 1i D x0 2 2i C if zi D 0 if zi D 1
y1i y2i
1i 2i
z is the variable that the selection is based on. If z D 0, then y1 is observed. If z D 1, then y2 is observed. Because it is never the case that y1 and y2 are observed at the same time, the correlation between y1 and y2 cannot be estimated. Only the correlation between z and y1 and the correlation between z and y2 can be estimated. This estimation uses the maximum likelihood method. A brief example of the code for this model can be found in Example 21.4: Sample Selection Model on page 1422.
The Heckman selection model can include censoring or truncation. For a brief example of the code for these models see Example 21.5: Sample Selection Model with Truncation and Censoring on page 1423. The following example shows a variable yi that is censored from below at zero. zi D w0 i zi D C ui
1 if zi > 0 0 if zi 0
i
yi D x0 C i yi D yi 0
if zi D 1
In this case, the log-likelihood function of the Heckman selection model needs to be modied to include the censored region. ` D X
fi jzi D0g
ln1
.w0 / i ( yi ln . Z ln
1 wi 0
xi 0
X
fi jzi D1;yi Dyi g
xi 0
" / ln C ln
w0 i
C p 1
yi xi 0 2
!#)
X
fi jzi D1;yi D0g
1 2 .u; v;
/ du dv
In case yi is truncated from below at zero instead of censored, the likelihood function can be written as ` D X
fi jzi D0g
ln1
.w0 / i xi 0 " / ln C ln w0 i C p 1
yi xi 0 2
X
fi jzi D1g
( yi ln .
!# ln .x0 = i /
x0 2 2i
where m is the number of models to be estimated. The vector has multivariate normal distribution with mean 0 and variance-covariance matrix . Similar to bivariate models, the likelihood may involve computing multivariate normal integrations. This is done using Monte Carlo integration. (See Genz (1992) and Hajivassiliou and McFadden (1998).) When the number of equations, N , increases in a system, the number of parameters increases at the rate of N 2 because of the correlation matrix. When the number of parameters is large, sometimes the optimization converges but some of the standard deviations are missing. This usually means that the model is over-parameterized. The default method for computing the covariance is to use the inverse Hessian matrix. The Hessian is computed by nite differences, and in over-parameterized cases, the inverse cannot be computed. It is recommended that you reduce the number of parameters in such cases. Sometimes using the outer product covariance matrix (COVEST=OP option) may also help.
Tests on Parameters
Tests on Parameters
In general, the hypothesis tested can be written as H0 W h. / D 0 where h. / is an r by 1 vector valued function of the parameters given by the r expressions specied in the TEST statement. O O Q O Let V be the estimate of the covariance matrix of . Let be the unconstrained estimate of and Q D 0. Let be the constrained estimate of such that h./ A. / D @h. /=@ j O Using this notation, the test statistics for the three kinds of tests are computed as follows. The Wald test statistic is dened as 8 9 1 0 O O O 0 O O W D h . /:A. /V A . /; h./ The Wald test is not invariant to reparameterization of the model (Gregory 1985; Gallant 1987, p. 219). For more information about the theoretical properties of the Wald test, see Phillips and Park (1988). The Lagrange multiplier test statistic is LM D where
0
Q Q 0 Q A. /V A . /
Q is the vector of Lagrange multipliers from the computation of the restricted estimate .
Q where represents the constrained estimate of and L is the concentrated log-likelihood value. For each kind of test, under the null hypothesis the test statistic is asymptotically distributed as a 2 random variable with r degrees of freedom, where r is the number of expressions in the TEST statement. The p-values reported for the tests are computed from the 2 .r/ distribution and are only asymptotically valid. Monte Carlo simulations suggest that the asymptotic distribution of the Wald test is a poorer approximation to its small sample distribution than that of the other two tests. However, the Wald test has the lowest computational cost, since it does not require computation of the constrained estimate Q . The following is an example of using the TEST statement to perform a likelihood ratio test:
proc qlim; model y = x1 x2 x3; test x1 = 0, x2 * .5 + 2 * x3 = 0 /lr; run;
Marginal Effects
Marginal effect is dened as a contribution of one control variable to the response variable. For the binary choice model with two response categories, 0 D 1, 1 D 0, 0 D 1; and ordinal response model with M response categories, 0 ; ; M , dene Ri;j D
j
x0 i
The probability that the unobserved dependent variable is contained in the j th category can be written as P
j 1
< yi
D F .Ri;j /
F .Ri;j
1/
The marginal effect of changes in the regressors on the probability of yi D j is then @P robyi D j D f . @x where f .x/ D
dF .x/ . dx j 1
x0 / i
f.
x0 / i
In particular, (
p1 2
dF .x/ f .x/ D D dx
x 2 =2
.probit/ .logit/
e x 1Ce . x/ 2
1 1
The marginal effects in the truncated regression model are @Eyi jLi < yi < Ri . .ai / .bi //2 ai .ai / D 1 C 2 @x ..bi / .ai // .bi / where ai D
Li x0 i
i
bi .bi / .ai /
and bi D
Ri x0 i
i
The marginal effects in the censored regression model are @Eyjxi D @x P robLi < yi < Ri
for a binary discrete variable. The expected value is the unconditional expectation of the dependent variable. For a censored variable, it is Eyi D .ai /Li C .x0 C i
i /..bi /
.ai // C .1
.bi //Ri
For a left-censored variable (Ri D 1), this formula is Eyi D .ai /Li C .x0 C i where D
.ai / . 1 .ai / i /.1
.ai //
C .1
For a noncensored variable, this formula is Eyi D x0 i The conditional expected value is the expectation given that the variable is inside the boundaries: Eyi jLi < yi < Ri D x0 C i
i
Probability
Probability applies only to discrete responses. It is the marginal probability that the discrete response is taking the value of the observation. If the PROBALL option is specied, then the probability for all of the possible responses of the discrete variables is computed.
Each parameter contains the estimate for the corresponding parameter in the corresponding model. In addition, the OUTEST= data set contains the following variables: _NAME_ _TYPE_ _STATUS_ the name of the independent variable type of observation. PARM indicates the row of coefcients; STD indicates the row of standard deviations of the corresponding coefcients. convergence status for optimization
The rest of the columns correspond to the explanatory variables. The OUTEST= data set contains one observation for the MODEL statement, giving the parameter estimates for that model. If the COVOUT option is specied, the OUTEST= data set includes additional observations for the MODEL statement, giving the rows of the covariance matrix of parameter estimates. For covariance observations, the value of the _TYPE_ variable is COV, and the _NAME_ variable identies the parameter associated with that row of the covariance matrix. If the CORROUT option is specied, the OUTEST= data set includes additional observations for the MODEL statement, giving the rows of the correlation matrix of parameter estimates. For correlation observations, the value of the _TYPE_ variable is CORR, and the _NAME_ variable identies the parameter associated with that row of the correlation matrix.
Naming
Naming of Parameters
When there is only one equation in the estimation, parameters are named in the same way as in other SAS procedures such as REG, PROBIT, etc. The constant in the regression equation is called Intercept. The coefcients on independent variables are named by the independent variables. The standard deviation of the errors is called _Sigma. If there are Box-Cox transformations, the coefcients are named _Lambdai , where i increments from 1, or as specied by the user. The limits for the discrete dependent variable are named _Limiti . If the LIMIT=varying option is specied, then _Limiti starts from 1. If the LIMIT=varying option is not specied, then _Limit1 is set to 0 and the limit parameters start from i D 2. If the HETERO statement is included, the coefcients of the independent variables in the hetero equation are called _H.x, where x is the name of the independent variable. If the parameter name includes interaction terms, it needs to be enclosed in quotation marks followed by N . The following example restricts the parameter that includes the interaction term to be greater than zero:
proc qlim data=a; model y = x1|x2; endogenous y ~ discrete; restrict "x1*x2"N>0; run;
When there are multiple equations in the estimation, the parameters in the main equation are named in the format of y.x, where y is the name of the dependent variable and x is the name of the independent variable. The standard deviation of the errors is called _Sigma.y. The correlation of the errors is called _Rho for bivariate model. For the model with three variables it is _Rho.y1.y2, _Rho.y1.y3, _Rho.y2.y3. The construction of correlation names for multivariate models is analogous. Box-Cox parameters are called _Lambdai .y and limit variables are called _Limiti .y. Parameters in the HETERO statement are named as _H.y.x. In the OUTEST= data set, all variables are changed from . to _.
If you prefer to name the output variables differently, you can use the RENAME option in the data set. For example, the following statements rename the residual of y as Resid:
proc qlim data=one; model y = x1-x10 / censored; output out=outds(rename=(resid_y=resid)) residual; run;
Table 21.2
ODS Table Name ResponseProle ClassLevels FitSummary GoodnessOfFit ConvergenceStatus ParameterEstimates CovB CorrB LinCon InputOptions ProblemDescription IterStart IterHist IterStop ConvergenceStatus ParameterEstimatesResults LinConSol
Description ODS Tables Created by the Model Statement Response Prole Class Levels Summary of Non-linear Estimation Pseudo-R2 Measures Convergence Status Parameter Estimates Covariance of Parameter Estimates Correlation of Parameter Estimates Linear Constraints Input Options Problem Description Optimization Start Iteration History Optimization Results Convergence Status Resulting Parameters Linear Constraints Evaluated at Solution ODS Tables Created by the Test Statement Test Results
Option default default default default default default COVB CORRB ITPRINT ITPRINT ITPRINT ITPRINT ITPRINT ITPRINT ITPRINT ITPRINT ITPRINT
TestResults
default
The output of the QLIM procedure for ordered data modeling is shown in Output 21.1.1.
Output 21.1.1 Ordered Data Modeling
Binary Data The QLIM Procedure Discrete Response Profile of dvisits Index 1 2 3 4 5 6 7 8 9 Value 0 1 2 3 4 5 6 7 8 Frequency 4141 782 174 30 24 9 12 12 6 Percent 79.79 15.07 3.35 0.58 0.46 0.17 0.23 0.23 0.12
Goodness-of-Fit Measures Measure Likelihood Ratio (R) Upper Bound of R (U) Aldrich-Nelson Cragg-Uhler 1 Cragg-Uhler 2 Estrella Adjusted Estrella McFaddens LRI Veall-Zimmermann McKelvey-Zavoina Value 789.73 7065.9 0.1321 0.1412 0.1898 0.149 0.1416 0.1118 0.2291 0.2036 Formula 2 * (LogL - LogL0) - 2 * LogL0 R / (R+N) 1 - exp(-R/N) (1-exp(-R/N)) / (1-exp(-U/N)) 1 - (1-R/U)^(U/N) 1 - ((LogL-K)/LogL0)^(-2/N*LogL0) R / U (R * (U+N)) / (U * (R+N))
N = # of observations, K = # of regressors
Parameter Intercept sex age agesq income levyplus freepoor freerepa illness actdays hscore chcond1 chcond2 _Limit2 _Limit3 _Limit4 _Limit5 _Limit6 _Limit7 _Limit8
DF 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Estimate -1.378705 0.131885 -0.534190 0.857308 -0.062211 0.137030 -0.346045 0.178382 0.150485 0.100575 0.031862 0.061601 0.135321 0.938884 1.514288 1.711660 1.952860 2.087422 2.333786 2.789796
t Value -9.35 3.01 -0.65 0.95 -0.91 2.57 -2.67 2.40 9.56 17.19 3.46 1.26 2.00 30.07 30.70 29.43 27.12 25.56 22.93 17.86
By default, ordinal probit/logit models are estimated assuming that the rst threshold or limit parameter ( 1 ) is 0. However, this parameter can also be estimated when the LIMIT1=VARYING option is specied. The probability that yi belongs to the j th category is dened as P
j 1
< yi <
D F.
x0 / i
F.
j 1
x0 / i
where F . / is the logistic or standard normal CDF, 0 D 1 and 9 D 1. Output 21.1.2 lists ordinal probit estimates computed in the following program. Note that the intercept term is suppressed for model identication when 1 is estimated.
/*-- Ordered Probit --*/ proc qlim data=docvisit; model dvisits = sex age agesq income levyplus freepoor freerepa illness actdays hscore chcond1 chcond2 / discrete(d=normal) limit1=varying; run;
Parameter sex age agesq income levyplus freepoor freerepa illness actdays hscore chcond1 chcond2 _Limit1 _Limit2 _Limit3 _Limit4 _Limit5 _Limit6 _Limit7 _Limit8
DF 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Estimate 0.131885 -0.534181 0.857298 -0.062211 0.137031 -0.346045 0.178382 0.150485 0.100575 0.031862 0.061602 0.135322 1.378706 2.317590 2.892994 3.090367 3.331566 3.466128 3.712493 4.168502
t Value 3.01 -0.65 0.95 -0.91 2.57 -2.67 2.40 9.56 17.19 3.46 1.26 2.00 9.35 15.43 18.64 19.53 20.31 20.53 20.65 19.32
yi 0
i idN.0;
data subset; input Hours Yrs_Ed Yrs_Exp @@; if Hours eq 0 then Lower=.; else Lower=Hours; datalines; 0 8 9 0 8 12 0 9 10 0 10 15 0 11 4 0 11 6 1000 12 1 1960 12 29 0 13 3 2100 13 36 3686 14 11 1920 14 38 0 15 14 1728 16 3 1568 16 19 1316 17 7 0 17 15 ; /*-- Tobit Model --*/ proc qlim data=subset; model hours = yrs_ed yrs_exp; endogenous hours ~ censored(lb=0); run;
DF 1 1 1 1
In the Parameter Estimates table there are four rows. The rst three of these rows correspond to the vector estimate of the regression coefcients . The last one is called _Sigma, which corresponds to the estimate of the error variance .
/*-- Bivariate Probit --*/ proc qlim data=a method=qn; init y1.x1 2.8, y1.x2 2.1, _rho .1; model y1 = x1 x2; model y2 = x1 x2; endogenous y1 y2 ~ discrete; run;
DF 1 1 1 1 1 1 1
Example 21.5: Sample Selection Model with Truncation and Censoring ! 1423
Parameter lwage.Intercept lwage.educ lwage.exper lwage.expersq _Sigma.lwage inlf.Intercept inlf.nwifeinc inlf.educ inlf.exper inlf.expersq inlf.age inlf.kidslt6 inlf.kidsge6 _Rho
DF 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Estimate -0.552716 0.108351 0.042837 -0.000837 0.663397 0.266459 -0.012132 0.131341 0.123282 -0.001886 -0.052829 -0.867398 0.035872 0.026617
t Value -2.12 7.29 2.88 -2.01 29.22 0.52 -2.49 5.17 6.58 -3.14 -6.23 -7.31 0.83 0.18
Note the correlation estimate is insignicant. This indicates that selection bias is not a big problem in the estimation of wage equation.
u1 = rannor( 76527 ); u2 = rannor( 65721 ); zl = 1 + 2 * x1 + 3 * x2 + u1; y = 3 + 4 * x1 - 2 * x2 + u1*.2 + u2; if ( zl > 0 ) then z = 1; else z = 0; if y>=0 then output; end; run; /*-- Sample Selection with Truncation --*/ proc qlim data=trunc method=qn; model z = x1 x2 / discrete; model y = x1 x2 / select(z=1) truncated(lb=0); run;
DF 1 1 1 1 1 1 1 1
In the following statements the data are generated such that the selection variable is discrete and the variable Y is censored from below by zero. The results are shown in Output 21.5.2.
data cens; keep z y x1 x2; do i = 1 to 500;
Example 21.5: Sample Selection Model with Truncation and Censoring ! 1425
x1 = rannor( 19283 ); x2 = rannor( 98721 ); u1 = rannor( 76527 ); u2 = rannor( 65721 ); zl = 1 + 2 * x1 + 3 * x2 + u1; yl = 3 + 4 * x1 - 2 * x2 + u1*.2 + u2; if ( zl > 0 ) then z = 1; else z = 0; if ( yl > 0 ) then y = yl; else y = 0; output; end; run; /*-- Sample Selection with Censoring --*/ proc qlim data=cens method=qn; model z = x1 x2 / discrete; model y = x1 x2 / select(z=1) censored(lb=0); run;
DF 1 1 1 1 1 1 1 1
DF 1 1 1
Type 2 Tobit
data a2; keep y1 y2 x1 x2; do i = 1 to 500; x1 = rannor( 19283 ); x2 = rannor( 98721 ); u1 = rannor( 76527 ); u2 = rannor( 65721 ); y1l = 1 + 2 * x1 + 3 * y2l = 3 + 4 * x1 - 2 * if ( y1l > 0 ) then y1 else y1 if ( y1l > 0 ) then y2 else y2 output; end; run;
/*-- Type 2 Tobit --*/ proc qlim data=a2 method=qn; model y1 = x1 x2 / discrete; model y2 = x1 x2 / select(y1=1); run;
DF 1 1 1 1 1 1 1 1
Type 3 Tobit
data a3; keep y1 y2 x1 x2; do i = 1 to 500; x1 = rannor( 19283 ); x2 = rannor( 98721 ); u1 = rannor( 76527 ); u2 = rannor( 65721 ); y1l = 1 + 2 * x1 + 3 * y2l = 3 + 4 * x1 - 2 * if ( y1l > 0 ) then y1 else y1 if ( y1l > 0 ) then y2 else y2 output; end; run;
/*-- Type 3 Tobit --*/ proc qlim data=a3 method=qn; model y1 = x1 x2 / censored(lb=0); model y2 = x1 x2 / select(y1>0); run;
Parameter y2.Intercept y2.x1 y2.x2 _Sigma.y2 y1.Intercept y1.x1 y1.x2 _Sigma.y1 _Rho
DF 1 1 1 1 1 1 1 1 1
Estimate 3.081206 3.998361 -2.088280 0.939799 0.981975 2.032675 2.976609 0.969968 0.226281
t Value 38.46 62.73 -28.66 24.07 14.58 34.24 45.39 24.37 3.92
Type 4 Tobit
data a4; keep y1 y2 y3 x1 x2; do i = 1 to 500; x1 = rannor( 19283 ); x2 = rannor( 98721 ); u1 = rannor( 76527 ); u2 = rannor( 65721 ); u3 = rannor( 12019 ); y1l = 1 + 2 * x1 + 3 * x2 + u1; y2l = 3 + 4 * x1 - 2 * x2 + u1*.2 + u2; y3l = 0 - 1 * x1 + 1 * x2 + u1*.1 - u2*.5 + u3*.5; if ( y1l > 0 ) then y1 = y1l; else y1 = 0; if ( y1l > 0 ) then y2 = y2l; else y2 = 0; if ( y1l <= 0 ) then y3 = y3l; else y3 = 0; output; end;
run; /*-- Type 4 Tobit --*/ proc qlim data=a4 method=qn; model y1 = x1 x2 / censored(lb=0); model y2 = x1 x2 / select(y1>0); model y3 = x1 x2 / select(y1<=0); run;
Parameter y2.Intercept y2.x1 y2.x2 _Sigma.y2 y3.Intercept y3.x1 y3.x2 _Sigma.y3 y1.Intercept y1.x1 y1.x2 _Sigma.y1 _Rho.y1.y2 _Rho.y1.y3
DF 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Estimate 2.894656 4.072704 -1.901163 0.981655 0.064594 -0.938384 1.035798 0.743124 0.987370 2.050408 2.982190 1.032473 0.291587 -0.031665
t Value 38.05 64.98 -24.73 24.81 0.36 -9.72 8.41 19.43 14.55 33.71 41.10 25.20 5.46 -0.12
Type 5 Tobit
data a5; keep y1 y2 y3 x1 x2; do i = 1 to 500; x1 = rannor( 19283 ); x2 = rannor( 98721 );
u1 = rannor( 76527 ); u2 = rannor( 65721 ); u3 = rannor( 12019 ); y1l = 1 + 2 * x1 + 3 * x2 + u1; y2l = 3 + 4 * x1 - 2 * x2 + u1*.2 + u2; y3l = 0 - 1 * x1 + 1 * x2 + u1*.1 - u2*.5 + u3*.5; if ( y1l > 0 ) then y1 = 1; else y1 = 0; if ( y1l > 0 ) then y2 = y2l; else y2 = 0; if ( y1l <= 0 ) then y3 = y3l; else y3 = 0; output; end; run; /*-- Type 5 Tobit --*/ proc qlim data=a5 method=qn; model y1 = x1 x2 / discrete; model y2 = x1 x2 / select(y1>0); model y3 = x1 x2 / select(y1<=0); run;
Parameter y2.Intercept y2.x1 y2.x2 _Sigma.y2 y3.Intercept y3.x1 y3.x2 _Sigma.y3 y1.Intercept y1.x1 y1.x2 _Rho.y1.y2 _Rho.y1.y3
DF 1 1 1 1 1 1 1 1 1 1 1 1 1
Estimate 2.887523 4.078926 -1.898898 0.983059 0.071764 -0.935299 1.039954 0.743083 1.067578 2.068376 3.157385 0.312369 -0.018225
t Value 30.33 58.59 -21.93 24.58 0.42 -10.07 8.62 19.44 7.48 9.15 10.03 1.76 -0.08
The following statements estimate a stochastic frontier exponential production model that uses Christensen Associates data:
/*-- Stochastic Frontier Production Model --*/ proc qlim data=airlines method=congra; model LQ=LF LM LE LL LP;
Parameter Estimates Standard Error 0.024528 0.124178 0.078755 0.081893 0.089702 0.042776 0.010063 0.017603 Approx Pr > |t| 0.0005 0.3510 <.0001 <.0001 0.1283 0.0207 <.0001 0.0006
DF 1 1 1 1 1 1 1 1
Similarly, the stochastic frontier production function can be estimated with (type=half) or (type=truncated) options that represent half-normal and truncated normal production or cost models. In the next step, stochastic frontier cost function is estimated. The data for the cost model are provided by Christensen and Greene (1976). The data describe costs and production inputs of 145 U.S. electricity producers in 1955. The model being estimated follows the nonhomogenous version of the Cobb-Douglas cost function: Cost KPrice LPrice 1 log D 0 C1 log C2 log C3 log.Output/C4 log.Output/2 C FPrice FPrice FPrice 2 All dollar values are normalized by fuel price. The quadratic log of the output is added to capture nonlinearities due to scale effects in cost functions. New variables, log_C_PF, log_PK_PF, log_PL_PF, log_y, and log_y_sq, are created to reect transformations. The following statements create the data set and transformed variables:
data electricity; input Firm Year Cost Output LPrice LShare datalines; 1 1955 .0820 2.0 2.090 .3164 183.000 2 1955 .6610 3.0 2.050 .2073 174.000 3 1955 .9900 4.0 2.050 .2349 171.000 4 1955 .3150 4.0 1.830 .1152 166.000 ... more lines ... /* Data transformations */ data electricity; set electricity; label Firm="firm index" Year="1955 for all observations" Cost="Total cost" Output="Total output" LPrice="Wage rate" LShare="Cost share for labor" KPrice="Capital price index" KShare="Cost share for capital" FPrice="Fuel price" FShare"Cost share for fuel"; log_C_PF=log(Cost/FPrice); log_PK_PF=log(KPrice/FPrice); log_PL_PF=log(LPrice/FPrice); log_y=log(Output); log_y_sq=log_y**2/2; run;
KPrice KShare FPrice FShare; .4521 .6676 .5799 .7857 17.9000 35.1000 35.1000 32.2000 .2315 .1251 .1852 .0990
The following statements estimate a stochastic frontier exponential cost model that uses Christensen and Greene (1976) data:
/*-- Stochastic Frontier Cost Model --*/ proc qlim data=electricity; model log_C_PF = log_PK_PF log_PL_PF log_y log_y_sq; endogenous log_C_PF ~ frontier (type=exponential cost); run;
DF 1 1 1 1 1 1 1
Similarly, the stochastic frontier cost model can be estimated with (type=half) or (type=truncated) options that represent half-normal and truncated normal errors. The following statements illustrate the half-normal option:
/*-- Stochastic Frontier Cost Model --*/ proc qlim data=electricity; model log_C_PF = log_PK_PF log_PL_PF log_y log_y_sq; endogenous log_C_PF ~ frontier (type=half cost); run;
DF 1 1 1 1 1 1 1
References ! 1437
DF 1 1 1 1 1 1 1 1
If no (Production) or (Cost) option is specied, the stochastic frontier production model is estimated by default.
References
Abramowitz, M. and Stegun, A. (1970), Handbook of Mathematical Functions, New York: Dover Press. Aigner, C., Lovell, C. A. K., Schmidt, P. (1977), Formulation and Estimation of Stochastic Frontier Production Function Models, Journal of Econometrics, 6:1 (July), 2137 Aitchison, J. and Silvey, S. (1957), The Generalization of Probit Analysis to the Case of Multiple Responses, Biometrika, 44, 131140. Amemiya, T. (1978a), The Estimation of a Simultaneous Equation Generalized Probit Model, Econometrica, 46, 11931205.
Amemiya, T. (1978b), On a Two-Step Estimate of a Multivariate Logit Model, Journal of Econometrics, 8, 1321. Amemiya, T. (1981), Qualitative Response Models: A Survey, Journal of Economic Literature, 19, 483536. Amemiya, T. (1984), Tobit Models: A Survey, Journal of Econometrics, 24, 361. Amemiya, T. (1985), Advanced Econometrics, Cambridge: Harvard University Press. Ben-Akiva, M. and Lerman, S. R. (1987), Discrete Choice Analysis, Cambridge: MIT Press. Bera, A. K., Jarque, C. M., and Lee, L.-F. (1984), Testing the Normality Assumption in Limited Dependent Variable Models, International Economic Review, 25, 563578. Bloom, D. E. and Killingsworth, M. R. (1985), Correcting for Truncation Bias Caused by a Latent Truncation Variable, Journal of Econometrics, 27, 131135. Box, G. E. P. and Cox, D. R. (1964), An Analysis of Transformations, Journal of the Royal Statistical Society, Series B., 26, 211252. Cameron, A. C. and Trivedi, P. K. (1986), Econometric Models Based on Count Data: Comparisons and Applications of Some Estimators, Journal of Applied Econometrics, 1, 2953. Cameron, A. C. and Trivedi, P. K. (1998), Regression Analysis of Count Data, Cambridge: Cambridge University Press. Christensen, L. and W. Greene, 1976, Economies of Scale in U.S. Electric Power Generation, Journal of Political Economy, 84, pp. 655-676. Coelli, T. J., Prasada Rao, D. S., Battese, G. E. (1998), An Introduction to Efciency and Productivity Analysis, London: Kluwer Academic Publisher. Copley, P. A., Doucet, M. S., and Gaver, K. M. (1994), A Simultaneous Equations Analysis of Quality Control Review Outcomes and Engagement Fees for Audits of Recipients of Federal Financial Assistance, The Accounting Review, 69, 244256. Cox, D. R. (1970), Analysis of Binary Data, London: Metheun. Cox, D. R. (1972), Regression Models and Life Tables, Journal of the Royal Statistical Society, Series B, 20, 187220. Cox, D. R. (1975), Partial Likelihood, Biometrika, 62, 269276. Deis, D. R. and Hill, R. C. (1998), An Application of the Bootstrap Method to the Simultaneous Equations Model of the Demand and Supply of Audit Services, Contemporary Accounting Research, 15, 8399. Estrella, A. (1998), A New Measure of Fit for Equations with Dichotomous Dependent Variables, Journal of Business and Economic Statistics, 16, 198205. Gallant, A. R. (1987), Nonlinear Statistical Models, New York: Wiley.
References ! 1439
Genz, A. (1992), Numerical Computation of Multivariate Normal Probabilities, Journal of Computational and Graphical Statistics, 1, 141150. Godfrey, L. G. (1988), Misspecication Tests in Econometrics, Cambridge: Cambridge University Press. Gourieroux, C., Monfort, A., Renault, E., and Trognon, A. (1987), Generalized Residuals, Journal of Econometrics, 34, 532. Greene, W. H. (1997), Econometric Analysis, Upper Saddle River, N.J.: Prentice Hall. Gregory, A. W. and Veall, M. R. (1985), On Formulating Wald Tests for Nonlinear Restrictions, Econometrica, 53, 14651468. Hajivassiliou, V. A. (1993), Simulation Estimation Methods for Limited Dependent Variable Models, in Handbook of Statistics, Vol. 11, ed. G. S. Maddala, C. R. Rao, and H. D. Vinod, New York: Elsevier Science Publishing. Hajivassiliou, V. A., and McFadden, D. (1998), The Method of Simulated Scores for the Estimation of LDV Models, Econometrica, 66, 863896. Heckman, J. J. (1978), Dummy Endogenous Variables in a Simultaneous Equation System, Econometrica, 46, 931959. Hinkley, D. V. (1975), On Power Transformations to Symmetry, Biometrika, 62, 101111. Kim, M. and Hill, R. C. (1993), The Box-Cox Transformation-of-Variables in Regression, Empirical Economics, 18, 307319. King, G. (1989b), Unifying Political Methodology: The Likelihood Theory and Statistical Inference, Cambridge: Cambridge University Press. Kumbhakar, S. C. and Knox Lovell, C. A. (2000), Stochastic Frontier Anaysis, New York: Cambridge University Press. Lee, L.-F. (1981), Simultaneous Equations Models with Discrete and Censored Dependent Variables, in Structural Analysis of Discrete Data with Econometric Applications, ed. C. F. Manski and D. McFadden, Cambridge: MIT Press Long, J. S. (1997), Regression Models for Categorical and Limited Dependent Variables, Thousand Oaks, CA: Sage Publications. McFadden, D. (1974), Conditional Logit Analysis of Qualitative Choice Behavior, in Frontiers in Econometrics, ed. P. Zarembka, New York: Academic Press. McFadden, D. (1981), Econometric Models of Probabilistic Choice, in Structural Analysis of Discrete Data with Econometric Applications, ed. C. F. Manski and D. McFadden, Cambridge: MIT Press. McKelvey, R. D. and Zavoina, W. (1975), A Statistical Model for the Analysis of Ordinal Level Dependent Variables, Journal of Mathematical Sociology, 4, 103120.
Meeusen, W. and van Den Broeck, J. (1977), Efciency Estimation from Cobb-Douglas Production Functions with Composed Error, International Economic Review, 18:2(Jun), 435444 Mroz, T. A. (1987), The Sensitivity of an Empirical Model of Married Womens Hours of Work to Economic and Statistical Assumptions, Econometrica, 55, 765799. Mroz, T. A. (1999), Discrete Factor Approximations in Simultaneous Equation Models: Estimating the Impact of a Dummy Endogenous Variable on a Continuous Outcome, Journal of Econometrics, 92, 233274. Nawata, K. (1994), Estimation of Sample Selection Bias Models by the Maximum Likelihood Estimator and Heckmans Two-Step Estimator, Economics Letters, 45, 3340. Parks, R. W. (1967), Efcient Estimation of a System of Regression Equations When Disturbances Are Both Serially and Contemporaneously Correlated, Journal of the American Statistical Association, 62, 500509. Phillips, C. B. and Park, J. Y. (1988), On Formulating Wald Tests of Nonlinear Restrictions, Econometrica, 56, 10651083. Powers, D. A. and Xie, Y. (2000), Statistical Methods for Categorical Data Analysis, San Diego: Academic Press. Wooldridge, J. M. (2002), Econometric Analysis of Cross Section of Panel Data, Cambridge, MA: MIT Press.
Chapter 22
ODS Graphics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Examples: SIMILARITY Procedure . . . . . . . . . . . . . . . . . . . . . . Example 22.1: Accumulating Transactional Data into Time Series Data Example 22.2: Similarity Analysis . . . . . . . . . . . . . . . . . . . Example 22.3: Sliding Similarity Analysis . . . . . . . . . . . . . . . Example 22.4: Searching for Historical Analogies . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
After optionally transforming each time series, the accumulated and transformed time series can be stored in an output data set (OUT= data set). After optional accumulation and transformation, each of these time series are the working series, which can now be analyzed as sequences of numeric data. Each of these sequences can be a target sequence, an input sequence, or both a target and an input sequence. Throughout the remainder of this chapter, the term original sequence applies to both the original input and target sequence. The term working sequence applies to a version both the original input and target sequence under investigation. Each original sequence can be normalized prior to similarity analysis. Normalizations are useful when you want to compare the shape or prole of the time series. Normalizations performed by the SIMILARITY procedure include the following: standard (STANDARD) absolute (ABSOLUTE) user-dened normalizations After each original sequence is optionally normalized as described above, each working input sequence can be scaled to the target sequence prior to similarity analysis. Scaling is useful when you want to compare the input sequence to the target sequence while discounting the variation of the target sequence. Input sequence scaling performed by the SIMILARITY procedure include the following: standard (STANDARD) sbsolute (ABSOLUTE) user-dened scaling Once the working input sequence is optionally scaled to the target sequence as described above, similarity measures can be computed. Similarity measures computed by the SIMILARITY procedure include: squared deviation (SQRDEV) absolute deviation (ABSDEV) mean square deviation (MSQRDEV) mean absolute deviation (MABSDEV) user-dened similarity measures In computing the similarity measure between two time series, tasks are needed for transforming time series, normalizing sequences, scaling sequences, and computing metrics or measures. The SIMILARITY procedure provides built-in routines to perform these tasks. The SIMILARITY procedure also provides a facility that allows you to extend the procedure with user-dened routines.
All results of the similarity analysis can be stored in output data sets, printed, or graphed using the Output Delivery System (ODS). The SIMILARITY procedure can process large amounts of time-stamped transactional data, time series, or sequential data. Therefore, the analysis results are useful for large-scale time series analysis, analogous time series forecasting, new product forecasting, or time series (temporal) data mining. In SAS/ETS, the EXPAND procedure can be used for frequency conversion and transformations of time series. The TIMESERIES procedure can be used for large-scale time series analysis. In SAS/STAT, the DISTANCE procedure can be used to compute various measures of distance, dissimilarity, or similarity between observations (rows) of a SAS data set.
The SIMILARITY procedure forms time series from the input time-stamped transactional data. It can provide results in output data sets or in other output formats using the Output Delivery System (ODS). The examples in this section are more fully illustrated in the section Examples: SIMILARITY Procedure on page 1485. Time-stamped transactional data are often recorded at no xed interval. Analysts often want to use time series analysis techniques that require xed-time intervals. Therefore, the transactional data must be accumulated to form a xed-interval time series. Suppose that a bank wants to analyze the transactions associated with each of its customers over time. Further, suppose that the data set WORK.TRANSACTIONS contains three variables related to the customer transactions (CUSTOMER, DATE, and WITHDRAWAL) and one variable that contains an example fraudulent behavior (FRAUD).
The following statements illustrate how to use the SIMILARITY procedure to accumulate timestamped transactional data to form a daily time series based on the accumulated daily totals of each type of transaction (WITHDRAWALS and FRAUD).
proc similarity data=transactions out=timedata; by customer; id date interval=day accumulate=total; input withdrawals; target fraud; run;
The OUT=TIMEDATA option species that the resulting time series data for each customer is to be stored in the data set WORK.TIMEDATA. The INTERVAL=DAY option species that the transactions are to be accumulated on a daily basis. The ACCUMULATE=TOTAL option species that the sum of the transactions are to be accumulated. After the transactional data are accumulated into a time series format, the time series data can be normalized so that the shape or prole is analyzed. For example, the following statements build on the previous statements and demonstrate normalization of the accumulated time series.
proc similarity data=transactions out=timedata; by customer; id date interval=day accumulate=total; input withdrawals / NORMALIZE=STANDARD; target fraud / NORMALIZE=STANDARD; run;
The NORMALIZE=STANDARD option species that each accumulated time series observation is normalized by subtracting the mean and then dividing by the standard deviation of the accumulated time series. The WORK.TIMEDATA data set now contains the accumulated and normalized time series data for each customer. After the transactional data are accumulated into a time series format and normalized to mean of zero and standard deviation of one, similarity analysis can be performed on the accumulated and normalized time series. For example, the following statements build on the previous statements and demonstrate similarity analysis of the accumulated and normalized time series.
proc similarity data=transactions out=timedata OUTSUM=SUMMARY; by customer; id date interval=day accumulate=total; input withdrawals / normalize=standard; target fraud / normalize=standard SIMILARITY=MABSDEV; run;
The SIMILARITY=MABSDEV option species the accumulated and normalized time series data associated with the variables WITHDRAWALS and FRAUD are to be compared by using mean absolute deviation. The OUTSUM=SUMMARY option species that the similarity analysis summary for each customer is to be stored in the data set WORK.SUMMARY.
Functional Summary
The statements and options that control the SIMILARITY procedure are summarized in the following table.
Table 22.1 SIMILARITY Functional Summary
Description Statements specify BY-group processing specify the time ID variable specify the FCMP options specify input variables to analyze specify target variables to analyze Data Set Options specify the input data set specify the time series output data set specify measure summary output data set specify path output data set specify sequence output data set specify summary output data set User-Dened Functions and Subroutine Options specify FCMP quiet mode specify FCMP trace mode Accumulation and Seasonality Options specify accumulation frequency specify length of seasonal cycle specify interval alignment specify time ID variable values are not sorted
Statement
Option
PROC SIMILARITY PROC SIMILARITY PROC SIMILARITY PROC SIMILARITY PROC SIMILARITY PROC SIMILARITY
FCMPOPT FCMPOPT
QUIET= TRACE=
ID PROC SIMILARITY ID ID
Description specify starting time ID value specify ending time ID value specify accumulation statistic specify missing value interpretation specify zero value interpretation specify missing value trimming Time Series Transformation Options specify simple differencing specify seasonal differencing specify transformation Input Sequence Options specify normalization specify scaling Target Sequence Options specify normalization Similarity Measure Options specify compression limits specify expansion limits specify similarity measure specify similarity measure and path specify sequence slide Printing and Graphical Control Options specify time ID format specify printed output specify detailed printed output specify graphical output Miscellaneous Options specify that analysis variables are processed in ascending order specify the ordering of the processsing of the input and target variables
Statement ID ID ID, INPUT, TARGET ID, INPUT, TARGET ID, INPUT, TARGET INPUT, TARGET
INPUT INPUT
NORMALIZE= SCALE=
TARGET
NORMALIZE=
SORTNAMES ORDER=
names the SAS data set that contains the time series, transactional, or sequence input data for the procedure. If the DATA= option is not specied, the most recently created SAS data set is used.
ORDER=
species the order that the variables listed in the INPUT and TARGET statements are processed. This ordering affects the OUTSEQUENCE=, OUTPATH=, OUTMEASURE=, and OUTSUM= data sets, as well as the printed and graphical output. The SORTNAMES option also affects the ordering of the analysis.
OUT=SAS-data-set
names the output data set to contain the time series variables specied in the subsequent INPUT and TARGET statements. If an ID variable is specied, it is also included in the OUT= data set. The values are accumulated based on the ID statement INTERVAL= option or the ACCUMULATE= options or both. The values are transformed based on the INPUT or TARGET statement TRANSFORM=, DIF=, and/ or SDIF= options in this order. The OUT= data set is particularly useful when you want to further analyze, model, or forecast the resulting time series with other SAS/ETS procedures.
OUTMEASURE=SAS-data-set
names the output data set to contain the detailed similarity measures by time ID value. The form of the OUTMEASURE= data set is determined by the PROC SIMILARITY statement SORTNAMES and ORDER= options.
OUTPATH=SAS-data-set
names the output data set to contain the path used to compute the similarity measures for each slide and warp. The form of the OUTPATH= data set is determined by the PROC SIMILARITY statement SORTNAMES and ORDER= options.
OUTSEQUENCE=SAS-data-set
names the output data set to contain the sequences used to compute the similarity measures for each slide and warp. The form of the OUTSEQUENCE= data set is determined by the PROC SIMILARITY statement SORTNAMES and ORDER= options.
OUTSUM=SAS-data-set
names the output data set to contain the similarity measure summary. The OUTSUM= data set is particularly useful when analyzing large numbers of series and only the summary of the results are needed. The form of the OUTSUM= data set is determined by the PROC SIMILARITY statement SORTNAMES and ORDER= options. INPUT species that each INPUT variable is processed and then the TARGET variables are processed. The results are stored and printed based only on the INPUT variables. species that each INPUT variable is processed and then the TARGET variables are processed. The results are stored and printed based on both the INPUT and TARGET variables. This is the default.
INPUTTARGET
TARGET
species that each TARGET variable is processed and then the INPUT variables are processed. The results are stored and printed based only on the TARGET variables. species that each TARGET variable is processed and then the INPUT variables are processed. The results are stored and printed based on both the TARGET and INPUT variables.
TARGETINPUT
species the graphical output desired. The options are separated by spaces. By default, the SIMILARITY procedure produces no graphical output. The following graphical options are available: ALL COSTS DISTANCES INPUTS MAPS MEASURES NORMALIZED same as PLOTS=(INPUTS TARGETS SEQUENCES NORMALIZED SCALED DISTANCES PATHS MAPS WARPS COST MEASURES). plots time warp costs graphics. plots similarity absolute and relative distances graphics. (OUTPATH= data set) plots input variable time series graphics. (OUT= data set) plots time warp maps graphics. (OUTPATH= data set) plots similarity measure graphics. (OUTMEASURE= data set) plots both the input and target variable normalized sequence graphics. These plots are displayed only when the INPUT or TARGET statement NORMALIZE= option is specied. plots time warp paths graphics. (OUTPATH= data set) plots both the input variable scaled sequence graphics. These plots are displayed only when the INPUT statement SCALE= option is specied. plots both the input and target variable sequence graphics. (OUTSEQUENCE= data set) plots target variable time series graphics. (OUT= data set) plots time warps graphics. (OUTPATH= data set)
species the printed output desired. The options are separated by spaces. By default, the SIMILARITY procedure produces no printed output. The following printing options are available: DESCSTATS PATHS COSTS WARPS prints the descriptive statistics for the working time series. prints the path statistics table. prints the cost statistics table. prints the warp summary table.
prints the slides summary table. prints the similarity measure summary table. same as PRINT=(DESCSTATS PATHS COSTS WARPS SLIDES SUMMARY).
PRINTDETAILS
species that output requested with the PRINT= option be printed in greater detail.
SEASONALITY=integer
species the length of the seasonal cycle where integer ranges from one to 10,000. For example, SEASONALITY=3 means that every group of three time periods forms a seasonal cycle. By default, the length of the seasonal cycle is one (no seasonality) or the length implied by the INTERVAL= option specied in the ID statement. For example, INTERVAL=MONTH implies that the length of the seasonal cycle is twelve.
SORTNAMES
species that the variables specied in the INPUT and TARGET statements are processed in order sorted by the variable names. By default, the SIMILARITY procedure processes the variables in the order they are listed. The ORDER= option also affects the ordering of the analysis.
BY Statement
A BY statement can be used with PROC SIMILARITY to obtain separate dummy variable denitions for groups of observations dened by the BY variables. When a BY statement appears, the procedure expects the input data set to be sorted in order of the BY variables. If your input data set is not sorted in ascending order, use one of the following alternatives: Sort the data by using the SORT procedure with a similar BY statement. Specify the option NOTSORTED or DESCENDING in the BY statement for the SIMILARITY procedure. The NOTSORTED option does not mean that the data are unsorted but rather that the data are arranged in groups (according to values of the BY variables) and that these groups are not necessarily in alphabetical or increasing numeric order. Create an index on the BY variables by using the DATASETS procedure. For more information about the BY statement, see SAS Language Reference: Concepts. For more information about the DATASETS procedure, see the discussion in the Base SAS Procedures Guide.
FCMPOPT Statement
FCMPOPT options ;
The FCMPOPT statement species the options related to user-dened functions and subroutines. The following options can be used with the FCMPOPT statement:
QUIET=ON | OFF
species whether the nonfatal errors and warnings generated by the user-dened SAS language functions and subroutines are printed to the log. Nonfatal errors are usually associated with operations with missing values. The default is QUIET=ON.
TRACE=ON | OFF
species whether the user-dened SAS language functions and subroutines tracings are printed to the log. Tracings are the results of every operation executed. This option is generally used for debugging. The default is TRACE=OFF.
ID Statement
ID variable INTERVAL= interval options ;
The ID statement names a numeric variable that identies observations in the input and output data sets. The ID variables values are assumed to be SAS date, time, or datetime values. In addition, the ID statement species the (desired) frequency associated with the time series. The ID statement options also specify how the observations are accumulated and how the time ID values are aligned to form the time series. The options specied affect all variables listed in subsequent INPUT and TARGET statements. If an ID statement is specied, the INTERVAL= option must also be specied. The other ID statement options are optional. If an ID statement is not specied, the observation number, with respect to the BY group, is used as the time ID. The following options can be used with the ID statement.
ACCUMULATE=option
species how the data set observations are accumulated within each time period. The frequency (width of each time interval) is specied by the INTERVAL= option. The ID variable contains the time ID values. Each time ID variable value corresponds to a specic time period. The accumulated values form the time series, which is used in subsequent analysis. The ACCUMULATE= option is particularly useful when there are zero or more than one input observations coinciding with a particular time period (for example, time-stamped transactional data). The EXPAND procedure offers additional frequency conversions and transformations that can also be useful in creating a time series. The following options determine how the observations are accumulated within each time period based on the ID variable and the frequency specied by the INTERVAL= option:
NONE TOTAL AVERAGE | AVG MINIMUM | MIN MEDIAN | MED MAXIMUM | MAX N NMISS NOBS FIRST LAST STDDEV |STD CSS USS
No accumulation occurs; the ID variable values must be equally spaced with respect to the frequency. This is the default option. Observations are accumulated based on the total sum of their values. Observations are accumulated based on the average of their values. Observations are accumulated based on the minimum of their values. Observations are accumulated based on the median of their values. Observations are accumulated based on the maximum of their values. Observations are accumulated based on the number of nonmissing observations. Observations are accumulated based on the number of missing observations. Observations are accumulated based on the number of observations. Observations are accumulated based on the rst of their values. Observations are accumulated based on the last of their values. Observations are accumulated based on the standard deviation of their values. Observations are accumulated based on the corrected sum of squares of their values. Observations are accumulated based on the uncorrected sum of squares of their values.
If the ACCUMULATE= option is specied, the SETMISSING= option is useful for specifying how accumulated missing values are treated. If missing values should be interpreted as zero, then SETMISSING=0 should be used. The DETAILS section describes accumulation in greater detail.
ALIGN=option
controls the alignment of SAS dates that are used to identify output observations. The ALIGN= option accepts the following values: BEGINNING|BEG|B, MIDDLE|MID|M, and ENDING|END|E. ALIGN=BEGINNING is the default.
END=option
species a SAS date, datetime, or time value that represents the end of the data. If the last time ID variable value is less than the END= value, the series is extended with missing values. If the last time ID variable value is greater than the END= value, the series is truncated. For example, END=&sysdateD uses the automatic macro variable SYSDATE to extend or truncate the series to the current date. The START= and END= options can be used to ensure that data associated within each BY group contains the same number of observations.
ID Statement ! 1453
FORMAT=format
species the SAS format for the time ID values. If the FORMAT= option is not specied, the default format is implied by the INTERVAL= option. For example, FORMAT=DATE9. species that the DATE9. SAS format is to be used. Notice that the terminating . is required when specifying a SAS format.
INTERVAL= interval
species the frequency of the accumulated time series. For example, if the input data set consists of quarterly observations, then INTERVAL=QTR should be used. If the SEASONALITY= option is not specied, the length of the seasonal cycle is implied from the INTERVAL= option. For example, INTERVAL=QTR implies a seasonal cycle of length 4. If the ACCUMULATE= option is also specied, the INTERVAL= option determines the time periods for the accumulation of observations.
NOTSORTED
species that the time ID values are not in sorted order. The SIMILARITY procedure sorts the data with respect to the time ID prior to analysis if the NOTSORTED option is specied.
SETMISSING=option | number
species how missing values (either actual or accumulated) are interpreted in the accumulated time series. If a number is specied, missing values are set to that number. If a missing value indicates an unknown value, the SETMISSING= option should not be used. If a missing value indicates no value, then SETMISSING=0 should be used. You typically use SETMISSING=0 for transactional data, because no recorded data usually implies no activity. The following options can also be used to determine how missing values are assigned: MISSING AVERAGE | AVG MINIMUM | MIN MEDIAN | MED MAXIMUM | MAX FIRST LAST PREVIOUS | PREV Missing values are set to missing. This is the default option. Missing values are set to the accumulated average value. Missing values are set to the accumulated minimum value. Missing values are set to the accumulated median value. Missing values are set to the accumulated maximum value. Missing values are set to the accumulated rst nonmissing value. Missing values are set to the accumulated last nonmissing value. Missing values are set to the previous periods accumulated nonmissing value. Missing values at the beginning of the accumulated series remain missing. Missing values are set to the next periods accumulated nonmissing value. Missing values at the end of the accumulated series remain missing.
NEXT
START= option
species a SAS date, datetime, or time value that represents the beginning of the data. If the rst time ID variable value is greater than the START= value, the series is prepended with missing values. If the rst time ID variable value is less than the START= value, the series is truncated. The START= and END= options can be used to ensure that data associated with each BY group contains the same number of observations.
ZEROMISS=option
species how beginning and/or ending zero values (either actual or accumulated) are interpreted in the accumulated time series. The following options can also be used to determine how beginning and/or ending zero values are assigned: NONE LEFT RIGHT BOTH Beginning and/or ending zeros are unchanged. This is the default. Beginning zeros are set to missing. Ending zeros are set to missing. Both beginning and ending zeros are set to missing.
If the accumulated series is all missing and/or zero, the series is not changed.
INPUT Statement
INPUT variable-list < / options > ;
The INPUT statement lists the input numeric variables in the DATA= data set whose values are to be accumulated to form the time series or represent ordered numeric sequences (when no ID statement is specied). An input data set variable can be specied in only one INPUT or TARGET statement. Any number of INPUT statements can be used. The following options can be used with an INPUT statement.
ACCUMULATE=option
species how the data set observations are accumulated within each time period for the variables listed in the INPUT statement. If the ACCUMULATE= option is not specied in the INPUT statement, accumulation is determined by the ACCUMULATE= option of the ID statement. If the ACCUMULATE= option is not specied on the ID statement or the INPUT statement, no accumulation is performed. See the ID statement ACCUMULATE= option for more details.
DIF=(numlist)
species the differencing to be applied to the accumulated time series. The list of differencing orders must be separated by spaces or commas. For example, DIF=(1,3) species rst, then third order, differencing. Differencing is applied after time series transformation. The TRANSFORM= option is applied before the DIF= option. Simple differencing is useful when you want to detrend the time series before computing the similarity measures.
NORMALIZE=Argumentoption
species the sequence normalization to be applied to the working input sequence. The following normalization options are provided: NONE ABSOLUTE STANDARD No normalization is applied. This option is the default. Absolute normalization is applied. Standard normalization is applied.
User-Dened
Normalization is computed by a user-dened subroutine created using the FCMP procedure, where User-Dened is the subroutine name.
Normalization is applied to the working input sequence, which can be a subset of the working input time series if the SLIDE=INDEX or SLIDE=SEASON options are specied.
SCALE=option
species the scaling of the working input sequence with respect to the working target sequence. Scaling is performed after normalization. The following scaling options are provided: NONE ABSOLUTE STANDARD
User-Dened
No scaling is applied. This option is the default. Absolute scaling is applied. Standard scaling is applied. Scaling is computed by a user-dened subroutine created using the FCMP procedure, where User-Dened is the subroutine name.
Scaling is applied to the working input sequence, which can be a subset of the working input time series if the SLIDE=INDEX or SLIDE=SEASON options are specied.
SDIF=(numlist)
species the seasonal differencing to be applied to the accumulated time series. The list of seasonal differencing orders must be separated by spaces or commas. For example, SDIF=(1,3) species rst, then third, order seasonal differencing. Differencing is applied after time series transformation. The TRANSFORM= option is applied before the SDIF= option. Seasonal differencing is useful when you want to deseasonalize the time series before computing the similarity measures.
SETMISSING=option | number SETMISS=option | number
species how missing values (either actual or accumulated) are interpreted in the accumulated time series or ordered sequence for variables listed in the INPUT statement. If the SETMISSING= option is not specied in the INPUT statement, missing values are set based on the SETMISSING= option of the ID statement. If the SETMISSING= option is not specied on the ID statement or the INPUT statement, no missing value interpretation is performed. See the ID statement SETMISSING= option for more details.
TRANSFORM= option
species the time series transformation to be applied to the accumulated time series. The following transformations are provided: NONE LOG SQRT LOGISTIC No transformation is applied. This option is the default. Logarithmic transformation is applied. Square-root transformation is applied. Logistic transformation is applied.
BOXCOX(number )
User-Dened
Box-Cox transformation with parameter is applied, where the real number is between -5 and 5. Transformation is computed by a user-dened subroutine created using the FCMP procedure where User-Dened is the subroutine name.
When the TRANSFORM= option is specied, the time series must be strictly positive unless a user-dened function is used.
TRIMMISSING= option TRIMMISSING= option
species how missing values (either actual or accumulated) are trimmed from the accumulated time series or ordered sequence for variables listed in the INPUT statement. The following trimming options are provided: NONE LEFT RIGHT BOTH
ZEROMISS= option
No missing value trimming is applied. Beginning missing values are trimmed. Ending missing values are trimmed. Both beginning and ending missing value are trimmed. This is the default.
species how beginning and/or ending zero values (either actual or accumulated) are interpreted in the accumulated time series or ordered sequence for variables listed in the INPUT statement. If the ZEROMISS= option is not specied in the INPUT statement, beginning and/or ending zero values are set based on the ZEROMISS= option of the ID statement. If the ZERO= option is not specied on the ID statement or the INPUT statement, no zero value interpretation is performed. ZEROMISS= option of the ID statement. See the ID statement ZEROMISS= option for more details.
TARGET Statement
TARGET variable-list < / options > ;
The TARGET statement lists the numeric target variables in the DATA= data set whose values are to be accumulated to form the time series or represent ordered numeric sequences (when no ID statement is specied). An input data set variable can be specied in only one INPUT or TARGET statement. Any number of TARGET statements can be used. The following options can be used with a TARGET statement.
ACCUMULATE=option
species how the data set observations are accumulated within each time period for the variables listed in the TARGET statement. If the ACCUMULATE= option is not specied in the TARGET statement, accumulation is determined by the ACCUMULATE= option of the ID statement. If the ACCUMULATE= option is not specied on the ID statement or the
TARGET statement, no accumulation is performed. See the ID statement ACCUMULATE= option for more details.
COMPRESS=option | (options)
species the sliding sequence (global) and warping (local) compression range of the target sequence with respect to the input sequence. Compression of the target sequence is the same as expansion of the input sequence and vice versa. The compression limits are dened based on the length of the target sequence and are imposed on the target sequence. The following compression options are provided: GLOBALABS=integer species the absolute global compression, where integer ranges from zero to 10,000. GLOBALABS=0 implies no global compression, which is the default unless the GLOBALPCT= option is specied. GLOBALPCT=number species global compression as a percentage of the length of the target sequence, where number ranges from zero to 100. GLOBALPCT=0 implies no global compression, which is the default. GLOBALPCT=100 implies maximum allowable compression. LOCALABS=integer species the absolute local compression, where integer ranges from zero to 10,000. The default is maximum allowable absolute local compression unless the LOCALPCT= option is specied. LOCALPCT=number species local compression as a percentage of the length of the input sequence, where number ranges from zero to 100. The percentage specied by the LOCALPCT= option must be less than the GLOBALPCT= option. LOCALPCT=0 implies no local compression. LOCALPCT=100 implies maximum allowable local compression. The default is LOCALPCT=100. If the SLIDE=NONE or the SLIDE=SEASON option is specied in the TARGET statement, the global compression options are ignored. To disallow local compression, use the option COMPRESS=(LOCALPCT=0 LOCALABS=0). If the SLIDE=INDEX option is specied, the global compression options are not ignored. To completely disallow both global and local compression, use the option COMPRESS=(GLOBALPCT=0 LOCALPCT=0) or COMPRESS=(GLOBALABS=0 LOCALABS=0). To allow only local compression, use the option COMPRESS=(GLOBALPCT=0 GLOBALABS=0). These are the default compression options. The above options can be used in combination to specify the desired amount of global and local compression as the following examples illustrate. Let Lc denote the global compression limit and lc denote the local compression limit. COMPRESS=(GLOBALPCT=20) species the global and local compression to range from zero to Lc D min 0:2Ny ; Ny 1 . COMPRESS=(GLOBALPCT=20 GLOBALABS=10) allows for the global and local com pression to range from zero to Lc D min 0:2Ny ; min Ny 1 ; 10 . COMPRESS=(LOCALPCT=10) allows for the local compression to range from zero to lc D min 0:1Ny ; Ny 1 .
COMPRESS=(LOCALPCT=20 LOCALABS=5) allows for the local compression to range from zero to lc D min 0:2Ny ; min Ny 1 ; 5 . COMPRESS=(GLOBALPCT=20 LOCALPCT=20) allows for the global compression to range from zero to Lc D min 0:2Ny ; Ny 1 and allows for the local compression to range from zero to lc D min 0:2Ny ; Ny 1 . COMPRESS=(GLOBALPCT=20 GLOBALABS=10 LOCALPCT=10 LOCALABS=5) allows for the global compression to range from zero to Lc D min 0:2Ny ; min Ny 1 ; 10 and allows for the local compression to range from zero to lc D min 0:1Ny ; min Ny 1 ; 5 . Suppose Tz is the length of the input time series and Ny is the length of the target sequence. The valid global compression limit, Lc , is always limited by the length of the target sequence: 0 Lc < Ny . Suppose Nx is the length of the input sequence, and Ny is the length of the target sequence. The valid local compression limit, lc , is always limited by the lengths of the input and target sequence: max 0; Ny Nx lc < Ny .
DIF=(numlist)
species the differencing to be applied to the accumulated time series. The list of differencing orders must be separated by spaces or commas. For example, DIF=(1,3) species rst, then third, order differencing. Differencing is applied after time series transformation. The TRANSFORM= option is applied before the DIF= option. Simple differencing is useful when you want to detrend the time series before computing the similarity measures.
EXPAND=option | (options)
species the sliding sequence (global) and warping (local) expansion range of the target sequence with respect to the input sequence. Expansion of the target sequence is the same as compression of the input sequence and vice versa. The expansion limits are dened based on the length of the input sequence, but are imposed on the target sequence. The following expansion options are provided: GLOBALABS=integer species the absolute global expansion, where integer ranges from zero to 10,000. GLOBALABS=0 implies no global expansion, which is the default unless the GLOBALPCT= option is specied. GLOBALPCT=number species global expansion as a percentage of the length of the target sequence, where number ranges from zero to 100. GLOBALPCT=0 implies no global expansion, which is the default unless the GLOBALABS= option is specied. GLOBALPCT=100 implies maximum allowable global expansion. LOCALABS=integer species the absolute local expansion, where integer ranges from zero to 10,000. The default is maximal allowable absolute local expansion unless the LOCALPCT= option is specied. LOCALPCT=number species local expansion as a percentage of the length of the target sequence, where number ranges from zero to 100. LOCALPCT=0 implies no local expansion. LOCALPCT=100 implies maximum allowable local expansion. The default is LOCALPCT=100.
If the SLIDE=NONE or the SLIDE=SEASON option is specied in the TARGET statement, the global expansion options are ignored. To disallow local expansion, use the option EXPAND=(LOCALPCT=0 LOCALABS=0). If the SLIDE=INDEX option is specied, the global expansion options are not ignored. To completely disallow both global and local expansion, use the option EXPAND=(GLOBALPCT=0 LOCALPCT=0) or EXPAND=(GLOBALABS=0 LOCALABS=0). To allow only local expansion, use the option EXPAND=(GLOBALPCT=0 GLOBALABS=0). These are the default expansion options. The above options can be used in combination to specify the desired amount of global and local expansion as the following examples illustrate. Let Le denote the global expansion limit and le denote the local expansion limit. EXPAND=(GLOBALPCT=20) allows for the global and local expansion to range from zero to Le D min 0:2Ny ; Ny 1 . EXPAND=(GLOBALPCT=20 GLOBALABS=10) allows for the global and local expansion to range from zero to Le D min 0:2Ny ; min Ny 1 ; 10 . EXPAND=(LOCALPCT=10) allows for the local expansion to range from zero to le D min 0:1Ny ; Ny 1 . EXPAND=(LOCALPCT=10 LOCALABS=5) allows for the local expansion to range from zero to le D min 0:1Ny ; min Ny 1 ; 5 . EXPAND=(GLOBALPCT=20 LOCALPCT=10) allows for the global expansion to range from zero to Le D min 0:2Ny ; Ny 1 and allows for the local expansion to range from zero to le D min 0:1Ny ; Ny 1 . EXPAND=(GLOBALPCT=20 GLOBALABS=10 LOCALPCT=10 LOCALABS=5) allows for the global expansion to range from zero to Le D min 0:2Ny ; min Ny 1 ; 10 and allows for the local expansion to range from zero to le D min 0:1Ny ; min Ny 1 ; 5 . Suppose Tz is the length of the input time series and Ny is the length of the target sequence. The valid global expansion limit, Le , is always limited by the length of the input time series: 0 Le < Tz . Suppose Nx is the length of the input sequence and Ny is the length of the target sequence. The valid local expansion limit, le , is always limited by the lengths of the input and target sequence: max 0; Nx Ny le < Nx .
MEASURE=option
species the similarity measure to be computed by using the working input and target sequences. The following similarity measures are provided: SQRDEV ABSDEV MSQRDEV MSQRDEVINP squared deviation. This option is the default. absolute deviation mean squared deviation mean squared deviation relative to the length of the input sequence
mean squared deviation relative to the length of the target sequence mean squared deviation relative to the minimum valid path length mean squared deviation relative to the maximum valid path length mean absolute deviation mean absolute deviation relative to the length of the input sequence mean absolute deviation relative to the length of the target sequence mean absolute deviation relative to the minimum valid path length mean absolute deviation relative to the maximum valid path length Measure computed by a user-dened function created by using the FCMP procedure, where User-Dened is the function name
NORMALIZE=option
species the sequence normalization to be applied to the working target sequence. The following normalization options are provided: NONE ABSOLUTE STANDARD
User-Dened
No normalization is applied. This option is the default. Absolute normalization is applied. Standard normalization is applied. Normalization computed by a user-dened subroutine, created by using the FCMP procedure, where User-Dened is the subroutine name.
PATH=option
species the similarity measure and warping path information to be computed using the working input and target sequences. The following similarity measures and warping path are provided:
User-Dened
measure and path computed by a user-dened subroutine created by using the FCMP procedure, where User-Dened is the subroutine name
For computational efciency, the PATH= option should be only used when it is desired to compute both the similarity measure and the warping path information. If only the similarity measure is needed, use the MEASURE= option. If both the MEASURE= and PATH= option are specied in the TARGET statement, the PATH= option takes precedence.
SDIF=(numlist)
species the seasonal differencing to be applied to the accumulated time series. The list of seasonal differencing orders must be separated by spaces or commas. For example, SDIF=(1,3) species rst, then third, order seasonal differencing. Differencing is applied after time series transformation. The TRANSFORM= option is applied before the SDIF= option. Seasonal differencing is useful when you want to deseasonalize the time series before computing the similarity measures.
SETMISSING=option | number
option species how missing values (either actual or accumulated) are interpreted in the accumulated time series for variables listed in the TARGET statement. If the SETMISSING= option is not specied in the TARGET statement, missing values are set, based on the SETMISSING= option of the ID statement. If the SETMISSING= option is not specied on the ID statement or the TARGET statement, no missing value interpretation is performed. See the ID statement SETMISSING= option for more details.
SLIDE=option
species the sliding of the target sequence with respect to the input sequence. The following slides are provided: NONE INDEX SEASON No sequence sliding. The input time series is compared with the target sequence directly with no sliding. This option is the default. Slide by time index. The input time series is compared with the target sequence by observation index. Slide by seasonal index. The input time series is compared with the target sequence by seasonal index.
NOTE: The SLIDE= option takes precedence over the COMPRESS= and EXPAND= option.
TRANSFORM= option
species the time series transformation to be applied to the accumulated time series. The following transformations are provided: NONE LOG SQRT LOGISTIC No transformation is applied. This option is the default. Logarithmic transformation is applied. Square-root transformation is applied. Logistic transformation is applied. Box-Cox transformation with parameter is applied, where the real
number is between -5 and 5 User-Dened
BOXCOX(number )
transformation is computed by a user-dened subroutine created by using the FCMP procedure, where User-Dened is the subroutine name.
When the TRANSFORM= option is specied, the time series must be strictly positive unless a user-dened function is used.
TRIMMISSING=option TRIMMISS= option
species how missing values (either actual or accumulated) are trimmed from the accumulated time series or ordered sequence for variables listed in the TARGET statement. The following trimming options are provided: NONE LEFT No missing value trimming is applied. Beginning missing values are trimmed.
RIGHT BOTH
Ending missing values are trimmed. Both beginning and ending missing values are trimmed. This is the default.
ZEROMISS=option
species how beginning and/or ending zero values (either actual or accumulated) are interpreted in the accumulated time series or ordered sequence for variables listed in the TARGET statement. If the ZEROMISS= option is not specied in the TARGET statement, beginning and/or ending values are set based on the ZEROMISS= option of the ID statement. See the ID statement ZEROMISS= option for more details.
The accumulated time series can then be transformed to form the working time series. Transformations are useful when you want to stabilize the time series before computing the similarity measures. Simple and seasonal differencing are useful when you want to detrend or deseasonalize the time series before computing the similarity measures. Often, but not always, the TRANSFORM=, DIF=, and SDIF= options should be specied in the same way for both the target and input variables. 1. time series transformation 2. time series differencing TRANSFORM= option DIF= and SDIF= option
3. time series missing value trimming TRIMMISSING= option 4. time series descriptive statistics PRINT=DESCSTATS option After the working series is formed, it can be treated as an ordered sequence that can be normalized or scaled. Normalizations are useful when you want to compare the shape or prole of the time series. Scaling is useful when you want to compare the input sequence to the target sequence while discounting the variation of the target sequence. 1. normalization 2. scaling NORMALIZE= option SCALE= option
Accumulation ! 1463
After the working sequences are formed, similarity measures can be computed between input and target sequences. 1. sliding 2. warping 3. similarity measure SLIDE= option COMPRESS= and EXPAND= option MEASURE= and PATH= option
The SLIDE= option is used to specify observation-index sliding, seasonal-index sliding, or no sliding. The COMPRESS= and EXPAND= options are used to specify the warping limits. The MEASURE= and PATH= options are used to specify how the similarity measures are computed.
Accumulation
If the ACCUMULATE= option is specied in the ID, INPUT, or TARGET statement, data set observations are accumulated within each time period. The frequency (width of each time interval) is specied by the INTERVAL= option in the ID statement. The ID variable contains the time ID values. Each time ID value corresponds to a specic time period. Accumulation is particularly useful when the input data set contains transactional data, whose observations are not spaced with respect to any particular time interval. The accumulated values form the time series, which is used in subsequent analyses. For example, suppose a data set contains the following observations:
19MAR1999 19MAR1999 11MAY1999 12MAY1999 23MAY1999 10 30 50 20 20
If the INTERVAL=MONTH is specied, all of the preceding observations fall within three time periods of March 1999, April 1999, and May 1999. The observations are accumulated within each time period as follows: If the ACCUMULATE=NONE option is specied, an error is generated because the ID variable values are not equally spaced with respect to the specied frequency (MONTH). If the ACCUMULATE=TOTAL option is specied, the data are accumulated as follows:
O1MAR1999 O1APR1999 O1MAY1999 40 . 90
20 . 30
As can be seen from the preceding examples, even though the data set observations contained no missing values, the accumulated time series can have missing values.
/10
and ceil.x/ is the smallest integer greater than or equal to x. Square root is the square root transformation, wt D p yt
Box-Cox
is the Box-Cox transformation, ( yt 1 ; 0 wt D ln.yt /; D0 is the transformation computed by a user-dened subroutine created by using the FCMP procedure, where User-Dened is the subroutine name.
User-Dened
Other time series transformations can be performed prior to invoking the SIMILARITY procedure by using the EXPAND procedure of SAS/ETS or the DATA step.
Sliding Sequences
Similarity measures can be computed between the target sequence and any contiguous subsequences of the input time series. There are three types of sequence sliding: No Sliding Slide by Time Index Slide by Season Index For more information see Leonard et al (2008).
Time Warping
Time warping allows for the comparison between target and input sequences of differing lengths by compressing or expanding the input sequence with respect the target sequence while respecting the order of the sequence elements. For more information see Leonard et al (2008).
Sequence Normalization
The working (input or target) sequence can be normalized to prior to further analysis. Let qi be the original sequence with mean, q , and standard deviation, q ; and let rt be the normalized sequence. The normalizations are dened as follows: Standard is the standard normalization ri D .qi
q /= q
Sequence Scaling
The working input sequence can be scaled to the working target sequence. Sequence scaling is applied after normalization. Let yj be the working target sequence with mean, y , and standard deviation, y . Let xi be the working input sequence and let qi be the scaled sequence. The scaling is dened as follows: Standard is the standard normalization qi D .xi
y /= y
Similarity Measures
The working input sequence can be compared to the working target sequence. For more information see Leonard et al (2008).
responsibility of the user. Additionally, the support for standard C libraries is restricted. If given a choice, it is highly recommended that user-dened functions and subroutines be written in the SAS language using the FCMP procedure. For each of the tasks described above, the following sections describe the required subroutine/function signature and provide examples of using a user-dened routine with the SIMILARITY procedure.
where the array-name is the time series to be transformed. For example, to duplicate the functionality of the built-in TRANSFORM=LOG option in the INPUT and TARGET statement, the following SAS statements create a user-dened version of this transformation called MYTRANSFORM and store this subroutine in the catalog SASUSER.MYSIMILAR.
proc fcmp outlib=sasuser.mysimilar.package; subroutine mytransform( series[*] ); outargs series; length = DIM(series); do i = 1 to length; value = series[i]; if value > 0 then do; series[i] = log( value ); end; else do; series[i] = .; end; end; endsub; run;
This user-dened subroutine can be specied in the TRANSFORM= option of the INPUT or TARGET statement as follows:
options cmplib = sasuser.mysimilar; proc similarity ...; ...
Sequence Normalizations
A user-dened normalization subroutine has the following signature:
SUBROUTINE <SUBROUTINE-NAME> ( <ARRAY-NAME>[*] );
where the array-name is the sequence to be normalized. For example, to duplicate the functionality of the built-in NORMALIZE=ABSOLUTE option in the INPUT and TARGET statement, the following SAS stements create a user-dened version of this normalization called MYNORMALIZE and store this subroutine in the catalog SASUSER.MYSIMILAR.
proc fcmp outlib=sasuser.mysimilar.package; subroutine mynormalize( sequence[*] ); outargs sequence; length = DIM(sequence); minimum = .; maximum = .; do i = 1 to length; value = sequence[i]; if nmiss(minimum) | nmiss(maximum) then do; minimum = value; maximum = value; end; if nmiss(value) = 0 then do; if value < minimum then minimum = value; if value > maximum then maximum = value; end; end; do i = 1 to length; value = sequence[i]; if nmiss( value ) | minimum > maximum then do; sequence[i] = .; end; else do; sequence[i] = (value - minimum) / (maximum - minimum); end; end; endsub; run;
This user-dened subroutine can be specied in the NORMALIZE= option of INPUT or TARGET statement as follows:
options cmplib = sasuser.mysimilar; proc similarity ...; ... input myinput / normalize=mynormalize; target mytarget / normalize=mynormalize; ... run;
Sequence Scaling
A user-dened scaling subroutine has the following signature:
SUBROUTINE <SUBROUTINE-NAME> ( <ARRAY-NAME>[*], <ARRAY-NAME>[*] );
where the rst array-name is the target sequence and the second array-name is the input sequence to be scaled. For example, to duplicate the functionality of the built-in SCALE=ABSOLUTE option in the INPUT statement, the following SAS statements create a user-dened version of this scaling called MYSCALE and store this subroutine in the catalog SASUSER.MYSIMILAR.
proc fcmp outlib=sasuser.mysimilar.package; subroutine myscale( target[*], input[*] ); outargs input; length = DIM(target); minimum = .; maximum = .; do i = 1 to length; value = target[i]; if nmiss(minimum) | nmiss(maximum) then do; minimum = value; maximum = value; end; if nmiss(value) = 0 then do; if value < minimum then minimum = value; if value > maximum then maximum = value; end; end; do i = 1 to length; value = input[i]; if nmiss( value ) | minimum > maximum then do; input[i] = .;
end; else do; input[i] = (value - minimum) / (maximum - minimum); end; end; endsub; run;
This user-dened subroutine can be specied in the SCALE= option of INPUT statement as follows:
options cmplib=sasuser.mysimilar; proc similarity ...; ... input myinput / scale=myscale; ... run;
Similarity Measures
A user-dened similarity measure function has the following signature:
FUNCTION <FUNCTION-NAME> ( <ARRAY-NAME>[*], <ARRAY-NAME>[*] );
where the rst array-name is the target sequence and the second array-name is the input sequence. The return value of the function is the similarity measure associated with the target sequence and the input sequence. For example, to duplicate the functionality of the built-in MEASURE=ABSDEV option in the TARGET statement with no warping, the following SAS statements create a user-dened version of this measure called MYMEASURE and store this subroutine in the catalog SASUSER.MYSIMILAR.
proc fcmp outlib=sasuser.mysimilar.package; function mymeasure( target[*], input[*] ); length = min(DIM(target), DIM(input)); sum = 0; num = 0; do i = 1 to length; x = input[i]; w = target[i]; if nmiss(x) = 0 & nmiss(w) = 0 then do; d = x - w; sum = sum + abs(d); end; end;
This user-dened function can be specied in the MEASURE= option of TARGET statement as follows:
options cmplib=sasuser.mysimilar; proc similarity ...; ... target mytarget / measure=mymeasure; ... run;
For another example, to duplicate the functionality of the built-in MEASURE=SQRDEV and MEASURE=ABSDEV options by using the C language, the following SAS statements create a userdened C language version of these measures called DTW_SQRDEV_C and DTW_ABSDEV_C; and store these functions in the catalog SASUSER.CSIMIL.CFUNCS. DTW refers to dynamic time warping. These C language functions can be then called by SAS language functions and subroutines.
proc proto package=sasuser.csimil.cfuncs; mapmiss double = 999999999; double dtw_sqrdev_c( double * target / iotype=input, int targetLength, double * input / iotype=input, int inputLength ); externc dtw_sqrdev_c; double dtw_sqrdev_c( double * target, int targetLength, double * input, int inputLength ) { int i,j; double x,w,d; double * prev = (double *)malloc( sizeof(double)*targetLength); double * curr = (double *)malloc( sizeof(double)*inputLength); if ( prev == 0 || curr == 0 ) return 999999999; x = input[0]; for ( j=0; j<targetLength; j++ ) { w = target[j]; d = x - w;
d = d*d; if ( j == 0 ) prev[j] = d; else prev[j] = d + prev[j-1]; } for (i=1; i<inputLength; i++ ) { x = input[i]; j = 0; w = target[j]; d = x - w; d = d*d; curr[j] = d + prev[j]; for (j=1; j<targetLength; j++ ) { w = target[j]; d = x - w; d = d*d; curr[j] = d + fmin( prev[j], fmin( prev[j-1], curr[j])); } if ( i < targetLength ) { for( j=0; j<inputLength; j++ ) prev[j] = curr[j]; } } d = curr[inputLength-1]; free( (char*) prev); free( (char*) curr); return( d ); } externcend; double dtw_absdev_c( double * target / iotype=input, int targetLength, double * input / iotype=input, int inputLength ); externc dtw_absdev_c; double dtw_absdev_c( double * target, int targetLength, double * input, int inputLength ) { int i,j; double x,w,d; double * prev = (double *)malloc( sizeof(double)*targetLength); double * curr = (double *)malloc( sizeof(double)*inputLength); if ( prev == 0 || curr == 0 ) return 999999999; x = input[0]; for ( j=0; j<targetLength; j++ ) { w = target[j];
d = x - w; d = fabs(d); if (j == 0) prev[j] = d; else prev[j] = d + prev[j-1]; } for (i=1; i<inputLength; i++ ) { x = input[i]; j = 0; w = target[j]; d = x - w; d = fabs(d); curr[j] = d + prev[j]; for (j=1; j<targetLength; j++) { w = target[j]; d = x - w; d = fabs(d); curr[j] = d + fmin( prev[j], fmin( prev[j-1], curr[j] )); } if ( i < inputLength) { for ( j=0; j<targetLength; j++ ) prev[j] = curr[j]; } } d = curr[inputLength-1]; free( (char*) prev); free( (char*) curr); return( d ); } externcend; run;
The preceding SAS statements create two C language functions which can then be used in SAS language functions and/or subroutines. However, these functions cannot be directly used by the SIMILARITY procedure. In order to use these C language functions in the SIMILARITY procedure, two SAS language functions must be created that call these two C language functions. The following SAS statements create two user-dened SAS language versions of these measures called DTW_SQRDEV and DTW_ABSDEV and stores these functions in the catalog SASUSER.MYSIMILAR.FUNCS. These SAS language functions use the previously created C language function; the SAS language functions can then be used by the SIMILARITY procedure.
proc fcmp outlib=sasuser.mysimilar.funcs inlib=sasuser.cfuncs; function dtw_sqrdev( target[*], input[*] );
dev = dtw_sqrdev_c(target,DIM(target),input,DIM(input)); return( dev ); endsub; function dtw_absdev( target[*], input[*] ); dev = dtw_absdev_c(target,DIM(target),input,DIM(input)); return( dev ); endsub; run;
This user-dened function can be specied in the MEASURE= option of TARGET statement as follows:
options cmplib=sasuser.mysimilar; proc similarity ...; ... target mytarget target yourtarget ... run;
/ measure=dtw_sqrdev; / measure=dtw_absdev;
where the rst array-name is the target sequence, the second array-name is the input sequence, the third array-name is the returned target sequence indices, the fourth array-name is the returned input sequence indices, the fth array-name is the returned path distances. The returned value of the function is the similarity measure. The last three returned arrays are used to compute the path and cost statistics. The returned sequence indices must represent a valid warping path, that is, integers greater than zero and less than or equal to the sequence length and recorded in ascending order. The returned path distances must be nonnegative numbers.
the BY statement. The ID statement time ID variable is also included in the data sets when the time dimension is important. If an analysis step related to an output data step fails, then the values of this step are not recorded or are set to missing in the related output data set, and appropriate error and/or warning messages are recorded in the SAS log.
The OUTMEASURE= data set is ordered by the variables _INPUT_, then _TARGET_, then _TIMEID_ when ORDER=INPUTTARGET. The OUTMEASURE= data set is ordered by the variables _TARGET_, then _INPUT_, then _TIMEID_ when ORDER=TARGETINPUT. For ORDER=INPUT, the OUTMEASURE= data set has the following form: _INPUT_ _TIMEID_ _INPSEQ_ input variable name time ID values input sequence values
target-names
The OUTMEASURE= data set is ordered by the variables _INPUT_, then _TIMEID_. For ORDER=TARGET, the OUTMEASURE= data set has the following form: _TARGET_ _TIMEID_ _TARSEQ_
input-names
target variable name time ID values target sequence values similarity measures associated with each INPUT statement variable name
The OUTMEASURE= data set is ordered by the variables _TARGET_, then _TIMEID_.
The OUTSUM= data set is ordered by the variables _INPUT_, then _TARGET_ when ORDER=INPUTTARGET. The OUTSUM= data set is ordered by the variables _TARGET_, then _INPUT_ when ORDER=TARGETINPUT. For ORDER=INPUT, the OUTSUM= data set has the following form: _INPUT_ _STATUS_
target-names
input variable name status ag that indicates whether the requested analyses were successful similarity measure summary associated with each TARGET statement variable name
The OUTSUM= data set is ordered by the variable _INPUT_. For ORDER=TARGET, the OUTSUM= data set has the following form:
_TARGET_ _STATUS_
input-names
target variable name status ag that indicates whether the requested analyses were successful similarity measure summary associated with each INPUT statement variable name
The sorting of the OUTSEQUENCE= data set depends on the SORTNAMES and the ORDER= option. The OUTSEQUENCE= data set is ordered by the variables _INPUT_, then _TARGET_, then _TIMEID_ when ORDER=INPUTTARGET or ORDER=INPUT. The OUTSEQUENCE= data set is ordered by the variables _TARGET_, then _INPUT_, then _TIMEID_ when ORDER=TARGETINPUT or ORDER=TARGET. If there are a large number of slides and/or warps, this data set may be large.
The sorting of the OUTPATH= data set depends on the SORTNAMES and the ORDER= option. The OUTPATH= data set is ordered by the variables _INPUT_, then _TARGET_, then _TIMEID_ when ORDER=INPUTTARGET or ORDER=INPUT. The OUTPATH= data set is ordered by the variables _TARGET_, then _INPUT_, then _TIMEID_ when ORDER=TARGETINPUT or ORDER=TARGET. If there are a large number of slides and/or warps, this data set may be large.
Differencing failure Unable to compute descriptive statistics Normalization failure Input contains imbedded missing values Target contains imbedded missing values Scaling failure Measure failure Path failure Slide summarization failure
Printed Output
The SIMILARITY procedure optionally produces printed output by using the Output Delivery System (ODS). By default, the procedure produces no printed output. All output is controlled by the PRINT= and PRINTDETAILS options in the PROC SIMILARITY statement. The sort, order, and form of the printed output depends on both the SORTNAMES option and the ORDER= option. If the SORTNAMES option is specied, each variable (INPUT or TARGET) is analyzed in ascending order. For ORDER=INPUTTARGET, the printed output is ordered by the INPUT statement variables (row) and then by the TARGET statement variables (row). For ORDER=TARGETINPUT, the printed output is ordered by the TARGET statement variables (row) and then by the INPUT statement variables (row). For ORDER=INPUT, the printed output is ordered by the INPUT statement variables (row) and then the TARGET statement variables (column). For ORDER=TARGET, the printed output is ordered by the TARGET statement variables (row) and then the INPUT statement variables (column). In general, if an analysis step related to printed output fails, the values of this step are not printed and appropriate error and/or warning messages are recorded in the SAS log. The printed output is similar to the output data set, these similarities are described below.
PRINT=COSTS
prints the costs statistics.
PRINT=DESCSTATS
prints the descriptive statistics.
PRINT=PATHS
prints the path statistics.
PRINT=SLIDES
prints the sliding sequence summary.
PRINT=SUMMARY
prints the summary of similarity measures similar to the OUTSUM= data set.
PRINT=WARPS
prints the warp summary.
PRINTDETAILS
prints each table with greater detail.
ODS Table Name CostStatistics DescStats PathLimits PathStatistics SlideMeasuresSummary MeasuresSummary InputMeasuresSummary TargetMeasuresSummary WarpMeasuresSummary
Description Cost statistics Descriptive statistics Path limits Path statistics Summary of measure per slide Measures summary Measures summary Measures summary Summary of measure per warp
Option PRINT=COSTS PRINT=DESCSTATS PRINT=PATHS PRINT=PATHS PRINT=SLIDES PRINT=SUMMARY PRINT=SUMMARY PRINT=SUMMARY PRINT=WARPS
ODS Graphics
This section describes the use of ODS for creating graphics with the SIMILARITY procedure.
To request these graphs, you must specify the ODS GRAPHICS statement. In addition, you can specify the PLOTS= option in the PROC SIMILARITY statement as described in Table 22.3.
ODS Graph Name CostsPlot PathDistancePlot PathDistanceHistogram PathRelativeDistancePlot PathRelativeDistanceHistogram SeriesPlot TargetSequencePlot SequencePlot SimilarityPlot PathPlot PathSequencesPlot PathSequencesScaledPlot WarpPlot ScaledWarpPlot
Plot Description Costs plot Path distances plot Path distances histogram Path relative distances plot Path relative distances histogram Input time series plot Target sequence plot Sequence plot Similarity measures plot Path plot Path sequences plot Scaled path sequences map plot Warping plot Scaled warping plot
Statement SIMILARITY SIMILARITY SIMILARITY SIMILARITY SIMILARITY SIMILARITY SIMILARITY SIMILARITY SIMILARITY SIMILARITY SIMILARITY SIMILARITY SIMILARITY SIMILARITY
PLOTS= Option PLOTS=COSTS PLOTS=DISTANCES PLOTS=DISTANCES PLOTS=DISTANCES PLOTS=DISTANCES PLOTS=INPUTS PLOTS=TARGETS PLOTS=SEQUENCES PLOTS=MEASURES PLOTS=PATHS PLOTS=MAPS PLOTS=MAPS PLOTS=WARPS PLOTS=WARPS
Sequence Plots
The sequence plots (SequencePlot) illustrate the target and input sequences to be compared. The horizontal axis represents the (target or input) sequence index, and the vertical axis represents the (target or input) sequence values.
Path Plots
The path plot (PathPlot) and path limits plot (PathPlot) illustrate the path through the distance matrix. The horizontal axis represents the input sequence index, and the vertical axis represents the target sequence index. The dots represent the path coordinates. The upper parallel line represents the compression limit and the lower parallel line represents the expansion limit. These plots visualize the path through the distance matrix. Vertical movements indicate compression, and horizontal movements represent expansion of the target sequence with respect to the input sequence. These plots are useful for visualizing the amount of expansion and compression along the path.
Cost Plots
The cost plot (CostPlot) and cost limits plot (CostPlot) illustrate the cost of traversing the distance matrix. The horizontal axis represents the input sequence index, and the vertical axis represents the target sequence index. The colors/shading within the plot illustrate the incremental cost of traversing the distance matrix. The upper parallel line represents the compression limit, and the lower parallel line represents the expansion limit.
The monthly time series data are stored in the data WORK.MSERIES. Each BY group associated with the BY variable STORE contains an observation for each of the 36 months associated with the
years 1998, 1999, and 2000. Each observation contains the variable STORE, TIMESTAMP, and each of the analysis variables in the input DATA= data set. After each set of transactions has been accumulated to form corresponding time series, accumulated time series can be analyzed by using various time series analysis techniques. For example, exponentially weighted moving averages can be used to smooth each series. The following statements use the EXPAND procedure to smooth the analysis variable named STOREITEM.
proc expand data=mseries out=smoothed from=month; by store; id timestamp; convert storeitem=smooth / transform=(ewma 0.1); run;
The smoothed series is stored in the data set WORK.SMOOTHED. The variable SMOOTH contains the smoothed series. If the time ID variable TIMESTAMP contained SAS datetime values instead of SAS date values, the INTERVAL= , START=, and END= options in the SIMILARITY procedure must be changed accordingly, and the following statements could be used to accumulate the datetime transactions to a monthly interval:
proc similarity data=work.retail out=tseries; by store; id timestamp interval=dtmonth accumulate=median setmiss=0 start=01jan1998:00:00:00dt end =31dec2000:00:00:00dt; var _NUMERIC_; run;
The monthly time series data are stored in the data WORK.TSERIES, and the time ID values use a SAS datetime representation.
5 3 6 8 7 9 8 3 9 10 10 11 ; run;
3 6 3 8 . .
The following statements perform similarity analysis on the example data set:
ods graphics on; proc similarity data=test out=_null_ print=all plot=all; input x; target y / measure=absdev; run;
The DATA=TEST option species that the input data set WORK.TEST is to be used in the analysis. The OUT=_NULL_ option species that no output time series data set is to be created. The PRINT=ALL and PLOTS=ALL options specify that all ODS tables and graphs are to be produced. The INPUT statement species that the input variable is X. The TARGET statement species that the target variable is Y and that the similarity measure is computed using absolute deviation (MEASURE=ABSDEV).
Output 22.2.1 Description Statistics of the Input Variable, x
The SIMILARITY Procedure Time Series Descriptive Statistics Variable Number of Observations Number of Missing Observations Minimum Maximum Mean Standard Deviation x 10 2 3 8 4.25 1.908627
Applied 9 7
Number 0 6 4 2 6
Maximum 0 2 1 2 2
Path Statistics Target Maximum Percent 0.000% 20.00% 10.00% 20.00% 20.00%
Number 12 12
Minimum 0 0
Cost Statistics Input Mean 1.875000 0.282305 Target Mean 1.500000 0.225844 Minimum Path Mean 1.875000 0.282305 Maximum Path Mean 0.8823529 0.1328495
The following statements repeat the above similarity analysis on the example data set with warping limits:
ods graphics on; proc similarity data=test out=_null_ print=all plot=all; input x; target y / measure=absdev compress=(localabs=2) expand=(localabs=2); run;
The COMPRESS=(LOCALABS=2) option limits local absolute compression to 2. The EXPAND=(LOCALABS=2) option limits local absolute expansion to 2.
Applied 2 2
The following statements repeat the above similarity analysis on the example data set but store the results in output data sets:
proc similarity data=test out=series outsequence=sequences outpath=path outsum=summary; input x; target y / measure=absdev compress=(localabs=2) expand=(localabs=2); run;
The OUT=SERIES, OUTSEQUENCE=SEQUENCES, OUTPATH=PATH, and OUTSUM=SUMMARY options specify that the output time series, time sequences, path analysis, and summary data sets be created, respectively.
The goal of sliding similarity measures analysis is nd the slide index that corresponds to the most similar subsequence of the input series when compared to the target sequence. The following statements perform sliding similarity analysis on the example data set:
proc similarity data=workers out=_NULL_ print=(slides summary); id date interval=month; input electric; target masonry / slide=index measure=msqrdev expand=(localabs=3 globalabs=3) compress=(localabs=3 globalabs=3); run;
The DATA=WORKERS option species that the input data set WORK.WORKERS is to be used in the analysis. The OUT=_NULL_ option species that no output time series data set is to be created. The PRINT=(SLIDES SUMMARY) option species that the ODS tables related to the sliding similarity measures and their summary are produced. The INPUT statement species that the input variable is ELECTRIC. The TARGET statement species that the target variable is MASONRY and that the similarity measure is computed using mean squared deviation (MEASURE=MSQRDEV). The SLIDE=INDEX option species observation index sliding. The COMPRESS=(LOCALABS=3 GLOBALABS=3) option limits local and global absolute compression to 3. The EXPAND=(LOCALABS=3 GLOBALABS=3) option limits local and global absolute expansion to 3.
Slide Index 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
DATE JAN1977 FEB1977 MAR1977 APR1977 MAY1977 JUN1977 JUL1977 AUG1977 SEP1977 OCT1977 NOV1977 DEC1977 JAN1978 FEB1978 MAR1978 APR1978 MAY1978 JUN1978 JUL1978 AUG1978 SEP1978 OCT1978 NOV1978
Slide Minimum Measure 497.6737 482.6777 474.1251 490.7792 533.0788 605.8198 701.7138 646.5918 616.3258 510.9836 382.1434 340.4702 327.0572 322.5460 325.2689 351.4161 398.0490 471.6931 590.8089 595.2538 689.2233 745.8891 679.1907
MASONRY 322.5460
This analysis results in 23 slides based on the observation index with the minimum measure (322.5460) occurring at slide index 14 which corresponds to the time value FEB1978. Note that the original data set SASHELP.WORKERS was modied beginning at the time value JAN1978. This similarity analysis justies the belief the ELECTRIC lags MASONRY by one month based on the time series cross-correlation analysis despite the lack of target data (MASONRY). The goal of seasonal sliding similarity measures is to nd the seasonal slide index which corresponds to the most similar seasonal subsequence of the input series when compared to the target
sequence. The following statements repeat the above similarity analysis on the example data set with seasonal sliding:
proc similarity data=workers out=_NULL_ print=(slides summary); id date interval=month; input electric; target masonry / slide=season measure=msqrdev; run;
Slide Index 0 12
MASONRY 641.9273
The analysis differs from the previous analysis in that the slides are performed based on the seasonal index (SLIDE=SEASON) with no warping. With a seasonality of 12, two seasonal slides are considered at slide indices 0 and 12 with the minimum measure (641.9273) occurring at slide index 12 which corresponds to the time value JAN1978. Note that the original data set SASHELP.WORKERS was modied beginning at the time value JAN1978. This similarity analysis justies the belief that ELECTRIC and MASONRY have similar seasonal properties based on seasonal decomposition analysis despite the lack of target data (MASONRY).
set that contains two time series of differing lengths, where the variable HISTORY represents the historical activity and RECENT represents the more recent activity.
data timedata; set sashelp.timedata; drop volume; recent = .; history = volume; if datetime >= 20AUG2000:00:00:00DT then do; recent = volume; history = .; end; run;
The goal of seasonal sliding similarity measures is to nd the seasonal slide index that corresponds to the most similar seasonal subsequence of the input series when compared to the target sequence. The following statements perform similarity analysis on the example data set with seasonal sliding:
proc similarity data=timedata out=_NULL_ outsequence=sequences outsum=summary; id datetime interval=dtday accumulate=total start=27JUL1997:00:00:00dt end=21OCT2000:11:59:59DT; input history / normalize=absolute; target recent / slide=season normalize=absolute measure=mabsdev; run;
The DATA=TIMEDATA option species that the input data set WORK.TIMEDATA is to be used in the analysis. The OUT=_NULL_ option species that no output time series data set is to be created. The OUTSEQUENCE=SEQUENCES and OUTSUM=SUMMARY options specify the output sequences and summary data sets, respectively. The ID statement species that the time ID variable is DATETIME, which is to be accumulated on a daily basis (INTERVAL=DTDAY) by summing the transactions (ACCUMULATE=TOTAL). The ID statement also species that the data is accumulated on the weekly boundaries starting on the week of 27JUL1997 and ending on the week of 15OCT2000 (START=27JUL1997:00:00:00DT END=21OCT2000:11:59:59DT). The INPUT statement species that the input variable is HISTORY, which is to be normalized using absolute normalization (NORMALIZE=ABSOLUTE). The TARGET statement species that the target variable is RECENT, which is to be normalized by using absolute normalization (NORMALIZE=ABSOLUTE) and that the similarity measure is computed by using mean absolute deviation (MEASURE=MABSDEV). The SLIDE=SEASON options species season index sliding. To illustrate the results of the similarity analysis, the output sequence data set must be subset by using the output summary data set.
data _NULL_; set summary; call symput(MEASURE, left(trim(putn(recent,BEST20.)))); run; data result; set sequences; by _SLIDE_; retain flag 0; if first._SLIDE_ then do; if (&measure - 0.00001 < _SIM_ < &measure + 0.00001) then flag = 1;
The cross series plot illustrates that the historical time series analogy most similar to the most recent time series data that started on 20AUG2000 occurred on 02AUG1998.
Output 22.4.1 Cross Series Plot of the Historical Time Series
References ! 1509
References
Barry, M.J. and Linoff, G.S. (1997), Data Mining Techniques: For Marketing, Sales, and Customer Support, New York: John Wiley & Sons, Inc. Han, J. and Kamber, M. (2001), Data Mining: Concepts and Techniques, San Francisco: Morgan Kaufmann Publishers. Leonard, M.J. and Wolfe, B. L. (2005), Mining Transactional and Time Series Data, SUGI 30. Leonard, M.J., Elsheimer, D.B., and Sloan, J. (2008), An Introduction to Similarity Analysis Using SAS, SAS Forum 2008. Pyle, D. (1999), Data Preparation for Data Mining, San Francisco: Morgan Kaufman Publishers, Inc. Sankoff, D. and Kruskal, J. B. (2001), Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison, Stanford, CA: CSLI Publications.
1510
Chapter 23
and, if input data are given, uses the reduced form equations to generate predicted values. PROC SIMLIN is especially useful when dealing with sets of structural difference equations. The SIMLIN procedure can perform simulation or forecasting of the endogenous variables. The SIMLIN procedure can be applied only to models that are: linear with respect to the parameters linear with respect to the variables square (as many equations as endogenous variables) nonsingular (the coefcients of the endogenous variables form an invertible matrix)
If the model contains lagged endogenous variables you must also use a LAGGED statement to tell PROC SIMLIN which variables contain lagged values, which endogenous variables they are lags of, and the number of periods of lagging. For dynamic models, the TOTAL and INTERIM= options can be used on the PROC SIMLIN statement to compute and print total and impact multipliers. (See "Dynamic Multipliers" later in this section for an explanation of multipliers.) In the following example the variables Y1LAG1, Y2LAG1, and Y2LAG2 contain lagged values of the endogenous variables Y1 and Y2. Y1LAG1 and Y2LAG1 contain values of Y1 and Y2 for the previous observation, while Y2LAG2 contains 2 period lags of Y2. The LAGGED statement species the lagged relationships, and the TOTAL and INTERIM= options request multiplier analysis. The INTERIM=2 option prints matrices showing the impact that changes to the exogenous variables have on the endogenous variables after 1 and 2 periods.
proc syslin data=in outest=e; model y1 = y2 y1lag1 y2lag2 x1; model y2 = y1 y2lag1 x2; run; proc simlin est=e total interim=2; endogenous y1 y2; exogenous x1 x2; lagged y1lag1 y1 1 y2lag1 y2 1 y2lag2 y2 2; run;
After the reduced form of the model is computed, the model can be simulated by specifying an input data set on the PROC SIMLIN statement and using an OUTPUT statement to write the simulation results to an output data set. The following example modies the PROC SIMLIN step from the preceding example to simulate the model and stores the results in an output data set.
proc simlin est=e total interim=2 data=in; endogenous y1 y2; exogenous x1 x2; lagged y1lag1 y1 1 y2lag1 y2 1 y2lag2 y2 2; output out=sim predicted=y1hat y2hat residual=y1resid y2resid; run;
Functional Summary
The statements and options controlling the SIMLIN procedure are summarized in the following table. Description Data Set Options specify input data set containing structural coefcients specify type of estimates read from EST= data set write reduced form coefcients and multipliers to an output data set specify the input data set for simulation write predicted and residual values to an output data set Printing Control Options print the structural coefcients suppress printing of reduced form coefcients suppress all printed output Dynamic Multipliers compute interim multipliers compute total multipliers Declaring the Role of Variables specify BY-group processing specify the endogenous variables specify the exogenous variables specify identifying variables Statement Option
INTERIM= TOTAL
BY ENDOGENOUS EXOGENOUS ID
Description specify lagged endogenous variables Controlling the Simulation specify the starting observation for dynamic simulation
Statement LAGGED
Option
PROC SIMLIN
START=
species the SAS data set containing input data for the simulation. If the DATA= option is used, the data set specied must supply values for all exogenous variables throughout the simulation. If the DATA= option is not specied, no simulation of the system is performed, and only the reduced form and multipliers are computed.
EST= SAS-data-set
species the input data set containing the structural coefcients of the system. If EST= is omitted the most recently created SAS data set is used. The EST= data set is normally a "TYPE=EST" data set produced by the OUTEST= option of PROC SYSLIN. However, you can also build the EST= data set with a SAS DATA step. See "The EST= Data Set" later in this chapter for details.
ESTPRINT
prints the structural coefcients read from the EST= data set.
INTERIM= n
requests that interim multipliers be computed for interims 1 through n. If not specied, no interim multipliers are computed. This feature is available only if there are no lags greater than 1.
NOPRINT
species an output SAS data set to contain the reduced form coefcients and multipliers, in addition to the structural coefcients read from the EST= data set. The OUTEST= data set has the same form as the EST= data set. If the OUTEST= option is not specied, the reduced form coefcients and multipliers are not written to a data set.
START= n
species the observation number in the DATA= data set where the dynamic simulation is to be started. By default, the dynamic simulation starts with the rst observation in the DATA= data set for which all variables (including lags) are not missing.
TOTAL
requests that the total multipliers be computed. This feature is available only if there are no lags greater than 1.
TYPE= value
species the type of estimates to be read from the EST= data set. The TYPE= value must match the value of the _TYPE_ variable for the observations that you want to select from the EST= data set (TYPE=2SLS, for example).
BY Statement
BY variables ;
A BY statement can be used with PROC SIMLIN to obtain separate analyses for groups of observations dened by the BY variables. The BY statement can be applied to one or both of the EST= and the DATA= input data set. When a BY statement is used and both an EST= and a DATA= input data set are specied, PROC SIMLIN checks to see if one or both of the data sets contain the BY variables. Thus, there are three ways of using the BY statement with PROC SIMLIN: 1. If the BY variables are found in the EST= data set only, PROC SIMLIN simulates over the entire DATA= data set once for each set of coefcients read from the BY groups in the EST= data set. 2. If the BY variables are found in the DATA= data set only, PROC SIMLIN performs separate simulations over each BY group in the DATA= data set, using the single set of coefcients in the EST= data set. 3. If the BY variables are found in both the EST= and the DATA= data sets, PROC SIMLIN performs separate simulations over each BY group in the DATA= data set using the coefcients from the corresponding BY group in the EST= data set.
ENDOGENOUS Statement
ENDOGENOUS variables ;
List the names of the endogenous (jointly dependent) variables in the ENDOGENOUS statement. The ENDOGENOUS statement can be abbreviated as ENDOG or ENDO.
EXOGENOUS Statement
EXOGENOUS variables ;
List the names of the exogenous (independent) variables in the EXOGENOUS statement. The EXOGENOUS statement can be abbreviated as EXOG or EXO.
ID Statement
ID variables ;
The ID statement can be used to restrict the variables copied from the DATA= data set to the OUT= data set. Use the ID statement to list the variables you want copied to the OUT= data set besides the exogenous, endogenous, lagged endogenous, and BY variables. If the ID statement is omitted, all the variables in the DATA= data set are copied to the OUT= data set.
LAGGED Statement
LAGGED lag-var endogenous-var number ellipsis ;
For each lagged endogenous variable, specify the name of the lagged variable, the name of the endogenous variable that was lagged, and the degree of the lag. Only one LAGGED statement is allowed. The following is an example of the use of the LAGGED statement:
proc simlin est=e; endog y1 y2; lagged y1lag1 y1 1 run;
y2lag1 y2 1
y2lag3 y2 3;
This statement species that the variable Y1LAG1 contains the values of the endogenous variable Y1 lagged one period; the variable Y2LAG1 refers to the values of Y2 lagged one period; and the variable Y2LAG3 refers to the values of Y2 lagged three periods.
OUTPUT Statement
OUTPUT OUT= SAS-data-set options ;
The OUTPUT statement species that predicted and residual values be put in an output data set. A DATA= input data set must be supplied if the OUTPUT statement is used, and only one OUTPUT statement is allowed. The following options can be used in the OUTPUT statement:
OUT= SAS-data-set
names the output SAS data set to contain the predicted values and residuals. If OUT= is not specied, the output data set is named using the DATAn convention.
PREDICTED= names P= names
names the variables in the output data set that contain the predicted values of the simulation. These variables correspond to the endogenous variables in the order in which they are specied in the ENDOGENOUS statement. Specify up to as many names as there are endogenous variables. If you specify names on the PREDICTED= option for only some of the endogenous variables, predicted values for the remaining variables are not output. The names must not match any variable name in the input data set.
RESIDUAL= names R= names
names the variables in the output data set that contain the residual values from the simulation. The residuals are the differences between the actual values of the endogenous variables from the DATA= data set and the predicted values from the simulation. These variables correspond to the endogenous variables in the order in which they are specied in the ENDOGENOUS statement. Specify up to as many names as there are endogenous variables. The names must not match any variable name in the input data set. The following is an example of the use of the OUTPUT statement. This example outputs predicted values for Y1 and Y2 and outputs residuals for Y1.
proc simlin est=e; endog y1 y2; output out=b predicted=y1hat y2hat residual=y1resid; run;
CyL C G t
Bxt
and 2 = G
1B
The reduced form matrices 1 = G 1 C and 2 = G 1 B are printed unless the NORED option is specied in the PROC SIMLIN statement. The structural coefcient matrices G, C, and B are printed when the ESTPRINT option is specied.
Dynamic Multipliers
For models that have only rst-order lags, the equation of the reduced form of the system can be rewritten yt D Dyt
1
C 2 xt
D is a matrix formed from the columns of 1 plus some columns of zeros, arranged in the order in which the variables meet the lags. The elements of 2 are called impact multipliers because they show the immediate effect of changes in each exogenous variable on the values of the endogenous variables. This equation can be rewritten as yt D D 2 yt
2
C D2 xt
C 2 xt
The matrix formed by the product D 2 shows the effect of the exogenous variables one lag back; the elements in this matrix are called interim multipliers and are computed and printed when the INTERIM= option is specied in the PROC SIMLIN statement. The i th period interim multipliers are formed by D i 2 . The series can be expanded as yt D D 1 yt
1C 1 X i D0
Di 2 xt
A permanent and constant setting of a value for x has the following cumulative effect: ! 1 X Di 2 x D .I D/ 1 2 x
iD0
The elements of (I-D) 1 2 are called the total multipliers. Assuming that the sum converges and that (I-D ) is invertible, PROC SIMLIN computes the total multipliers when the TOTAL option is specied in the PROC SIMLIN statement.
C bxt
This can be converted to a rst-order three-equation system by introducing two additional endogenous variables, y1;t and y2;t , and computing corresponding rst-order lagged variables for each endogenous variable: yt 1 , y1;t 1 , and y2;t 1 . The higher order lag relations are then produced by adding identities to link the endogenous and identical lagged endogenous variables: y1;t D yt yt D ay2;t
1 1 1
y2;t D y1;t
C bXt
This conversion using the SYSLIN and SIMLIN procedures requires three steps:
1. Add the extra endogenous and lagged endogenous variables to the input data set using a DATA step. Note that two copies of each lagged endogenous variable are needed for each lag reduced, one to serve as an endogenous variable and one to serve as a lagged endogenous variable in the reduced system. 2. Add IDENTITY statements to the PROC SYSLIN step to equate each added endogenous variable to its lagged endogenous variable copy. 3. In the PROC SIMLIN step, declare the added endogenous variables in the ENDOGENOUS statement and dene the lag relations in the LAGGED statement. See Example 23.2 for an illustration of how to convert an equation system with higher-order lags into a larger system with only rst-order lags.
REDUCED
the reduced form coefcients. The endogenous variables for this group of observations contain the inverse of the endogenous coefcient matrix G. The lagged endogenous variables contain the matrix 1 =G 1 C. The exogenous variables contain the matrix 2 =G 1 B. the interim multipliers, if the INTERIM= option is specied. There are gn observations for the interim multipliers, where g is the number of endogenous variables and n is the value of the INTERIM=n option. For these observations the _TYPE_ variable has the value IMULTi, where the interim number i ranges from 1 to n. The exogenous variables in groups of g observations that have a _TYPE_ value of IMULTi contain the matrix D i 2 of multipliers at interim i. The endogenous and lagged endogenous variables for this group of observations are set to missing.
IMULTi
TOTAL
the total multipliers, if the TOTAL option is specied. The exogenous variables in this group of observations contain the matrix (I-D ) 1 2 . The endogenous and lagged endogenous variables for this group of observations are set to missing.
Printed Output
Structural Form
The following items are printed as they are read from the EST= input data set. Structural zeros are printed as dots in the listing of these matrices. 1. Structural Coefcients for Endogenous Variables. This is the G matrix, with g rows and g columns.
2. Structural Coefcients for Lagged Endogenous Variables. These coefcients make up the C matrix, with g rows and l columns. 3. Structural Coefcients for Exogenous Variables. These coefcients make up the B matrix, with g rows and k columns.
Reduced Form
1. The reduced form coefcients are obtained by inverting G so that the endogenous variables can be directly expressed as functions of only lagged endogenous and exogenous variables. 2. Inverse Coefcient Matrix for Endogenous Variables. This is the inverse of the G matrix. 3. Reduced Form for Lagged Endogenous Variables. This is 1 =G 1 C, with g rows and l columns. Each value is a dynamic multiplier that shows how past values of lagged endogenous variables affect values of each of the endogenous variables. 4. Reduced Form for Exogenous Variables. This is 2 =G 1 B, with g rows and k columns. Its values are called impact multipliers because they show the immediate effect of each exogenous variable on the value of the endogenous variables.
Multipliers
Interim and total multipliers show the effect of a change in an exogenous variable over time. 1. Interim Multipliers. These are the interim multiplier matrices. They are formed by multiplying 2 by powers of D. The d th interim multiplier is D d 2 . The interim multiplier of order d shows the effects of a change in the exogenous variables after d periods. Interim multipliers are only available if the maximum lag of the endogenous variables is 1. 2. Total Multipliers. This is the matrix of total multipliers, T=(I-D ) 1 2 . This matrix shows the cumulative effect of changes in the exogenous variables. Total multipliers are only available if the maximum lag is one.
Statistics of Fit
If the DATA= option is used and the DATA= data set contains endogenous variables, PROC SIMLIN prints a statistics-of-t report for the simulation. The statistics printed include the following. (Summations are over the observations for which both yt and yt are nonmissing.) O 1. the number of nonmissing errors. (Number of observations for which both yt and yt are O nonmissing.) 1P 2. the mean error: n .yt yt / O 3. the mean percent error: 4. the mean absolute error:
O 100 P .yt yt / n yt 1P jyt n
yt j O
P O 5. the mean absolute percent error 100 jytytyt j n q P 1 O 6. the root mean square error: n .yt yt /2 7. the root mean square percent error: q
100 P .yt yt / 2 . yt O / n
ODS Table Name Endogenous LaggedEndogenous Exogenous InverseCoeff RedFormLagEndo RedFormExog InterimMult TotalMult FitStatistics
Description Structural Coefcients for Endogenous Variables Structural Coefcients for Lagged Endogenous Variables Structural Coefcients for Exogenous Variables Inverse Coefcient Matrix for Endogenous Variables Reduced Form for Lagged Endogenous Variables Reduced Form for Exogenous Variables Interim Multipliers Total Multipliers Fit statistics
Option default default default default default default INTERIM= option TOTAL= option default
data klein; input year c p w i x wp g t k wsum; date=mdy(1,1,year); format date year.; y = c + i + g - t; yr = year - 1931; klag = lag( k ); plag = lag( p ); xlag = lag( x ); if year >= 1921; label c =consumption p =profits w =private wage bill i =investment k =capital stock y =national income x =private production wsum=total wage bill wp =govt wage bill g =govt demand t =taxes klag=capital stock lagged plag=profits lagged xlag=private product lagged yr =year-1931; datalines; 1920 . 12.7 . . 44.9 . ... more lines ...
182.8
First, the model is specied and estimated using the SYSLIN procedure, and the parameter estimates are written to an OUTEST= data set. The printed output produced by the SYSLIN procedure is not shown here; see Example 26.1 in Chapter 26 for the printed output of the PROC SYSLIN step.
title1 Simulation of Kleins Model I using SIMLIN; proc syslin 3sls data=klein outest=a; instruments klag plag xlag wp g t yr; endogenous c p w i x wsum k y; consume: model invest: model labor: model product: income: profit: stock: wage: run; identity identity identity identity identity c = p plag wsum; i = p plag klag; w = x xlag yr; x = c + i + g; y = c + i + g - t; p = x - w - t; k = klag + i; wsum = w + wp;
The OUTEST= data set A created by the SYSLIN procedure contains parameter estimates to be used by the SIMLIN procedure. The OUTEST= data set is shown in Output 23.1.1.
_ T Y O P b E s _ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 INST INST INST INST INST INST INST INST 3SLS 3SLS 3SLS IDENTITY IDENTITY IDENTITY IDENTITY IDENTITY 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
_ S T A T U S _ Converged Converged Converged Converged Converged Converged Converged Converged Converged Converged Converged Converged Converged Converged Converged Converged
_ M O D E L _ FIRST FIRST FIRST FIRST FIRST FIRST FIRST FIRST CONSUME INVEST LABOR PRODUCT INCOME PROFIT STOCK WAGE
_ D E P V A R _ c p w i x wsum k y c i w x y p k wsum
_ S I G M A _ 2.11403 2.18298 1.75427 1.72376 3.77347 1.75427 1.72376 3.77347 1.04956 1.60796 0.80149 . . . . .
k l a g -0.14654 -0.21610 -0.12295 -0.19251 -0.33906 -0.12295 0.80749 -0.33906 . -0.19485 . . . . 1.00000 .
p l a g
x l a g
0.74803 0.23007 0.80250 0.02200 0.87192 0.09533 0.92639 -0.11274 1.67442 0.11733 0.87192 0.09533 0.92639 -0.11274 1.67442 0.11733 0.16314 . 0.75572 . . 0.18129 . . . . . . . . . .
O b s 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
g 0.20501 0.43902 0.86622 0.10023 1.30524 0.86622 0.10023 1.30524 . . . 1.00000 1.00000 . . .
t -0.36573 -0.92310 -0.60415 -0.16152 -0.52725 -0.60415 -0.16152 -1.52725 . . . . -1.00000 -1.00000 . .
y r
w s u m
0.70109 -1 . . . . . . . 0.31941 . -1.00000 . . . . . . 0.71358 . . -1 . . . . . 0.33190 . . . -1 . . . . 1.03299 . . . . -1.00000 . . . 0.71358 . . . . . -1.00000 . . 0.33190 . . . . . . -1 . 1.03299 . . . . . . . -1 . -1 0.12489 . . . 0.79008 . . . . -0.01308 . -1 . . . . 0.14967 . . -1 . 0.40049 . . . . 1 . . 1 -1.00000 . . . . 1 . . 1 . . . -1 . . -1.00000 -1 . 1.00000 . . . . . . . 1 . . -1 . . . . 1 . . -1.00000 . .
Using the OUTEST= data set A produced by the SYSLIN procedure, the SIMLIN procedure can now compute the reduced form and simulate the model. The following statements perform the simulation.
title1 Simulation of Kleins Model I using SIMLIN; proc simlin data=klein est=a type=3sls estprint total interim=2 outest=b; endogenous c p w i x wsum k y; exogenous wp g t yr; lagged klag k 1 plag p 1 xlag x 1; id year; output out=c p=chat phat what ihat xhat wsumhat khat yhat r=cres pres wres ires xres wsumres kres yres; run;
The reduced form coefcients and multipliers are added to the information read from EST= data set A and written to the OUTEST= data set B. The predicted and residual values from the simulation are written to the OUT= data set C specied in the OUTPUT statement. The SIMLIN procedure rst prints the structural coefcient matrices read from the EST= data set, as shown in Output 23.1.2 through Output 23.1.4.
Output 23.1.2 SIMLIN Procedure Output Endogenous Structural Coefcients
Simulation of Kleins Model I using SIMLIN The SIMLIN Procedure Structural Coefficients for Endogenous Variables Variable c i w x y p k wsum c 1.0000 . . -1.0000 -1.0000 . . . p -0.1249 0.0131 . . . 1.0000 . . w . . 1.0000 . . 1.0000 . -1.0000 i . 1.0000 . -1.0000 -1.0000 . -1.0000 .
Structural Coefficients for Endogenous Variables Variable c i w x y p k wsum x . . -0.4005 1.0000 . -1.0000 . . wsum -0.7901 . . . . . . 1.0000 k . . . . . . 1.0000 . y . . . . 1.0000 . . .
The SIMLIN procedure then prints the inverse of the endogenous variables coefcient matrix, as shown in Output 23.1.5.
Inverse Coefficient Matrix for Endogenous Variables Variable c p w i x wsum k y y 0 0 0 0 0 0 0 1.0000 p 0.1959 1.1087 0.0726 -0.0145 0.1814 0.0726 -0.0145 0.1814 k 0 0 0 0 0 0 1.0000 0 wsum 1.2915 0.7682 0.5132 -0.0100 1.2815 1.5132 -0.0100 1.2815
The SIMLIN procedure next prints the reduced form coefcient matrices, as shown in Output 23.1.6.
Output 23.1.6 SIMLIN Procedure Output Reduced Form Coefcients
Reduced Form for Lagged Endogenous Variables Variable c p w i x wsum k y klag -0.1237 -0.1895 -0.1266 -0.1924 -0.3160 -0.1266 0.8076 -0.3160 plag 0.7463 0.8935 0.5969 0.7440 1.4903 0.5969 0.7440 1.4903 xlag 0.1986 -0.0617 0.2612 0.000807 0.1994 0.2612 0.000807 0.1994
The multiplier matrices (requested by the INTERIM=2 and TOTAL options) are printed next, as shown in Output 23.1.7 and Output 23.1.8.
Output 23.1.7 SIMLIN Procedure Output Interim Multipliers
Interim Multipliers for Interim 1 Variable c p w i x wsum k y wp 0.829130 0.609213 0.794488 0.574572 1.403702 0.794488 0.564524 1.403702 g 1.049424 0.771077 1.005578 0.727231 1.776655 1.005578 0.714514 1.776655 t -0.865262 -0.982167 -0.710961 -0.827867 -1.693129 -0.710961 -0.813366 -1.693129 yr -.0054080 -.0558215 0.0125018 -.0379117 -.0433197 0.0125018 -.0372452 -.0433197 Intercept 43.27442 28.39545 41.45124 26.57227 69.84670 41.45124 54.19068 69.84670
Interim Multipliers for Interim 2 Variable c p w i x wsum k y wp 0.663671 0.350716 0.658769 0.345813 1.009485 0.658769 0.910337 1.009485 g 0.840004 0.443899 0.833799 0.437694 1.277698 0.833799 1.152208 1.277698 t -0.968727 -0.618929 -0.925467 -0.575669 -1.544396 -0.925467 -1.389035 -1.544396 yr -.0456589 -.0401446 -.0399178 -.0344035 -.0800624 -.0399178 -.0716486 -.0800624 Intercept 28.36428 10.79216 28.33114 10.75901 39.12330 28.33114 64.94969 39.12330
The last part of the SIMLIN procedure output is a table of statistics of t for the simulation, as shown in Output 23.1.9.
Output 23.1.9 SIMLIN Procedure Output Simulation Statistics
Fit Statistics Mean Error 0.1367 0.1422 0.1282 0.1337 0.2704 0.1282 -0.1424 0.2704 Mean Pct Error -0.3827 -4.0671 -0.8939 105.8529 -0.9553 -0.6669 -0.1506 -1.3476 Mean Abs Error 3.5011 2.9355 3.1247 2.4983 5.9622 3.1247 3.8879 5.9622 Mean Abs Pct Error 6.69769 19.61400 8.92110 127.13736 10.40057 7.88988 1.90614 11.74177 RMS Error 4.3155 3.4257 4.0930 2.9980 7.1881 4.0930 5.0036 7.1881 RMS Pct Error 8.1701 26.0265 11.4709 252.3497 12.5653 10.1724 2.4209 14.2214
Variable c p w i x wsum k y
N 21 21 21 21 21 21 21 21
The OUTEST= output data set contains all the observations read from the EST= data set, and in addition contains observations for the reduced form and multiplier matrices. The following statements produce a partial listing of the OUTEST= data set, as shown in Output 23.1.10.
proc print data=b; where _type_ = REDUCED | _type_ = IMULT1; run;
O b s 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
_ T Y P E _ REDUCED REDUCED REDUCED REDUCED REDUCED REDUCED REDUCED REDUCED IMULT1 IMULT1 IMULT1 IMULT1 IMULT1 IMULT1 IMULT1 IMULT1
_ D E P V A R _ c p w i x wsum k y c p w i x wsum k y
_ M O D E L _
_ S I G M A _
w s u m
. 1.63465 0.63465 1.09566 0.63465 0 0.19585 0 1.29151 . 0.97236 0.97236 -0.34048 0.97236 0 1.10872 0 0.76825 . 0.64957 0.64957 1.44059 0.64957 0 0.07263 0 0.51321 . -0.01272 0.98728 0.00445 -0.01272 0 -0.01450 0 -0.01005 . 1.62194 1.62194 1.10011 1.62194 0 0.18135 0 1.28146 . 0.64957 0.64957 1.44059 0.64957 0 0.07263 0 1.51321 . -0.01272 0.98728 0.00445 -0.01272 0 -0.01450 1 -0.01005 . 1.62194 1.62194 1.10011 0.62194 1 0.18135 0 1.28146 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I n t e r c e p t 46.7273 42.7736 31.5721 27.6184 74.3457 31.5721 27.6184 74.3457 43.2744 28.3955 41.4512 26.5723 69.8467 41.4512 54.1907 69.8467
O b s 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
w p 1.29151 0.76825 0.51321 -0.01005 1.28146 1.51321 -0.01005 1.28146 0.82913 0.60921 0.79449 0.57457 1.40370 0.79449 0.56452 1.40370
g 0.63465 0.97236 0.64957 -0.01272 1.62194 0.64957 -0.01272 1.62194 1.04942 0.77108 1.00558 0.72723 1.77666 1.00558 0.71451 1.77666
t -0.19585 -1.10872 -0.07263 0.01450 -0.18135 -0.07263 0.01450 -1.18135 -0.86526 -0.98217 -0.71096 -0.82787 -1.69313 -0.71096 -0.81337 -1.69313
y r 0.16399 -0.05096 0.21562 0.00067 0.16466 0.21562 0.00067 0.16466 -0.00541 -0.05582 0.01250 -0.03791 -0.04332 0.01250 -0.03725 -0.04332
The actual and predicted values for the variable C are plotted in Output 23.1.11.
title2 Plots of Simulation Results; proc sgplot data=c; scatter x=year y=c; series x=year y=chat / markers markerattrs=(symbol=plus); refline 1941.5 / axis=x; run;
The REG procedure processes the input data and writes the parameter estimates to the OUTEST= data set A.
title2 Estimated Parameters; proc reg data=test outest=a; model y=ylag3 x; run; title2 Listing of OUTEST= Data Set; proc print data=a; run;
Output 23.2.2 shows the printed output produced by the REG procedure, and Output 23.2.3 displays the OUTEST= data set A produced.
Output 23.2.2 Estimates and Fit Information from PROC REG
Simulate Equation with Third-Order Lags Estimated Parameters The REG Procedure Model: MODEL1 Dependent Variable: y Analysis of Variance Sum of Squares 173.98377 1.38818 175.37196 0.22675 13.05234 1.73721 Mean Square 86.99189 0.05141
DF 2 27 29
F Value 1691.98
Pr > F <.0001
0.9921 0.9915
DF 1 1 1
The SIMLIN procedure processes the TEST data set using the estimates from PROC REG. The following statements perform the simulation and write the results to the OUT= data set OUT2.
title2 Simulation of Equation; proc simlin est=a data=test nored; endogenous y; exogenous x; lagged ylag3 y 3; id n; output out=out1 predicted=yhat residual=yresid; run;
The printed output from the SIMLIN procedure is shown in Output 23.2.4.
Output 23.2.4 Output Produced by PROC SIMLIN
Simulate Equation with Third-Order Lags Simulation of Equation The SIMLIN Procedure Fit Statistics Mean Error -0.0233 Mean Pct Error -0.2268 Mean Abs Error 0.2662 Mean Abs Pct Error 2.05684 RMS Error 0.3408 RMS Pct Error 2.6159
Variable y
N 30
The following statements plot the actual and predicted values, as shown in Output 23.2.5.
title2 Plots of Simulation Results; proc sgplot data=out1;
Next, the input data set TEST is modied by creating two new variables, YLAG1X and YLAG2X, that are equal to YLAG1 and YLAG2. These variables are used in the SYSLIN procedure. (The estimates produced by PROC SYSLIN are the same as before and are not shown.) A listing of the OUTEST= data set B created by PROC SYSLIN is shown in Output 23.2.6.
data test2; set test; ylag1x=ylag1; ylag2x=ylag2; run; title2 Estimation of parameters and definition of identities; proc syslin data=test2 outest=b; endogenous y ylag1x ylag2x; model y=ylag3 x; identity ylag1x=ylag1; identity ylag2x=ylag2; run;
title2 Listing of OUTEST= data set from PROC SYSLIN; proc print data=b; run;
Output 23.2.6 Listing of OUTEST= Data Set Created from PROC SYSLIN
Simulate Equation with Third-Order Lags Listing of OUTEST= data set from PROC SYSLIN I n t e r c e p t
_ T Y O P b E s _
_ S T A T U S _
_ M O D E L _
_ D E P V A R _
_ S I G M A _
y l a g 3
y l a g 1
y l a g 2
y l a g 1 x
y l a g 2 x
1 OLS 0 Converged y y 0.22675 0.14239 0.77121 -1.77668 . . -1 . . 2 IDENTITY 0 Converged ylag1x . 0.00000 . . 1 . . -1 . 3 IDENTITY 0 Converged ylag2x . 0.00000 . . . 1 . . -1
The SIMLIN procedure is used to compute the reduced form and multipliers. The OUTEST= data set B from PROC SYSLIN is used as the EST= data set for the SIMLIN procedure. The following statements perform the multiplier analysis.
title2 Simulation of transformed first-order equation system; proc simlin est=b data=test2 total interim=2; endogenous y ylag1x ylag2x; exogenous x; lagged ylag1 y 1 ylag2 ylag1x 1 ylag3 ylag2x 1; id n; output out=out2 predicted=yhat residual=yresid; run;
Output 23.2.7 shows the interim 2 and total multipliers printed by the SIMLIN procedure.
Output 23.2.7 Interim 2 and Total Multipliers
Simulate Equation with Third-Order Lags Simulation of transformed first-order equation system The SIMLIN Procedure Interim Multipliers for Interim 2 Variable y ylag1x ylag2x x 0.000000 0.000000 -1.776682 Intercept 0.0000000 0.0000000 0.1423865
References ! 1539
References
Maddala, G.S (1977), Econometrics, New York: McGraw-Hill Book Co. Pindyck, R.S. and Rubinfeld, D.L. (1991), Econometric Models and Economic Forecasts, Third Edition, New York: McGraw-Hill Book Co. Theil, H. (1971), Principles of Econometrics, New York: John Wiley & Sons, Inc.
1540
Chapter 24
periodogram ordinates are smoothed by a moving average to produce estimated spectral and crossspectral densities. PROC SPECTRA can also test whether or not the data are white noise. PROC SPECTRA uses the nite Fourier transform to decompose data series into a sum of sine and cosine waves of different amplitudes and wavelengths. The Fourier transform decomposition of the series x t is
m X a0 xt D C ak cos.!k t/ C bk si n.!k t/ 2 kD1
where t xt n m a0 ak bk !k is the time subscript, t D 1; 2; : : : ; n are the equally spaced time series data is the number of observations in the time series is the number of frequencies in the Fourier decomposition: m D m D n 2 1 if n is odd is the mean term: a0 D 2x are the cosine coefcients are the sine coefcients are the Fourier frequencies: !k D
2 k n n 2
if n is even;
Functions of the Fourier coefcients ak and bk can be plotted against frequency or against wave length to form periodograms. The amplitude periodogram Jk is dened as follows: Jk D n 2 2 .ak C bk / 2
Several denitions of the term periodogram are used in the spectral analysis literature. The following discussion refers to the Jk sequence as the periodogram. The periodogram can be interpreted as the contribution of the kth harmonic !k to the total sum of squares (in an analysis of variance sense) in the decomposition of the process into two-degree-offreedom components for each of the m frequencies. When n is even, si n.! n / is zero, and thus the 2 last periodogram value is a one-degree-of-freedom component. The periodogram is a volatile and inconsistent estimator of the spectrum. The spectral density estimate is produced by smoothing the periodogram. Smoothing reduces the variance of the estimator but introduces a bias. The weight function used for the smoothing process, W(), often called the kernel or spectral window, is specied with the WEIGHTS statement. It is related to another weight function, w(), the lag window, that is used in other methods to taper the correlogram rather than to smooth the periodogram. Many specic weighting functions have been suggested in the literature (Fuller 1976, Jenkins and Watts 1968, Priestly 1981). Table 24.3 later in this chapter gives the relevant formulas when the WEIGHTS statement is used. p Letting i represent the imaginary unit 1, the cross-periodogram is dened as follows: Jk D
xy
n x y n x y x y .ak ak C bk bk / C i .ak bk 2 2
x bk ak /
The cross-spectral density estimate is produced by smoothing the cross-periodogram in the same way as the periodograms are smoothed using the spectral window specied by the WEIGHTS statement. The SPECTRA procedure creates an output SAS data set whose variables contain values of the periodograms, cross-periodograms, estimates of spectral densities, and estimates of cross-spectral densities. The form of the output data set is described in the section OUT= Data Set on page 1552.
This PROC SPECTRA step writes the Fourier coefcients ak and bk to the variables COS_01 and SIN_01 in the output data set B. When a WEIGHTS statement is specied, the periodogram is smoothed by a weighted moving average to produce an estimate of the spectral density of the series. The following statements write a spectral density estimate for X to the variable S_01 in the output data set B.
proc spectra data=a out=b s; var x; weights 1 2 3 4 3 2 1; run;
When the VAR statement species more than one variable, you can perform cross-spectral analysis by specifying the CROSS option in the PROC SPECTRA statemnet. The CROSS option by itself produces the cross-periodograms for all two-way combinations of the variables listed in the VAR statement. For example, the following statements write the real and imaginary parts of the crossperiodogram of X and Y to the variables RP_01_02 and IP_01_02 in the output data set B.
proc spectra data=a out=b cross; var x y; run;
To produce cross-spectral density estimates, specify both the CROSS option and the S option. The cross-periodogram is smoothed using the weights specied by the WEIGHTS statement in the same way as the spectral density. The squared coherency and phase estimates of the cross-spectrum are computed when the K and PH options are used. The following example computes cross-spectral density estimates for the variables X and Y.
proc spectra data=a out=b cross s; var x y; weights 1 2 3 4 3 2 1; run;
The real part and imaginary part of the cross-spectral density estimates are written to the variables CS_01_02 and QS_01_02, respectively.
Functional Summary
Table 24.1 summarizes the statements and options that control the SPECTRA procedure.
Table 24.1 SPECTRA Functional Summary
Description Statements specify BY-group processing specify the variables to be analyzed specify weights for spectral density estimates Data Set Options specify the input data set specify the output data set Output Control Options output the amplitudes of the cross-spectrum output the Fourier coefcients
Option
DATA= OUT=
A COEF
Table 24.1
continued
Description output the periodogram output the spectral density estimates output cross-spectral analysis results output squared coherency of the crossspectrum output the phase of the cross-spectrum Smoothing Options specify the Bartlett kernel specify the Parzen kernel specify the quadratic spectral kernel specify the Tukey-Hanning kernel specify the truncated kernel Other Options subtract the series mean specify an alternate quadrature spectrum estimate request tests for white noise
Statement PROC SPECTRA PROC SPECTRA PROC SPECTRA PROC SPECTRA PROC SPECTRA
Option P S CROSS K PH
subtracts the series mean before performing the Fourier decomposition. This sets the rst periodogram ordinate to 0 rather than 2n times the squared mean. This option is commonly used when the periodograms are to be plotted to prevent a large rst periodogram ordinate from distorting the scale of the plot.
ALTW
species that the quadrature spectrum estimate is computed at the boundaries in the same way as the spectral density estimate and the cospectrum estimate are computed.
COEF
is used with the P and S options to output cross-periodograms and cross-spectral densities when more than one variable is listed in the VAR statement.
DATA=SAS-data-set
names the SAS data set that contains the input data. If the DATA= option is omitted, the most recently created SAS data set is used.
K
outputs the squared coherency variables (K_nn _mm ) of the cross-spectrum. The K_nn _mm variables are identically 1 unless weights are given in the WEIGHTS statement and the S option is specied.
OUT=SAS-data-set
names the output data set created by PROC SPECTRA to store the results. If the OUT= option is omitted, the output data set is named by using the DATAn convention.
P
outputs the periodogram variables. The variables are named P_nn, where nn is an index of the original variable with which the periodogram variable is associated. When both the P and CROSS options are specied, the cross-periodogram variables RP_nn_mm and IP_nn_mm are also output.
PH
outputs the spectral density estimates. The variables are named S_nn, where nn is an index of the original variable with which the estimate variable is associated. When both the S and CROSS options are specied, the cross-spectral variables CS_nn _mm and QS_nn _mm are also output.
WHITETEST
prints two tests of the hypothesis that the data are white noise. See the section White Noise Test on page 1551 for details. Note that the CROSS, A, K, and PH options are meaningful only if more than one variable is listed in the VAR statement.
BY Statement
BY variables ;
A BY statement can be used with PROC SPECTRA to obtain separate analyses for groups of observations dened by the BY variables.
VAR Statement
VAR variables ;
The VAR statement species one or more numeric variables that contain the time series to analyze. The order of the variables in the VAR statement list determines the index, nn, used to name the output variables. The VAR statement is required.
WEIGHTS Statement
WEIGHTS weight-constants | kernel-specication ;
The WEIGHTS statement species the relative weights used in the moving average applied to the periodogram ordinates to form the spectral density estimates. A WEIGHTS statement must be used to produce smoothed spectral density estimates. You can specify the relative weights in two ways: you can specify them explicitly as explained in the section Using Weight Constants Specication on page 1547, or you can specify them implicitly by using the kernel specication as explained in the section Using Kernel Specications on page 1547. If the WEIGHTS statement is not used, only the periodogram is produced.
where c >D 0 and e >D 0 are used to compute the bandwidth parameter as l.q/ D cq e and q is the number of periodogram ordinates +1: q D oor.n=2/ C 1 To specify the bandwidth explicitly, set c D to the desired bandwidth and e D 0. For example, a Parzen kernel can be specied using the following WEIGHTS statement:
weights parzen 0.5 0;
Input Data
Observations in the data set analyzed by the SPECTRA procedure should form ordered, equally spaced time series. No more than 99 variables can be included in the analysis. Data are often detrended before analysis by the SPECTRA procedure. This can be done by using the residuals output by a SAS regression procedure. Optionally, the data can be centered using the ADJMEAN option in the PROC SPECTRA statement, since the zero periodogram ordinate corresponding to the mean is of little interest from the point of view of spectral analysis.
Missing Values
Missing values are excluded from the analysis by the SPECTRA procedure. If the SPECTRA procedure encounters missing values for any variable listed in the VAR statement, the procedure determines the longest contiguous span of data that has no missing values for the variables listed in the VAR statement and uses that span for the analysis.
Computational Method
If the number of observations n factors into prime integers that are less than or equal to 23, and the product of the square-free factors of n is less than 210, then PROC SPECTRA uses the fast
Kernels ! 1549
Fourier transform developed by Cooley and Tukey and implemented by Singleton (1969). If n cannot be factored in this way, then PROC SPECTRA uses a Chirp-Z algorithm similar to that proposed by Monro and Branch (1976). To reduce memory requirements, when n is small, the Fourier coefcients are computed directly using the dening formulas.
Kernels
Kernels are used to smooth the periodogram by using a weighted moving average of nearby points. A smoothed periodogram is dened by the following equation. O Ji .l.q// D
l.q/ X D l.q/
w l.q/
Q Ji C
where w.x/ is the kernel or weight function. At the endpoints, the moving average is computed cyclically; that is, 8 Ji C < D J .i C / : Jq .iC / 0 <D i C iC <0 iC >q <D q
Q Ji C
The SPECTRA procedure supports the following kernels. They are listed with their default bandwidth functions. Bartlett: KERNEL BART ( 1 jxj w.x/ D 0 1 1=3 q l.q/ D 2
jxj1 otherwise
Parzen: KERNEL PARZEN 8 1 6jxj2 C 6jxj3 < w.x/ D 2.1 jxj/3 : 0 l.q/ D q 1=5 Quadratic spectral: KERNEL QS w.x/ D l.q/ D 25 12 2 x 2 1 1=5 q 2
cos.6 x=5/
2 1=5 q 3
1 1=5 q 4
A summary of the default values of the bandwidth parameters, c and e, associated with the kernel smoothers in PROC SPECTRA are listed below in Table 24.2:
Table 24.2
Bandwidth Parameters
See Andrews (1991) for details about the properties of these kernels.
where m D n if n is even or m D n 2 1 if n is odd. The test statistic is the maximum absolute differ2 ence of the normalized cumulative periodogram and the uniform cumulative distribution function. Approximate p-values for Bartletts Kolmogorov-Smirnov test statistics are provided with the test statistics. Small p-values cause you to reject the null-hypothesis that the series is white noise.
Transforming Frequencies
The variable FREQ in the data set created by the SPECTRA procedure ranges from 0 to . Sometimes it is preferable to express frequencies in cycles per observation period, which is equal to 2 FREQ. To express frequencies in cycles per unit time (for example, in cycles per year), multiply FREQ by d 2 , where d is the number of observations per unit of time. For example, for monthly data, if the 2 desired time unit is years then d is 12. The period of the cycle is d 2 , which ranges from d to FREQ innity.
The subscript of Wj runs from W p to Wp , so that W0 is the middle weight in the list. Let !k D 2 nk , where k D 0; 1; : : :; oor. n /. 2
Table 24.3 Variables Created by PROC SPECTRA
Variable FREQ
FREQ .) 2
PERIOD
cosine P transform of n 2 x ak D n t D1 Xt cos.!k .t 1// 2 P x sine transform of X: bk D n nD1 Xt sin.!k .t t x x x periodogram of X: Jk D n .ak /2 C .bk /2 2 spectral estimate Pp density x x Fk D j D p Wj JkCj (except across endpoints) real part of cross-periodogram xy x y x y real.Jk / D n .ak ak C bk bk / 2 X of
X: 1//
X:
RP_nn _mm
and
Y:
IP_nn _mm
imaginary part of cross-periodogram of X and Y: xy x y x y imag.Jk / D n .ak bk bk ak / 2 cospectrum estimate (real part of cross-spectrum) of X and Y: Pp xy xy Ck D j D p Wj real.JkCj /(except across endpoints) quadrature spectrum estimate (imaginary part of cross-spectrum) of X and Y: Pp xy xy Qk D j D p Wj imag.JkCj /(except across endpoints) amplitude (modulus) of cross-spectrum of X and Y: q xy xy 2 xy 2 Ak D .Ck / C .Qk / coherency squared xy xy 2 x y Kk D .Ak / =.Fk Fk / of X and Y:
CS_nn _mm
QS_nn _mm
A_nn _mm
K_nn _mm
PH_nn _mm
of
and
Y:
Printed Output
By default PROC SPECTRA produces no printed output. When the WHITETEST option is specied, the SPECTRA procedure prints the following statistics for each variable in the VAR statement: 1. the name of the variable 2. M1, the number of two-degree-of-freedom periodogram ordinates used in the test 3. MAX(P(*)), the maximum periodogram ordinate 4. SUM(P(*)), the sum of the periodogram ordinates 5. Fishers Kappa statistic 6. Bartletts Kolmogorov-Smirnov test statistic 7. Approximate p-value for Bartletts Kolmogorov-Smirnov test statistic See the section White Noise Test on page 1551 for details.
Description white noise test Fishers Kappa statistic Bartletts Kolmogorov-Smirnov statistic
The spectral analysis of the sunspot series is performed by the following statements:
proc spectra data=sunspot out=b p s adjmean whitetest; var wolfer; weights 1 2 3 4 3 2 1; run; proc print data=b(obs=12); run;
The PROC SPECTRA statement species the P and S options to write the periodogram and spectral density estimates to the OUT= data set B. The WEIGHTS statement species a triangular spectral window for smoothing the periodogram to produce the spectral density estimate. The ADJMEAN option zeros the frequency 0 value and avoids the need to exclude that observation from the plots. The WHITETEST option prints tests for white noise. The Fishers Kappa test statistic of 16.070 is larger than the 5% critical value of 7.2, so the null hypothesis that the sunspot series is white noise is rejected (see the table of critical values in Fuller (1976)). The Bartletts Kolmogorov-Smirnov statistic is 0.6501, and its approximate p-value is < 0:0001. The small p-value associated with this test leads to the rejection of the null hypothesis that the spectrum represents white noise.
The printed output produced by PROC SPECTRA is shown in Output 24.1.2. The output data set B created by PROC SPECTRA is shown in part in Output 24.1.3.
Output 24.1.2 White Noise Test Results
Wolfers Sunspot Data The SPECTRA Procedure Test for White Noise for Variable wolfer M-1 Max(P(*)) Sum(P(*)) 87 4062267 21156512
Bartletts Kolmogorov-Smirnov Statistic: Maximum absolute difference of the standardized partial sums of the periodogram and the CDF of a uniform(0,1) random variable. Test Statistic Approximate P-Value 0.650055 <.0001
The following statements plot the periodogram and spectral density estimate by the frequency and period.
proc sgplot data=b; series x=freq y=p_01 / markers markerattrs=(symbol=circlefilled); run;
proc sgplot data=b; series x=period y=p_01 / markers markerattrs=(symbol=circlefilled); run; proc sgplot data=b; series x=freq y=s_01 / markers markerattrs=(symbol=circlefilled); run; proc sgplot data=b; series x=period y=s_01 / markers markerattrs=(symbol=circlefilled); run;
The periodogram is plotted against the frequency in Output 24.1.4 and plotted against the period in Output 24.1.5. The spectral density estimate is plotted against the frequency in Output 24.1.6 and plotted against the period in Output 24.1.7.
Output 24.1.4 Plot of Periodogram by Frequency
Since PERIOD is the reciprocal of frequency, the plot axis for PERIOD is stretched for low frequencies and compressed at high frequencies. One way to correct for this is to use a WHERE statement to restrict the plots and exclude the low frequency components. The following statements plot the spectral density for periods less than 50.
proc sgplot data=b; where period < 50; series x=period y=s_01 / markers markerattrs=(symbol=circlefilled); refline 11 / axis=x; run;
The spectral analysis of the sunspot series conrms a strong 11-year cycle of sunspot activity. The plot makes this clear by drawing a reference line at the 11 year period, which highlights the position of the main peak in the spectral density. Output 24.1.8 shows the plot. Contrast Output 24.1.8 with Output 24.1.7.
The PROC CONTENTS report for the output data set B is shown in Output 24.2.1.
Output 24.2.1 Contents of PROC SPECTRA OUT= Data Set
Wolfers Sunspot Data The CONTENTS Procedure Alphabetic List of Variables and Attributes # 16 3 5 13 1 12 15 2 17 7 8 14 11 4 6 9 10 Variable A_01_02 COS_01 COS_02 CS_01_02 FREQ IP_01_02 K_01_02 PERIOD PH_01_02 P_01 P_02 QS_01_02 RP_01_02 SIN_01 SIN_02 S_01 S_02 Type Num Num Num Num Num Num Num Num Num Num Num Num Num Num Num Num Num Len 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 Label Amplitude of x by y Cosine Transform of x Cosine Transform of y Cospectra of x by y Frequency from 0 to PI Imag Periodogram of x by y Coherency**2 of x by y Period Phase of x by y Periodogram of x Periodogram of y Quadrature of x by y Real Periodogram of x by y Sine Transform of x Sine Transform of y Spectral Density of x Spectral Density of y
The following statements plot the amplitude of the cross-spectrum estimate against frequency and against period for periods less than 25.
proc sgplot data=b; series x=freq y=a_01_02 / markers markerattrs=(symbol=circlefilled); xaxis values=(0 to 4 by 1); run;
The plot of the amplitude of the cross-spectrum estimate against frequency is shown in Output 24.2.2.
The plot of the cross-spectrum amplitude against period for periods less than 25 observations is shown in Output 24.2.3.
proc sgplot data=b; where period < 25; series x=period y=a_01_02 / markers markerattrs=(symbol=circlefilled); xaxis values=(0 to 30 by 5); run;
References ! 1565
References
Anderson, T. W. (1971), The Statistical Analysis of Time Series, New York: John Wiley & Sons. Andrews, D. W. K. (1991), Heteroscedasticity and Autocorrelation Consistent Covariance Matrix Estimation, Econometrica, 59 (3), 817858. Bartlett, M. S. (1966), An Introduction to Stochastic Processes, Second Edition, Cambridge: Cambridge University Press. Brillinger, D. R. (1975), Time Series: Data Analysis and Theory, New York: Holt, Rinehart and Winston, Inc. Davis, H. T. (1941), The Analysis of Economic Time Series, Bloomington, IN: Principia Press. Durbin, J. (1967), Tests of Serial Independence Based on the Cumulated Periodogram, Bulletin of Int. Stat. Inst., 42, 10391049. Fuller, W. A. (1976), Introduction to Statistical Time Series, New York: John Wiley & Sons.
Gentleman, W. M. and Sande, G. (1966), Fast Fourier Transformsfor Fun and Prot, AFIPS Proceedings of the Fall Joint Computer Conference, 19, 563578. Jenkins, G. M. and Watts, D. G. (1968), Spectral Analysis and Its Applications, San Francisco: Holden-Day. Miller, L. H. (1956), Tables of Percentage Points of Kolmogorov Statistics, Journal of American Statistical Association, 51, 111. Monro, D. M. and Branch, J. L. (1976), Algorithm AS 117. The Chirp Discrete Fourier Transform of General Length, Applied Statistics, 26, 351361. Nussbaumer, H. J. (1982), Fast Fourier Transform and Convolution Algorithms, Second Edition, New York: Springer-Verlag. Owen, D. B. (1962), Handbook of Statistical Tables, Addison Wesley. Parzen, E. (1957), On Consistent Estimates of the Spectrum of a Stationary Time Series, Annals of Mathematical Statistics, 28, 329348. Priestly, M. B. (1981), Spectral Analysis and Time Series, New York: Academic Press, Inc. Singleton, R. C. (1969), An Algorithm for Computing the Mixed Radix Fast Fourier Transform, I.E.E.E. Transactions of Audio and Electroacoustics, AU-17, 93103.
Chapter 25
state transition equation : zt C1 D Fzt C Get C1 In the state transition equation, the s s coefcient matrix F is called the transition matrix; it determines the dynamic properties of the model. The s r coefcient matrix G is called the input matrix; it determines the variance structure of the transition equation. For model identication, the rst r rows and columns of G are set to an r r identity matrix. The input vector e t is a sequence of independent normally distributed random vectors of dimension r with mean 0 and covariance matrix ee . The random error e t is sometimes called the innovation vector or shock vector. In addition to the state transition equation, state space models usually include a measurement equation or observation equation that gives the observed values xt as a function of the state vector zt . However, since PROC STATESPACE always includes the observed values xt in the state vector zt , the measurement equation in this case merely represents the extraction of the rst r components of the state vector. The measurement equation used by the STATESPACE procedure is xt D Ir 0zt where Ir is an r r identity matrix. In practice, PROC STATESPACE performs the extraction of xt from zt without reference to an explicit measurement equation. In summary: xt zt F G et is an observation vector of dimension r. is a state vector of dimension s, whose rst r elements are x s r elements are conditional prediction of future x t . is an s s transition matrix. is an s r input matrix, with the identity matrix I r forming the rst r rows and columns. is a sequence of independent normally distributed random vectors of dimension r with mean 0 and covariance matrix ee .
t
autoregressive models are estimated using the sample autocovariance matrices and the Yule-Walker equations. The order of the VAR model that produces the smallest Akaike information criterion is chosen as the order (number of lags into the past) to use in the canonical correlation analysis. The elements of the state vector are then determined via a sequence of canonical correlation analyses of the sample autocovariance matrices through the selected order. This analysis computes the sample canonical correlations of the past with an increasing number of steps into the future. Variables that yield signicant correlations are added to the state vector; those that yield insignicant correlations are excluded from further consideration. The importance of the correlation is judged on the basis of another information criterion proposed by Akaike. See the section Canonical Correlation Analysis Options on page 1583 for details. If you specify the state vector explicitly, these model identication steps are omitted. After the state vector is determined, the state space model is t to the data. The free parameters in the F, G, and ee matrices are estimated by approximate maximum likelihood. By default, the F and G matrices are unrestricted, except for identiability requirements. Optionally, conditional least squares estimates can be computed. You can impose restrictions on elements of the F and G matrices. After the parameters are estimated, the Kalman ltering technique is used to produce forecasts from the tted state space model. If differencing was specied, the forecasts are integrated to produce forecasts of the original input variables.
when needed and try to use PROC STATESPACE with nonstationary data, an inappropriate state space model might be selected, and the model estimation might fail to converge. The following statements identify and t a state space model for the rst differences of X and Y, and forecast X and Y 10 periods ahead:
proc statespace data=in out=out lead=10; var x(1) y(1); id t; run;
The DATA= option species the input data set and the OUT= option species the output data set for the forecasts. The LEAD= option species forecasting 10 observations past the end of the input data. The VAR statement species the variables to forecast and species differencing. The notation X(1) Y(1) species that the state space model analyzes the rst differences of X and Y.
Variable x y
The STATESPACE Procedure Information Criterion for Autoregressive Models Lag=0 Lag=1 Lag=2 Lag=3 Lag=4 Lag=5 Lag=6 Lag=7 Lag=8
149.697 8.387786 5.517099 12.05986 15.36952 21.79538 24.00638 29.88874 33.55708 Information Criterion for Autoregressive Models Lag=9 41.17606 Lag=10 47.70222
. is between
Descriptive statistics are printed rst, giving the number of nonmissing observations after differencing and the sample means and standard deviations of the differenced series. The sample means are subtracted before the series are modeled (unless the NOCENTER option is specied), and the sample means are added back when the forecasts are produced. Let Xt and Yt be the observed values of X and Y, and let xt and yt be the values of X and Y after differencing and subtracting the mean difference. The series xt modeled by the STATEPSPACE procedure is xt .1 xt D D yt .1 B/Xt B/Yt 0:144316 0:164871
where B represents the backshift operator. After the descriptive statistics, PROC STATESPACE prints the Akaike information criterion (AIC) values for the autoregressive models t to the series. The smallest AIC value, in this case 5.517 at lag 2, determines the number of autocovariance matrices analyzed in the canonical correlation phase. A schematic representation of the autocorrelations is printed next. This indicates which elements of the autocorrelation matrices at different lags are signicantly greater than or less than 0. The second page of the STATESPACE printed output is shown in Figure 25.3.
Figure 25.3 Partial Autocorrelations and VAR Model
Schematic Representation of Partial Autocorrelations Name/Lag x y 1 ++ ++ 2 +. .. 3 .. .. 4 .. .. 5 .. .. 6 .. .. 7 .. .. 8 .. .. 9 .. .. 10 .. ..
. is between
Yule-Walker Estimates for Minimum AIC --------Lag=1------x y x y 0.257438 0.292177 0.202237 0.469297 --------Lag=2------x y 0.170812 -0.00537 0.133554 -0.00048
Figure 25.3 shows a schematic representation of the partial autocorrelations, similar to the autocorrelations shown in Figure 25.2. The selection of a second order autoregressive model by the AIC statistic looks reasonable in this case because the partial autocorrelations for lags greater than 2 are not signicant. Next, the Yule-Walker estimates for the selected autoregressive model are printed. This output shows the coefcient matrices of the vector autoregressive model at each lag.
Figure 25.4 rst prints the state vector as X[T;T] Y[T;T] X[T+1;T]. This notation indicates that
the state vector is 3 xt jt zt D 4 yt jt 5 xt C1jt The notation xt C1jt indicates the conditional expectation or prediction of xt C1 based on the information available at time t, and xt jt and yt jt are xt and yt , respectively. The remainder of Figure 25.4 shows the preliminary estimates of the transition matrix F, the input matrix G, and the covariance matrix ee . 2
The estimated state space model shown in Figure 25.5 is 2 2 3 xt C1jt C1 0 4yt C1jtC1 5 D 40:297 0:230 xt C2jt C1 e 0:945 var t C1 D nt C1 0:101 0 0:474 0:228 0:101 1:015 3 32 3 2 1 0 xt 1 e 1 5 t C1 0:0205 4 yt 5 C 4 0 ntC1 0:257 0:202 xt C1jt 0:256
The next page of the STATESPACE output lists the estimates of the free parameters in the F and G matrices with standard errors and t statistics, as shown in Figure 25.6.
Figure 25.6 Final Parameter Estimates
Parameter Estimates Standard Error 0.129995 0.115688 0.313025 0.126226 0.112978 0.305256 0.071060 0.068593
Convergence Failures
The maximum likelihood estimates are computed by an iterative nonlinear maximization algorithm, which might not converge. If the estimates fail to converge, warning messages are printed in the output. If you encounter convergence problems, you should recheck the stationarity of the data and ensure that the specied differencing orders are correct. Attempting to t state space models to nonstationary data is a common cause of convergence failure. You can also use the MAXIT= option to increase the number of iterations allowed, or experiment with the convergence tolerance options DETTOL= and PARMTOL=.
run;
The OUT= data set produced by PROC STATESPACE contains the VAR and ID statement variables. In addition, for each VAR statement variable, the OUT= data set contains the variables FORi, RESi, and STDi. These variables contain the predicted values, residuals, and forecast standard errors for the ith variable in the VAR statement list. In this case, X is listed rst in the VAR statement, so FOR1 contains the forecasts of X, while FOR2 contains the forecasts of Y. The following statements plot the forecasts and actuals for the series.
proc sgplot data=out noautolegend; where t > 150; series x=t y=for1 / markers markerattrs=(symbol=circle color=blue) lineattrs=(pattern=solid color=blue); series x=t y=for2 / markers markerattrs=(symbol=circle color=blue) lineattrs=(pattern=solid color=blue); series x=t y=x / markers markerattrs=(symbol=circle color=red) lineattrs=(pattern=solid color=red); series x=t y=y / markers markerattrs=(symbol=circle color=red) lineattrs=(pattern=solid color=red); refline 200.5 / axis=x; run;
The forecast plot is shown in Figure 25.8. The last 50 observations are also plotted to provide context, and a reference line is drawn between the historical and forecast periods.
Figure 25.8 Plot of Forecasts
You can specify the form for only some of the variables and allow PROC STATESPACE to select the form for the other variables. If only some of the variables are specied in the FORM statement, canonical correlation analysis is used to determine the number of lags included in the state vector for the remaining variables not specied by the FORM statement. If the FORM statement includes specications for all the variables listed in the VAR statement, the state vector is completely dened and the canonical correlation analysis is not performed.
The nal estimates produced by these statements are shown in Figure 25.10.
PROC STATESPACE options ; BY variable . . . ; FORM variable value . . . ; ID variable ; INITIAL F (row,column)=value . . . G(row,column)=value . . . ; RESTRICT F (row,column)=value . . . G (row,column)=value . . . ; VAR variable (difference, difference, . . . ) . . . ;
Functional Summary
Table 25.1 summarizes the statements and options used by PROC STATESPACE.
Table 25.1 STATESPACE Functional Summary
Description Input Data Set Options specify the input data set prevent subtraction of sample mean specify the ID variable specify the observed series and differencing Options for Autoregressive Estimates specify the maximum order specify maximum lag for autocovariances output only minimum AIC model specify the amount of detail printed write preliminary AR models to a data set Options for Canonical Correlation Analysis print the sequence of canonical correlations specify upper limit of dimension of state vector specify the minimum number of lags specify the multiplier of the degrees of freedom Options for State Space Model Estimation specify starting values print covariance matrix of parameter estimates specify the convergence criterion specify the convergence criterion print the details of the iterations specify an upper limit of the number of lags specify maximum number of iterations allowed suppress the nal estimation
Statement
Option
DATA= NOCENTER
PROC STATESPACE PROC STATESPACE PROC STATESPACE PROC STATESPACE PROC STATESPACE
INITIAL PROC STATESPACE PROC STATESPACE PROC STATESPACE PROC STATESPACE PROC STATESPACE PROC STATESPACE PROC STATESPACE
Description write the state space model parameter estimates to an output data set use conditional least squares for nal estimates specify criterion for testing for singularity Options for Forecasting start forecasting before end of the input data specify the time interval between observations specify multiple periods in the time series specify how many periods to forecast specify the output data set for forecasts print forecasts Options to Specify the State Space Model specify the state vector specify the parameter values BY Groups specify BY-group processing Printing suppresses all printed output
PROC STATESPACE PROC STATESPACE PROC STATESPACE PROC STATESPACE PROC STATESPACE PROC STATESPACE
FORM RESTRICT
BY
NOPRINT
Printing Options
NOPRINT
species the name of the SAS data set to be used by the procedure. If the DATA= option is omitted, the most recently created SAS data set is used.
LAGMAX=k
species the number of lags for which the sample autocovariance matrix is computed. The
LAGMAX= option controls the number of lags printed in the schematic representation of the autocorrelations. The sample autocovariance matrix of lag i, denoted as Ci , is computed as Ci D 1 N 1
N X t D1Ci
xt x0 t
where xt is the differenced and centered data and N is the number of observations. (If the NOCENTER option is specied, 1 is not subtracted from N .) LAGMAX= k species that C0 through Ck are computed. The default is LAGMAX=10.
NOCENTER
prevents subtraction of the sample mean from the input series (after any specied differencing) before the analysis.
species the maximum order of the preliminary autoregressive models. The ARMAX= option controls the autoregressive orders for which information criteria are printed, and controls the number of lags printed in the schematic representation of partial autocorrelations. The default is ARMAX=10. See the section Preliminary Autoregressive Models on page 1590 for details.
MINIC
writes to the OUTAR= data set only the preliminary Yule-Walker estimates for the VAR model that produces the minimum AIC. See the section OUTAR= Data Set on page 1601 for details.
OUTAR=SAS-data-set
writes the Yule-Walker estimates of the preliminary autoregressive models to a SAS data set. See the section OUTAR= Data Set on page 1601 for details.
PRINTOUT=SHORT | LONG | NONE
determines the amount of detail printed. PRINTOUT=LONG prints the lagged covariance matrices, the partial autoregressive matrices, and estimates of the residual covariance matrices from the sequence of autoregressive models. PRINTOUT=NONE suppresses the output for the preliminary autoregressive models. The descriptive statistics and state space model estimation output are still printed when PRINTOUT=NONE is specied. PRINTOUT=SHORT is the default.
prints the canonical correlations and information criterion for each candidate state vector considered. See the section Canonical Correlation Analysis Options on page 1583 for details.
DIMMAX=n
species the upper limit to the dimension of the state vector. The DIMMAX= option can be used to limit the size of the model selected. The default is DIMMAX=10.
PASTMIN=n
species the minimum number of lags to include in the canonical correlation analysis. The default is PASTMIN=0. See the section Canonical Correlation Analysis Options on page 1583 for details.
SIGCORR=value
species the multiplier of the degrees of freedom for the penalty term in the information criterion used to select the state space form. The default is SIGCORR=2. The larger the value of the SIGCORR= option, the smaller the state vector tends to be. Hence, a large value causes a simpler model to be t. See the section Canonical Correlation Analysis Options on page 1583 for details.
prints the inverse of the observed information matrix for the parameter estimates. This matrix is an estimate of the covariance matrix for the parameter estimates.
DETTOL=value
species the convergence criterion. The DETTOL= and PARMTOL= option values are used together to test for convergence of the estimation process. If, during an iteration, the relative change of the parameter estimates is less than the PARMTOL= value and the relative change of the determinant of the innovation variance matrix is less than the DETTOL= value, then iteration ceases and the current estimates are accepted. The default is DETTOL=1E5.
ITPRINT
sets an upper limit for the number of lags of the sample autocovariance matrix used in computing the approximate likelihood function. If the data have a strong moving average character, a larger KLAG= value might be necessary to obtain good estimates. The default is KLAG=15. See the section Parameter Estimation on page 1596 for details.
MAXIT=n
sets an upper limit to the number of iterations in the maximum likelihood or conditional least squares estimation. The default is MAXIT=50.
NOEST
writes the parameter estimates and their standard errors to a SAS data set. See the section OUTMODEL= Data Set on page 1602 for details.
PARMTOL=value
species the convergence criterion. The DETTOL= and PARMTOL= option values are used together to test for convergence of the estimation process. If, during an iteration, the relative change of the parameter estimates is less than the PARMTOL= value and the relative change of the determinant of the innovation variance matrix is less than the DETTOL= value, then iteration ceases and the current estimates are accepted. The default is PARMTOL=0.001.
RESIDEST
computes the nal estimates by using conditional least squares on the raw data. This type of estimation might be more stable than the default maximum likelihood method but is usually more computationally expensive. See the section Parameter Estimation on page 1596 for details about the conditional least squares method.
SINGULAR=value
species the criterion for testing for singularity of a matrix. A matrix is declared singular if a scaled pivot is less than the SINGULAR= value when sweeping the matrix. The default is SINGULAR=1E7.
Forecasting Options
BACK=n
starts forecasting n periods before the end of the input data. The BACK= option value must not be greater than the number of observations. The default is BACK=0.
INTERVAL=interval
species the time interval between observations. The INTERVAL= value is used in conjunction with the ID variable to check that the input data are in order and have no missing periods. The INTERVAL= option is also used to extrapolate the ID values past the end of the input data. See Chapter 4, Date Intervals, Formats, and Functions, for details about the INTERVAL= values allowed.
INTPER=n
species that each input observation corresponds to n time periods. For example, the options INTERVAL=MONTH and INTPER=2 specify bimonthly data and are equivalent to specifying INTERVAL=MONTH2. If the INTERVAL= option is not specied, the INTPER= option controls the increment used to generate ID values for the forecast observations. The default is INTPER=1.
LEAD=n
species how many forecast observations are produced. The forecasts start at the point set by the BACK= option. The default is LEAD=0, which produces no forecasts.
OUT=SAS-data-set
writes the residuals, actual values, forecasts, and forecast standard errors to a SAS data set. See the section OUT= Data Set on page 1601 for details.
PRINT
BY Statement
BY variable . . . ;
A BY statement can be used with the STATESPACE procedure to obtain separate analyses on observations in groups dened by the BY variables.
FORM Statement
FORM variable value . . . ;
The FORM statement species the number of times a variable is included in the state vector. Values can be specied for any variable listed in the VAR statement. If a value is specied for each variable in the VAR statement, the state vector for the state space model is entirely specied, and automatic selection of the state space model is not performed. The FORM statement forces the state vector, zt , to contain a specic variable a given number of times. For example, if Y is one of the variables in xt , then the statement
form y 3;
forces the state vector to contain Yt ; YtC1jt , and Yt C2jt , possibly along with other variables. The following statements illustrate the use of the FORM statement:
proc statespace data=in; var x y; form x 3 y 2; run;
These statements t a state space model with the following state vector: 2 3 xt jt 6 yt jt 7 6 7 zt D 6xt C1jt 7 6 7 4ytC1jt 5 xt C2jt
ID Statement
ID variable ;
The ID statement species a variable that identies observations in the input data set. The variable specied in the ID statement is included in the OUT= data set. The values of the ID variable are extrapolated for the forecast observations based on the values of the INTERVAL= and INTPER= options.
INITIAL Statement
INITIAL F (row,column)= value . . . G(row, column)= value . . . ;
The INITIAL statement gives initial values to the specied elements of the F and G matrices. These initial values are used as starting values for the iterative estimation. Parts of the F and G matrices represent xed structural identities. If an element specied is a xed structural element instead of a free parameter, the corresponding initialization is ignored. The following is an example of an INITIAL statement:
initial f(3,2)=0 g(4,1)=0 g(5,1)=0;
RESTRICT Statement
RESTRICT F(row,column)= value . . . G(row,column)= value . . . ;
The RESTRICT statement restricts the specied elements of the F and G matrices to the specied values. To use the restrict statement, you need to know the form of the model. Either specify the form of the model with the FORM statement, or do a preliminary run (perhaps with the NOEST option) to nd the form of the model that PROC STATESPACE selects for the data. The following is an example of a RESTRICT statement:
restrict f(3,2)=0 g(4,1)=0 g(5,1)=0 ;
Parts of the F and G matrices represent xed structural identities. If a restriction is specied for an element that is a xed structural element instead of a free parameter, the restriction is ignored.
VAR Statement
VAR variable (difference, difference, . . . ) . . . ;
The VAR statement species the variables in the input data set to model and forecast. The VAR statement also species differencing of the input variables. The VAR statement is required. Differencing is specied by following the variable name with a list of difference periods separated by commas. See the section Stationarity and Differencing on page 1588 for more information about differencing of input variables. The order in which variables are listed in the VAR statement controls the order in which variables are included in the state vector. Usually, potential inputs should be listed before potential outputs.
For example, assuming the input data are monthly, the following VAR statement species modeling and forecasting of the one period and seasonal second difference of X and Y:
var x(1,12) y(1,12);
In this example, the vector time series analyzed is .1 B/.1 B 12 /Xt x xt D .1 B/.1 B 12 /Yt y where B represents the back shift operator and x and y represent the means of the differenced series. If the NOCENTER option is specied, the mean differences are not subtracted.
Missing Values
The STATESPACE procedure does not support missing values. The procedure uses the rst contiguous group of observations with no missing values for any of the VAR statement variables. Observations at the beginning of the data set with missing values for any VAR statement variable are not used or included in the output data set.
In this example, the change in X from one period to the next is analyzed. When the series has a seasonal pattern, differencing at a period equal to the length of the seasonal cycle can be desirable. For example, suppose the variable X is measured quarterly and shows a seasonal cycle over the year. You can use the following statement to analyze the series of changes from the same quarter in the previous year:
var x(4);
To difference twice, add another differencing period to the list. For example, the following statement analyzes the series of second differences .Xt Xt 1 / .Xt 1 Xt 2 / D Xt 2Xt 1 C Xt 2 :
var x(1,1);
The series that is being modeled is the 1-period difference of the 4-period difference: .Xt Xt 4 / .Xt 1 Xt 5 / D Xt Xt 1 Xt 4 C Xt 5 . Another way to obtain stationary series is to use a regression on time to detrend the data. If the time series has a deterministic linear trend, regressing the series on time produces residuals that should be stationary. The following statements write residuals of X and Y to the variable RX and RY in the output data set DETREND.
data a; set a; t=_n_; run; proc reg data=a; model x y = t; output out=detrend r=rx ry; run;
You then use PROC STATESPACE to forecast the detrended series RX and RY. A disadvantage of this method is that you need to add the trend back to the forecast series in an additional step. A more serious disadvantage of the detrending method is that it assumes a deterministic trend. In practice, most time series appear to have a stochastic rather than a deterministic trend. Differencing is a more exible and often more appropriate method. There are several other methods to handle nonstationary time series. For more information and examples, see Brockwell and Davis (1991).
i x t
C et
The backward autoregressive form based on the future observations is written as follows: xt D
p X i D1
i xt Ci C nt
Letting E denote the expected value operator, the autocovariance sequence for the xt series, i , is i D Ext x0 t
i
The Yule-Walker equations for the autoregressive model that matches the rst p elements of the autocovariance sequence are 2 6 6 6 4 0 0 1 : : :
0 p 1
1 0 : : :
0 p 2
p p : : : 0
3 2 p3 2 3 1 1 7 6 p 7 6 2 7 2 7 62 7 6 7 76 : 7 D 6 : 7 54 : 5 4 : 5 : :
1
and 2 6 6 6 4 0 1 : : : p
p 1 0 1 0 : : : 0 p 0 p : : : 2
2 03 p3 1 1 7 6 p 7 6 0 7 27 6 2 7 6 27 76 : 7 D 6 : 7 : 5 4 : 5 54 : :
1
32
0 p
Here i are the coefcient matrices for the past observation form of the vector autoregressive p model, and i are the coefcient matrices for the future observation form. More information about the Yule-Walker equations in the multivariate setting can be found in Whittle (1963) and Ansley and Newbold (1979). The innovation variance matrices for the two forms can be written as follows: p D 0
p X i D1
i i0
p
D 0
p X i D1
i i
The autoregressive models are t to the data by using the preceding Yule-Walker equations with i replaced by the sample covariance sequence Ci . The covariance matrices are calculated as Ci D 1 N 1
N X t Di C1
xt x0 t
b b b Let p , p , p , and b p represent the Yule-Walker estimates of p , p , p , and p , respectively. These matrices are written to an output data set when the OUTAR= option is specied. b When the PRINTOUT=LONG option is specied, the sequence of matrices p and the correb sponding correlation matrices are printed. The sequence of matrices p is used to compute Akaike information criteria for selection of the autoregressive order of the process.
You can use the printed AIC array to compute a likelihood ratio test of the autoregressive order. The log-likelihood ratio test statistic for testing the order p model against the order p 1 model is b b nln.j p j/ C nln.j p
1 j/
This quantity is asymptotically distributed as a 2 with r2 degrees of freedom if the series is autoregressive of order p 1. It can be computed from the AIC array as AICp
1
AICp C 2r 2
You can evaluate the signicance of these test statistics with the PROBCHI function in a SAS DATA step or with a 2 table.
. is between
bp The partial autocorrelations are from the sample partial autoregressive matrices p . The standard errors used for the signicance limits of the partial autocorrelations are computed from the sequence of matrices p and p .
Under the assumption that the observed series arises from an autoregressive process of order p 1, 1 bp the pth sample partial autoregressive matrix p has an asymptotic variance matrix n p 1 p . bp The signicance limits for p used in the schematic plot of the sample partial autoregressive sequence are derived by replacing p and p with their sample estimators to produce the variance estimate, as follows: bp V ar p D
1 n rp
b p 1 p b
; x0 t
0 p/
In the canonical correlation analysis, consider submatrices of the sample covariance matrix of pt and ft . This covariance matrix, V, has a block Hankel form: C0 6 C0 6 1 VD6 : 4 : : 2 C0 1 C0 2 : : : C0 2 C0 3 : : : 3 C0 p C0 7 pC1 7 : 7 : 5 : C0 2p
0 C0 C0 p pC1 CpC2
That is, start by considering whether to add x1;t C1jt to the initial state vector z1 . t The procedure forms the submatrix V1 that corresponds to f1 and computes its canonical correlat tions. Denote the smallest canonical correlation of V1 as mi n . If mi n is signicantly greater than 0, x1;tC1jt is added to the state vector. If the smallest canonical correlation of V1 is not signicantly greater than 0, then a linear combination of f1 is uncorrelated with the past, pt . Assuming that the determinant of C0 is not 0, (that is, t no input series is a constant), you can take the coefcient of x1;t C1jt in this linear combination to be 1. Denote the coefcients of z1 in this linear combination as `. This gives the relationship: t x1;t C1jt D `0 xt Therefore, the current state vector already contains all the past information useful for predicting x1;tC1 and any greater leads of x1;t . The variable x1;t C1jt is not added to the state vector, nor are any terms x1;t Ckjt considered as possible components of the state vector. The variable x1 is no longer active for state vector selection. The process described for x1;t C1jt is repeated for the remaining elements of ft . The next candidate for inclusion in the state vector is the next component of ft that corresponds to an active variable. Components of ft that correspond to inactive variables that produced a zero mi n in a previous step are skipped. Denote the next candidate as xl;tCkjt . The vector ft is formed from the current state vector and xl;tCkjt as follows: ft D .zt ; xl;t Ckjt /0 The matrix Vj is formed from ft and its canonical correlations are computed. The smallest canonical correlation of Vj is judged to be either greater than or equal to 0. If it is judged to be greater j than 0, xl;tCkjt is added to the state vector. If it is judged to be 0, then a linear combination of ft is uncorrelated with the pt , and the variable xl is now inactive. The state vector selection process continues until no active variables remain.
j j j0 j
.r.p C 1/
j
q C 1/
where q is the dimension of ft at the current step, r is the order of the state vector, p is the order of the vector autoregressive process, and is the value of the SIGCORR= option. The default is SIGCORR=2. If this information criterion is less than or equal to 0, mi n is taken to be 0; otherwise, it is taken to be signicantly greater than 0. (Do not confuse this information criterion with the AIC.)
Variables in xt Cpjt are not added in the model, even with positive information criterion, because of the singularity of V. You can force the consideration of more candidate state variables by increasing the size of the V matrix by specifying a PASTMIN= option value larger than p.
.n
:5.r.p C 1/
q C 1//ln.1
2 mi n /
with r.p C 1/
q C 1 degrees of freedom.
Figure 25.12 shows the output of the CANCORR option for the introductory example shown in the Getting Started: STATESPACE Procedure on page 1570.
proc statespace data=in out=out lead=10 cancorr; var x(1) y(1); id t; run;
x(T;T) 1
y(T;T) 1
x(T+1;T) 0.237045
DF 4
New variables are added to the state vector if the information criteria are positive. In this example, ytC1jt and xt C2jt are not added to the state space vector because the information criteria for these models are negative. If the information criterion is nearly 0, then you might want to investigate models that arise if the opposite decision is made regarding mi n . This investigation can be accomplished by using a FORM statement to specify part or all of the state vector.
Preliminary Estimates of F
When a candidate variable xl;t Ckjt yields a zero mi n and is not added to the state vector, a linear j j combination of ft is uncorrelated with the pt . Because of the method used to construct the ft
sequence, the coefcient of xl;tCkjt in l can be taken as 1. Denote the coefcients of zt in this linear combination as l. This gives the relationship: xl;tCkjt D l0 zt
j
The vector l is used as a preliminary estimate of the rst r columns of the row of the transition matrix F corresponding to xl;tCk 1jt .
Parameter Estimation
The model is zt C1 D Fzt C Get C1 , where et is a sequence of independent multivariate normal innovations with mean vector 0 and variance ee . The observed sequence xt composes the rst r components of zt , and thus xt D Hzt , where H is the r s matrix Ir 0. Let E be the r E D e1 n matrix of innovations: en
If the number of observations n is reasonably large, the log likelihood L can be approximated up to an additive constant as follows: LD n ln.jee j/ 2 1 t race.ee1 EE0 / 2
The elements of ee are taken as free parameters and are estimated as follows: S0 D 1 0 EE n
Replacing ee by S0 in the likelihood equation, the log likelihood, up to an additive constant, is LD n ln.jS0 j/ 2
Letting B be the backshift operator, the formal relation between xt and et is xt D H.I et D .H.I BF/ BF/
1
Get G/
1
xt D
1 X i D0
i xt
Letting Ci be the ith lagged sample covariance of xt and neglecting end effects, the matrix S0 is S0 D
1 X i;j D0
i C
i Cj j
Forecasting ! 1597
For the computation of S0 , the innite sum is truncated at the value of the KLAG= option. The value of the KLAG= option should be large enough that the sequence i is approximately 0 beyond that point. Let be the vector of free parameters in the F and G matrices. The derivative of the log likelihood with respect to the parameter is @L D @ n 1 @S0 trace S0 2 @
@2 S0 @@ 0
Near the maximum, the rst term is unimportant and the second term can be approximated to give the following second derivative approximation: @2 L @@ 0 n trace
0 1 @E @E S0 @ @ 0
The rst derivative matrix and this second derivative matrix approximation are computed from the sample covariance matrix C0 and the truncated sequence i . The approximate likelihood function is maximized by a modied Newton-Raphson algorithm that employs these derivative matrices. The matrix S0 is used as the estimate of the innovation covariance matrix, ee . The negative of the inverse of the second derivative matrix at the maximum is used as an approximate covariance matrix for the parameter estimates. The standard errors of the parameter estimates printed in the parameter estimates tables are taken from the diagonal of this covariance matrix. The parameter covariance matrix is printed when the COVB option is specied. If the data are nearly nonstationary, a better estimate of ee and the other parameters can sometimes be obtained by specifying the RESIDEST option. The RESIDEST option estimates the parameters by using conditional least squares instead of maximum likelihood. The residuals are computed using the state space equation and the sample mean values of the variables in the model as start-up values. The estimate of S0 is then computed using the residuals from the ith observation on, where i is the maximum number of times any variable occurs in the state vector. A multivariate Gauss-Marquardt algorithm is used to minimize jS0 j. See Harvey (1981a) for a further description of this method.
Forecasting
Given estimates of F, G, and ee , forecasts of xt are computed from the conditional expectation of zt .
In forecasting, the parameters F, G, and ee are replaced with the estimates or by values specied in the RESTRICT statement. One-step-ahead forecasting is performed for the observation xt , where tn b. Here n is the number of observations and b is the value of the BACK= option. For the observation xt , where t > n b, m-step-ahead forecasting is performed for m D t n C b. The forecasts are generated recursively with the initial condition z0 D 0. The m-step-ahead forecast of zt Cm is zt Cmjt , where ztCmjt denotes the conditional expectation of zt Cm given the information available at time t. The m-step-ahead forecast of xt Cm is xt Cmjt D Hzt Cmjt , where the matrix H D Ir 0. Let i D Fi G. Note that the last s r elements of zt consist of the elements of xujt for u > t.
i et Cm
Therefore, the m-step-ahead forecast of xt Cm is xt Cmjt D Hzt Cmjt The m-step-ahead forecast error is zt Cm zt Cmjt D
m 1 X i D0
i et Cm
Letting Vz;0 D 0, the variance of the m-step-ahead forecast error of zt Cm , Vz;m , can be computed recursively as follows: Vz;m D Vz;m
1
C m
1 ee m 1
The variance of the m-step-ahead forecast error of xt Cm is the r that is, Vx;m D HVz;m H0
Unless the NOCENTER option is specied, the sample mean vector is added to the forecast. When differencing is specied, the forecasts x t Cmjt plus the sample mean vector are integrated back to produce forecasts for the original series.
Let yt be the original series specied by the VAR statement, with some 0 values appended that correspond to the unobserved past observations. Let B be the backshift operator, and let .B/ be the s s matrix polynomial in the backshift operator that corresponds to the differencing specied by the VAR statement. The off-diagonal elements of i are 0. Note that 0 D Is , where Is is the s s identity matrix. Then zt D .B/yt . This gives the relationship yt D where
1 .B/ 1
.B/zt D P1
1 X i D0
i zt
i
i D0 i B
and 0 D Is .
i zt Cm
i jt
1 X i Dm
i zt Cm
! u i
u
i zt Cm
zt Cm
i jt
et Cm
! u i
m 1 X uD0 u
Vy;m D
ee
i X uD0
!0 u i
u
! u m
1 u
D Vy;m
ee
m 1 X uD0
!0 u m
1 u
1 ),
If the roots of the determinantial equation j.B/j D 0 are outside the unit circle in the complex plane, the model can also be written as xt D
1
.B/.B/et D
1 X i D0
i et
The i matrices are known as the impulse response matrices and can be computed as 1 .B/.B/. You can assume p > q since, if this is not initially true, you can add more terms i that are identically 0 without changing the model. To write this set of equations in a state space form, proceed as follows. Let xtCi jt be the conditional expectation of xtCi given xw for wt. The following relations hold: xt Ci jt D
1 X j Di
j et Ci
xt Ci jtC1 D xt Ci jt C i
1 et C1
However, from equation (1) you can derive the following relationship: xt Cpjt D 1 xt Cp 1jt C C p xt
(2)
Hence, when i D p, you can substitute for xt Cpjt in the right-hand side of equation (2) and close the system of equations. This substitution results in the following model in the state space form zt C1 D Fzt C Get C1 : 2 3 2 32 3 2 3 xt C1 0 I 0 0 xt I 6 xt C2jtC1 7 6 0 0 I 0 7 6 xtC1jt 7 6 1 7 6 7 6 76 7 6 7 6 7D6 : 7 C 6 : 7 et C1 : : : : 76 : : : : : : 54 : : 5 4 5 4 : 5 4 : : : : : : xt Cpjt C1 p p 1 1 xtCp 1jt p 1 Note that the state vector zt is composed of conditional expectations of xt and the rst r components of zt are equal to xt . The state space form can be cast into an ARMA form by solving the system of difference equations for the rst r components. When converting from an ARMA form to a state space form, you can generate a state vector larger than needed; that is, the state space model might not be a minimal representation. When going from a state space form to an ARMA form, you can have nontrivial common factors in the autoregressive and moving average operators that yield an ARMA model larger than necessary. If the state space form used is not a minimal representation, some but not all components of xt Ci jt might be linearly dependent. This situation corresponds to p p 1 being of less than full rank when .B/ and .B/ have no common nontrivial left factors. In this case, zt consists of a subset
of the possible components of xt Ci jt i D 1; 2; ; p 1: However, once a component of xt Ci jt (for example, the jth one) is linearly dependent on the previous conditional expectations, then all subsequent jth components of xtCkjt for k > i must also be linearly dependent. Note that in this case, equivalent but seemingly different structures can arise if the order of the components within xt is changed.
SIGBl, numeric variables that contain the estimate of the innovation covariance matrices for the backward autoregressive models. The variable SIGBl contains the lth column of b p in the observations with ORDER=p. FORk _l, numeric variables that contain the estimates of the autoregressive parameter matrices for the forward models. The variable FORk _l contains the lth column of the lag k bp autoregressive parameter matrix k in the observations with ORDER=p. BACk _l, numeric variables that contain the estimates of the autoregressive parameter matrices for the backward models. The variable BACk _l contains the lth column of the lag k bp autoregressive parameter matrix k in the observations with ORDER=p. The estimates for the order p autoregressive model can be selected as those observations with ORp DER=p. Within these observations, the k,lth element of i is given by the value of the FORi _l p variable in the kth observation. The k,lth element of i is given by the value of BACi _l variable in the kth observation. The k,lth element of p is given by SIGFl in the kth observation. The k,lth element of p is given by SIGBl in the kth observation. Table 25.2 shows an example of the OUTAR= data set, with ARMAX=3 and xt of dimension 2. In Table 25.2, .i; j / indicate the i,jth element of the matrix.
Table 25.2
Obs 1 2 3 4 5 6 7 8 0 0 1 1 2 2 3 3
ORDER
SIGB2
0.1;2/ 0.2;2/ 1.1;2/ 1.1;2/ 2.1;2/ 2.1;2/ 3.1;2/ 3.1;2/
FOR1_1 . .
1 1 .1;1/ 1 1 .2;1/ 2 1 .1;1/ 2 1 .2;1/ 3 1 .1;1/ 3 1 .2;1/
FOR1_2 . .
1 1 .1;2/ 1 1 .2;2/ 2 1 .1;2/ 2 1 .2;2/ 3 1 .1;2/ 3 1 .2;2/
FOR2_1 . . . .
2 2 .1;1/ 2 2 .2;1/ 3 2 .1;1/ 3 2 .2;1/
FOR2_2 . . . .
2 2 .1;2/ 2 2 .2;2/ 3 2 .1;2/ 3 2 .2;2/
FOR3_1 . . . . . .
3 3 .1;1/ 3 3 .2;1/
Obs
1 2 3 4 5 6 7 8
FOR3_2
. . . . . .
3 3 .1;2/ 3 3 .2;2/
BACK1_1
. .
1 1 .1;1/ 1 1 .2;1/ 2 1 .1;1/ 2 1 .2;1/ 3 1 .1;1/ 3 1 .2;1/
BACK1_2
. .
1 1 .1;2/ 1 1 .2;2/ 2 1 .1;2/ 2 1 .2;2/ 3 1 .1;2/ 3 1 .2;2/
BACK2_1
. . . .
2 2 .1;1/ 2 2 .2;1/ 3 2 .1;1/ 3 2 .2;1/
BACK2_2
. . . .
2 2 .1;2/ 2 2 .2;2/ 3 2 .1;2/ 3 2 .2;2/
BACK3_1
. . . . . .
3 3 .1;1/ 3 3 .2;1/
BACK3_2
. . . . . .
3 3 .1;2/ 3 3 .2;2/
The estimated autoregressive parameters can be used in the IML procedure to obtain autoregressive estimates of the spectral density function or forecasts based on the autoregressive models.
the BY variables STATEVEC, a character variable that contains the name of the component of the state vector corresponding to the observation. The STATEVEC variable has the value STD for standard deviations observations, which contain the standard errors for the estimates given in the preceding observation. F_j, numeric variables that contain the columns of the F matrix. The variable F_j contains the jth column of F. The number of F_j variables is equal to the value of the DIMMAX= option. If the model is of smaller dimension, the extraneous variables are set to missing. G_j, numeric variables that contain the columns of the G matrix. The variable G_j contains the jth column of G. The number of G_j variables is equal to r, the dimension of xt given by the number of variables in the VAR statement. SIG_j, numeric variables that contain the columns of the innovation covariance matrix. The variable SIG_j contains the jth column of ee . There are r variables SIG_j. Table 25.3 shows an example of the OUTMODEL= data set, with xt D .xt ; yt /0 , zt D .xt ; yt ; xt C1jt /0 , and DIMMAX=4. In Table 25.3, Fi;j and Gi;j are the i,jth elements of F and G respectively. Note that all elements for F_4 are missing because F is a 3 3 matrix.
Table 25.3 Value in the OUTMODEL= Data Set
Obs
1 2 3 4 5 6
STATEVEC
X(T;T) STD Y(T;T) STD X(T+1;T) STD
F_1
0 . F 2;1 std F 2;1 F 3;1 std F 3;1
F_2
0 . F 2;2 std F 2;2 F 3;2 std F 3;2
F_3
1 . F 2;3 std F 2;3 F 3;3 std F 3;3
F_4
. . . . . .
G_1
1 . 0 . G 3;1 std G 3;1
G_2
0 . 1 . G 3;2 std G 3;2
SIG_1
1;1 . 2;1 . . .
SIG_2
1;2 . 2;2 . . .
Printed Output
The printed output produced by the STATESPACE procedure includes the following: 1. descriptive statistics, which include the number of observations used, the names of the variables, their means and standard deviations (Std), and the differencing operations used 2. the Akaike information criteria for the sequence of preliminary autoregressive models 3. if the PRINTOUT=LONG option is specied, the sample autocovariance matrices of the input series at various lags 4. if the PRINTOUT=LONG option is specied, the sample autocorrelation matrices of the input series
5. a schematic representation of the autocorrelation matrices, showing the signicant autocorrelations 6. if the PRINTOUT=LONG option is specied, the partial autoregressive matrices. (These are p p as described in the section Preliminary Autoregressive Models on page 1590.) 7. a schematic representation of the partial autocorrelation matrices, showing the signicant partial autocorrelations 8. the Yule-Walker estimates of the autoregressive parameters for the autoregressive model with the minimum AIC 9. if the PRINTOUT=LONG option is specied, the autocovariance matrices of the residuals of the minimum AIC model. This is the sequence of estimated innovation variance matrices for the solutions of the Yule-Walker equations. 10. if the PRINTOUT=LONG option is specied, the autocorrelation matrices of the residuals of the minimum AIC model 11. If the CANCORR option is specied, the canonical correlations analysis for each potential state vector considered in the state vector selection process. This includes the potential state vector, the canonical correlations, the information criterion for the smallest canonical correlation, Bartletts 2 statistic (Chi Square) for the smallest canonical correlation, and the degrees of freedom of Bartletts 2 . 12. the components of the chosen state vector 13. the preliminary estimate of the transition matrix, F, the input matrix, G, and the variance matrix for the innovations, ee 14. if the ITPRINT option is specied, the iteration history of the likelihood maximization. For each iteration, this shows the iteration number, the number of step halvings, the determinant of the innovation variance matrix, the damping factor Lambda, and the values of the parameters. 15. the state vector, printed again to aid interpretation of the following listing of F and G 16. the nal estimate of the transition matrix F 17. the nal estimate of the input matrix G 18. the nal estimate of the variance matrix for the innovations ee 19. a table that lists the estimates of the free parameters in F and G and their standard errors and t statistics 20. if the COVB option is specied, the covariance matrix of the parameter estimates 21. if the COVB option is specied, the correlation matrix of the parameter estimates 22. if the PRINT option is specied, the forecasts and their standard errors
ODS Table Name NObs Summary InfoCriterion CovLags CorrLags PartialAR YWEstimates CovResiduals CorrResiduals StateVector CorrGraph TransitionMatrix InputMatrix VarInnov CovB CorrB CanCorr IterHistory ParameterEstimates Forecasts ConvergenceStatus
Description number of observations simple summary statistics table information criterion table covariance matrices of input series correlation matrices of input series partial autoregressive matrices Yule-Walker estimates for minimum AIC covariance of residuals residual correlations from AR models state vector table schematic representation of correlations transition matrix input matrix variance matrix for the innovation covariance of parameter estimates correlation of parameter estimates canonical correlation analysis iterative tting table parameter estimates table forecasts table convergence status table
Option default default default PRINTOUT=LONG PRINTOUT=LONG PRINTOUT=LONG default PRINTOUT=LONG PRINTOUT=LONG default default default default default COVB COVB CANCORR ITPRINT default PRINT default
The results for the automatically selected model are shown in Output 25.1.1.
Output 25.1.1 Results for Automatically Selected Model
Gas Furnace Data Box & Jenkins Series J Automatically Selected Model The STATESPACE Procedure Number of Observations 296 Standard Error 1.072766 3.202121
Variable x y
Gas Furnace Data Box & Jenkins Series J Automatically Selected Model The STATESPACE Procedure Information Criterion for Autoregressive Models Lag=0 Lag=1 Lag=2 Lag=3 Lag=4 Lag=5 Lag=6 Lag=7 Lag=8
651.3862 -1033.57 -1632.96 -1645.12 -1651.52 -1648.91 -1649.34 -1643.15 -1638.56 Information Criterion for Autoregressive Models Lag=9 -1634.8 Lag=10 -1633.59
. is between
. is between
Yule-Walker Estimates for Minimum AIC ------Lag=1------ ------Lag=2------ ------Lag=3------ ------Lag=4-----x y x y x y x y x y 1.925887 -0.00124 -1.20166 0.004224 0.116918 -0.00867 0.104236 0.003268 0.050496 1.299793 -0.02046 -0.3277 -0.71182 -0.25701 0.195411 0.133417
x(T;T) 1
y(T;T) 1
x(T+1;T) 0.804883
DF 8
Estimate of Transition Matrix 0 0 -0.84718 0 -0.19785 0 0 0.026794 0 0.334274 1 0 1.711715 0 -0.18174 0 1 -0.05019 0 -1.23557 0 0 0 1 1.787475
Input Matrix for Innovation 1 0 1.925887 0.050496 0.142421 0 1 -0.00124 1.299793 1.361696
Input Matrix for Innovation 1 0 1.92442 0.015621 0.08058 0 1 -0.00416 1.258495 1.353204
Parameter Estimates Standard Error 0.072961 0.026167 0.061599 0.030169 0.135253 0.046299 0.096527 0.109525 0.083737 0.058162 0.035255 0.095771 0.055742 0.151622 0.091388
Parameter F(3,1) F(3,2) F(3,3) F(3,4) F(5,1) F(5,2) F(5,3) F(5,4) F(5,5) G(3,1) G(3,2) G(4,1) G(4,2) G(5,1) G(5,2)
Estimate -0.86192 0.030609 1.724235 -0.05483 -0.34839 0.292124 -0.09435 -1.09823 1.671418 1.924420 -0.00416 0.015621 1.258495 0.080580 1.353204
t Value -11.81 1.17 27.99 -1.82 -2.58 6.31 -0.98 -10.03 19.96 33.09 -0.12 0.16 22.58 0.53 14.81
The two series are believed to have a transfer function relation with the gas rate (variable X) as the input and the CO2 concentration (variable Y) as the output. Since the parameter estimates shown in Output 25.1.1 support this kind of model, the model is reestimated with the feedback parameters restricted to 0. The following statements t the transfer function (no feedback) model.
title3 Transfer Function Model; proc statespace data=seriesj printout=none;
The last two pages of the output are shown in Output 25.1.8.
Output 25.1.8 STATESPACE Output for Transfer Function Model
Gas Furnace Data Box & Jenkins Series J Transfer Function Model The STATESPACE Procedure Selected Statespace Form and Fitted Model State Vector x(T;T) y(T;T) x(T+1;T) y(T+1;T) y(T+2;T)
Estimate of Transition Matrix 0 0 -0.68882 0 -0.35944 0 0 0 0 0.284179 1 0 1.598717 0 -0.0963 0 1 0 0 -1.07313 0 0 0 1 1.650047
References ! 1611
Parameter F(3,1) F(3,3) F(5,1) F(5,2) F(5,3) F(5,4) F(5,5) G(3,1) G(4,2) G(5,2)
Estimate -0.68882 1.598717 -0.35944 0.284179 -0.09630 -1.07313 1.650047 1.923446 1.260856 1.346332
t Value -13.63 31.39 -1.57 2.93 -0.68 -4.29 8.75 34.15 22.33 14.78
References
Akaike, H. (1974), Markovian Representation of Stochastic Processes and Its Application to the Analysis of Autoregressive Moving Average Processes, Annals of the Institute of Statistical Mathematics, 26, 363387. Akaike, H. (1976), Canonical Correlations Analysis of Time Series and the Use of an Information Criterion, in Advances and Case Studies in System Identication, eds. R. Mehra and D.G. Lainiotis, New York: Academic Press. Anderson, T.W. (1971), The Statistical Analysis of Time Series, New York: John Wiley & Sons. Ansley, C.F. and Newbold, P. (1979), Multivariate Partial Autocorrelations, Proceedings of the Business and Economic Statistics Section, American Statistical Association, 349353. Box, G.E.P. and Jenkins, G. (1976), Time Series Analysis: Forecasting and Control, San Francisco: Holden-Day. Brockwell, P.J. and Davis, R.A. (1991), Time Series: Theory and Methods, 2nd Edition, SpringerVerlag. Hannan, E.J. (1970), Multiple Time Series, New York: John Wiley & Sons. Hannan, E.J. (1976), The Identication and Parameterization of ARMAX and State Space Forms, Econometrica, 44, 713722. Harvey, A.C. (1981a), The Econometric Analysis of Time Series, New York: John Wiley & Sons. Harvey, A.C. (1981b), Time Series Models, New York: John Wiley & Sons. Jones, R.H. (1974), Identication and Autoregressive Spectrum Estimation, IEEE Transactions
on Automatic Control, AC-19, 894897. Pham-Dinh-Tuan (1978), On the Fitting of Multivariate Processes of the Autoregressive Moving Average Type, Biometrika, 65, 99107. Priestley, M.B. (1980), System Identication, Kalman Filtering, and Stochastic Control, in Directions in Time Series, eds. D.R. Brillinger and G.C. Tiao, Institute of Mathematical Statistics. Whittle, P. (1963), On the Fitting of Multivariate Autoregressions and the Approximate Canonical Factorization of a Spectral Density Matrix, Biometrika, 50, 129134.
Chapter 26
Computational Details . . . . . . . . . . . . . . . . . . . . . . . Missing Values . . . . . . . . . . . . . . . . . . . . . . . . . . . OUT= Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . . OUTEST= Data Set . . . . . . . . . . . . . . . . . . . . . . . . OUTSSCP= Data Set . . . . . . . . . . . . . . . . . . . . . . . Printed Output . . . . . . . . . . . . . . . . . . . . . . . . . . . ODS Table Names . . . . . . . . . . . . . . . . . . . . . . . . . ODS Graphics . . . . . . . . . . . . . . . . . . . . . . . . . . . Examples: SYSLIN Procedure . . . . . . . . . . . . . . . . . . . . . . Example 26.1: Kleins Model I Estimated with LIML and 3SLS . Example 26.2: Grunfelds Model Estimated with SUR . . . . . . Example 26.3: Illustration of ODS Graphics . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
1651 1654 1655 1655 1656 1657 1659 1660 1661 1661 1668 1672 1676
Other features of the SYSLIN procedure enable you to: impose linear restrictions on the parameter estimates test linear hypotheses about the parameters write predicted and residual values to an output SAS data set write parameter estimates to an output SAS data set write the crossproducts matrix (SSCP) to an output SAS data set use raw data, correlations, covariances, or cross products as input
An Example Model
In simultaneous systems of equations, endogenous variables are determined jointly rather than sequentially. Consider the following supply and demand functions for some product: QD D a1 C b1 P C c1 Y C d1 S C QS D a2 C b2 P C c2 U C
1 .demand/
2 .supply/
Q D QD D QS .market equilibrium/ The variables in this system are as follows: QD QS Q P Y S quantity demanded quantity supplied the observed quantity sold, which equates quantity supplied and quantity demanded in equilibrium price per unit income price of substitutes
U
1 2
unit cost the random error term for the demand equation the random error term for the supply equation
In this system, quantity demanded depends on price, income, and the price of substitutes. Consumers normally purchase more of a product when prices are lower and when income and the price of substitute goods are higher. Quantity supplied depends on price and the unit cost of production. Producers supply more when price is high and when unit cost is low. The actual price and quantity sold are determined jointly by the values that equate demand and supply. Since price and quantity are jointly endogenous variables, both structural equations are necessary to adequately describe the observed values. A critical assumption of OLS is that the regressors are uncorrelated with the residual. When current endogenous variables appear as regressors in other equations (endogenous variables depend on each other), this assumption is violated and the OLS parameter estimates are biased and inconsistent. The bias caused by the violated assumptions is called simultaneous equation bias. Neither the demand nor supply equation can be estimated consistently by OLS.
OLS Estimation
PROC SYSLIN performs OLS regression if you do not specify a method of estimation in the PROC SYSLIN statement. OLS does not use instruments, so the ENDOGENOUS and INSTRUMENTS statements can be omitted. The following statements estimate the supply and demand model shown previously:
proc syslin data=in; demand: model q = p y s; supply: model q = p u; run;
The PROC SYSLIN output for the demand equation is shown in Figure 26.1, and the output for the
Analysis of Variance Sum of Squares 9.587901 0.449338 10.03724 0.08958 1.30095 6.88542 Mean Square 3.195967 0.008024
Source Model Error Corrected Total Root MSE Dependent Mean Coeff Var
DF 3 56 59
F Value 398.31
Pr > F <.0001
0.95523 0.95283
Parameter Estimates Parameter Estimate -0.47677 0.123326 0.201282 0.167258 Standard Error 0.210239 0.105177 0.032403 0.024091 Variable Label Intercept Price Income Price of Substitutes
Variable Intercept p y s
DF 1 1 1 1
Analysis of Variance Sum of Squares 9.033902 1.003337 10.03724 Mean Square 4.516951 0.017602
DF 2 57 59
F Value 256.61
Pr > F <.0001
Parameter Estimates Parameter Estimate -0.30389 1.218743 -1.07757 Standard Error 0.471397 0.053914 0.234150 Variable Label Intercept Price Unit Cost
Variable Intercept p u
DF 1 1 1
For each MODEL statement, the output rst shows the model label and dependent variable name and label. This is followed by an analysis-of-variance table for the model, which shows the model, error, and total mean squares, and an F test for the no-regression hypothesis. Next, the procedure prints the root mean squared error, dependent variable mean and coefcient of variation, and the R2 and adjusted R2 statistics. Finally, the table of parameter estimates shows the estimated regression coefcients, standard errors, and t tests. You would expect the price coefcient in a demand equation to be negative. However, note that the OLS estimate of the price coefcient P in the demand equation (0.1233) has a positive sign. This could be caused by simultaneous equation bias.
The 2SLS option in the PROC SYSLIN statement species the two-stage least squares method. The ENDOGENOUS statement species that P is an endogenous regressor for which rst-stage predicted values are substituted. You need to declare an endogenous variable in the ENDOGENOUS statement only if it is used as a regressor; thus although Q is endogenous in this model, it is not necessary to list it in the ENDOGENOUS statement. Usually, all predetermined variables that appear in the system are used as instruments. The INSTRUMENTS statement species that the exogenous variables Y, U, and S are used as instruments for the rst-stage regression to predict P.
The 2SLS results are shown in Figure 26.3 and Figure 26.4. The rst-stage regressions are not shown. To see the rst-stage regression results, use the FIRST option in the PROC SYSLIN statement.
Figure 26.3 2SLS Results for Demand Equation
The SYSLIN Procedure Two-Stage Least Squares Estimation Model Dependent Variable Label DEMAND q Quantity
Analysis of Variance Sum of Squares 9.670892 1.561956 10.03724 0.16701 1.30095 12.83744 Mean Square 3.223631 0.027892
Source Model Error Corrected Total Root MSE Dependent Mean Coeff Var
DF 3 56 59
F Value 115.58
Pr > F <.0001
0.86095 0.85350
Parameter Estimates Parameter Estimate 1.901048 -1.11519 0.419546 0.331475 Standard Error 1.171231 0.607395 0.117955 0.088472 Variable Label Intercept Price Income Price of Substitutes
Variable Intercept p y s
DF 1 1 1 1
Analysis of Variance Sum of Squares 9.646109 1.082503 10.03724 Mean Square 4.823054 0.018991
DF 2 57 59
F Value 253.96
Pr > F <.0001
Parameter Estimates Parameter Estimate -0.51878 1.333080 -1.14623 Standard Error 0.490999 0.059271 0.243491 Variable Label Intercept Price Unit Cost
Variable Intercept p u
DF 1 1 1
The 2SLS output is similar in form to the OLS output. However, the 2SLS results are based on predicted values for the endogenous regressors from the rst stage instrumental regressions. This makes the analysis-of-variance table and the R2 statistics difcult to interpret. See the sections ANOVA Table for Instrumental Variables Methods on page 1650 and The R-Square Statistics on page 1651 for details. Note that, unlike the OLS results, the 2SLS estimate for the P coefcient in the demand equation (1.115) is negative.
For more information about these estimation methods, see the section Estimation Methods on page 1647 and consult econometrics textbooks.
account.
INSTRUMENTS and ENDOGENOUS statements are not needed for SUR, because the SUR method assumes there are no endogenous regressors. For SUR to be effective, the models must use different regressors. SUR produces the same results as OLS unless the model contains at least one regressor not used in the other equations.
The 3SLS output begins with a two-stage least squares regression to estimate the cross-model correlation matrix. This output is the same as the 2SLS results shown in Figure 26.3 and Figure 26.4, and is not repeated here. The next part of the 3SLS output prints the cross-model correlation matrix computed from the 2SLS residuals. This output is shown in Figure 26.5 and includes the crossmodel covariances, correlations, the inverse of the correlation matrix, and the inverse covariance matrix.
Cross Model Correlation DEMAND DEMAND SUPPLY 1.00000 -0.49022 SUPPLY -0.49022 1.00000
Cross Model Inverse Correlation DEMAND DEMAND SUPPLY 1.31634 0.64530 SUPPLY 0.64530 1.31634
Cross Model Inverse Covariance DEMAND DEMAND SUPPLY 47.1941 28.0379 SUPPLY 28.0379 69.3130
Parameter Estimates Parameter Estimate 1.980269 -1.17654 0.404117 0.359204 Standard Error 1.169176 0.605015 0.117179 0.085077 Variable Label Intercept Price Income Price of Substitutes
Variable Intercept p y s
DF 1 1 1 1
Parameter Estimates Parameter Estimate -0.51878 1.333080 -1.14623 Standard Error 0.490999 0.059271 0.243491 Variable Label Intercept Price Unit Cost
Variable Intercept p u
DF 1 1 1
This output rst prints the system weighted mean squared error and system weighted R2 statistics. The system weighted MSE and system weighted R2 measure the t of the joint model obtained by stacking all the models together and performing a single regression with the stacked observations weighted by the inverse of the model error variances. See the section The R-Square Statistics on page 1651 for details. Next, the table of 3SLS parameter estimates for each model is printed. This output has the same form as for the other estimation methods. Note that, in some cases, the 3SLS and 2SLS results can be the same. Such a case could arise because of the same principle that causes OLS and SUR results to be identical, unless an equation includes a regressor not used in the other equations of the system. However, the application of this principle is more complex when instrumental variables are used. When all the exogenous variables are used as instruments, linear combinations of all the exogenous variables appear in the third-stage regressions through substitution of rst-stage predicted values. In this example, 3SLS produces different (and, it is hoped, more efcient) estimates for the demand equation. However, the 3SLS and 2SLS results for the supply equation are the same. This is because the supply equation has one endogenous regressor and one exogenous regressor not used in other equations. In contrast, the demand equation has fewer endogenous regressors than exogenous regressors not used in other equations in the system.
Parameter Estimates Parameter Estimate 1.988538 -1.18148 0.402312 0.361345 Standard Error 1.233632 0.652278 0.107270 0.103817 Variable Label Intercept Price Income Price of Substitutes
Variable Intercept p y s
DF 1 1 1 1
Parameter Estimates Parameter Estimate -0.52443 1.336083 -1.14804 Standard Error 0.479522 0.057939 0.237793 Variable Label Intercept Price Unit Cost
Variable Intercept p u
DF 1 1 1
run;
The rst four pages of this output were as shown previously and are not repeated here. (See Figure 26.3, Figure 26.4, Figure 26.5, and Figure 26.6.) The nal page of the output from this example contains the reduced form coefcients from the 3SLS structural estimates, as shown in Figure 26.8.
Figure 26.8 Reduced Form 3SLS Results
The SYSLIN Procedure Three-Stage Least Squares Estimation Endogenous Variables p DEMAND SUPPLY 1.176543 -1.33308 Exogenous Variables Intercept DEMAND SUPPLY 1.980269 -0.51878 y 0.404117 0 s 0.359204 0 u 0 -1.14623 q 1 1
Inverse Endogenous Variables DEMAND p q 0.398466 0.531187 Reduced Form Intercept p q 0.995788 0.808682 y 0.161027 0.214662 s 0.143131 0.190804 u 0.456735 -0.53737 SUPPLY -0.39847 0.468813
provided the dependent variable uniquely labels the model.) Tests for the signicance of the restrictions are printed when RESTRICT or SRESTRICT statements are used. You can label RESTRICT and SRESTRICT statements to identify the restrictions in the output. The RESTRICT statement in the following example restricts the price coefcient in the demand equation to equal 0.015. The SRESTRICT statement restricts the estimate of the income coefcient in the demand equation to be 0.01 times the estimate of the unit cost coefcient in the supply equation.
proc syslin data=in 3sls; endogenous p; instruments y u s; demand: model q = p y s; peq015: restrict p = .015; supply: model q = p u; yeq01u: srestrict demand.y = .01 * supply.u; run;
Parameter Estimates Parameter Estimate -0.46584 0.015000 -0.00679 0.325589 50.59353 Standard Error 0.053307 0 0.002357 0.009872 7.464988 Variable Label Intercept Price Income Price of Substitutes PEQ015
DF 1 1 1 1 -1
Parameter Estimates Parameter Estimate -1.31894 1.291718 -0.67887 Standard Error 0.477633 0.059101 0.235679 Variable Label Intercept Price Unit Cost
Variable Intercept p u
DF 1 1 1
Variable RESTRICT
DF -1
t Value 8.98
The standard error for P in the demand equation is 0, since the value of the P coefcient was specied by the RESTRICT statement and not estimated from the data. The Parameter Estimates table for the demand equation contains an additional row for the restriction specied by the RESTRICT statement. The parameter estimate for the restriction is the value of the Lagrange multiplier used to impose the restriction. The restriction is highly signicant (t D 6:777), which means that the data are not consistent with the restriction, and the model does not t as well with the restriction imposed. See the section RESTRICT Statement on page 1641 for details. Following the Parameter Estimates table for the supply equation, the results for the cross model restrictions are printed. This shows that the restriction specied by the SRESTRICT statement is not consistent with the data (t D 8:98). See the section SRESTRICT Statement on page 1642 for details.
Testing Parameters
You can test linear hypotheses about the model parameters with TEST and STEST statements. The TEST statement tests hypotheses about parameters in the equation specied by the preceding MODEL statement. The STEST statement tests hypotheses that relate parameters in different models. For example, the following statements test the hypothesis that the price coefcient in the demand equation is equal to 0.015.
/*--- 3SLS Estimation with Tests ---*/ proc syslin data=in 3sls; endogenous p; instruments y u s; demand: model q = p y s; test_1: test p = .015; supply: model q = p u; run;
The TEST statement results are shown in Figure 26.10. This reports an F test for the hypothesis specied by the TEST statement. In this case, the F statistic is 6.79 (3.879/.571) with 1 and 113 degrees of freedom. The p value for this F statistic is 0.0104, which indicates that the hypothesis tested is almost but not quite rejected at the 0.01 level. See the section TEST Statement on page 1644 for details.
Parameter Estimates Parameter Estimate 1.980269 -1.17654 0.404117 0.359204 Standard Error 1.169176 0.605015 0.117179 0.085077 Variable Label Intercept Price Income Price of Substitutes
Variable Intercept p y s
DF 1 1 1 1
Test Results Num DF 1 Den DF 113 F Value 6.79 Pr > F 0.0104 Label TEST_1
To test hypotheses that involve parameters in different equations, use the STEST statement. Specify the parameters in the linear hypothesis as model-label.regressor-name. (If the MODEL statement does not have a label, you can use the dependent variable name as the label for the model, provided the dependent variable uniquely labels the model.) For example, the following statements test the hypothesis that the income coefcient in the demand equation is 0.01 times the unit cost coefcient in the supply equation:
proc syslin data=in 3sls; endogenous p; instruments y u s; demand: model q = p y s; supply: model q = p u; stest1: stest demand.y = .01 * supply.u; run;
The STEST statement results are shown in Figure 26.11. The form and interpretation of the STEST statement results are like the TEST statement results. In this case, the F test produces a p value less than 0.0001, and strongly rejects the hypothesis tested. See the section STEST Statement on page 1643 for details.
Parameter Estimates Parameter Estimate 1.980269 -1.17654 0.404117 0.359204 Standard Error 1.169176 0.605015 0.117179 0.085077 Variable Label Intercept Price Income Price of Substitutes
Variable Intercept p y s
DF 1 1 1 1
Parameter Estimates Parameter Estimate -0.51878 1.333080 -1.14623 Standard Error 0.490999 0.059271 0.243491 Variable Label Intercept Price Unit Cost
Variable Intercept p u
DF 1 1 1
Test Results Num DF 1 Den DF 113 F Value 22.46 Pr > F 0.0001 Label STEST1
You can combine TEST and STEST statements with RESTRICT and SRESTRICT statements to perform hypothesis tests for restricted models. Of course, the validity of the TEST and STEST statement results depends on the correctness of any restrictions you impose on the estimates.
for new variables to contain the predicted and residual values. For example, the following statements store the predicted quantity from the supply and demand equations in a data set PRED:
/*--- Saving Output Data from 3SLS Estimation ---*/ proc syslin data=in out=pred 3sls; endogenous p; instruments y u s; demand: model q = p y s; output predicted=q_demand; supply: model q = p u; output predicted=q_supply; run;
Plotting Residuals
You can plot the residuals against the regressors by using the PROC SGPLOT. For example, the following statements plot the 2SLS residuals for the demand model against price, income, and price of substitutes.
proc syslin data=in 2sls out=out; endogenous p; instruments y u s; demand: model q = p y s; output residual=residual_q; run; proc sgplot data=out; scatter x=p y=residual_q; refline 0 / axis=y; run; proc sgplot data=out; scatter x=y y=residual_q; refline 0 / axis=y; run; proc sgplot data=out; scatter x=s y=residual_q; refline 0 / axis=y; run;
The plot for income is shown in Figure 26.12. The other plots are not shown.
Functional Summary
The SYSLIN procedure statements and options are summarized in the following table.
Description
Statement
Option
Data Set Options specify the input data set specify the output data set write parameter estimates to an output data set write covariances to the OUTEST= data set write the SSCP matrix to an output data set
PROC SYSLIN PROC SYSLIN PROC SYSLIN PROC SYSLIN PROC SYSLIN
Estimation Method Options specify full information maximum likelihood estimation specify iterative SUR estimation specify iterative 3SLS estimation specify K-class estimation specify limited information maximum likelihood estimation specify minimum expected loss estimation specify ordinary least squares estimation specify seemingly unrelated estimation specify two-stage least squares estimation specify three-stage least squares estimation specify Fullers modication to LIML specify convergence criterion specify maximum number of iterations use diagonal of S instead of S exclude RESTRICT statements in nal stage specify criterion for testing for singularity specify denominator for variance estimates
PROC SYSLIN PROC SYSLIN PROC SYSLIN PROC SYSLIN PROC SYSLIN PROC SYSLIN PROC SYSLIN PROC SYSLIN PROC SYSLIN PROC SYSLIN PROC SYSLIN PROC SYSLIN PROC SYSLIN PROC SYSLIN PROC SYSLIN PROC SYSLIN PROC SYSLIN
FIML ITSUR IT3SLS K= LIML MELO OLS SUR 2SLS 3SLS ALPHA= CONVERGE= MAXIT= SDIAG NOINCLUDE SINGULAR= VARDEF=
Printing Control Options print all results print rst-stage regression statistics print estimates and SSE at each iteration print the reduced form estimates print descriptive statistics print uncorrected SSCP matrix print correlations of the parameter estimates
PROC SYSLIN PROC SYSLIN PROC SYSLIN PROC SYSLIN PROC SYSLIN PROC SYSLIN MODEL
Description
Statement
Option
print covariances of the parameter estimates print Durbin-Watson statistics print Basmanns test plot residual values against regressors print standardized parameter estimates print unrestricted parameter estimates print the model crossproducts matrix print the inverse of the crossproducts matrix suppress printed output suppress all printed output
MODEL MODEL MODEL MODEL MODEL MODEL MODEL MODEL MODEL PROC SYSLIN
Model Specication specify structural equations suppress the intercept parameter specify linear relationship among variables perform weighted regression
NOINT
Tests and Restrictions on Parameters place restrictions on parameter estimates place restrictions on parameter estimates test linear hypothesis test linear hypothesis Other Statements specify BY-group processing specify the endogenous variables specify instrumental variables write predicted and residual values to a data set name variable for predicted values name variable for residual values include additional variables in X 0 X matrix
The following options can be used with the PROC SYSLIN statement.
species the input data set. If the DATA= option is omitted, the most recently created SAS data set is used. In addition to ordinary SAS data sets, PROC SYSLIN can analyze data sets of TYPE=CORR, TYPE=COV, TYPE=UCORR, TYPE=UCOV, and TYPE=SSCP. See the section Special TYPE= Input Data Sets on page 1647 for details.
OUT=SAS-data-set
species an output SAS data set for residuals and predicted values. The OUT= option is used in conjunction with the OUTPUT statement. See the section OUT= Data Set on page 1655 for details.
OUTEST=SAS-data-set
writes the parameter estimates to an output data set. See the section OUTEST= Data Set on page 1655 for details.
OUTCOV COVOUT
writes the covariance matrix of the parameter estimates to the OUTEST= data set in addition to the parameter estimates.
OUTCOV3 COV3OUT
writes covariance matrices for each model in a system to the OUTEST= data set when the 3SLS, SUR, or FIML option is used.
OUTSSCP=SAS-data-set
writes the sum-of-squares-and-crossproducts matrix to an output data set. See the section OUTSSCP= Data Set on page 1656 for details.
species Fullers modication to the LIML estimation method. See the section Fullers Modication to LIML on page 1654 for details.
CONVERGE=value
species the convergence criterion for the iterative estimation methods IT3SLS, ITSUR, and FIML. The default is CONVERGE=0.0001.
FIML
ITSUR
species the maximum number of iterations allowed for the IT3SLS, ITSUR, and FIML estimation methods. The MAXITER= option can be abbreviated as MAXIT=. The default is MAXITER=30.
MELO
excludes the RESTRICT statements from the nal stage for the 3SLS, IT3SLS, SUR, and ITSUR estimation methods.
OLS
species the ordinary least squares estimation method. This is the default.
SDIAG
uses the diagonal of S instead of S to do the estimation, where S is the covariance matrix of equation errors. See the section Uncorrelated Errors across Equations on page 1654 for details.
SINGULAR=value
species a criterion for testing singularity of the crossproducts matrix. This is a tuning parameter used to make PROC SYSLIN more or less sensitive to singularities. The value must be between 0 and 1. The default is SINGULAR=1E8.
SUR
species the CORRB, COVB, DW, I, OVERID, PLOT, STB, and XPX options for every MODEL statement.
FIRST
prints rst-stage regression statistics for the endogenous variables regressed on the instruments. This output includes sums of squares, estimates, variances, and standard deviations.
BY Statement ! 1637
ITPRINT
prints parameter estimates, system-weighted residual sum of squares, and R2 at each iteration for the IT3SLS and ITSUR estimation methods. For the FIML method, the ITPRINT option prints parameter estimates, negative of log-likelihood function, and norm of gradient vector at each iteration.
NOPRINT
suppresses all printed output. Specifying NOPRINT in the PROC SYSLIN statement is equivalent to specifying NOPRINT in every MODEL statement.
REDUCED
prints the reduced form estimates. If the REDUCED option is specied, you should specify any IDENTITY statements needed to make the system square. See the section Reduced Form Estimates on page 1653 for details.
SIMPLE
prints descriptive statistics for the dependent variables. The statistics printed include the sum, mean, uncorrected sum of squares, variance, and standard deviation.
USSCP
prints the uncorrected sum-of-squares-and-crossproducts matrix for all variables used in the analysis, including predicted values of variables generated by the procedure.
VARDEF=DF | N | WEIGHT | WGT
species the denominator to use in calculating cross-equation error covariances and parameter standard errors and covariances. The default is VARDEF=DF, which corrects for model degrees of freedom. VARDEF=N species no degrees-of-freedom correction. VARDEF=WEIGHT species the sum of the observation weights. VARDEF=WGT species the sum of the observation weights minus the model degrees of freedom. See the section Computation of Standard Errors on page 1653 for details.
BY Statement
BY variables ;
A BY statement can be used with PROC SYSLIN to obtain separate analyses on observations in groups dened by the BY variables.
ENDOGENOUS Statement
ENDOGENOUS variables ;
The ENDOGENOUS statement declares the jointly dependent variables that are projected in the rst-stage regression through the instrument variables. The ENDOGENOUS statement is not needed for the SUR, ITSUR, or OLS estimation methods. The default ENDOGENOUS list consists of all the dependent variables in the MODEL and IDENTITY statements that do not appear in the INSTRUMENTS statement.
IDENTITY Statement
IDENTITY equation ;
The IDENTITY statement species linear relationships among variables to write to the OUTEST= data set. It provides extra information in the OUTEST= data set but does not create or compute variables. The OUTEST= data set can be processed by the SIMLIN procedure in a later step. The IDENTITY statement is also used to compute reduced form coefcients when the REDUCED option in the PROC SYSLIN statement is specied. See the section Reduced Form Estimates on page 1653 for details. The equation given by the IDENTITY statement has the same form as equations in the MODEL statement. A label can be specied for an IDENTITY statement as follows:
label : IDENTITY . . . ;
INSTRUMENTS Statement
INSTRUMENTS variables ;
The INSTRUMENTS statement declares the variables used in obtaining rst-stage predicted values. All the instruments specied are used in each rst-stage regression. The INSTRUMENTS statement is required for the 2SLS, 3SLS, IT3SLS, LIML, MELO, and K-class estimation methods. The INSTRUMENTS statement is not needed for the SUR, ITSUR, OLS, or FIML estimation methods.
MODEL Statement
MODEL response = regressors / options ;
The MODEL statement regresses the response variable on the left side of the equal sign against the regressors listed on the right side. Models can be given labels. Model labels are used in the printed output to identify the results for different models. Model labels are also used in SRESTRICT and STEST statements to refer to parameters in different models. If no label is specied, the response variable name is used as the label for the model. The model label is specied as follows:
label : MODEL . . . ;
The following options can be used in the MODEL statement after a slash (/).
ALL
species the CORRB, COVB, DW, I, OVERID, PLOT, STB, and XPX options.
ALPHA=value
species the parameter for Fullers modication to the LIML estimation method. See the section Fullers Modication to LIML on page 1654 for details.
CORRB
prints Durbin-Watson statistics and autocorrelation coefcients for the residuals. If there are missing values, d 0 is calculated according to Savin and White (1978). Use the DW option only if the data set to be analyzed is an ordinary SAS data set with time series observations sorted in time order. The Durbin-Watson test is not valid for models with lagged dependent regressors.
I
prints the inverse of the crossproducts matrix for the model, .X0 X/ 1 . If restrictions are specied, the crossproducts matrix printed is adjusted for the restrictions. See the section Computational Details on page 1651 for details.
K=value
prints Basmanns (1960) test for over identifying restrictions. Overidentication Restrictions on page 1654 for details.
PLOT
plots residual values against regressors. A plot of the residuals for each regressor is printed.
STB
prints standardized parameter estimates. Sometimes known as a standard partial regression coefcient, a standardized parameter estimate is a parameter estimate multiplied by the standard deviation of the associated regressor and divided by the standard deviation of the response variable.
UNREST
prints parameter estimates computed before restrictions are applied. The UNREST option is valid only if a RESTRICT statement is specied.
XPX
prints the model crossproducts matrix, X 0 X . See the section Computational Details on page 1651 for details.
OUTPUT Statement
OUTPUT < PREDICTED=variable > < RESIDUAL=variable > ;
The OUTPUT statement writes predicted values and residuals from the preceding model to the data set specied by the OUT= option in the PROC SYSLIN statement. An OUTPUT statement must come after the MODEL statement to which it applies. The OUT= option must be specied in the PROC SYSLIN statement. The following options can be specied in the OUTPUT statement:
PREDICTED=variable
names a new variable to contain the predicted values for the response variable. The PREDICTED= option can be abbreviated as PREDICT=, PRED=, or P=.
RESIDUAL=variable
names a new variable to contain the residual values for the response variable. The RESIDUAL= option can be abbreviated as RESID= or R=. For example, the following statements create an output data set named B. In addition to the variables in the input data set, the data set B contains the variable YHAT, with values that are predicted values of the response variable Y, and YRESID, with values that are the residual values of Y.
proc syslin data=a out=b; model y = x1 x2; output p=yhat r=yresid; run;
For example, the following statements create an output data set named PRED. In addition to the variables in the input data set, the data set PRED contains the variables Q_DEMAND and Q_SUPPLY, with values that are predicted values of the response variable Q for the demand and supply equations respectively, and R_DEMAND and R_SUPPLY, with values that are the residual values of the demand and supply equations respectively.
proc syslin data=in out=pred; demand: model q = p y s; output p=q_demand r=r_demand; supply: model q = p u; output p=q_supply r=r_supply; run;
See the section OUT= Data Set on page 1655 for details.
RESTRICT Statement
RESTRICT equation , . . . , equation ;
The RESTRICT statement places restrictions on the parameter estimates for the preceding MODEL statement. Any number of RESTRICT statements can follow a MODEL statement. Each restriction is written as a linear equation. If more than one restriction is specied in a single RESTRICT statement, the restrictions are separated by commas. Parameters are referred to by the name of the corresponding regressor variable. Each name used in the equation must be a regressor in the preceding MODEL statement. The keyword INTERCEPT is used to refer to the intercept parameter in the model. RESTRICT statements can be given labels. The labels are used in the printed output to distinguish results for different restrictions. Labels are specied as follows:
label : RESTRICT . . . ;
The following is an example of the use of the RESTRICT statement, in which the coefcients of the regressors X1 and X2 are required to sum to 1.
proc syslin data=a; model y = x1 x2; restrict x1 + x2 = 1; run;
Variable names can be multiplied by constants. When no equal sign appears, the linear combination is set equal to 0. Note that the parameters associated with the variables are restricted, not the variables themselves. Here are some examples of valid RESTRICT statements:
restrict restrict restrict restrict restrict x1 + x2 = 1; x1 + x2 - 1; 2 * x1 = x2 + x3 , intercept + x4 = 0; x1 = x2 = x3 = 1; 2 * x1 - x2;
Restricted parameter estimates are computed by introducing a Lagrangian parameter for each restriction (Pringle and Rayner 1971). The estimates of these Lagrangian parameters are printed in the Parameter Estimates table. If a restriction cannot be applied, its parameter value and degrees of freedom are listed as 0. The Lagrangian parameter measures the sensitivity of the sum of squared errors (SSE) to the restriction. If the restriction is changed by a small amount , the SSE is changed by 2 . The t ratio tests the signicance of the restrictions. If as the unrestricted. is zero, the restricted estimates are the same
Any number of restrictions can be specied on a RESTRICT statement, and any number of RESTRICT statements can be used. The estimates are computed subject to all restrictions specied. However, restrictions should be consistent and not redundant. N OTE : The RESTRICT statement is not supported for the FIML estimation method.
SRESTRICT Statement
SRESTRICT equation , . . . , equation ;
The SRESTRICT statement imposes linear restrictions that involve parameters in two or more MODEL statements. The SRESTRICT statement is like the RESTRICT statement but is used to impose restrictions across equations, whereas the RESTRICT statement applies only to parameters in the immediately preceding MODEL statement. Each restriction is written as a linear equation. Parameters are referred to as label.variable, where label is the model label and variable is the name of the regressor to which the parameter is attached. (If the MODEL statement does not have a label, you can use the dependent variable name as the label for the model, provided the dependent variable uniquely labels the model.) Each variable name used must be a regressor in the indicated MODEL statement. The keyword INTERCEPT is used to refer to intercept parameters. SRESTRICT statements can be given labels. The labels are used in the printed output to distinguish results for different restrictions. Labels are specied as follows:
label : SRESTRICT . . . ;
The following is an example of the use of the SRESTRICT statement, in which the coefcient for the regressor X2 is constrained to be the same in both models.
proc syslin data=a 3sls; endogenous y1 y2; instruments x1 x2; model y1 = y2 x1 x2; model y2 = y1 x2; srestrict y1.x2 = y2.x2; run;
When no equal sign is used, the linear combination is set equal to 0. Thus, the restriction in the preceding example can also be specied as
srestrict y1.x2 - y2.x2;
Any number of restrictions can be specied on an SRESTRICT statement, and any number of SRESTRICT statements can be used. The estimates are computed subject to all restrictions specied. However, restrictions should be consistent and not redundant.
When a system restriction is requested for a single equation estimation method (such as OLS or 2SLS), PROC SYSLIN produces the restricted estimates by actually using a corresponding system method. For example, when SRESTRICT is specied along with OLS, PROC SYSLIN produces the restricted OLS estimates via a two-step process equivalent to using SUR estimation with the SDIAG option. First, the unrestricted OLS results are produced. Then, the GLS (SUR) estimation with the system restriction is performed, using the diagonal of the covariance matrix of the residuals. When SRESTRICT is specied along with 2SLS, PROC SYSLIN produces the restricted 2SLS estimates via a multistep process equivalent to using 3SLS estimation with the SDIAG option. First, the unrestricted 2SLS results are produced. Then, the GLS (3SLS) estimation with the system restriction is performed, using the diagonal of the covariance matrix of the residuals. The results of the SRESTRICT statements are printed after the parameter estimates for all the models in the system. The format of the SRESTRICT statement output is the same as the Parameter Estimates table. In this output the parameter estimate is the Lagrangian parameter used to impose the restriction. The Lagrangian parameter measures the sensitivity of the system sum of square errors to the restriction. The system SSE is the system MSE shown in the printed output multiplied by the degrees of freedom. If the restriction is changed by a small amount , the system SSE is changed by 2 . The t ratio tests the signicance of the restriction. If as the unrestricted estimates. is zero, the restricted estimates are the same
The model degrees of freedom are not adjusted for the cross-model restrictions imposed by SRESTRICT statements. N OTE : The SRESTRICT statement is not supported for the LIML and the FIML estimation methods.
STEST Statement
STEST equation , . . . , equation / options ;
The STEST statement performs an F test for the joint hypotheses specied in the statement. The hypothesis is represented in matrix notation as L D c and the F test is computed as .Lb c/0 .L.X0 X/ 1 L0 / mO 2
1 .Lb
c/
where b is the estimate of , m is the number of restrictions, and O 2 is the system weighted mean squared error. See the section Computational Details on page 1651 for information about the matrix X0 X. Each hypothesis to be tested is written as a linear equation. Parameters are referred to as label.variable, where label is the model label and variable is the name of the regressor to which
the parameter is attached. (If the MODEL statement does not have a label, you can use the dependent variable name as the label for the model, provided the dependent variable uniquely labels the model.) Each variable name used must be a regressor in the indicated MODEL statement. The keyword INTERCEPT is used to refer to intercept parameters. STEST statements can be given labels. The label is used in the printed output to distinguish different tests. Any number of STEST statements can be specied. Labels are specied as follows:
label : STEST . . . ;
The test performed is exact only for ordinary least squares, given the OLS assumptions of the linear model. For other estimation methods, the F test is based on large sample theory and is only approximate in nite samples. If RESTRICT or SRESTRICT statements are used, the tests computed by the STEST statement are conditional on the restrictions specied. The validity of the tests can be compromised if incorrect restrictions are imposed on the estimates. The following are examples of STEST statements:
stest a.x1 + b.x2 = l; stest 2 * b.x2 = c.x3 + c.x4 , a.intercept + b.x2 = 0; stest a.x1 = c.x2 = b.x3 = 1; stest 2 * a.x1 - b.x2 = 0;
The PRINT option can be specied in the STEST statement after a slash (/):
PRINT
prints intermediate calculations for the hypothesis tests. N OTE : The STEST statement is not supported for the FIML estimation method.
TEST Statement
TEST equation , . . . , equation / options ;
The TEST statement performs F tests of linear hypotheses about the parameters in the preceding MODEL statement. Each equation species a linear hypothesis to be tested. If more than one equation is specied, the equations are separated by commas. Variable names must correspond to regressors in the preceding MODEL statement, and each name represents the coefcient of the corresponding regressor. The keyword INTERCEPT is used to refer to the model intercept. TEST statements can be given labels. The label is used in the printed output to distinguish different tests. Any number of TEST statements can be specied. Labels are specied as follows:
label : TEST . . . ;
The following is an example of the use of TEST statement, which tests the hypothesis that the coefcients of X1 and X2 are the same:
proc syslin data=a; model y = x1 x2; test x1 = x2; run;
The following statements perform F tests for the hypothesis that the coefcients of X1 and X2 are equal, for the hypothesis that the sum of the X1 and X2 coefcients is twice the intercept, and for the joint hypothesis.
proc syslin data=a; model y = x1 x2; x1eqx2: test x1 = x2; sumeq2i: test x1 + x2 = 2 * intercept; joint: test x1 = x2, x1 + x2 = 2 * intercept; run;
The TEST statement performs an F test for the joint hypotheses specied. The hypothesis is represented in matrix notation as follows: L D c The F test is computed as .Lb c/0 .L.X0 X/ L0 / mO 2
1 .Lb
c/
where b is the estimate of , m is the number of restrictions, and O 2 is the model mean squared error. See the section Computational Details on page 1651 for information about the matrix X0 X. The test performed is exact only for ordinary least squares, given the OLS assumptions of the linear model. For other estimation methods, the F test is based on large sample theory and is only approximate in nite samples. If RESTRICT or SRESTRICT statements are used, the tests computed by the TEST statement are conditional on the restrictions specied. The validity of the tests can be compromised if incorrect restrictions are imposed on the estimates. The PRINT option can be specied in the TEST statement after a slash (/):
PRINT
prints intermediate calculations for the hypothesis tests. N OTE : The TEST statement is not supported for the FIML estimation method.
VAR Statement
VAR variables ;
The VAR statement is used to include variables in the crossproducts matrix that are not specied in any MODEL statement. This statement is rarely used with PROC SYSLIN and is used only with the OUTSSCP= option in the PROC SYSLIN statement.
WEIGHT Statement
WEIGHT variable ;
The WEIGHT statement is used to perform weighted regression. The WEIGHT statement names a variable in the input data set whose values are relative weights for a weighted least squares t. If the weight value is proportional to the reciprocal of the variance for each observation, the weighted estimates are the best linear unbiased estimates (BLUE).
Estimation Methods
A brief description of the methods used by the SYSLIN procedure follows. For more information about these methods, see the references at the end of this chapter. There are two fundamental methods of estimation for simultaneous equations: least squares and maximum likelihood. There are two approaches within each of these categories: single equation methods (also referred to as limited information methods) and system methods (also referred to as
full information methods). System methods take into account cross-equation correlations of the disturbances in estimating parameters, while single equation methods do not. OLS, 2SLS, MELO, K-class, SUR, ITSUR, 3SLS, and IT3SLS use the least squares method; LIML and FIML use the maximum likelihood method. OLS, 2SLS, MELO, K-class, and LIML are single equation methods. The system methods are SUR, ITSUR, 3SLS, IT3SLS, and FIML.
MELO is a Bayesian K-class estimator. It yields estimates that can be expressed as a matrixweighted average of the OLS and 2SLS estimates. MELO estimators have nite second moments and hence nite risk. Other frequently used K-class estimators might not have nite moments under some commonly encountered circumstances, and hence there can be innite risk relative to quadratic and other loss functions. One way of comparing K-class estimators is to note that when k =1, the correlation between regressor and the residual is completely corrected for. In all other cases, it is only partially corrected for. See Computational Details on page 1651 for more details about K-class estimators.
N OTE : The RESTRICT, SRESTRICT, TEST, and STEST statements are not supported when the FIML method is used.
This denition is consistent with the F test of the null hypothesis that the true coefcients of all regressors are zero. However, this R2 might not be a good measure of the goodness of t of the model.
R0 WY=Y0 WY
In this equation, the matrix X0 X is R0 WR and W is the projection matrix of the instruments: WDS
1
Z.Z0 Z/
1 0
The matrix Z is the instrument set, R is the regressor set, and S is the estimated cross-model covariance matrix. The system weighted MSE, printed for the 3SLS, IT3SLS, SUR, ITSUR, and FIML methods, is computed as follows: M SE D 1 .Y0 WY t df Y0 WR.X0 X/
1
R0 WY/
In this equation, tdf is the sum of the error degrees of freedom for the equations in the system.
Computational Details
This section discusses various computational details.
Cu
is the vector of observations on the dependent variable is the matrix of observations on the endogenous variables included in the equation is the vector of parameters associated with Yi is the matrix of observations on the predetermined variables included in the equation is the vector of parameters associated with Xi is a vector of errors O O Yi , where Yi is the projection of Yi onto the space spanned by the instruments matrix
O Let Vi D Yi Z. Let i D
i
i
be the vector of parameters associated with both the endogenous and exogenous variables. The K-class of estimators (Theil 1971) is dened by O i;k D O O Yi0 Yi k Vi0 Vi 0 Xi Yi Yi0 Xi Xi0 Xi
1
.Yi
kVi /0 yi Xi0 yi
where k is a user-dened value. Let R D Yi Xi and O O R D Yi Xi The 2SLS estimator is dened as O O0 O i;2SLS D Ri Ri
1
O0 Ri yi
Let y and be the vectors obtained by stacking the vectors of dependent variables and parameters for O O all G equations, and let R and R be the block diagonal matrices formed by Ri and Ri , respectively. The SUR and ITSUR estimators are dened as h i 1 O O O .I T /SUR D R0 1 I R R0
I y
while the 3SLS and IT3SLS estimators are dened as i 1 0 h 0 O O O O O O R 1I y .I T /3SLS D R 1 I R O where I is the identity matrix, and is an estimator of the cross-equation correlation matrix. For O is obtained from the 2SLS estimation, while for SUR it is derived from the OLS esti3SLS, mation. For IT3SLS and ITSUR, it is obtained iteratively from the previous estimation step, until convergence.
The VARDEF= option does not affect the model mean squared error, root mean squared error, or R2 statistics. These statistics are always based on the error degrees of freedom, regardless of the VARDEF= option. The VARDEF= option also does not affect the dependent variable coefcient of variation (CV).
X
1 .
Overidentication Restrictions
The OVERID option in the MODEL statement can be used to test for overidentifying restrictions on parameters of each equation. The null hypothesis is that the predetermined variables that do not appear in any equation have zero coefcients. The alternative hypothesis is that at least one of the assumed zero coefcients is nonzero. The test is approximate and rejects the null hypothesis too frequently for small sample sizes. The formula for the test is given as follows. Let yi D i Yi C i Zi C ei be the i th equation. Yi are the endogenous variables that appear as regressors in the i th equation, and Zi are the instrumental variables that appear as regressors in the i th equation. Let Ni be the number of variables in Yi and Zi . O Let vi D yi Yi i . Let Z represent all instrumental variables, T be the total number of observaO tions, and K be the total number of instrumental variables. Dene l as follows:
0 0 1 0 O v i .I Zi .Z i Zi / Z i /vi lD v 0 i .I Z.Z0 Z/ 1 Z0 /vi
Missing Values
Observations that have a missing value for any variable in the analysis are excluded from the computations.
_STATUS_
_MODEL_
_DEPVAR_ _NAME_
_SIGMA_
INTERCEPT regressors
The parameter estimates are stored under the names of the regressor variables. The intercept parameters are stored in the variable INTERCEPT. The dependent variable of the model is given a coefcient of 1. Variables that are not in a model have missing values for the OUTEST= observations for that model. Some estimation methods require computation of preliminary estimates. All estimates computed are output to the OUTEST= data set. For each BY group and each estimation, the OUTEST= data set contains one observation for each MODEL or IDENTITY statement. Results for different estimations are identied by the _TYPE_ variable. For example, consider the following statements:
proc syslin data=a outest=est 3sls; by b; endogenous y1 y2; instruments x1-x4; model y1 = y2 x1 x2; model y2 = y1 x3 x4; identity x1 = x3 + x4; run;
The 3SLS method requires both a preliminary 2SLS stage and preliminary rst-stage regressions for the endogenous variable. The OUTEST= data set thus contains three different kinds of estimates. The observations for the rst-stage regression estimates have the _TYPE_ value INST. The observations for the 2SLS estimates have the _TYPE_ value 2SLS. The observations for the nal 3SLS estimates have the _TYPE_ value 3SLS. Since there are two endogenous variables in this example, there are two rst-stage regressions and two _TYPE_=INST observations in the OUTEST= data set. Since there are two model statements, there are two OUTEST= observations with _TYPE_=2SLS and two observations with _TYPE_=3SLS. In addition, the OUTEST= data set contains an observation with the _TYPE_ value IDENTITY that contains the coefcients specied by the IDENTITY statement. All these observations are repeated for each BY group in the input data set dened by the values of the BY variable B. When the COVOUT option is specied, the estimated covariance matrix for the parameter estimates is included in the OUTEST= data set. Each observation for parameter estimates is followed by observations that contain the rows of the parameter covariance matrix for that model. The row of the covariance matrix is identied by the variable _NAME_. For observations that contain parameter estimates, _NAME_ is blank. For covariance observations, _NAME_ contains the regressor name for the row of the covariance matrix and the regressor variables contain the covariances. See Example 26.1 for an example of the OUTEST= data set.
statements. Observations are identied by the variable _NAME_. The OUTSSCP= data set can be useful when a large number of observations are to be explored in many different SYSLIN runs. The sum-of-squares-and-crossproducts matrix can be saved with the OUTSSCP= option and used as the DATA= data set on subsequent SYSLIN runs. This is much less expensive computationally because PROC SYSLIN never reads the original data again. In the step that creates the OUTSSCP= data set, include in the VAR statement all the variables you expect to use.
Printed Output
The printed output produced by the SYSLIN procedure is as follows: 1. If the SIMPLE option is used, a table of descriptive statistics is printed that shows the sum, mean, sum of squares, variance, and standard deviation for all the variables used in the models. 2. If the FIRST option is specied and an instrumental variables method is used, rst-stage regression results are printed. The results show the regression of each endogenous variable on the variables in the INSTRUMENTS list. 3. The results of the second-stage regression are printed for each model. (See the following section Printed Output for Each Model on page 1657 for details.) 4. If a systems method like 3SLS, SUR, or FIML is used, the cross-equation error covariance matrix is printed. This matrix is shown four ways: the covariance matrix itself, the correlation matrix form, the inverse of the correlation matrix, and the inverse of the covariance matrix. 5. If a systems method like 3SLS, SUR, or FIML is used, the system weighted mean squared error and system weighted R2 statistics are printed. The system weighted MSE and R2 measure the t of the joint model obtained by stacking all the models together and performing a single regression with the stacked observations weighted by the inverse of the model error variances. 6. If a systems method like 3SLS, SUR, or FIML is used, the nal results are printed for each model. 7. If the REDUCED option is used, the reduced form coefcients are printed. These consist of the structural coefcient matrix for the endogenous variables, the structural coefcient matrix for the exogenous variables, the inverse of the endogenous coefcient matrix, and the reduced form coefcient matrix. The reduced form coefcient matrix is the product of the inverse of the endogenous coefcient matrix and the exogenous structural coefcient matrix.
The printed output produced for each model is described in the following. The analysis-of-variance table includes the following: the model degrees of freedom, sum of squares, and mean square the error degrees of freedom, sum of squares, and mean square. The error mean square is computed by dividing the error sum of squares by the error degrees of freedom and is not affected by the VARDEF= option. the corrected total degrees of freedom and total sum of squares. Note that for instrumental variables methods, the model and error sums of squares do not add to the total sum of squares. the F ratio, labeled F Value, and its signicance, labeled PROB>F, for the test of the hypothesis that all the nonintercept parameters are 0 the root mean squared error. This is the square root of the error mean square. the dependent variable mean the coefcient of variation (CV) of the dependent variable the R2 statistic. This R2 is computed consistently with the calculation of the F statistic. It is valid for hypothesis tests but might not be a good measure of t for models estimated by instrumental variables methods. the R2 statistic adjusted for model degrees of freedom, labeled Adj R-SQ The Parameter Estimates table includes the following: estimates of parameters for regressors in the model and the Lagrangian parameter for each restriction specied a degrees of freedom column labeled DF. Estimated model parameters have 1 degree of freedom. Restrictions have a DF of 1. Regressors or restrictions dropped from the model due to collinearity have a DF of 0. the standard errors of the parameter estimates the t statistics, which are the parameter estimates divided by the standard errors the signicance of the t tests for the hypothesis that the true parameter is 0, labeled Pr > |t|. As previously noted, the signicance tests are strictly valid in nite samples only for OLS estimates but are asymptotically valid for the other methods. the standardized regression coefcients, if the STB option is specied. This is the parameter estimate multiplied by the ratio of the standard deviation of the regressor to the standard deviation of the dependent variable. the labels of the regressor variables or restriction labels
In addition to the analysis-of-variance table and the Parameter Estimates table, the results printed for each model can include the following: If TEST statements are specied, the test results are printed. If the DW option is specied, the Durbin-Watson statistic and rst-order autocorrelation coefcient are printed. If the OVERID option is specied, the results of Basmanns test for overidentifying restrictions are printed. If the PLOT option is used, plots of residual against each regressor are printed. If the COVB or CORRB options are specied, the results for each model also include the covariance or correlation matrix of the parameter estimates. For systems methods like 3SLS and FIML, the COVB and CORB output is printed for the whole system after the output for the last model, instead of separately for each model. The third-stage output for 3SLS, SUR, IT3SLS, ITSUR, and FIML does not include the analysisof-variance table. When a systems method is used, the second-stage output does not include the optional output, except for the COVB and CORRB matrices.
ODS Table Name ANOVA AugXPXMat AutoCorrStat ConvergenceStatus CorrB CorrResiduals CovB CovResiduals EndoMat ExogMat FitStatistics InvCorrResiduals InvCovResiduals
Description Summary of the SSE, MSE for the equations Model crossproducts Autocorrelation statistics Convergence status Correlations of parameters Correlations of residuals Covariance of parameters Covariance of residuals Endogenous variables Exogenous variables Statistics of t Inverse correlations of residuals Inverse covariance of residuals
Option default XPX or USSCP DW default CORRB COVB REDUCED REDUCED default
Table 26.2
(continued)
ODS Table Name InvEndoMat InvXPX IterHistory MissingValues ModelVars ParameterEstimates RedMat SimpleStatistics SSCP TestResults Weight
Description Inverse endogenous variables X 0 X inverse for system Iteration printing Missing values generated by the program Name and label for the model Parameter estimates Reduced form Descriptive statistics Model crossproducts Test for overidentifying restrictions Weighted model statistics
Option REDUCED I ITPRINT default default default REDUCED SIMPLE XPX or USSCP
ODS Graphics
This section describes the use of ODS for creating graphics with the SYSLIN procedure.
Plot Description Predicted versus actual plot Q-Q plot of residuals Histogram of the residuals Residual plot
The following statements estimate the Klein model using the limited information maximum likelihood method. In addition, the parameter estimates are written to a SAS data set with the OUTEST= option.
proc syslin data=klein outest=b liml; endogenous c p w i x wsum k y; instruments klag plag xlag wp g t yr; consume: model c = p plag wsum; invest: model i = p plag klag; labor: model w = x xlag yr; run; proc print data=b; run;
The PROC SYSLIN estimates are shown in Output 26.1.1 through Output 26.1.3.
Output 26.1.1 LIML Estimates for Consumption
The SYSLIN Procedure Limited-Information Maximum Likelihood Estimation Model Dependent Variable Label CONSUME c Consumption
Analysis of Variance Sum of Squares 854.3541 40.88419 941.4295 1.55079 53.99524 2.87209 Mean Square 284.7847 2.404952
Source Model Error Corrected Total Root MSE Dependent Mean Coeff Var
DF 3 17 20
F Value 118.42
Pr > F <.0001
0.95433 0.94627
Parameter Estimates Parameter Estimate 17.14765 -0.22251 0.396027 0.822559 Standard Error 2.045374 0.224230 0.192943 0.061549 Variable Label Intercept Profits Profits Lagged Total Wage Bill
DF 1 1 1 1
Example 26.1: Kleins Model I Estimated with LIML and 3SLS ! 1663
Analysis of Variance Sum of Squares 210.3790 34.99649 252.3267 1.43479 1.26667 113.27274 Mean Square 70.12634 2.058617
Source Model Error Corrected Total Root MSE Dependent Mean Coeff Var
DF 3 17 20
F Value 34.06
Pr > F <.0001
0.85738 0.83221
Parameter Estimates Parameter Estimate 22.59083 0.075185 0.680386 -0.16826 Standard Error 9.498146 0.224712 0.209145 0.045345 Variable Label Intercept Profits Profits Lagged Capital Stock Lagged
DF 1 1 1 1
Analysis of Variance Sum of Squares 696.1485 10.02192 794.9095 0.76781 36.36190 2.11156 Mean Square 232.0495 0.589525
Source Model Error Corrected Total Root MSE Dependent Mean Coeff Var
DF 3 17 20
F Value 393.62
Pr > F <.0001
0.98581 0.98330
DF 1 1 1 1
The OUTEST= data set is shown in part in Output 26.1.4. Note that the data set contains the parameter estimates and root mean squared errors, _SIGMA_, for the rst-stage instrumental regressions as well as the parameter estimates and for the LIML estimates for the three structural equations.
Output 26.1.4 The OUTEST= Data Set
Obs _TYPE_ 1 2 3 Obs 1 2 3 LIML LIML LIML xlag . . 0.15132 _STATUS_ _MODEL_ _DEPVAR_ _SIGMA_ Intercept c i w c -1 . . 1.55079 1.43479 0.76781 p -0.22251 0.07518 . w . . -1 17.1477 22.5908 1.5262 i . -1 . x . . 0.43394 klag plag
The following statements estimate the model using the 3SLS method. The reduced form estimates are produced by the REDUCED option; IDENTITY statements are used to make the model complete.
proc syslin data=klein 3sls reduced; endogenous c p w i x wsum k y; instruments klag plag xlag wp g t yr; consume: model c = p plag wsum; invest: model i = p plag klag; labor: model w = x xlag yr; product: identity x = c + i + g; income: identity y = c + i + g - t; profit: identity p = y - w; stock: identity k = klag + i; wage: identity wsum = w + wp; run;
The preliminary 2SLS results and estimated cross-model covariance matrix are not shown. The 3SLS estimates are shown in Output 26.1.5 through Output 26.1.7. The reduced form estimates are shown in Output 26.1.8 through Output 26.1.11.
Example 26.1: Kleins Model I Estimated with LIML and 3SLS ! 1665
CONSUME c Consumption
Parameter Estimates Parameter Estimate 16.44079 0.124890 0.163144 0.790081 Standard Error 1.449925 0.120179 0.111631 0.042166 Variable Label Intercept Profits Profits Lagged Total Wage Bill
DF 1 1 1 1
Parameter Estimates Parameter Estimate 28.17785 -0.01308 0.755724 -0.19485 Standard Error 7.550853 0.179938 0.169976 0.036156 Variable Label Intercept Profits Profits Lagged Capital Stock Lagged
DF 1 1 1 1
DF 1 1 1 1
Example 26.1: Kleins Model I Estimated with LIML and 3SLS ! 1667
Inverse Endogenous Variables INCOME c p w i x wsum k y 0.195852 1.108721 0.072629 -0.0145 0.181351 0.072629 -0.0145 1.181351 PROFIT 0.195852 1.108721 0.072629 -0.0145 0.181351 0.072629 -0.0145 0.181351 STOCK 0 0 0 0 0 0 1 0 WAGE 1.291509 0.768246 0.513215 -0.01005 1.281461 1.513215 -0.01005 1.281461
= = = = = =
Gross Investment, GE Capital Stock Lagged, GE Value of Outstanding Shares Lagged, GE Gross Investment, WH Capital Stock Lagged, WH Value of Outstanding Shares Lagged, WH;
The following statements compute the SUR estimates for the Grunfeld model.
proc syslin data=grunfeld sur; ge: model ge_i = ge_f ge_c; westing: model wh_i = wh_f wh_c; run;
The PROC SYSLIN output is shown in Output 26.2.1 through Output 26.2.5.
Output 26.2.1 PROC SYSLIN Output for SUR
The SYSLIN Procedure Ordinary Least Squares Estimation Model Dependent Variable Label GE ge_i Gross Investment, GE
Analysis of Variance Sum of Squares 31632.03 13216.59 44848.62 27.88272 102.29000 27.25850 Mean Square 15816.02 777.4463
Source Model Error Corrected Total Root MSE Dependent Mean Coeff Var
DF 2 17 19
F Value 20.34
Pr > F <.0001
0.70531 0.67064
Parameter Estimates Parameter Estimate -9.95631 0.026551 0.151694 Standard Error 31.37425 0.015566 0.025704 Variable Label Intercept Value of Outstanding Shares Lagged, GE Capital Stock Lagged, GE
DF 1 1 1
Analysis of Variance Sum of Squares 5165.553 1773.234 6938.787 10.21312 42.89150 23.81153 Mean Square 2582.776 104.3079
Source Model Error Corrected Total Root MSE Dependent Mean Coeff Var
DF 2 17 19
F Value 24.76
Pr > F <.0001
0.74445 0.71438
Parameter Estimates Parameter Estimate -0.50939 0.052894 0.092406 Standard Error 8.015289 0.015707 0.056099 Variable Label Intercept Value of Outstanding Shares Lagged, WH Capital Stock Lagged, WH
DF 1 1 1
Cross Model Inverse Covariance GE GE WESTING 0.002745 -.005463 WESTING -.005463 0.020458
Parameter Estimates Parameter Estimate -27.7193 0.038310 0.139036 Standard Error 29.32122 0.014415 0.024986 Variable Label Intercept Value of Outstanding Shares Lagged, GE Capital Stock Lagged, GE
DF 1 1 1
DF 1 1 1
wp =Govt Wage Bill g =Govt Demand i =Taxes klag=Capital Stock Lagged plag=Profits Lagged xlag=Private Product Lagged yr =YEAR-1931; datalines; 1920 . 12.7 . . 44.9 . 1921 41.9 12.4 25.5 -0.2 45.6 2.7 1922 45.0 16.9 29.3 1.9 50.1 2.9 1923 49.2 18.4 34.1 5.2 57.2 2.9 1924 50.6 19.4 33.9 3.0 57.1 3.1 1925 52.6 20.1 35.4 5.1 61.0 3.2 1926 55.1 19.6 37.4 5.6 64.0 3.3 1927 56.2 19.8 37.9 4.2 64.4 3.6 1928 57.3 21.1 39.2 3.0 64.5 3.7 1929 57.8 21.7 41.3 5.1 67.0 4.0 ... more lines ...
182.8 182.6 184.5 189.7 192.7 197.8 203.4 207.6 210.6 215.7
ods graphics on; proc syslin data=klein outest=b liml plots(unpack only)=residual ; endogenous c p w i x wsum k y; instruments klag plag xlag wp g t yr; consume: model c = p plag wsum; invest: model i = p plag klag; labor: model w = x xlag yr; run;
References
Basmann, R.L. (1960), On Finite Sample Distributions of Generalized Classical Linear Identiability Test Statistics, Journal of the American Statistical Association, 55, 650659. Fuller, W.A. (1977), Some Properties of a Modication of the Limited Information Estimator, Econometrica, 45, 939952. Hausman, J.A. (1975), An Instrumental Variable Approach to Full Information Estimators for Linear and Certain Nonlinear Econometric Models, Econometrica, 43, 727738. Johnston, J. (1984), Econometric Methods, Third Edition, New York: McGraw-Hill. Judge, George G., W. E. Grifths, R. Carter Hill, Helmut Lutkepohl, and Tsoung-Chao Lee (1985), The Theory and Practice of Econometrics, Second Edition, New York: John Wiley & Sons. Maddala, G.S. (1977), Econometrics, New York: McGraw-Hill. Park, S.B. (1982), Some Sampling Properties of Minimum Expected Loss (MELO) Estimators of
References ! 1677
Structural Coefcients, Journal of the Econometrics, 18, 295311. Pindyck, R.S. and Rubinfeld, D.L. (1981), Econometric Models and Economic Forecasts, Second Edition, New York: McGraw-Hill. Pringle, R.M. and Rayner, A.A. (1971), Generalized Inverse Matrices with Applications to Statistics, New York: Hafner Publishing Company. Rao, P. (1974), Specication Bias in Seemingly Unrelated Regressions, in Essays in Honor of Tinbergen, Volume 2, New York: International Arts and Sciences Press. Savin, N.E. and White, K.J. (1978), Testing for Autocorrelation with Missing Observations, Econometrics, 46, 5966. Theil, H. (1971), Principles of Econometrics, New York: John Wiley & Sons. Zellner, A. (1962), An Efcient Method of Estimating Seemingly Unrelated Regressions and Tests for Aggregation Bias, Journal of the American Statistical Association, 57, 348368. Zellner, A. (1978), Estimation of Functions of Population Means and Regression Coefcients: A Minimum Expected Loss (MELO) Approach, Journal of the Econometrics, 8, 127158. Zellner, A. and Park, S. (1979), Minimum Expected Loss (MELO) Estimators for Functions of Parameters and Structural Coefcients of Econometric Models, Journal of the American Statistical Association, 74, 185193.
1678
Chapter 27
Examples: TIMESERIES Procedure . . . . . . . . . . . . . . . . . . . . . . Example 27.1: Accumulating Transactional Data into Time Series Data Example 27.2: Trend and Seasonal Analysis . . . . . . . . . . . . . . Example 27.3: Illustration of ODS Graphics . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . .
. . . . .
. . . . .
when using this procedure with SAS/STAT procedures or SAS Enterprise Miner. For example, clustering time-stamped transactional data can be achieved by using the results of this procedure with the clustering procedures of SAS/STAT and the nodes of SAS Enterprise Miner. The EXPAND procedure can be used for the frequency conversion and transformations of time series output from this procedure.
The TIMESERIES procedure forms time series from the input time-stamped transactional data. It can provide results in output data sets or in other output formats by using the Output Delivery System (ODS). Time-stamped transactional data are often recorded at no xed interval. Analysts often want to use time series analysis techniques that require xed-time intervals. Therefore, the transactional data must be accumulated to form a xed-interval time series. Suppose that a bank wants to analyze the transactions associated with each of its customers over time. Further, suppose that the data set WORK.TRANSACTIONS contains four variables that are related to these transactions: CUSTOMER, DATE, WITHDRAWAL, and DEPOSITS. The following examples illustrate possible ways to analyze these transactions by using the TIMESERIES procedure. To accumulate the time-stamped transactional data to form a daily time series based on the accumulated daily totals of each type of transaction (WITHDRAWALS and DEPOSITS ), the following TIMESERIES procedure statements can be used:
proc timeseries data=transactions out=timeseries; by customer; id date interval=day accumulate=total; var withdrawals deposits; run;
The OUT=TIMESERIES option species that the resulting time series data for each customer is to be stored in the data set WORK.TIMESERIES. The INTERVAL=DAY option species that the transactions are to be accumulated on a daily basis. The ACCUMULATE=TOTAL option species that the sum of the transactions is to be calculated. After the transactional data is accumulated into a time series format, many of the procedures provided with SAS/ETS software can be used to analyze the resulting time series data. For example, the ARIMA procedure can be used to model and forecast each customers withdrawal data by using an ARIMA(0,1,1)(0,1,1)s model (where the number of seasons is s=7 days in a week) using the following statements:
proc arima data=timeseries; identify var=withdrawals(1,7) noprint; estimate q=(1)(7) outest=estimates noprint; forecast id=date interval=day out=forecasts; quit;
The OUTEST=ESTIMATES data set contains the parameter estimates of the model specied. The OUT=FORECASTS data set contains forecasts based on the model specied. See the SAS/ETS ARIMA procedure for more detail. A single set of transactions can be very large and must be summarized in order to analyze them effectively. Analysts often want to examine transactional data for trends and seasonal variation. To analyze transactional data for trends and seasonality, statistics must be computed for each time period and season of concern. For each observation, the time period and season must be determined and the data must be analyzed based on this determination. The following statements illustrate how to use the TIMESERIES procedure to perform trend and seasonal analysis of time-stamped transactional data.
proc timeseries data=transactions out=out outseason=season outtrend=trend; by customer; id date interval=day accumulate=total; var withdrawals deposits; run;
Since the INTERVAL=DAY option is specied, the length of the seasonal cycle is seven (7) where the rst season is Sunday and the last season is Saturday. The output data set specied by the OUTSEASON=SEASON option contains the seasonal statistics for each day of the week by each customer. The output data set specied by the OUTTREND=TREND option contains the trend statistics for each day of the calendar by each customer. Often it is desired to seasonally decompose into seasonal, trend, cycle, and irregular components or to seasonally adjust a time series. The following techniques describe how the changing seasons inuence the time series. The following statements illustrate how to use the TIMESERIES procedure to perform seasonal adjustment/decomposition analysis of time-stamped transactional data.
proc timeseries data=transactions out=out outdecomp=decompose; by customer; id date interval=day accumulate=total; var withdrawals deposits; run;
The output data set specied by the OUTDECOMP=DECOMPOSE data set contains the decomposed/adjusted time series for each customer. A single time series can be very large. Often, a time series must be summarized with respect to time lags in order to be efciently analyzed using time domain techniques. These techniques help describe how a current observation is related to the past observations with respect to the time (season) lag. The following statements illustrate how to use the TIMESERIES procedure to perform time domain analysis of time-stamped transactional data.
proc timeseries data=transactions out=out outcorr=timedomain; by customer; id date interval=day accumulate=total; var withdrawals deposits; run;
The output data set specied by the OUTCORR=TIMEDOMAIN data set contains the time domain statistics, such as sample autocorrelations and partial autocorrelations, by each customer. By default, the TIMESERIES procedure produces no printed output.
Functional Summary
Table 27.1 summarizes the statements and options that control the TIMESERIES procedure.
Table 27.1 TIMESERIES Functional Summary
Description Statements specify BY-group processing specify variables to analyze specify cross variables to analyze specify the time ID variable specify correlation options specify cross-correlation options specify decomposition options specify seasonal statistics options specify trend statistics options Data Set Options specify the input data set specify the output data set specify correlations output data set specify cross-correlations output data set specify decomposition output data set specify seasonal statistics output data set specify summary statistics output data set specify trend statistics output data set Accumulation and Seasonality Options specify accumulation frequency specify length of seasonal cycle specify interval alignment specify that time ID variable values not be sorted specify starting time ID value specify ending time ID value specify accumulation statistic specify missing value interpretation Time-Stamped Data Seasonal Statistics Options specify the form of the output data set
Statement
Option
PROC TIMESERIES PROC TIMESERIES PROC TIMESERIES PROC TIMESERIES PROC TIMESERIES PROC TIMESERIES PROC TIMESERIES PROC TIMESERIES
SEASON
TRANSPOSE=
Description Time-Stamped Data Trend Statistics Options specify the form of the output data set specify the number of time periods to be stored Time Series Transformation Options specify simple differencing specify seasonal differencing specify transformation Time Series Correlation Options specify the list of lags specify the number of lags specify the number of parameters specify the form of the output data set Time Series Cross-Correlation Options specify the list of lags specify the number of lags specify the form of the output data set Time Series Decomposition Options specify mode of decomposition specify the Hodrick-Prescott lter parameter specify the number of time periods to be stored specify the form of the output data set Printing Control Options specify time ID format specify printed output specify detailed printed output specify univariate graphical output specify cross-variable graphical output Miscellaneous Options specify that analysis variables be processed in sorted order limit error and warning messages ODS Graphics Options specify the cross-variable graphical output specify the variable graphical output
Statement
Option
TREND TREND
TRANSPOSE= NPERIODS=
SORTNAMES MAXERROR=
CROSSPLOT= PLOT=
names the SAS data set that contains the input data for the procedure to create the time series. If the DATA= option is not specied, the most recently created SAS data set is used.
CROSSPLOTS= option | ( options )
species the cross-variable graphical output desired. By default, the TIMESERIES procedure produces no graphical output. The following plotting options are available: SERIES CCF ALL plots the time series (OUT= data set). plots the cross-correlation functions (OUTCROSSCORR= data set). same as PLOTS=(SERIES CCF).
For example, CROSSPLOTS=SERIES plots the two time series. The CROSSPLOTS= option produces graphical output for these results by using the Output Delivery System (ODS). The CROSSPLOTS= option produces results similar to the data sets listed in parentheses next to the preceding options.
MAXERROR= number
limits the number of warning and error messages that are produced during the execution of the procedure to the specied value. The default is MAXERRORS=50. This option is particularly useful in BY-group processing where it can be used to suppress the recurring messages.
OUT= SAS-data-set
names the output data set to contain the time series variables specied in the subsequent VAR and CROSSVAR statements. If BY variables are specied, they are also included in the OUT= data set. If an ID variable is specied, it is also included in the OUT= data set. The values are accumulated based on the ID statement INTERVAL= or the ACCUMULATE= option or both. The OUT= data set is particularly useful when you want to further analyze, model, or forecast the resulting time series with other SAS/ETS procedures.
OUTCORR= SAS-data-set
names the output data set to contain the univariate time domain statistics.
OUTCROSSCORR= SAS-data-set
names the output data set to contain the decomposed and/or seasonally adjusted time series.
OUTSEASON= SAS-data-set
names the output data set to contain the seasonal statistics. The statistics are computed for each season as specied by the ID statement INTERVAL= option or the PROC TIMESERIES
statement SEASONALITY= option. The OUTSEASON= data set is particularly useful when analyzing transactional data for seasonal variations.
OUTSUM= SAS-data-set
names the output data set to contain the descriptive statistics. The descriptive statistics are based on the accumulated time series when the ACCUMULATE= and/or SETMISSING= options are specied in the ID or VAR statements. The OUTSUM= data set is particularly useful when analyzing large numbers of series and a summary of the results are needed.
OUTTREND= SAS-data-set
names the output data set to contain the trend statistics. The statistics are computed for each time period as specied by the ID statement INTERVAL= option. The OUTTREND= data set is particularly useful when analyzing transactional data for trends.
PLOTS= option | ( options )
species the univariate graphical output desired. By default, the TIMESERIES procedure produces no graphical output. The following plotting options are available: SERIES RESIDUAL CYCLES CORR ACF PACF IACF WN DECOMP TCS TCC SIC SC SA PCSA IC TC CC ALL plots the time series (OUT= data set). plots the residual time series (OUT= data set). plots the seasonal cycles (OUT= data set). plots the correlation panel (OUTCORR= data set). plots the autocorrelation function (OUTCORR= data set). plots the partial autocorrelation function (OUTCORR= data set). plots the inverse autocorrelation function (OUTCORR= data set). plots the white noise probabilities (OUTCORR= data set). plots the seasonal adjustment panel (OUTDECOMP= data set). plots the trend-cycle-seasonal component (OUTDECOMP= data set). plots the trend-cycle component (OUTDECOMP= data set). plots the seasonal-irregular component (OUTDECOMP= data set). plots the seasonal component (OUTDECOMP= data set). plots the seasonal adjusted component (OUTDECOMP= data set). plots the percent change in the seasonal adjusted component (OUTDECOMP= data set). plots the irregular component (OUTDECOMP= data set). plots the trend component (OUTDECOMP= data set). plots the cycle component (OUTDECOMP= data set). same as PLOTS=(SERIES ACF PACF IACF WN).
For example, PLOTS=SERIES plots the time series. The PLOTS= option produces graphical output for these results by using the Output Delivery System (ODS). The PLOTS= option produces results similar to the data sets listed in parentheses next to the preceding options.
species the printed output desired. By default, the TIMESERIES procedure produces no printed output. The following printing options are available: DECOMP SEASONS DESCSTATS SUMMARY TRENDS prints the seasonal decomposition/adjustment table (OUTDECOMP= data set). prints the seasonal statistics table (OUTSEASON= data set). prints the descriptive statistics for the accumulated time series (OUTSUM= data set). prints the descriptive statistics table for all time series (OUTSUM= data set). prints the trend statistics table (OUTTREND= data set).
For example, PRINT=SEASONS prints the seasonal statistics. The PRINT= option produces printed output for these results by using the Output Delivery System (ODS). The PRINT= option produces results similar to the data sets listed in parentheses next to the preceding options.
PRINTDETAILS
species that output requested with the PRINT= option be printed in greater detail.
SEASONALITY= number
species the length of the seasonal cycle. For example, SEASONALITY=3 means that every group of three time periods forms a seasonal cycle. By default, the length of the seasonal cycle is one (no seasonality) or the length implied by the INTERVAL= option specied in the ID statement. For example, INTERVAL=MONTH implies that the length of the seasonal cycle is 12.
SORTNAMES
species that the variables specied in the VAR and CROSSVAR statements be processed in sorted order by the variable names. This option allows the output data sets to be presorted by the variable names.
BY Statement
A BY statement can be used with PROC TIMESERIES to obtain separate dummy variable denitions for groups of observations dened by the BY variables. When a BY statement appears, the procedure expects the input data set to be sorted in order of the BY variables. If your input data set is not sorted in ascending order, use one of the following alternatives: Sort the data by using the SORT procedure with a similar BY statement.
Specify the option NOTSORTED or DESCENDING in the BY statement for the TIMESERIES procedure. The NOTSORTED option does not mean that the data are unsorted but rather that the data are arranged in groups (according to values of the BY variables) and that these groups are not necessarily in alphabetical or increasing numeric order. Create an index on the BY variables by using the DATASETS procedure. For more information about the BY statement, see SAS Language Reference: Concepts. For more information about the DATASETS procedure, see the discussion in the Base SAS Procedures Guide.
CORR Statement
CORR statistics < / options > ;
A CORR statement can be used with the TIMESERIES procedure to specify options related to time domain analysis of the accumulated time series. Only one CORR statement is allowed. The following time domain statistics are available: LAG N ACOV ACF ACFSTD ACF2STD ACFNORM ACFPROB ACFLPROB PACF PACFSTD PACF2STD PACFNORM PACFPROB PACFLPROB IACF IACFSTD IACF2STD IACFNORM IACFPROB time lag number of variance products autocovariances autocorrelations autocorrelation standard errors two standard errors beyond autocorrelations normalized autocorrelations autocorrelation probabilities autocorrelation log probabilities partial autocorrelations partial autocorrelation standard errors two standard errors beyond partial autocorrelations partial normalized autocorrelations partial autocorrelation probabilities partial autocorrelation log probabilities inverse autocorrelations inverse autocorrelation standard errors two standard errors beyond inverse autocorrelations normalized inverse autocorrelations inverse autocorrelation probabilities
inverse autocorrelation log probabilities white noise test statistics white noise test probabilities white noise test log probabilities
The following options can be specied in the CORR statement following the slash (/):
NLAG= number
species the number of lags to be stored in the OUTCORR= data set or to be plotted. The default is 24 or three times the length of the seasonal cycle, whichever is smaller. The LAGS= option takes precedence over the NLAG= option.
LAGS= (numlist)
species the list of lags to be stored in OUTCORR= data set or to be plotted. The list of lags must be separated by spaces or commas. For example, LAGS=(1,3) species the rst then third lag.
NPARMS= number
species the number of parameters used in the model that created the residual time series. The number of parameters determines the degrees of freedom associated with the Ljung-Box statistics. The default is NPARMS=0. This option is useful when analyzing the residuals of a time series model with the number of parameters specied by NPARMS=number option.
TRANSPOSE= NO|YES
TRANSPOSE=YES species that the OUTCORR= data set is recorded with the lags as the column names instead of with the correlation statistics as the column names. The TRANSPOSE=NO option is useful for graphing the correlation results with SAS/GRAPH procedures. The TRANSPOSE=YES option is useful for analyzing the correlation results from other SAS procedures such as the CLUSTER procedure of SAS/STAT or SAS Enterprise Miner. The default is TRANSPOSE=NO.
CROSSCORR Statement
CROSSCORR statistics < / options > ;
A CROSSCORR statement can be used with the TIMESERIES procedure to specify options that are related to cross-correlation analysis of the accumulated time series. Only one CROSSCORR statement is allowed. The following time domain statistics are available:
time lag number of variance products cross covariances cross-correlations cross-correlation standard errors two standard errors beyond cross-correlation normalized cross-correlations cross-correlation probabilities cross-correlation log probabilities
The following options can be specied in the CROSSCORR statement following the slash (/):
NLAG= number
species the number of lags to be stored in the OUTCROSSCORR= data set or to be plotted. The default is 24 or three times the length of the seasonal cycle, whichever is smaller. The LAGS= option takes precedence over the NLAG= option.
LAGS=( numlist )
species a list of lags to be stored in OUTCROSSCORR= data set or to be plotted. The list of lags must be separated by spaces or commas. For example, LAGS=(1,3) species the rst then third lag.
TRANSPOSE= NO|YES
TRANSPOSE=YES species that the OUTCROSSCORR= data set be recorded with the lags as the column names instead of with the cross-correlation statistics as the column names. The TRANSPOSE=NO option is useful for graphing the cross-correlation results with SAS/GRAPH procedures. The TRANSPOSE=YES option is useful for analyzing the cross-correlation results from using other procedures such as the CLUSTER procedure of SAS/STAT or SAS Enterprise Miner. The default is TRANSPOSE=NO.
DECOMP Statement
DECOMP components < / options > ;
A DECOMP statement can be used with the TIMESERIES procedure to specify options related to classical seasonal decomposition of the time series data. Only one DECOMP statement is allowed. The options specied affect all variables listed in the VAR statements. Decomposition can be performed only when the length of the seasonal cycle specied by the PROC TIMESERIES statement SEASONALITY= option or implied by the ID statement INTERVAL= option is greater than one.
The following seasonal decomposition components are available: ORIG | originAL TCC | TRENDCYCLE SIC | SEASONIRREGULAR SC | seasonal SCSTD TCS | TRENDCYCLESEASON IC | IRREGULAR SA | ADJUSTED PCSA TC CC | CYCLE original series trend-cycle component seasonal-irregular component seasonal component seasonal component standard errors trend-cycle-seasonal component irregular component seasonally adjusted series percent change seasonally adjusted series trend component cycle component
The following options can be specied in the DECOMP statement following the slash (/):
MODE= option
species the type of decomposition to be used to decompose the time series. The following values can be specied for the MODE= option: ADD | ADDITIVE additive decomposition multiplicative decomposition log-additive decomposition pseudo-additive decomposition
PSEUDOADD | PSEUDOADDITIVE
Multiplicative and log additive decomposition require strictly positive time series. If the accumulated time series contains nonpositive values and the MODE=MULT or MODE=LOGADD option is specied, an error results. Pseudo-additive decomposition requires a nonnegative-valued time series. If the accumulated time series contains negative values and the MODE=PSEUDOADD option is specied, an error results. The MODE=MULTORADD option species that multiplicative decomposition be used when the accumulated time series contains only positive values, that pseudo-additive decomposition be used when the accumulated time series contains only nonnegative values, and that additive decomposition be used otherwise. The default is MODE=MULTORADD.
LAMBDA= number
species the Hodrick-Prescott lter parameter for trend-cycle decomposition. The default is LAMBDA=1600. Filtering applies when the trend component or the cycle component is requested. If ltering is not specied, this option is ignored.
ID Statement ! 1693
NPERIODS= number
species the number of time periods to be stored in the OUTDECOMP= data set when the TRANSPOSE=YES option is specied. If the TRANSPOSE=NO option is specied, the NPERIODS= option is ignored. If the NPERIODS= option is positive, the rst or beginning time periods are recorded. If the NPERIODS= option is negative, the last or ending time periods are recorded. The NPERIODS= option species the number of OUTDECOMP= data set variables to contain the seasonal decomposition and is therefore limited to the maximum allowable number of SAS variables. If the number of time periods exceeds this limit, a warning is printed in the log and the number periods stored is reduced to the limit. If the NPERIODS= option is not specied, all of the periods specied between the ID statement START= and END= options are stored. If either of the START= or END= options are not specied, the default magnitude is the seasonality specied by the PROC TIMESERIES statement SEASONALITY= option or implied by the ID statement INTERVAL= option. If only the START= option is specied, the default sign is positive. If only the END= option is specied, the default sign is negative.
TRANSPOSE= NO | YES
TRANSPOSE=YES species that the OUTDECOMP= data set be recorded with the time periods as the column names instead of with the statistics as the column names. The rst and last time periods stored in the OUTDECOMP= data set correspond to the period of the ID statement START= option and END= option, respectively. If only the ID statement END= option is specied, the last time ID value of each accumulated time series corresponds to the last time period column. If only the ID statement START= option is specied, the rst time ID value of each accumulated time series corresponds to the rst time period column. If neither the START= option nor the END= option is specied with the ID statement, the rst time ID value of each accumulated time series corresponds to the rst time period column. The TRANSPOSE=NO option is useful for analyzing or displaying the decomposition results with SAS/GRAPH procedures. The TRANSPOSE=YES option is useful for analyzing the decomposition results from using other SAS procedures or SAS Enterprise Miner. The default is TRANSPOSE=NO.
ID Statement
ID variable INTERVAL=interval < options > ;
The ID statement names a numeric variable that identies observations in the input and output data sets. The ID variables values are assumed to be SAS date, time, or datetime values. In addition, the ID statement species the (desired) frequency associated with the time series. The ID statement options also specify how the observations are accumulated and how the time ID values are aligned to form the time series. The information specied affects all variables listed in subsequent VAR statements. If the ID statement is specied, the INTERVAL= option must also be used. If an ID statement is not specied, the observation number, with respect to the BY group, is used as the time ID. The following options can be used with the ID statement:
ACCUMULATE= option
species how the data set observations are to be accumulated within each time period. The frequency (width of each time interval) is specied by the INTERVAL= option. The ID variable contains the time ID values. Each time ID variable value corresponds to a specic time period. The accumulated values form the time series, which is used in subsequent analysis. The ACCUMULATE= option is useful when there are zero or more than one input observations that coincide with a particular time period (for example, time-stamped transactional data). The EXPAND procedure offers additional frequency conversions and transformations that can also be useful in creating a time series. The following options determine how the observations are accumulated within each time period based on the ID variable and the frequency specied by the INTERVAL= option: NONE TOTAL AVERAGE | AVG MINIMUM | MIN MEDIAN | MED MAXIMUM | MAX N NMISS NOBS FIRST LAST STDDEV |STD CSS USS No accumulation occurs; the ID variable values must be equally spaced with respect to the frequency. This is the default option. Observations are accumulated based on the total sum of their values. Observations are accumulated based on the average of their values. Observations are accumulated based on the minimum of their values. Observations are accumulated based on the median of their values. Observations are accumulated based on the maximum of their values. Observations are accumulated based on the number of nonmissing observations. Observations are accumulated based on the number of missing observations. Observations are accumulated based on the number of observations. Observations are accumulated based on the rst of their values. Observations are accumulated based on the last of their values. Observations are accumulated based on the standard deviation of their values. Observations are accumulated based on the corrected sum of squares of their values. Observations are accumulated based on the uncorrected sum of squares of their values.
If the ACCUMULATE= option is specied, the SETMISSING= option is useful for specifying how accumulated missing values are to be treated. If missing values should be interpreted as zero, then SETMISSING=0 should be used. The section Details: TIMESERIES Procedure on page 1699 describes accumulation in greater detail.
ALIGN= option
controls the alignment of SAS dates used to identify output observations. The ALIGN= option accepts the following values: BEGINNING|BEG|B, MIDDLE|MID|M, and ENDING|END|E. BEGINNING is the default.
ID Statement ! 1695
END= option
species a SAS date, datetime, or time value that represents the end of the data. If the last time ID variable value is less than the END= value, the series is extended with missing values. If the last time ID variable value is greater than the END= value, the series is truncated. For example, END=&sysdateD uses the automatic macro variable SYSDATE to extend or truncate the series to the current date. The START= and END= options can be used to ensure that data associated within each BY group contains the same number of observations.
FORMAT= format
species the SAS format for the time ID values. If the FORMAT= option is not specied, the default format is implied from the INTERVAL= option.
INTERVAL= interval
species the frequency of the accumulated time series. For example, if the input data set consists of quarterly observations, then INTERVAL=QTR should be used. If the PROC TIMESERIES statement SEASONALITY= option is not specied, the length of the seasonal cycle is implied from the INTERVAL= option. For example, INTERVAL=QTR implies a seasonal cycle of length 4. If the ACCUMULATE= option is also specied, the INTERVAL= option determines the time periods for the accumulation of observations. The INTERVAL= option is required and must be the rst option specied in the ID statement.
NOTSORTED
species that the time ID values not be in sorted order. The TIMESERIES procedure sorts the data with respect to the time ID prior to analysis.
SETMISSING= option | number
species how missing values (either actual or accumulated) are to be interpreted in the accumulated time series. If a number is specied, missing values are set to the number. If a missing value indicates an unknown value, this option should not be used. If a missing value indicates no value, SETMISSING=0 should be used. You would typically use SETMISSING=0 for transactional data because no recorded data usually implies no activity. The following options can also be used to determine how missing values are assigned: MISSING AVERAGE | AVG MINIMUM | MIN MEDIAN | MED MAXIMUM | MAX FIRST LAST PREVIOUS | PREV Missing values are set to missing. This is the default option. Missing values are set to the accumulated average value. Missing values are set to the accumulated minimum value. Missing values are set to the accumulated median value. Missing values are set to the accumulated maximum value. Missing values are set to the accumulated rst nonmissing value. Missing values are set to the accumulated last nonmissing value. Missing values are set to the previous periods accumulated nonmissing value. Missing values at the beginning of the accumulated series remain missing. Missing values are set to the next periods accumulated nonmissing value. Missing values at the end of the accumulated series remain missing.
NEXT
START= option
species a SAS date, datetime, or time value that represents the beginning of the data. If the rst time ID variable value is greater than the START= value, the series is prepended with missing values. If the rst time ID variable value is less than the START= value, the series is truncated. The START= and END= options can be used to ensure that data associated with each by group contains the same number of observations.
SEASON Statement
SEASON statistics < / options > ;
A SEASON statement can be used with the TIMESERIES procedure to specify options that are related to seasonal analysis of the time-stamped transactional data. Only one SEASON statement is allowed. The options specied affect all variables specied in the VAR statements. Seasonal analysis can be performed only when the length of the seasonal cycle specied by the PROC TIMESERIES statement SEASONALITY= option or implied by the ID statement INTERVAL= option is greater than one. The following seasonal statistics are available: NOBS N NMISS MINIMUM MAXIMUM RANGE SUM MEAN STDDEV CSS USS MEDIAN number of observations number of nonmissing observations number of missing observations minimum value maximum value range value summation value mean value standard deviation corrected sum of squares uncorrected sum of squares median value
The following option can be specied in the SEASON statement following the slash (/):
TRANSPOSE= NO | YES
TRANSPOSE=YES species that the OUTSEASON= data set be recorded with the seasonal
indices as the column names instead of with the statistics as the column names. The TRANSPOSE=NO option is useful for graphing the seasonal analysis results with SAS/GRAPH procedures. The TRANSPOSE=YES option is useful for analyzing the seasonal analysis results using other SAS procedures or SAS Enterprise Miner. The default is TRANSPOSE=NO.
TREND Statement
TREND statistics < / options > ;
A TREND statement can be used with the TIMESERIES procedure to specify options related to trend analysis of the time-stamped transactional data. Only one TREND statement is allowed. The options specied affect all variables specied in the VAR statements. The following trend statistics are available: NOBS N NMISS MINIMUM MAXIMUM RANGE sum MEAN STDDEV CSS USS MEDIAN number of observations number of nonmissing observations number of missing observations minimum value maximum value range value summation value mean value standard deviation corrected sum of squares uncorrected sum of squares median value
The following options can be specied in the TREND statement following the slash (/):
NPERIODS= number
species the number of time periods to be stored in the OUTTREND= data set when the TRANSPOSE=YES option is specied. If the TRANSPOSE option is not specied, the NPERIODS= option is ignored. The NPERIODS= option species the number of OUTTREND= data set variables to contain the trend statistics and is therefore limited to the maximum allowable number of SAS variables. If the NPERIODS= option is not specied, all of the periods specied between the ID statement START= and END= options is stored. If either of the START= or END= options are not
specied, the default is the seasonality specied by the PROC TIMESERIES statement SEASONALITY= option or implied by the ID statement INTERVAL= option. If the seasonality is zero, the default is NPERIODS=5.
TRANSPOSE= NO | YES
TRANSPOSE=YES species that the OUTTREND= data set be recorded with the time periods as the column names instead of with the statistics as the column names. The rst and last time periods stored in the OUTTREND= data set correspond to the period of the ID statement START= and END= options, respectively. If only the ID statement END= option is specied, the last time ID value of each accumulated time series corresponds to the last time period column. If only the ID statement START= option is specied, the rst time ID value of each accumulated time series corresponds to the rst time period column. If neither the START= option nor the END= option is specied with the ID statement, the rst time ID value of each accumulated time series corresponds to the rst time period column. The TRANSPOSE=NO option is useful for analyzing or displaying the trend analysis results with SAS/GRAPH procedures. The TRANSPOSE=YES option is useful for analyzing the trend analysis results from using other SAS procedures or SAS Enterprise Miner. The default is TRANSPOSE=YES.
The VAR and CROSSVAR statements list the numeric variables in the DATA= data set whose values are to be accumulated to form the time series. An input data set variable can be specied in only one VAR or CROSSVAR statement. Any number of VAR and CROSSVAR statements can be used. The following options can be used with the VAR and CROSSVAR statements:
ACCUMULATE= option
species how the data set observations are to be accumulated within each time period for the variables listed in the VAR or CROSSVAR statement. If the ACCUMULATE= option is not specied in the VAR or CROSSVAR statement, accumulation is determined by the ACCUMULATE= option of the ID statement. See the ID statement ACCUMULATE= option for more details.
DIF=( numlist )
species the differencing to be applied to the accumulated time series. The list of differencing orders must be separated by spaces or commas. For example, DIF=(1,3) species rst then third order differencing. Differencing is applied after time series transformation. The TRANSFORM= option is applied before the DIF= option.
SDIF=( numlist )
species the seasonal differencing to be applied to the accumulated time series. The list
of seasonal differencing orders must be separated by spaces or commas. For example, SDIF=(1,3) species rst then third order seasonal differencing. Differencing is applied after time series transformation. The TRANSFORM= option is applied before the SDIF= option.
SETMISS= option | number SETMISSING= option | number
species how missing values (either actual or accumulated) are to be interpreted in the accumulated time series for variables listed in the VAR or CROSSVAR statement. If the SETMISSING= option is not specied in the VAR or CROSSVAR statement, missing values are set based on the SETMISSING= option of the ID statement. See the ID statement SETMISSING= option for more details.
TRANSFORM= option
species the time series transformation to be applied to the accumulated time series. The following transformations are provided: NONE LOG SQRT LOGISTIC BOXCOX(n ) No transformation is applied. This option is the default. logarithmic transformation square-root transformation logistic transformation Box-Cox transformation with parameter number where the number is between 5 and 5
When the TRANSFORM= option is specied, the time series must be strictly positive.
3. time series transformation 4. time series differencing 5. descriptive statistics 6. seasonal decomposition 7. correlation analysis 8. cross-correlation analysis
TRANSFORM= option DIF= and SDIF= option OUTSUM= option, PRINT=DESCSTATS DECOMP statement, OUTDECOMP= option CORR statement, OUTCORR= option CROSSCORR statement, OUTCROSSCORR= option
Accumulation
If the ACCUMULATE= option in the ID, VAR, or CROSSVAR statement is specied, data set observations are accumulated within each time period. The frequency (width of each time interval) is specied by the ID statement INTERVAL= option. The ID variable contains the time ID values. Each time ID value corresponds to a specic time period. Accumulation is useful when the input data set contains transactional data, whose observations are not spaced with respect to any particular time interval. The accumulated values form the time series, which is used in subsequent analyses. For example, suppose a data set contains the following observations:
19MAR1999 19MAR1999 11MAY1999 12MAY1999 23MAY1999 10 30 50 20 20
If the INTERVAL=MONTH is specied, all of the above observations fall within a three-month period of time between March 1999 and May 1999. The observations are accumulated within each time period as follows: If the ACCUMULATE=NONE option is specied, an error is generated because the ID variable values are not equally spaced with respect to the specied frequency (MONTH). If the ACCUMULATE=TOTAL option is specied, the resulting time series is:
O1MAR1999 O1APR1999 O1MAY1999 40 . 90
As can be seen from the above examples, even though the data set observations contain no missing values, the accumulated time series can have missing values.
be interpreted as no valuethat is, zero. In the former case, the SETMISSING= option can be used to interpret how missing values are treated. The SETMISSING=0 option should be used when missing observations are to be treated as no (zero) values. In other cases, missing values should be interpreted as global values, such as minimum or maximum values of the accumulated series. The accumulated and interpreted time series is used in subsequent analyses.
cyt //
ceil.log10 .max.yt ///
and ceil.x/ is the smallest integer greater than or equal to x. is the square root transformation. p wt D yt is the Box-Cox transformation. ( yt 1 ; 0 wt D ln.yt /; D0
Box Cox
More complex time series transformations can be performed by using the EXPAND procedure of SAS/ETS.
Additionally, assuming yt is strictly positive, the VAR and CROSSVAR statement TRANSFORM=, DIF=, and SDIF= options can be combined.
Descriptive Statistics
Descriptive statistics can be computed from the working series by specifying the OUTSUM= option or PRINT=DESCSTATS.
Seasonal Decomposition
Seasonal decomposition/analysis can be performed on the working series by specifying the OUTDECOMP= option, the PRINT=DECOMP option, or one of the PLOTS= options associated with decomposition in the PROC TIMESERIES statement. The DECOMP statement enables you to specify options related to decomposition. The TIMESERIES procedure uses classical decomposition. More complex seasonal decomposition/adjustment analysis can be performed by using the X11 or the X12 procedure of SAS/ETS. The DECOMP statement MODE= option determines the mode of the seasonal adjustment decomposition to be performed. There are four modes: multiplicative (MODE=MULT), additive (MODE=ADD), pseudo-additive (MODE=PSEUDOADD), and log-additive (MODE=LOGADD) decomposition. The default is MODE=MULTORADD which species MODE=MULT for series that are strictly positive, MODE=PSEUDOADD for series that are nonnegative, and MODE=ADD for series that are not nonnegative. When MODE=LOGADD is specied, the components are exponentiated to the original metric. The DECOMP statement LAMBDA= option species the Hodrick-Prescott lter parameter (Hodrick and Prescott 1980). The default is LAMBDA=1600. The Hodrick-Prescott lter is used to decompose the trend-cycle component into the trend component and cycle component in an additive fashion. A smaller parameter assigns less signicance to the cycle; that is, LAMBDA=0 implies no cycle component. The notation and keywords associated with seasonal decomposition/adjustment analysis are dened in Table 27.2.
Table 27.2
Keyword ORIGINAL
MODE= Option MULT ADD LOGADD PSEUDOADD MULT ADD LOGADD PSEUDOADD MULT ADD LOGADD PSEUDOADD MULT ADD LOGADD PSEUDOADD MULT ADD LOGADD PSEUDOADD MULT ADD LOGADD PSEUDOADD MULT ADD LOGADD PSEUDOADD MULT ADD LOGADD PSEUDOADD MULT ADD LOGADD PSEUDOADD
Formula Ot D T Ct St It Ot D T C t C St C I t log.Ot / D T Ct C St C It Ot D T Ct .St C It 1/ centered moving average of Ot centered moving average of Ot centered moving average of log.Ot / centered moving average of Ot SIt D St It D Ot =T Ct SIt D St C It D Ot T Ct SIt D St C It D log.Ot / T Ct SIt D St C It 1 D Ot =T Ct seasonal Averages of SIt seasonal Averages of SIt seasonal Averages of SIt seasonal Averages of SIt It D SIt =St It D SIt St It D SIt St It D SIt St C 1 T CSt D T Ct St D Ot =It T CSt D T Ct C St D Ot It T CSt D T Ct C St D Ot It T CSt D T Ct St T t D T C t Ct T t D T C t Ct T t D T C t Ct T t D T C t Ct Ct D T Ct Tt Ct D T Ct Tt Ct D T Ct Tt Ct D T Ct Tt SAt D Ot =St D T Ct It SAt D Ot St D T Ct C It SAt D Ot =exp.St / D exp.T Ct C It / SAt D T Ct It
trend-cycle component
TCC
seasonal-irregular component
SIC
seasonal component
SC
irregular component
IC
trend-cycle-seasonal component
TCS
trend component
TC
cycle component
CC
SA
The trend-cycle component is computed from the s-period centered moving average as follows: T Ct D
bs=2c X kD bs=2c
yt Ck =s
The seasonal component is obtained by averaging the seasonal-irregular component for each season. SkCjs D X
t Dk mod s
SIt T =s
where 0j T =s and 1ks. The seasonal components are normalized to sum to one (multiplicative) or zero (additive).
Correlation Analysis
Correlation analysis can be performed on the working series by specifying the OUTCORR= option or one of the PLOTS= options that are associated with correlation. The CORR statement enables you to specify options that are related to correlation analysis.
Autocovariance Statistics
LAGS N ACOV ACOV h 2 f0; : : : ; H g Nh is the number of observed products at lag h, ignoring missing values 1 P O .h/ D T TDhC1 .yt y/.yt h y/ t 1 PT O .h/ D Nh t DhC1 .yt y/.yt h y/ when embedded missing values are present
Autocorrelation Statistics
ACF ACFSTD ACFNORM ACFPROB ACFLPROB ACF2STD O.h/ D O .h/= O .0/ r P 1 1 S t d. O.h// D T 1 C 2 hD1 O.j /2 j Norm. O.h// D O.h/=Std. O.h// P rob. O.h// D 2 .1 .jNorm. O.h//j// LogP rob. O.h// D log10 .P rob. O.h// 8 O.h/ > 2Std. O.h// < 1 0 2Std. O.h// < O.h/ < 2Std. O.h// F lag. O.h// D : 1 O.h/ < 2Std. O.h//
PACFLPROB PACF2STD
LogP rob.'.h// D log10 .P rob.'.h// O O 8 '.h/ > 2Std.'.h// O O < 1 0 2Std.'.h// < '.h/ < 2Std.'.h// O O O F lag.'.h// D O : 1 '.h/ < 2Std.'.h// O O
O O LogP rob. .h// D log10 .P rob..h// 8 O O 1 .h/ > 2Std..h// < O O O O F lag. .h// D 0 2Std..h// < .h/ < 2Std..h// : 1 .h/ < 2Std..h// O O
LogP rob.Q.h// D
log10 .P rob.Q.h//
Cross-Correlation Analysis
Cross-correlation analysis can be performed on the working series by specifying the OUTCROSSCORR= option or one of the CROSSPLOTS= options that are associated with cross-correlation. The CROSSCORR statement enables you to specify options that are related to cross-correlation analysis.
Cross-Correlation Statistics
LAGS N CCOV CCOV CCF h 2 f0; : : : ; H g Nh is the number of observed products at lag h, ignoring missing values 1 P Ox;y .h/ D T TDhC1 .xt x/.yt h y/ t 1 PT Ox;y .h/ D Nh t DhC1 .xt x/.yt h y/ when embedded missing values are present p Ox;y .h/ D Ox;y .h/= Ox .0/ Oy .0/
p S t d. Ox;y .h// D 1= N0 Norm. Ox;y .h// D Ox;y .h/=Std. Ox;y .h// P rob. Ox;y .h// D 2 1 jNorm. Ox;y .h//j LogP rob. Ox;y .h// D log10 .P rob. Ox;y .h// 8 Ox;y .h/ > 2Std. Ox;y .h// < 1 0 2Std. Ox;y .h// < Ox;y .h/ < 2Std. Ox;y .h// F lag. Ox;y .h// D : 1 Ox;y .h/ < 2Std. Ox;y .h//
ACFSTD ACF2STD ACFNORM ACFPROB ACFLPROB PACF PACFSTD PACF2STD PACFNORM PACFPROB PACFLPROB IACF IACFSTD IACF2STD IACFNORM IACFPROB IACFLPROB WN WNPROB WNLPROB
autocorrelation standard errors two standard errors beyond autocorrelation normalized autocorrelations autocorrelation probabilities autocorrelation log probabilities partial autocorrelations partial autocorrelation standard errors two standard errors beyond partial autocorrelation partial normalized autocorrelations partial autocorrelation probabilities partial autocorrelation log probabilities inverse autocorrelations inverse autocorrelation standard errors two standard errors beyond inverse autocorrelation normalized inverse autocorrelations inverse autocorrelation probabilities inverse autocorrelation log probabilities white noise test Statistics white noise test probabilities white noise test log probabilities
The preceding correlation statistics are computed for each specied time lag. When the CORR statement TRANSPOSE=YES option is specied, the variable values are related to correlation statistics specied in the CORR statement and the variable names are related to the NLAG= or LAGS= options. _NAME_ _STAT_ _LABEL_ LAGh variable name correlation statistic name correlation statistic label correlation statistics for lag h
When the CROSSCORR statement TRANSPOSE=NO option is omitted or specied explicitly, the variable names are related to cross-correlation statistics specied in the CROSSCORR statement options and the variable values are related to the NLAG= or LAGS= option. _NAME_ _CROSS_ LAG N CCOV CCF CCFSTD CCF2STD CCFNORM CCFPROB CCFLPROB variable name cross variable name time lag number of variance products cross-covariances cross-correlations cross-correlation standard errors two standard errors beyond cross-correlation normalized cross-correlations cross-correlation probabilities cross-correlation log probabilities
The preceding cross-correlation statistics are computed for each specied time lag. When the CROSSCORR statement TRANSPOSE=YES option is specied, the variable values are related to cross-correlation statistics specied in the CROSSCORR statement and the variable names are related to the NLAG= or LAGS= options. _NAME_ _CROSS_ _STAT_ _LABEL_ LAGh variable name cross variable name cross-correlation statistic name cross-correlation statistic label cross-correlation statistics for lag h
time ID values seasonal index original series values trend-cycle component seasonal-irregular component seasonal component seasonal component standard errors trend-cycle-seasonal component irregular component seasonally adjusted series percent change seasonally adjusted series trend component cycle component
The preceding decomposition components are computed for each time period. When the DECOMP statement TRANSPOSE=YES option is specied, the variable values are related to decomposition/adjustments specied in the DECOMP statement and the variable names are related to the ID statement INTERVAL= option, the PROC TIMESERIES statement SEASONALITY= option, and the DECOMP statement NPERIODS= option. _NAME_ _MODE_ _COMP_ _LABEL_ PERIODt variable name mode of decomposition name decomposition component name decomposition component label decomposition component value for time period t
_SEASON_ NOBS N NMISS MINIMUM MAXIMUM RANGE SUM MEAN STDDEV CSS USS MEDIAN
seasonal index number of observations number of nonmissing observations number of missing observations minimum value maximum value maximum value summation value mean value standard deviation corrected sum of squares uncorrected sum of squares median value
The preceding statistics are computed for each season. When the SEASON statement TRANSPOSE=YES option is specied, the variable values are related to seasonal statistics specied in the SEASON statement and the variable names are related to the ID statement INTERVAL= option or the PROC TIMESERIES statement SEASONALITY= option. _NAME_ _STAT_ _LABEL_ SEASONs variable name season statistic name season statistic name season statistic value for season s
number of missing observations minimum value maximum value average value standard deviation
The OUTSUM= data set contains the descriptive statistics of the (accumulated) time series.
The preceding statistics are computed for each time period. When the TREND statement TRANSPOSE=YES option is specied, the variable values related to trend statistics specied in the TREND statement and the variable name are related to the ID statement INTERVAL=, the PROC TIMESERIES statement SEASONALITY= option, and the TREND statement NPERIODS= option.
variable name trend statistic name trend statistic name trend statistic value for time period t
Printed Output
The TIMESERIES procedure optionally produces printed output by using the Output Delivery System (ODS). By default, the procedure produces no printed output. All output is controlled by the PRINT= and PRINTDETAILS options associated with the PROC TIMESERIES statement. In general, if an analysis step related to printed output fails, the values of this step are not printed and appropriate error and/or warning messages are recorded in the log. The printed output is similar to the output data set, and these similarities are described below. The printed output produced by different printing option values is described as follows: PRINT=DECOMP prints the seasonal decomposition similar to the OUTDECOMP= data set. prints a table of descriptive statistics for each variable.
PRINT=DESCSTATS
PRINT=SEASONS prints the seasonal statistics similar to the OUTSEASON= data set. PRINT=SUMMARY prints the summary statistics similar to the OUTSUM= data set. PRINT=TRENDS prints the trend statistics similar to the OUTTREND= data set.
PRINTDETAILS prints each table with greater detail. If PRINT=SEASONS and the PRINTDETAILS options are both specied, all seasonal statistics are printed.
ODS Table Name SeasonalDecomposition DescStats GlobalStatistics SeasonStatistics StatisticsSummary TrendStatistics GlobalStatistics
Description seasonal decomposition descriptive statistics global statistics season statistics statistics summary trend statistics global statistics
ODS Graph Name ACFPlot ACFNORMPlot CCFNORMPlot CCFPlot CorrelationPlots CrossSeriesPlot CycleComponentPlot DecompositionPlots
Plot Description autocorrelation function normalized autocorrelation function normalized cross-correlation function cross-correlation function correlation graphics panel cross series plot cycle component decomposition graphics panel
Table 27.4
(continued)
Plot Description inverse autocorrelation function normalized inverse autocorrelation function IrregularComponentPlot irregular component PACFPlot partial autocorrelation function PACFNORMPlot standardized partial autocorrelation function PercentChangeAdjustedplot percent-change seasonally adjusted ResidualPlot residual time series plot SeasonallyAdjustedPlot seasonally adjusted SeasonalComponentPlot seasonal component SeasonalCyclePlot seasonal cycles plot SeasonalIrregularComponentPlot seasonal-irregular component SeriesPlot time series plot TrendComponentPlot trend component TrendCycleComponentPlot trend-cycle component TrendCycleSeasonalPlot trend-cycle-seasonal component WhiteNoiseLogProbabilityPlot white noise log probability WhiteNoiseProbabilityPlot white noise probability
Statement PLOTS PLOTS PLOTS PLOTS PLOTS PLOTS PLOTS PLOTS PLOTS PLOTS PLOTS PLOTS PLOTS PLOTS PLOTS PLOTS PLOTS
Option IACF IACF IC PACF PACF PCSA RESIDUAL SA SC CYCLES SIC SERIES TC TCC TCS WN WN
The monthly time series data are stored in the data WORK.MSERIES. Each BY group associated with the BY variable STORE contains an observation for each of the 36 months associated with the years 1998, 1999, and 2000. Each observation contains the variable STORE, TIMESTAMP, and each of the analysis variables in the input data set. After each set of transactions has been accumulated to form corresponding time series, accumulated time series can be analyzed using various time series analysis techniques. For example, exponentially weighted moving averages can be used to smooth each series. The following statements use the EXPAND procedure to smooth the analysis variable named STOREITEM.
proc expand data=mseries out=smoothed from=month; by store; id date; convert storeitem=smooth / transform=(ewma 0.1); run;
The smoothed series are stored in the data set WORK.SMOOTHED. The variable SMOOTH contains the smoothed series. If the time ID variable TIMESTAMP contains SAS datetime values instead of SAS date values, the INTERVAL=, START=, and END= options must be changed accordingly and the following statements could be used:
proc timeseries data=retail out=tseries; by store; id timestamp interval=dtmonth accumulate=median setmiss=0 start=01jan1998:00:00:00dt end =31dec2000:00:00:00dt; var _numeric_; run;
The monthly time series data are stored in the data WORK.TSERIES, and the time ID values use a SAS datetime representation.
The time series is stored in the data set WORK.SERIES, the trend statistics are stored in the data set WORK.TREND, and the seasonal statistics are stored in the data set WORK.SEASON. Additionally, the seasonal statistics are printed (PRINT=SEASONS) and the results of the seasonal analysis are shown in Output 27.2.1.
Output 27.2.1 Seasonal Statistics Table
The TIMESERIES Procedure Season Statistics for Variable AIR Season Index 1 2 3 4 Standard Deviation 95.65189 117.61839 143.97935 101.34732
N 36 36 36 36
Using the trend statistics stored in the WORK.TREND data set, the following statements plot various trend statistics associated with each time period over time.
title1 "Trend Statistics"; proc sgplot data=trend; series x=date y=max / lineattrs=(pattern=solid); series x=date y=mean / lineattrs=(pattern=solid); series x=date y=min / lineattrs=(pattern=solid); yaxis display=(nolabel); format date year4.; run;
Using the trend statistics stored in the WORK.TREND data set, the following statements chart the sum of the transactions associated with each time period for the second season over time.
title1 "Trend Statistics for 2nd Season"; proc sgplot data=trend; where _season_ = 2; vbar date / freq=sum; format date year4.; yaxis label=Sum; run;
Using the trend statistics stored in the WORK.TREND data set, the following statements plot the mean of the transactions associated with each time period by each year over time.
data trend; set trend; year = year(date); run; title1 "Trend Statistics by Year"; proc sgplot data=trend; series x=_season_ y=mean / group=year lineattrs=(pattern=solid); xaxis values=(1 to 4 by 1); run;
Using the season statistics stored in the WORK.SEASON data set, the following statements plot various season statistics for each season.
title1 "Seasonal Statistics"; proc sgplot data=season; series x=_season_ y=max / lineattrs=(pattern=solid); series x=_season_ y=mean / lineattrs=(pattern=solid); series x=_season_ y=min / lineattrs=(pattern=solid); yaxis display=(nolabel); xaxis values=(1 to 4 by 1); run;
References ! 1725
References
Greene, W.H. (1999), Econometric Analysis, Fourth Edition, New York: Macmillan. Hodrick, R. and Prescott, E. (1980), Post-War U.S. Business Cycles: An Empirical Investigation, Discussion Paper 451, Carnegie Mellon University. Makridakis, S. and Wheelwright, S.C. (1978), Interactive Forecasting: Univariate and Multivariate Methods, Second Edition, San Francisco: Holden-Day, 198201. Pyle, D. (1999), Data Preparation for Data Mining, San Francisco: Morgan Kaufman Publishers, Inc. Stoffer, D.S., Toloi, C.M.C. (1992), A Note on the Ljung-Box-Pierce Portmanteau Statistic with Missing Data, Statistics and Probability Letters 13, 391396. Wheelwright, S.C. and Makridakis, S. (1973), Forecasting Methods for Management, Third Edition, New York: Wiley-Interscience, 123133.
1726
Chapter 28
The next step is to invoke the TSCSREG procedure and specify the cross section and time series variables in an ID statement. List the variables in the ID statement exactly as they are listed in the BY statement.
proc tscsreg data=a; id state date;
Alternatively, you can omit the ID statement and use the CS= and TS= options on the PROC TSCSREG statement to specify the number of cross sections in the data set and the number of time series observations in each cross section.
Unbalanced Data
In the case of xed-effects and random-effects models, the TSCSREG procedure is capable of processing data with different numbers of time series observations across different cross sections. You must specify the ID statement to estimate models that use unbalanced data. The missing time series observations are recognized by the absence of time series ID variable values in some of the cross sections in the input data set. Moreover, if an observation with a particular time series ID value and cross-sectional ID value is present in the input data set, but one or more of the model variables are missing, that time series point is treated as missing for that cross section.
The MODEL statement in PROC TSCSREG is specied like the MODEL statement in other SAS regression procedures: the dependent variable is listed rst, followed by an equal sign, followed by the list of regressor variables. The reason for using PROC TSCSREG instead of other SAS regression procedures is that you can incorporate a model for the structure of the random errors. It is important to consider what kind of error structure model is appropriate for your data and to specify the corresponding option in the MODEL statement. The error structure options supported by the TSCSREG procedure are FIXONE, FIXTWO, RANONE, RANTWO, FULLER, PARKS, and DASILVA. See Details: The TSCSREG Procedure on page 1738 for more information about these methods and the error structures they assume. By default, the two-way random-effects error model structure is used while Fuller-Battese and Wansbeek-Kapteyn methods are used for the estimation of variance components in balanced data and unbalanced data, respectively. Thus, the preceding example is the same as specifying the RANTWO option, as shown in the following statements:
proc tscsreg data=a; id state date; model y = x1 x2 / rantwo; run;
You can specify more than one error structure option in the MODEL statement; the analysis is repeated using each method specied. You can use any number of MODEL statements to estimate different regression models or estimate the same model by using different options. In order to aid in model specication within this class of models, the procedure provides two specication test statistics. The rst is an F statistic that tests the null hypothesis that the xed-effects parameters are all zero. The second is a Hausman m-statistic that provides information about the appropriateness of the random-effects specication. It is based on the idea that, under the null hypothesis of no correlation between the effects variables and the regressors, OLS and GLS are consistent, but OLS is inefcient. Hence, a test can be based on the result that the covariance of an efcient estimator with its difference from an inefcient estimator is zero. Rejection of the null hypothesis might suggest that the xed-effects model is more appropriate. The procedure also provides the Buse R-square measure, which is the most appropriate goodnessof-t measure for models estimated by using GLS. This number is interpreted as a measure of the
proportion of the transformed sum of squares of the dependent variable that is attributable to the inuence of the independent variables. In the case of OLS estimation, the Buse R-square measure is equivalent to the usual R-square measure.
Estimation Techniques
If the effects are xed, the models are essentially regression models with dummy variables that correspond to the specied effects. For xed-effects models, ordinary least squares (OLS) estimation is equivalent to best linear unbiased estimation. The output from TSCSREG is identical to what one would obtain from creating dummy variables to represent the cross-sectional and time (xed) effects. The output is presented in this manner to facilitate comparisons to the least squares dummy variables estimator (LSDV). As such, the inclusion of a intercept term implies that one dummy variable must be dropped. The actual estimation of the xed-effects models is not LSDV. LSDV is much too cumbersome to implement. Instead, TSCSREG operates in a two step fashion. In the rst step, the following occurs: One-way xed-effects model: In the one-way xed-effects model, the data is transformed by removing the cross-sectional means from the dependent and independent variables. The following is true: yit D yit Q Q xit D xit yi N N xi
Two-way xed-effects model: In the two-way xed-effects model, the data is transformed by removing the cross-sectional and time means and adding back the overall means: yit D yit Q Q xit D xit where the symbols: yit and xit are the dependent variable (a scalar) and the explanatory variables (a vector whose columns are the explanatory variables not including a constant), respectively N yi and xi are cross section means N N y t and x t are time means N N y and x are the overall means N The second step consists of running OLS on the properly demeaned series, provided that the data are balanced. The unbalanced case is slightly more difcult, because the structure of the missing data must be retained. For this case, PROC TSCSREG uses a slight specialization on Wansbeek and Kapteyn. yi N N xi N yt Cy N N N N N xt Cx
The other alternative is to assume that the effects are random. In the one-way case, E. i / D 0, E. i2 / D 2 , and E. i j / D 0 for i j , and i is uncorrelated with i t for all i and t . In the two2 2 way case, in addition to all of the preceding, E.et / D 0, E.et / D e , and E.et es / D 0 for t s, and the et are uncorrelated with the i and the i t for all i and t . Thus, the model is a variance 2 components model, with the variance components 2 , e , and 2 , to be estimated. A crucial implication of such a specication is that the effects are independent of the regressors. For randomeffects models, the estimation method is an estimated generalized least squares (EGLS) procedure that involves estimating the variance components in the rst stage and using the estimated variance covariance matrix thus obtained to apply generalized least squares (GLS) to the data.
Introductory Example
The following example uses the cost function data from Greene (1990) to estimate the variance components model. The variable OUTPUT is the log of output in millions of kilowatt-hours, and COST is the log of cost in millions of dollars. Refer to Greene (1990) for details.
title1; data greene; input firm df1 = firm df2 = firm df3 = firm df4 = firm df5 = firm d60 = year d65 = year d70 = year datalines;
Usually you cannot explicitly specify all the explanatory variables that affect the dependent variable. The omitted or unobservable variables are summarized in the error disturbances. The TSCSREG procedure used with the RANTWO option species the two-way random-effects error model where the variance components are estimated by the Fuller-Battese method, because the data are balanced and the parameters are efciently estimated by using the GLS method. The variance components model used by the Fuller-Battese method is yit D
K X kD1
Xitk k C vi C et C
it i
D 1; : : :; NI t D 1; : : :; T
proc tscsreg data=greene; model cost = output / rantwo; id firm year; run;
The TSCSREG procedure output is shown in Figure 28.1. A model description is printed rst; it reports the estimation method used and the number of cross sections and time periods. The variance components estimates are printed next. Finally, the table of regression parameter estimates shows the estimates, standard errors, and t tests.
Figure 28.1 The Variance Components Estimates
The TSCSREG Procedure Fuller and Battese Variance Components (RanTwo) Dependent Variable: cost Model Description Estimation Method Number of Cross Sections Time Series Length Fit Statistics SSE MSE R-Square 0.3481 0.0158 0.8136 DFE Root MSE 22 0.1258 RanTwo 6 4
Variance Component Estimates Variance Component for Cross Sections Variance Component for Time Series Variance Component for Error Hausman Test for Random Effects DF 1 m Value 26.46 Pr > m <.0001 0.046907 0.00906 0.008749
DF 1 1
Functional Summary
The statements and options used with the TSCSREG procedure are summarized in the following table.
Table 28.1 Functional Summary
Description
Statement
Option
Data Set Options specify the input data set write parameter estimates to an output data set include correlations in the OUTEST= data set include covariances in the OUTEST= data set specify number of time series observations specify number of cross sections Declaring the Role of Variables specify BY-group processing specify the cross section and time ID variables Printing Control Options print correlations of the estimates print covariances of the estimates suppress printed output perform tests of linear hypotheses Model Estimation Options specify the one-way xed-effects model specify the two-way xed-effects model specify the one-way random-effects model specify the two-way random-effects model specify Fuller-Battese method specify PARKS
BY ID
Description
Statement
Option
specify Da Silva method specify order of the moving-average error process for Da Silva method print matrix for Parks method print autocorrelation coefcients for Parks method suppress the intercept term control check for singularity
names the input data set. The input data set must be sorted by cross section and by time period within cross section. If you omit the DATA= option, the most recently created SAS data set is used.
TS=number
species the number of observations in the time series for each cross section. The TS= option value must be greater than 1. The TS= option is required unless an ID statement is used. Note that the number of observations for each time series must be the same for each cross section and must cover the same time period.
CS=number
species the number of cross sections. The CS= option value must be greater than 1. The CS= option is required unless an ID statement is used.
OUTEST=SAS-data-set
the parameter estimates. When the OUTEST= option is not specied, the OUTEST= data set is not created.
OUTCOV COVOUT
writes the covariance matrix of the parameter estimates to the OUTEST= data set.
OUTCORR CORROUT
writes the correlation matrix of the parameter estimates to the OUTEST= data set. In addition, any of the following MODEL statement options can be specied in the PROC TSCSREG statement: CORRB, COVB, FIXONE, FIXTWO, RANONE, RANTWO,
BY Statement ! 1735
FULLER, PARKS, DASILVA, NOINT, NOPRINT, M=, PHI, RHO, and SINGULAR=. When specied in the PROC TSCSREG statement, these options are equivalent to specifying the options for every MODEL statement.
BY Statement
BY variables ;
A BY statement can be used with PROC TSCSREG to obtain separate analyses on observations in groups dened by the BY variables. When a BY statement appears, the input data set must be sorted by the BY variables as well as by cross section and time period within the BY groups. When both an ID statement and a BY statement are specied, the input data set must be sorted rst with respect to BY variables and then with respect to the cross section and time series ID variables. For example,
proc sort data=a; by byvar1 byvar2 csid tsid; run; proc tscsreg data=a; by byvar1 byvar2; id csid tsid; ... run;
When both a BY statement and an ID statement are used, the data set might have a different number of cross sections or a different number of time periods in each BY group. If no ID statement is used, the CS=N and TS=T options must be specied and each BY group must contain N T observations.
ID Statement
ID cross-section-id-variable time-series-id-variable ;
The ID statement is used to specify variables in the input data set that identify the cross section and time period for each observation. When an ID statement is used, the TSCSREG procedure veries that the input data set is sorted by the cross section ID variable and by the time series ID variable within each cross section. The TSCSREG procedure also veries that the time series ID values are the same for all cross sections. To make sure the input data set is correctly sorted, use PROC SORT with a BY statement with the variables listed exactly as they are listed in the ID statement to sort the input data set. For example,
proc sort data=a;
by csid tsid; run; proc tscsreg data=a; id csid tsid; ... etc. ... run;
If the ID statement is not used, the TS= and CS= options must be specied on the PROC TSCSREG statement. Note that the input data must be sorted by time within cross section, regardless of whether the cross section structure is given by an ID statement or by the options TS= and CS=. If an ID statement is specied, the time series length T is set to the minimum number of observations for any cross section, and only the rst T observations in each cross section are used. If both the ID statement and the TS= and CS= options are specied, the TS= and CS= options are ignored.
MODEL Statement
MODEL response = regressors / options ;
The MODEL statement species the regression model and the error structure assumed for the regression residuals. The response variable on the left side of the equal sign is regressed on the independent variables listed after the equal sign. Any number of MODEL statements can be used. For each model statement, only one response variable can be specied on the left side of the equal sign. The error structure is specied by the FIXONE, FIXTWO, RANONE, RANTWO, FULLER, PARKS, and DASILVA options. More than one of these options can be used, in which case the analysis is repeated for each error structure model specied. Models can be given labels up to 32 characters in length. Model labels are used in the printed output to identify the results for different models. If no label is specied, the response variable name is used as the label for the model. The model label is specied as follows: label: MODEL response = regressors / options ; The following options can be specied on the MODEL statement after a slash (/).
CORRB CORR
species that a one-way xed-effects model be estimated with the one-way model that corresponds to group effects only.
FIXTWO
species that the model be estimated by using the Fuller-Battese method, which assumes a variance components model for the error structure.
PARKS
species that the model be estimated by using the Parks method, which assumes a rst-order autoregressive model for the error structure.
DASILVA
species that the model be estimated by using the Da Silva method, which assumes a mixed variance-component moving-average model for the error structure.
M=number
species the order of the moving-average process in the Da Silva method. The M= value must be less than T 1. The default is M=1.
PHI
prints the matrix of estimated covariances of the observations for the Parks method. The PHI option is relevant only when the PARKS option is used.
RHO
species a singularity criterion for the inversion of the matrix. The default depends on the precision of the computer system.
TEST Statement
TEST equation < , equation . . . > < / options > ;
The TEST statement performs F tests of linear hypotheses about the regression parameters in the preceding MODEL statement. Each equation species a linear hypothesis to be tested. All hypotheses in one TEST statement are tested jointly. Variable names in the equations must correspond to regressors in the preceding MODEL statement, and each name represents the coefcient of the corresponding regressor. The keyword INTERCEPT refers to the coefcient of the intercept. The following statements illustrate the use of the TEST statement:
proc tscsreg; model y = x1 x2 x3; test x1 = 0, x2 * .5 + 2 * x3= 0; test_int: test intercept=0, x3 = 0;
Description
ODS Tables Created by the MODEL Statement ModelDescription Model description FitStatistics Fit statistics FixedEffectsTest F test for no xed effects
ParameterEstimates CovB
Table 28.2
continued
Option CORRB FULLER, DASILVA, M=, RANONE, RANTWO FULLER, DASILVA, M=, RANONE, RANTWO PARKS, RHO PARKS DASILVA, M=
RandomEffectsTest
First order autoregressive parameter estimates Estimated phi matrix Estimates of autocovariances
Chapter 29
ODS Graph Names . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . OUTFOR= Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . OUTEST= Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Statistics of Fit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Examples: UCM Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Example 29.1: The Airline Series Revisited . . . . . . . . . . . . . . . . . Example 29.2: Variable Star Data . . . . . . . . . . . . . . . . . . . . . . . Example 29.3: Modeling Long Seasonal Patterns . . . . . . . . . . . . . . Example 29.4: Modeling Time-Varying Regression Effects Using the RANDOMREG Statement . . . . . . . . . . . . . . . . . . . . . . . . . Example 29.5: Trend Removal Using the Hodrick-Prescott Filter . . . . . . Example 29.6: Using Splines to Incorporate Nonlinear Effects . . . . . . . Example 29.7: Detection of Level Shift . . . . . . . . . . . . . . . . . . . . Example 29.8: ARIMA Modeling (Experimental) . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1813 1817 1818 1819 1820 1820 1826 1829 1833 1839 1841 1846 1849 1853
analysis using the UCMs. Once a suitable UCM is found for the series under consideration, it can be used for a variety of purposes. For example, it can be used for the following: forecasting the values of the response series and the component series in the model obtaining a model-based seasonal decomposition of the series obtaining a denoised version and interpolating the missing values of the response series in the historical period obtaining the full sample or smoothed estimates of the component series in the model
The following statements produce a time series plot of the series by using the TIMESERIES procedure (see Chapter 27, The TIMESERIES Procedure). The trend and seasonal features of the series are apparent in the plot in Figure 29.1.
proc timeseries data=seriesG plot=series; id date interval=month; var logair; run;
In this example this series is modeled using an unobserved component model called the basic structural model (BSM). The BSM models a time series as a sum of three stochastic components: a trend component t , a seasonal component t , and random error t . Formally, a BSM for a response series yt can be described as yt D
t
Each of the stochastic components in the model is modeled separately. The random error t , also called the irregular component, is modeled simply as a sequence of independent, identically distributed (i.i.d.) zero-mean Gaussian random variables. The trend and the seasonal components can be modeled in a few different ways. The model for trend used here is called a locally linear time trend. This trend model can be written as follows:
t
t 1 1
C t C
t;
C t ;
t
t
D t
2 / 2
These equations specify a trend where the level t as well as the slope t is allowed to vary over time. This variation in slope and level is governed by the variances of the disturbance terms t and t in their respective equations. Some interesting special cases of this model arise when you
manipulate these disturbance variances. For example, if the variance of t is zero, the slope will be constant (equal to 0 ); if the variance of t is also zero, t will be a deterministic trend given by the line 0 C 0 t . The seasonal model used in this example is called a trigonometric seasonal. The stochastic equations governing a trigonometric seasonal are explained later (see the section Modeling Seasons on page 1783). However, it is interesting to note here that this seasonal model reduces to the familiar regression with deterministic seasonal dummies if the variance of the disturbance terms in its equations is equal to zero. The following statements specify a BSM with these three components:
proc ucm data=seriesG; id date interval=month; model logair; irregular; level; slope; season length=12 type=trig print=smooth; estimate; forecast lead=24 print=decomp; run;
The PROC UCM statement signies the start of the UCM procedure, and the input data set, seriesG, containing the dependent series is specied there. The optional ID statement is used to specify a date, datetime, or time identication variable, date in this example, to label the observations. The INTERVAL=MONTH option in the ID statement indicates that the measurements were collected on a monthly basis. The model specication begins with the MODEL statement, where the response series is specied (logair in this case). After this the components in the model are specied using separate statements that enable you to control their individual properties. The irregular component t is specied using the IRREGULAR statement and the trend component t is specied using the LEVEL and SLOPE statements. The seasonal component t is specied using the SEASON statement. The specics of the seasonal characteristics such as the season length, its stochastic evolution properties, etc., are specied using the options in the SEASON statement. The seasonal component used in this example has a season length of 12, corresponding to the monthly seasonality, and is of the trigonometric type. Different types of seasonals are explained later (see the section Modeling Seasons on page 1783). The parameters of this model are the variances of the disturbance terms in the evolution equations of t , t , and t and the variance of the irregular component t . These parameters are estimated by maximizing the likelihood of the data. The ESTIMATE statement options can be used to specify the span of data used in parameter estimation and to display and save the results of the estimation step and the model diagnostics. You can use the estimated model to obtain the forecasts of the series as well as the components. The options in the individual component statements can be used to display the component forecastsfor example, PRINT=SMOOTH option in the SEASON statement requests the displaying of smoothed forecasts of the seasonal component t . The series forecasts and forecasts of the sum of components can be requested using the FORECAST statement. The option PRINT=DECOMP in the FORECAST statement requests the printing of the smoothed trend t and the trend plus seasonal component ( t C t ). The parameter estimates for this model are displayed in Figure 29.2.
The estimates suggest that except for the slope component, the disturbance variances of all the components are signicantthat is, all these components are stochastic. The slope component, however, appears to be deterministic because its error variance is quite insignicant. It might then be useful to check if the slope component can be dropped from the modelthat is, if 0 D 0. This can be checked by examining the signicance analysis table of the components given in Figure 29.3.
Figure 29.3 Component Signicance Analysis for the Logair Series
Significance Analysis of Components (Based on the Final State) Component Irregular Level Slope Season DF 1 1 1 11 Chi-Square 0.08 117867 43.78 507.75 Pr > ChiSq 0.7747 <.0001 <.0001 <.0001
This table provides the signicance of the components in the model at the end of the estimation span. If a component is deterministic, this analysis is equivalent to checking whether the corresponding regression effect is signicant. However, if a component is stochastic, then this analysis pertains only to the portion of the series near the end of the estimation span. In this example the slope appears quite signicant and should be retained in the model, possibly as a deterministic component. Note that, on the basis of this table, the irregular components contribution appears insignicant toward the end of the estimation span; however, since it is a stochastic component, it cannot be dropped from the model on the basis of this analysis alone. The slope component can be made deterministic by holding the value of its error variance xed at zero. This is done by modifying the SLOPE statement as follows:
slope variance=0 noest;
After a tentative model is t, its adequacy can be checked by examining different goodness-of-t measures and other diagnostic tests and plots that are based on the model residuals. Once the model appears satisfactory, it can be used for forecasting. An interesting feature of the UCM procedure is that, apart from the series forecasts, you can request the forecasts of the individual components
in the model. The plots of component forecasts can be useful in understanding their contributions to the series. In order to obtain the plots, you need to turn ODS Graphics on by using the ODS GRAPHICS ON; statement. The following statements illustrate some of these features:
ods graphics on; proc ucm data=seriesG; id date interval = month; model logair; irregular; level plot=smooth; slope variance=0 noest; season length=12 type=trig plot=smooth; estimate; forecast lead=24 plot=decomp; run;
The table given in Figure 29.4 shows the goodness-of-t statistics that are computed by using the one-step-ahead prediction errors (see the section Statistics of Fit on page 1819). These measures indicate a good agreement between the model and the data. Additional diagnostic measures are also printed by default but are not shown here.
Figure 29.4 Fit Statistics for the Logair Series
The UCM Procedure Fit Statistics Based on Residuals Mean Squared Error Root Mean Squared Error Mean Absolute Percentage Error Maximum Percent Error R-Square Adjusted R-Square Random Walk R-Square Amemiyas Adjusted R-Square 0.00147 0.03830 0.54132 2.19097 0.99061 0.99046 0.87288 0.99017
Number of non-missing residuals used for computing the fit statistics = 131
The rst plot, shown in Figure 29.5, is produced by the PLOT=SMOOTH option in the LEVEL statement, it shows the smoothed level of the series.
The second plot (Figure 29.6), produced by the PLOT=SMOOTH option in the SEASON statement, shows the smoothed seasonal component by itself.
The plot of the sum of the trend and seasonal component, produced by the PLOT=DECOMP option in the FORECAST statement, is shown in Figure 29.7. You can see that, at least visually, the model seems to t the data well. In all these decomposition plots the component estimates are extrapolated for two years in the future based on the LEAD=24 option specied in the FORECAST statement.
PROC UCM < options > ; AUTOREG < options > ; BLOCKSEASON options ; BY variables ; CYCLE < options > ; DEPLAG options ; ESTIMATE < options > ; FORECAST < options > ; ID variable options ; IRREGULAR < options > ; LEVEL < options > ; MODEL dependent variable < = regressors > ; NLOPTIONS options ; OUTLIER options ; RANDOMREG regressors < / options > ; SEASON options ; SLOPE < options > ; SPLINEREG regressor < options > ; SPLINESEASON options ;
The PROC UCM and MODEL statements are required. In addition, the model must contain at least one component with nonzero disturbance variance.
Functional Summary
The statements and options controlling the UCM procedure are summarized in the following table. Most commonly needed scenarios are listed; see the individual statements for additional details. You can use the PRINT= and PLOT= options in the individual component statements for printing and plotting the corresponding component forecasts.
Table 29.1 Functional Summary
Description Data Set Options specify the input data set write parameter estimates to an output data set write series and component forecasts to an output data set Model Specication specify the dependent variable and simple predictors specify predictors with time-varying coefcients specify a nonlinear predictor
Table 29.1
continued
Description specify the irregular component specify the random walk trend specify the locally linear trend specify a cycle component specify a dummy seasonal component specify a trigonometric seasonal component drop some harmonics from a trigonometric seasonal component specify a list of harmonics to keep in a trigonometric seasonal component specify a spline-season component specify a block-season component specify an autoreg component specify the lags of the dependent variable
Statement IRREGULAR LEVEL LEVEL and SLOPE CYCLE SEASON SEASON SEASON SEASON SPLINESEASON BLOCKSEASON AUTOREG DEPLAG
Option
Controlling the Likelihood Optimization Process request optimization of the prole likelihood ESTIMATE request optimization of the usual likelihood ESTIMATE specify the optimization technique NLOPTIONS limit the number of iterations NLOPTIONS Outlier Detection turn on the search for additive outliers turn on the search for level shifts specify the signicance level for outlier tests limit the number of outliers limit the number of outliers to a percentage of the series length Controlling the Series Span exclude some initial observations from analysis during the parameter estimation exclude some observations at the end from analysis during the parameter estimation exclude some initial observations from analysis during forecasting exclude some observations at the end from analysis during forecasting Graphical Residual Analysis get a panel of plots consisting of residual autocorrelation plots and residual normality plots get the residual CUSUM plot get the residual cumulative sum of squares plot
Table 29.1
continued
Description get a plot of p-values for the portmanteau white noise test get a time series plot of residuals with overlaid LOESS smoother Series Decomposition and Forecasting specify the number of periods to forecast in the future specify the signicance level of the forecast condence interval request printing of smoothed series decomposition request printing of one-step-ahead and multi step-ahead forecasts request plotting of smoothed series decomposition request plotting of one-step-ahead and multi step-ahead forecasts BY Groups specify BY group processing Global Printing and Plotting Options turn off all the printing for the procedure turn on all the printing options for the procedure turn off all the plotting for the procedure turn on all the plotting options for the procedure turn on a variety of plotting options for the procedure ID specify a variable that provides the time index for the series values
BY
PROC UCM PROC UCM PROC UCM PROC UCM PROC UCM
ID
The PROC UCM statement is required. The following options can be used in the PROC UCM statement:
DATA=SAS-data-set
species the name of the SAS data set containing the time series. If the DATA= option is not specied in the PROC UCM statement, the most recently created SAS data set is used.
NOPRINT
turns off all the printing for the procedure. The subsequent print options in the procedure are ignored.
PLOTS< (global-plot-options) > < = plot-request < (options) > > PLOTS< (global-plot-options) > < = (plot-request < (options) > < ... plot-request < (options) > >) >
controls the plots produced with ODS Graphics. When you specify only one plot request, you can omit the parentheses around the plot request. Here are some examples:
plots=none plots=all plots=residuals(acf loess) plots(noclm)=(smooth(decomp) residual(panel loess))
You must enable ODS Graphics before requesting plots, as shown in the following example. For general information about ODS Graphics, see Chapter 21, Statistical Graphics Using ODS (SAS/STAT Users Guide).
ods graphics on; proc ucm; model y = x; irregular; level; run; proc ucm plots=all; model y = x; irregular; level; run;
The rst PROC UCM step does not specify the PLOTS= option, so the default plot that displays the series forecasts in the forecast region is produced. The PLOTS=ALL option in the second PROC UCM step produces all the plots that are appropriate for the specied model. In addition to the PLOTS= option in the PROC UCM statement, you can request plots by using the PLOT= option in other statements of the UCM procedure. This way of requesting plots provides ner control over the plot production. If you have enabled ODS Graphics but do not specify any specic plot request, then PROC UCM produces the plot of series forecasts in the forecast horizon by default.
Global Plot Options: The global-plot-options apply to all relevant plots generated by the UCM procedure. The following global-plot-option is supported:
NOCLM
suppresses the condence limits in all the component and forecast plots.
Specic Plot Options: The following list describes the specic plots and their options:
ALL
produces time series plots of the ltered component estimates. The following lterplot-options are available:
ALL
produces all the ltered component estimate plots appropriate for the particular analysis.
LEVEL
produces a time series plot of the ltered level component estimate, provided the model contains the level component.
SLOPE
produces a time series plot of the ltered slope component estimate, provided the model contains the slope component.
CYCLE
produces time series plots of the ltered cycle component estimates for all cycle components in the model, if there are any.
SEASON
produces time series plots of the ltered season component estimates for all seasonal components in the model, if there are any.
DECOMP
produces time series plots of the ltered estimates of the series decomposition.
RESIDUAL ( < residual-plot-options >)
produces all the residual diagnostics plots appropriate for the particular analysis.
ACF
produces a scatter plot of residuals against time, which has an overlaid loess-t.
PACF
produces a summary panel of the residual diagnostics consisting of the following: histogram of residuals normal quantile plot of residuals the residual-autocorrelation-plot the residual-partial-autocorrelation-plot
QQ
produces the plot of Ljung-Box white-noise test p-values at different lags (in log scale).
SMOOTH ( < smooth-plot-options >)
produces time series plots of the smoothed component estimates. The following smooth-plot-options are available:
ALL
produces all the smoothed component estimate plots appropriate for the particular analysis.
LEVEL
produces time series plot of the smoothed level component estimate, provided the model contains the level component.
SLOPE
produces time series plot of the smoothed slope component estimate, provided the model contains the slope component.
CYCLE
produces time series plots of the smoothed cycle component estimates for all cycle components in the model, if there are any.
SEASON
produces time series plots of the smoothed season component estimates for all season components in the model, if there are any.
DECOMP
produces time series plots of the smoothed estimates of the series decomposition.
PRINTALL
turns on all the printing options for the procedure. The subsequent NOPRINT options in the procedure are ignored.
AUTOREG Statement
AUTOREG < options > ;
The AUTOREG statement species an autoregressive component in the model. An autoregressive component is a special case of cycle that corresponds to the frequency of zero or . It is modeled separately for easier interpretation. A stochastic equation for an autoregressive component rt can be written as follows: rt D rt
1
t;
i:i:d: N.0;
The damping factor can take any value in the interval (1, 1), including 1 but excluding 1. If D 1, the autoregressive component cannot be distinguished from the random walk level component. If D 1, the autoregressive component corresponds to a seasonal component with a season length of 2, or a nonstationary cycle with period 2. If j j < 1, then the autoregressive component is stationary. The following example illustrates the AUTOREG statement. This statement includes an autoregressive component in the model. The damping factor and the disturbance variance 2 are estimated from the data.
autoreg;
and
species an initial value for the damping factor during the parameter estimation process. The value of must be in the interval (1, 1), including 1 but excluding 1.
VARIANCE=value
species an initial value for the disturbance variance 2 during the parameter estimation process. Any nonnegative value, including zero, is an acceptable starting value.
BLOCKSEASON Statement
BLOCKSEASON NBLOCKS = integer BLOCKSIZE = integer < options > ;
The BLOCKSEASON or BLOCKSEASONAL statement is used to specify a seasonal component t that has a special block structure. The seasonal t is called a block seasonal of block size m and number of blocks k if its season length, s, can be factored as s D m k and its seasonal effects have a block formthat is, the rst m seasonal effects are all equal to some number 1 , the next m effects are all equal to some number 2 , and so on. This type of seasonal structure can be appropriate in some cases; for example, consider a series that is recorded on an hourly basis. Further assume that, in this particular case, the hour-of-the-day effect and the day-of-the-week effect are additive. In this situation the hour-of-the-week seasonality, having a season length of 168, can be modeled as a sum of two components. The hour-of-the-day effect is modeled using a simple seasonal of season length 24, while the day-of-the-week is modeled as a block seasonal component that has the days of the week as blocks. This day-of-the-week block seasonal component has seven blocks, each of size 24. A block seasonal specication requires, at the minimum, the block size m and the number of blocks in the seasonal k. These are specied using the BLOCKSIZE= and NBLOCKS= option, respectively. In addition, you might need to specify the position of the rst observation of the series by using the OFFSET= option if it is not at the beginning of one of the blocks. In the example just considered, this corresponds to a situation where the rst series measurement is not at the start of the day. Suppose that the rst measurement of the series corresponds to the hour between 6:00 and 7:00 a.m., which is the seventh hour within that day or at the seventh position within that block. This is specied as OFFSET=7. The other options in this statement are very similar to the options in the SEASON statement; for example, a block seasonal can also be of one of the two types, DUMMY and TRIG. There can be
more than one block seasonal component in the model, each specied using a separate BLOCKSEASON statement. No two block seasonals in the model can have the same NBLOCKS= and BLOCKSIZE= specications. The following example illustrates the use of the BLOCKSEASON statement to specify the additive, hour-of-the-week seasonal model:
season length=24 type=trig; blockseason nblocks=7 blocksize=24;
BLOCKSIZE=integer
species the block size, m. This is a required option in this statement. The block size can be any integer larger than or equal to two. Typical examples of block sizes are 24, corresponding to the hours of the day when a day is being used as a block in hourly data, or 60, corresponding to the minutes in an hour when an hour is being used as a block in data recorded by minutes, etc.
NBLOCKS=integer
species the number of blocks, k. This is a required option in this statement. The number of blocks can be any integer greater than or equal to two.
NOEST
species the value of the disturbance variance parameter to the value specied in the VARIANCE= option.
OFFSET=integer
species the position of the rst measurement within the block, if the rst measurement is not at the start of a block. The OFFSET= value must be between one and the block size. The default value is one. The rst measurement refers to the start of the estimation span and the forecast span. If these spans differ, their starting measurements must be separated by an integer multiple of the block size.
PLOT=FILTER PLOT=SMOOTH PLOT=F_ANNUAL PLOT=S_ANNUAL PLOT=( < plot request > . . . < plot request > )
requests plots of the season component. When you specify only one plot request, you can omit the parentheses around the plot request. You can use the FILTER and SMOOTH options to plot the ltered and smoothed estimates of the season component t . You can use the F_ANNUAL and S_ANNUAL options to get the plots of annual variation in the ltered and smoothed estimates of t . The annual plots are useful to see the change in the contribution of a particular month over the span of years. Here month and year are generic terms that change appropriately with the interval type being used to label the observations and the season length. For example, for monthly data with a season length of 12, the usual meaning applies, while for daily data with a season length of 7, the days of the week serve as months and the weeks serve as years. The rst period in each block is plotted over the years.
requests the printing of the ltered or smoothed estimate of the block seasonal component
TYPE=DUMMY | TRIG
t.
species the type of the block seasonal component. The default type is DUMMY.
VARIANCE=value
2 species an initial value for the disturbance variance, ! , in the t equation at the start of the parameter estimation process. Any nonnegative value, including zero, is an acceptable starting value.
BY Statement
BY variables ;
A BY statement can be used in the UCM procedure to process a data set in groups of observations dened by the BY variables. The model specied using the MODEL and other component statements is applied to all the groups dened by the BY variables. When a BY statement appears, the procedure expects the input data set to be sorted in order of the BY variables. The variables are one or more variables in the input data set.
CYCLE Statement
CYCLE < options > ;
The CYCLE statement is used to specify a cycle component, t , in the model. The stochastic equation governing a cycle component of period p and damping factor is as follows: cos sin t 1 t t D C sin cos t t t 1 where t and t are independent, zero-mean, Gaussian disturbances with variance 2 and D 2 =p is the angular frequency of the cycle. Any p strictly greater than two is an admissible value for the period, and the damping factor can be any value in the interval (0, 1), including one but excluding zero. The cycles with frequency zero and , which correspond to the periods equal to innity and two, respectively, can be specied using the AUTOREG statement. The values of less than one give rise to a stationary cycle, while D 1 gives rise to a nonstationary cycle. As a default, values of , p, and 2 are estimated from the data. However, if necessary, you can x the values of some or all of these parameters. There can be multiple cycles in a model, each specied using a separate CYCLE statement. The examples that follow illustrate the use of the CYCLE statement.
The following statements request including two cycles in the model. The parameters of each of these cycles are estimated from the data.
cycle; cycle;
The following statement requests inclusion of a nonstationary cycle in the model. The cycle period p and the disturbance variance 2 are estimated from the data.
cycle rho=1 noest=rho;
In the following statement a nonstationary cycle with a xed period of 12 is specied. Moreover, a starting value is supplied for 2 .
cycle period=12 rho=1 variance=4 noest=(rho period);
NOEST=PERIOD NOEST=RHO NOEST=VARIANCE NOEST=( < RHO > < PERIOD > < VARIANCE > )
xes the values of the component parameters to those specied in the RHO=, PERIOD=, and VARIANCE= options. This option enables you to x any combination of parameter values.
PERIOD=value
species an initial value for the cycle period during the parameter estimation process. Period value must be strictly greater than 2.
PLOT=FILTER PLOT=SMOOTH PLOT=( < FILTER > < SMOOTH > )
t.
species an initial value for the damping factor in this component during the parameter estimation process. Any value in the interval (0, 1), including one but excluding zero, is an acceptable initial value for the damping factor.
VARIANCE=value
species an initial value for the disturbance variance parameter, 2 , to be used during the parameter estimation process. Any nonnegative value, including zero, is an acceptable starting value.
DEPLAG Statement
DEPLAG LAGS = order < PHI = value . . . > < NOEST > ;
The DEPLAG statement is used to specify the lags of the dependent variable to be included as predictors in the model. The following examples illustrate the use of DEPLAG statement. If the dependent series is denoted by yt , the following statement species the inclusion of 2 yt 2 in the model. The parameters 1 and 2 are estimated from the data.
deplag lags=2;
1 yt 1 C
The following statement requests including of 1 and 2 are xed at 0.8 and 1.2.
1 yt 1 C 2 yt 4
1 2 yt 5
The dependent lag parameters are not constrained to lie in any particular region. In particular, this implies that a UCM that contains only an irregular component and dependent lags, resulting in a traditional autoregressive model, is not constrained to be a stationary model. In the DEPLAG statement, if an initial value is supplied for any one of the parameters, the initial values must also be supplied for all other parameters.
LAGS=order LAGS=(lag, . . . , lag ) . . . (lag, . . . , lag )
is a required option in this statement. LAGS=(l 1 , l 2 , . . . , l k ) denes a model with specied lags of the dependent variable included as predictors. LAGS=order is equivalent to LAGS=(1, 2, . . . , order ). A concatenation of parenthesized lists species a factored model. For example, LAGS=(1)(12) species that the lag values, 1, 12, and 13, corresponding to the following polynomial in the backward shift operator, be included in the model .1
1;1 B/.1 2;1 B 12
Note that, in this case, the coefcient of the thirteenth lag is constrained to be the product of the coefcients of the rst and twelfth lags.
NOEST
lists starting values for the coefcients of the lagged dependent variable. The order of the values listed corresponds with the order of the lags specied in the LAGS= option.
ESTIMATE Statement
ESTIMATE < options > ;
The ESTIMATE statement is an optional statement used to control the overall model-tting environment. Using this statement, you can control the span of observations used to t the model by using the SKIPFIRST= and BACK= options. This can be useful in model diagnostics. You can request a variety of goodness-of-t statistics and other model diagnostic information including different residual diagnostic plots. Note that the ESTIMATE statement is not used to control the nonlinear optimization process itself. That is done using the NLOPTIONS statement, where you can control the number of iterations, choose between the different optimization techniques, and so on. You can save the estimated parameters and other related information in a data set by using the OUTEST= option. You can request the optimization of the prole likelihood, the likelihood obtained by concentrating out a disturbance variance, for parameter estimation by using the PROFILE option. The following example illustrates the use of this statement:
estimate skipfirst=12 back=24;
This statement requests that the initial 12 measurements and the last 24 measurements be excluded during the model-tting process. The actual observation span used to t the model is decided as follows: Suppose that n0 and n1 are the observation numbers of the rst and the last nonmissing values of the response variable, respectively. As a result of SKIPFIRST=12 and BACK=24, the measurements between observation numbers n0 C 12 and n1 24 form the estimation span. Of course, the model tting might not take place if there are insufcient data in the resulting span. The model tting does not take place if there are regressors in the model that have missing values in the estimation span.
BACK=integer SKIPLAST=integer
indicates that some ending part of the data needs to be ignored during the parameter estimation. This can be useful when you want to study the forecasting performance of a model on the observed data. BACK=10 results in skipping the last 10 measurements of the response series during the parameter estimation. The default is BACK=0.
EXTRADIFFUSE=k
enables continuation of the diffuse ltering iterations for k additional iterations beyond the rst instance where the initialization of the diffuse state would have otherwise taken place. If the specied k is larger than the sample size, the diffuse iterations continue until the end of the sample. Note that one-step-ahead residuals are produced only after the diffuse state is initialized. Delaying the initialization leads to a reduction in the number of one-step-ahead residuals available for computing the residual diagnostic measures. This option is useful when you want to ignore the rst few one-step-ahead residuals that often have large variance.
NOPROFILE
requests that the usual likelihood be optimized for parameter estimation. For more information, see the section Parameter Estimation by Prole Likelihood Optimization on page 1798.
OUTEST=SAS-data-set
species an output data set for the estimated parameters. In the ESTIMATE statement, the PLOT= option is used to obtain different residual diagnostic plots. The different possibilities are as follows:
PLOT=ACF PLOT=MODEL PLOT=LOESS PLOT=HISTOGRAM PLOT=PACF PLOT=PANEL PLOT=QQ PLOT=RESIDUAL PLOT=WN PLOT=( < plot request > . . . < plot request > )
requests different residual diagnostic plots. The different options are as follows:
ACF
produces a scatter plot of residuals against time, which has an overlaid loess-t.
PACF
produces a summary panel of the residual diagnostics consisting of histogram of residuals normal quantile plot of residuals the residual-autocorrelation-plot the residual-partial-autocorrelation-plot
QQ
RESIDUAL
produces a plot of p-values, in log-scale, at different lags for the Ljung-Box portmanteau white noise test statistics.
PRINT=NONE
suppresses all the printed output related to the model tting, such as the parameter estimates, the goodness-of-t statistics, and so on.
PROFILE
requests that the prole likelihood, obtained by concentrating out one of the disturbance variances from the likelihood, be optimized for parameter estimation. By default, the prole likelihood is not optimized if any of the disturbance variance parameters is held xed to a nonzero value. For more information see the section Parameter Estimation by Prole Likelihood Optimization on page 1798.
SKIPFIRST=integer
indicates that some early part of the data needs to be ignored during the parameter estimation. This can be useful if there is a reason to believe that the model being estimated is not appropriate for this portion of the data. SKIPFIRST=10 results in skipping the rst 10 measurements of the response series during the parameter estimation. The default is SKIPFIRST=0.
FORECAST Statement
FORECAST < options > ;
The FORECAST statement is an optional statement that is used to specify the overall forecasting environment for the specied model. It can be used to specify the span of observations, the historical period, to use to compute the forecasts of the future observations. This is done using the SKIPFIRST= and BACK= options. The number of periods to forecast beyond the historical period, and the signicance level of the forecast condence interval, is specied using the LEAD= and ALPHA= options. You can request one-step-ahead series and component forecasts by using the PRINT= option. You can save the series forecasts, and the model-based decomposition of the series, in a data set by using the OUTFOR= option. The following example illustrates the use of this statement:
forecast skipfirst=12 back=24 lead=30;
This statement requests that the initial 12 and the last 24 response values be excluded during the forecast computations. The forecast horizon, specied using the LEAD= option, is 30 periods; that is, multistep forecasting begins at the end of the historical period and continues for 30 periods. The actual observation span used to compute the multistep forecasting is decided as follows: Suppose that n0 and n1 are the observation numbers of the rst and the last nonmissing values of the response variable, respectively. As a result of SKIPFIRST=12 and BACK=24, the historical period, or the
forecast span, begins at n0 C12 and ends at n1 24. Multistep forecasts are produced for the next 30 periodsthat is, for the observation numbers n1 23 to n1 C6. Of course, the forecast computations can fail if the model has regressor variables that have missing values in the forecast span. If the regressors contain missing values in the forecast horizonthat is, between the observations n1 23 and n1 C 6the forecast horizon is reduced accordingly.
ALPHA=value
species the signicance level of the forecast condence intervals; for example, ALPHA=0.05, which is the default, results in a 95% condence interval.
BACK=integer SKIPLAST=integer
species the holdout sample for the evaluation of the forecasting performance of the model. For example, BACK=10 results in treating the last 10 observed values of the response series as unobserved. A post-sample-prediction-analysis table is produced for comparing the predicted values with the actual values in the holdout period. The default is BACK=0.
EXTRADIFFUSE=k
enables continuation of the diffuse ltering iterations for k additional iterations beyond the rst instance where the initialization of the diffuse state would have otherwise taken place. If the specied k is larger than the sample size, the diffuse iterations continue until the end of the sample. Note that one-step-ahead forecasts are produced only after the diffuse state is initialized. Delaying the initialization leads to reduction in the number of one-step-ahead forecasts. This option is useful when you want to ignore the rst few one-step-ahead forecasts that often have large variance.
LEAD=integer
species the number of periods to forecast beyond the historical period dened by the SKIPFIRST= and BACK= options; for example, LEAD=10 results in the forecasting of 10 future values of the response series. The default is LEAD=12.
OUTFOR=SAS-data-set
species an output data set for the forecasts. The output data set contains the ID variable (if specied), the response and predictor series, the one-step-ahead and out-of-sample response series forecasts, the forecast condence intervals, the smoothed values of the response series, and the smoothed forecasts produced as a result of the model-based decomposition of the series.
PLOT=DECOMP PLOT=DECOMPVAR PLOT=FDECOMP PLOT=FDECOMPVAR PLOT=FORECASTS PLOT=TREND PLOT=( < plot request > . . . < plot request > )
requests forecast and model decomposition plots. The FORECASTS option provides the plot of the series forecasts, the TREND and DECOMP options provide the plots of the smoothed trend and other decompositions, the DECOMPVAR option can be used to plot the variance of
ID Statement ! 1767
these components, and the FDECOMP and FDECOMPVAR options provide the same plots for the ltered decomposition estimates and their variances.
PRINT=DECOMP PRINT=FDECOMP PRINT=FORECASTS PRINT=NONE PRINT=( < print request > . . . < print request > )
controls the printing of the series forecasts and the printing of smoothed model decomposition estimates. By default, the series forecasts are printed only for the forecast horizon specied by the LEAD= option; that is, the one-step-ahead predicted values are not printed. You can request forecasts for the entire forecast span by specifying the PRINT=FORECASTS option. Using PRINT=DECOMP, you can get smoothed estimates of the following effects: trend, trend plus regression, trend plus regression plus cycle, and sum of all components except the irregular. If some of these effects are absent in the model, then they are ignored. Similarly you can get ltered estimates of these effects by using PRINT=FDECOMP. You can use PRINT=NONE to suppress the printing of all the forecast output.
SKIPFIRST=integer
indicates that some early part of the data needs to be ignored during the forecasting calculations. This can be useful if there is a reason to believe that the model being used for forecasting is not appropriate for this portion of the data. SKIPFIRST=10 results in skipping the rst 10 measurements of the response series during the forecast calculations. The default is SKIPFIRST=0.
ID Statement
ID variable INTERVAL=value < ALIGN=value > ;
The ID statement names a numeric variable that identies observations in the input and output data sets. The ID variables values are assumed to be SAS date, time, or datetime values. In addition, the ID statement species the frequency associated with the time series. The ID statement options also specify how the observations are aligned to form the time series. If the ID statement is specied, the INTERVAL= option must also be specied. If the ID statement is not specied, the observation number, with respect to the BY group, is used as the time ID. The values of the ID variable are extrapolated for the forecast observations based on the values of the INTERVAL= option.
ALIGN=value
controls the alignment of SAS dates used to identify output observations. The ALIGN= option has the following possible values: BEGINNING | BEG | B, MIDDLE | MID | M, and ENDING | END | E. The default is BEGINNING. The ALIGN= option is used to align the ID variable with the beginning, middle, or end of the time ID interval specied by the INTERVAL= option.
INTERVAL=value
species the time interval between observations. This option is required in the ID statement.
INTERVAL=value is used in conjunction with the ID variable to check that the input data are in order and have no gaps. The INTERVAL= option is also used to extrapolate the ID values past the end of the input data. For a complete discussion of the intervals supported, please see Chapter 4, Date Intervals, Formats, and Functions.
IRREGULAR Statement
IRREGULAR < options > ;
The IRREGULAR statement is used to include an irregular component in the model. There can be at most one IRREGULAR statement in the model specication. The irregular component corresponds to the overall random error, t , in the model. By default the irregular component is modeled as white noisethat is, as a sequence of independent, identically distributed, zero-mean, Gaussian random variables. However, as an experimental feature in this release of the UCM procedure, you can also model it as an autoregressive moving-average (ARMA) process. The options for specifying an ARMA model for the irregular component are given in a separate subsection: ARMA Specication (Experimental) on page 1769. The options in this statement enable you to specify the value of 2 and to output the forecasts of 2 is estimated using the data. Two examples of the IRREGULAR statement are t . As a default, given next. In the rst example the statement is in its simplest form, resulting in the inclusion of an irregular component that is white noise with unknown variance:
irregular;
The following statement provides a starting value for 2 , to be used in the nonlinear parameter estimation process. It also requests the printing of smoothed predictions of t . The smoothed irregulars are useful in model diagnostics.
irregular variance=4 print=smooth;
NOEST
species an initial value for 2 during the parameter estimation process. Any nonnegative value, including zero, is an acceptable starting value.
D .B/.B s /at
where at is a white noise sequence and s is the season length. The polynomials ; ; ; and are of orders p, P , q, and Q, respectively, which can be any nonnegative integers. The season length s must be a positive integer. For example, t satises an ARMA(1,1) modelthat is, p D 1; q D 1; P D 0; and Q D 0if
t
1 t 1
C at
1 at
for some coefcients 1 and 1 and a white noise sequence at . ARMA(1,1) (1,1)12 model if
t
Similarly C 1 1 at
satises an
1 t 1
C 1
t 12
1 1 t 13
C at
1 at
1 at
12
13
for some coefcients 1 ; 1 ; 1 ; and 1 and a white noise sequence at . The ARMA process is stationary and invertible if the dening polynomials ; ; ; and have all their roots outside the unit circlethat is, their absolute values are strictly larger than 1.0. It is assumed that the ARMA model specied for the irregular component is stationary and invertiblethat is, the coefcients of the polynomials ; ; ; and are constrained so that the stationarity and invertibility conditions are satised. The unknown coefcients of these polynomials become part of the model parameter vector that is estimated using the data. The notation for a closely related class of models, autoregressive integrated movingaverage (ARIMA) models, is also given here. A random sequence yt is said to follow an ARIMA(p,d,q) (P,D,Q)s model if, for some nonnegative integers d and D, the differenced series t D .1 B/d .1 B s /D yt follows an ARMA(p,q) (P,Q)s model. The integers d and D are called nonseasonal and seasonal differencing orders, respectively. You can specify ARIMA models by using the DEPLAG statement for specifying the differencing orders and by using the IRREGULAR statement for the ARMA specication. See Example 29.8 for an example of ARIMA(0,1,1) (0,1,1)12 model specication. Brockwell and Davis (1991) can be consulted for additional information about ARIMA models. You can use options of the IRREGULAR statement to specify the desired ARMA model and to request printed and graphical output. Several examples of the IRREGULAR statement are given next. The following statement species an irregular component that is modeled as an ARMA(1,1) process. It also requests plotting its smoothed estimate.
irregular p=1 q=1 plot=smooth;
The following statement species an ARMA(1,1) (1,1)12 model. It also xes the coefcient of the rst-order seasonal moving-average polynomial to 0.1. The other coefcients and the white noise variance are estimated using the data.
irregular p=1 sp=1 q=1 sq=1 s=12 sma=0.1 noest=(sma);
AR=
...
lists the starting values of the coefcients of the nonseasonal autoregressive polynomial: .B/ D 1
1B
:::
pB
where the order p is specied in the P= option. The coefcients autoregressive polynomial.
MA=1 2 . . . q
lists the starting values of the coefcients of the nonseasonal moving-average polynomial: .B/ D 1 1 B ::: q B q
where the order q is specied in the Q= option. The coefcients i must dene an invertible moving-average polynomial.
NOEST=(<VARIANCE> <AR> <SAR> <MA> <SMA>)
xes the values of the ARMA parameters and the value of the white noise variance to those specied in the AR=, SAR=, MA=, SMA=, or VARIANCE= options.
P=integer
species the order of the nonseasonal autoregressive polynomial. The order can be any nonnegative integer; the default value is 0. In practice the order is a small integer such as 1, 2, or 3.
Q=integer
species the order of the nonseasonal moving-average polynomial. The order can be any nonnegative integer; the default value is 0. In practice the order is a small integer such as 1, 2, or 3.
S=integer
species the season length used during the specication of the seasonal autoregressive or seasonal moving-average polynomial. The season length can be any positive integer; for example, S=4 might be an appropriate value for a quarterly series. The default value is S=1.
SAR=1 2 . . . P
lists the starting values of the coefcients of the seasonal autoregressive polynomial: .B s / D 1 1 B s ::: P B sP
where the order P is specied in the SP= option and the season length s is specied in the S= option. The coefcients i must dene a stationary autoregressive polynomial.
SMA=1 2 . . . Q
lists the starting values of the coefcients of the seasonal moving-average polynomial: .B s / D 1 1 B s ::: Q B sQ
where the order Q is specied in the SQ= option and the season length s is specied in the S= option. The coefcients i must dene an invertible moving-average polynomial.
SP=integer
species the order of the seasonal autoregressive polynomial. The order can be any nonnegative integer; the default value is 0. In practice the order is a small integer such as 1 or 2.
SQ=integer
species the order of the seasonal moving-average polynomial. The order can be any nonnegative integer; the default value is 0. In practice the order is a small integer such as 1 or 2.
LEVEL Statement
LEVEL < options > ;
The LEVEL statement is used to include a level component in the model. The level component, either by itself or together with a slope component (see the SLOPE statement), forms the trend component, t , of the model. If the slope component is absent, the resulting trend is a random walk (RW) specied by the following equations:
t
t 1
C t ;
i:i:d: N.0;
2 /
If the slope component is present, signied by the presence of a SLOPE statement, a locally linear trend (LLT) is obtained. The equations of LLT are as follows:
t
t 1 1
C t C
t;
C t ;
t
t
D t
2 / 2
2 In either case, the options in the LEVEL statement are used to specify the value of and to request forecasts of t . The SLOPE statement is used for similar purposes in the case of slope t . The following examples illustrate the use of the LEVEL statement. Assuming that a SLOPE statement is not added subsequently, a simple random walk trend is specied by the following statement:
level;
2 The following statements specify a locally linear trend with value of xed at 4. It also requests printing of ltered values of t . The value of 2 , the disturbance variance in the slope equation, is estimated from the data.
CHECKBREAK
MODEL Statement
MODEL dependent < = regressors > ;
The MODEL statement species the response variable and, optionally, the predictor or regressor variables for the UCM model. This is a required statement in the UCM procedure. The predictors specied in the MODEL statement are assumed to have a linear and time-invariant relationship with the response. The predictors that have time-varying regression coefcients are specied separately in the RANDOMREG statement. Similarly, the predictors that have a nonlinear effect on the response variable are specied separately in the SPLINEREG statement. Only one MODEL statement can be specied.
NLOPTIONS Statement
NLOPTIONS < options > ;
PROC UCM uses the nonlinear optimization (NLO) subsystem to perform the nonlinear optimization of the likelihood function during the estimation of model parameters. You can use the NLOPTIONS statement to control different aspects of this optimization process. For most problems the
default settings of the optimization process are adequate. However, in some cases it might be useful to change the optimization technique or to change the maximum number of iterations. This can be done by using the TECH= and MAXITER= options in the NLOPTIONS statement as follows:
nloptions tech=dbldog maxiter=200;
This sets the maximum number of iterations to 200 and changes the optimization technique to DBLDOG rather than the default technique, TRUREG, used in PROC UCM. A discussion of the full range of options that can be used with the NLOPTIONS statement is given in Chapter 6, Nonlinear Optimization Methods. In PROC UCM all these options are available except the options related to the printing of the optimization history. In this version of PROC UCM all the printed output from the NLO subsystem is suppressed.
OUTLIER Statement
OUTLIER < options > ;
The OUTLIER statement enables you to control the reporting of the additive outliers (AO) and level shifts (LS) in the response series. The AOs are searched by default. You can turn on the search for LSs by using the CHECKBREAK option in the LEVEL statement.
ALPHA=signicance-level
species the signicance level for reporting the outliers. The default is 0.05.
MAXNUM=number
is similar to the MAXNUM= option. In the MAXPCT= option you can limit the number of outliers to search for according to a percentage of the series length. The default is MAXPCT=1. When both of these options are specied, the minimum of the two search numbers is used.
PRINT=SHORT | DETAIL
enables you to control the printed output of the outlier search. The PRINT=SHORT option, which is the default, produces an outlier summary table containing the most signicant outliers, either AO or LS, discovered in the outlier search. The PRINT=DETAIL option produces, in addition to the outlier summary table, separate tables containing the AO and LS structural break chi-square statistics computed at each time point in the estimation span.
RANDOMREG Statement
RANDOMREG regressors < / options > ;
The RANDOMREG statement is used to specify regressors with time-varying regression coefcients. Each regression coefcientsay, t is assumed to evolve as a random walk: t D t
1
C t ;
i:i:d: N.0;
Of course, if the random walk disturbance variance 2 is zero, then the regression coefcient is not time varying, and it reduces to the standard regression setting. There can be multiple RANDOMREG statements, and each statement can contain one or more regressors. The regressors in a given RANDOMREG statement form a group that is assumed to share the same disturbance variance parameter. The random walks associated with different regressors are assumed to be independent. For an example of using this statement see Example 29.4. See the section Reporting Parameter Estimates for Random Regressors on page 1794 for additional information about the way parameter estimates are reported for this type of regressors.
NOEST
2
requests printing of the ltered or smoothed estimate of the time-varying regression coefcient.
VARIANCE=value
species an initial value for 2 during the parameter estimation process. Any nonnegative value, including zero, is an acceptable starting value.
SEASON Statement
SEASON LENGTH = integer < options > ;
The SEASON or SEASONAL statement is used to specify a seasonal component, t , in the model. A seasonal component can be one of the two types, DUMMY or TRIG. A DUMMY seasonal with season length s satises the following stochastic equation:
s 1 X i D0 t i
D !t ;
!t
i:i:d: N.0;
2 !/
The equations for a TRIG (short for trigonometric) seasonal component are as follows:
t
s=2 X j D1
j;t
where s=2 equals s=2 if s is even and .s 1/=2 if it is odd. The sinusoids, also called harmonics, j;t have frequencies j D 2 j=s and are specied by the matrix equation
j;t j;t
cos j sin
sin cos
j j
j;t 1 j;t 1
!j;t !j;t
where the disturbances !j;t and !j;t are assumed to be independent and, for xed j , !j;t and 2 !j;t N.0; ! /. If s is even, then the equation for s=2;t is not needed and s=2;t is given by
s=2;t
s=2;t 1
C !s=2;t
In the TRIG seasonal case, the option KEEPH= or DROPH= can be used to obtain subset trigonometric seasonals that contain only a subset of the full set of harmonics j;t , j D 1; 2; : : : ; s=2. This is particularly useful when the season length s is large and the seasonal pattern is relatively smooth. Note that whether the seasonal type is DUMMY or TRIG, there is only one parameter, the distur2 bance variance ! , in the seasonal model. There can be more than one seasonal component in the model, necessarily with different season lengths if the seasons are full. You can have multiple subset season components with the same season length, if you need to use separate disturbance variances for different sets of harmonics. Each seasonal component is specied using a separate SEASON statement. A model with multiple seasonal components can easily become quite complex and might need a large amount of data and computing resources for its estimation and forecasting. The examples that follow illustrate the use of SEASON statement. The following statement species a DUMMY type (default) seasonal component with a season 2 length of four, corresponding to the quarterly seasonality. The disturbance variance ! is estimated from the data.
season length=4;
The following statement species a trigonometric seasonal with monthly seasonality. It also pro2 vides a starting value for ! .
season length=12 type=trig variance=4;
DROPHARMONICS|DROPH=number-list | n TO m BY p
enables you to drop some harmonics j;t from the full set of harmonics used to obtain a trigonometric seasonal. The drop list can include any integer between 1 and s=2, s being the season length. For example, the following specication results in a specication of a trigonometric seasonal with a season length 12 that consists of only the rst four harmonics j;t , j D 1; 2; 3; 4:
season length=12 type=trig DROPH=5 6;
The last two high frequency harmonics are dropped. The DROPH= option cannot be used with the KEEPH= option.
KEEPHARMONICS|KEEPH=number-list | n TO m BY p
enables you to keep only the harmonics j;t listed in the option to obtain a trigonometric seasonal. The keep list can include any integer between 1 and s=2, s being the season length. For example, the following specication results in a specication of a trigonometric seasonal with a season length of 12 that consists of all the six harmonics j;t , j D 1; : : : 6:
season length=12 type=trig KEEPH=1 to 3; season length=12 type=trig KEEPH=4 to 6;
However, these six harmonics are grouped into two groups, each having its own disturbance variance parameter. The DROPH= option cannot be used with the KEEPH= option.
LENGTH=integer
species the season length, s. This is a required option in this statement. The season length can be any integer greater than or equal to 2. Typical examples of season lengths are 12, corresponding to the monthly seasonality, or 4, corresponding to the quarterly seasonality.
NOEST
xes the value of the disturbance variance parameter to the value specied in the VARIANCE= option.
PLOT=FILTER PLOT=SMOOTH PLOT=F_ANNUAL PLOT=S_ANNUAL PLOT=( <plot request> . . . <plot request> )
requests plots of the season component. When you specify only one plot request, you can omit the parentheses around the plot request. You can use the FILTER and SMOOTH options to plot the ltered and smoothed estimates of the season component t . You can use the F_ANNUAL and S_ANNUAL options to get the plots of annual variation in the ltered and smoothed estimates of t . The annual plots are useful to see the change in the contribution of a particular month over the span of years. Here month and year are generic terms that change appropriately with the interval type being used to label the observations and the season length. For example, for monthly data with a season length of 12, the usual meaning applies, while for daily data with a season length of 7, the days of the week serve as months and the weeks serve as years.
PRINT=HARMONICS
requests printing of the summary of harmonics present in the seasonal component. This option is valid only for the trigonometric seasonal component.
PRINT=FILTER PRINT=SMOOTH PRINT=( < print request > . . . < print request > )
t.
species the type of the seasonal component. The default type is DUMMY.
VARIANCE=value
2 species an initial value for the disturbance variance, ! , in the t equation at the start of the parameter estimation process. Any nonnegative value, including zero, is an acceptable starting value.
SLOPE Statement
SLOPE < options > ;
The SLOPE statement is used to include a slope component in the model. The slope component cannot be used without the level component (see the LEVEL statement). The level and slope specications jointly dene the trend component of the model. A SLOPE statement without the accompanying LEVEL statement is ignored. The equations of the trend, dened jointly by the level t and slope t , are as follows:
t
t 1 1
C t C
t;
C t ;
t
t
D t
2 / 2
The SLOPE statement is used to specify the value of the disturbance variance, 2 , in the slope equation, and to request forecasts of t . The following examples illustrate this statement:
level; slope;
2 The preceding statements t a model with a locally linear trend. The disturbance variances and 2 are estimated from the data. You can request a locally linear trend with xed slope by using the following statements:
NOEST
VARIANCE=value
species an initial value for the disturbance variance, 2 , in the t equation at the start of the parameter estimation process. Any nonnegative value, including zero, is an acceptable starting value.
SPLINEREG Statement
SPLINEREG regressor < options > ;
The SPLINEREG statement is used to specify a regressor that has a nonlinear relationship with the dependent series that can be approximated by a given B-spline. If the specied spline has degree d and is based on n internal knots, then it is known that it can be written as a linear combination of .n C d C 1/ regressors that are derived from the original regressor. The span of these .n C d C 1/ derived regressors includes constant; therefore, to avoid multicollinearity with the level component, one of these regressors is dropped. Specifying the SPLINEREG statement is equivalent to specifying a RANDOMREG statement with these derived regressors. There can be multiple SPLINEREG statements. You must specify at least one interior knot, either using the NKNOTS= option or the KNOTS= option. For additional information about splines, see Chapter 90, The TRANSREG Procedure (SAS/STAT Users Guide). For an example of using this statement, see Example 29.6. See the section Reporting Parameter Estimates for Random Regressors on page 1794 for additional information about the way parameter estimates are reported for this type of regressors.
DEGREE=integer
species the degree of the spline. It can be any integer larger than or equal to zero. The default value is 3. The polynomial degree should be a small integer, usually 0, 1, 2, or 3. Larger values are rarely useful. If you have any doubt as to what degree to specify, use the default.
KNOTS=number-list | n TO m BY p
species the interior knots or break points. The values in the knot list must be nondecreasing and must lie between the minimum and the maximum of the spline regressor values in the input data set. The rst time you specify a value in the knot list, it indicates a discontinuity in the nth (from DEGREE=n) derivative of the transformation function at the value of the knot. The second mention of a value indicates a discontinuity in the .n 1/th derivative of the transformation function at the value of the knot. Knots can be repeated any number of times for decreasing smoothness at the break points, but the values in the knot list can never decrease. You cannot use the KNOTS= option with the NKNOTS= option. You should keep the number of knots small.
NKNOTS=m
creates m knots, the rst at the 100=.m C 1/ percentile, the second at the 200=.m C 1/ percentile, and so on. Knots are always placed at data values; there is no interpolation. For example, if NKNOTS=3, knots are placed at the 25th percentile, the median, and the 75th percentile. The value specied for the NKNOTS= option must be 1. You cannot use the NKNOTS=option with the KNOTS= option.
N OTE : Specifying knots by using the NKNOTS= option can result in different sets of knots in the estimation and forecast stages if the distributions of regressor values in the estimation and forecast spans differ. The estimation span is based on the BACK= and SKIPFIRST= options in the ESTIMATE statement, and the forecast span is based on the BACK= and SKIPFIRST= options in the FORECAST statement.
NOEST
xes the value of the regression coefcient random walk disturbance variance to the value specied in the VARIANCE= option.
PLOT=FILTER PLOT=SMOOTH PLOT=( < FILTER > < SMOOTH > )
species an initial value for the regression coefcient random walk disturbance variance during the parameter estimation process. Any nonnegative value, including zero, is an acceptable starting value.
SPLINESEASON Statement
SPLINESEASON LENGTH = integer KNOTS= i nteger1 i nteger2 . . . < options > ;
The SPLINESEASON statement is used to specify a seasonal pattern that is to be approximated by a given B-spline. If the specied spline has degree d and is based on n internal knots, then it can be written as a linear combination of .n C d / regressors that are derived from the seasonal dummy regressors. The SPLINESEASON specication is equivalent to specifying a RANDOMREG specication with these derived regressors. Such approximation is useful only if the season length is relatively large, at least larger than .n C d /. For additional information about splines, see Chapter 90, The TRANSREG Procedure (SAS/STAT Users Guide). For an example of using this statement, see Example 29.3.
DEGREE=integer
species the degree of the spline. It can be any integer greater than or equal to zero. The default value is 3.
KNOTS=i nt eger1 i nt eger2 . . .
lists the internal knots. This list of values must be a nondecreasing sequence of integers within the range of 2 to .s 1/, where s is the season length specied in the LENGTH= option. This is a required option in this statement.
LENGTH=integer
species the season length, s. This is a required option in this statement. The length can be any integer greater than or equal to three.
NOEST
xes the value of the regression coefcient random walk disturbance variance to the value specied in the VARIANCE= option.
OFFSET=integer
species the position of the rst measurement within the season, if the rst measurement is not at the start of the season. The OFFSET= value must be between one and the season length. The default value is one. The rst measurement refers to the start of the estimation span and the forecast span. If these spans differ, their starting measurements must be separated by an integer multiple of the season length.
PLOT=FILTER PLOT=SMOOTH PLOT=( < FILTER > < SMOOTH > )
requests plots of the season component. When you specify only one plot request, you can omit the parentheses around the plot request. You can use the FILTER and SMOOTH options to plot the ltered and smoothed estimates of the season component.
PRINT=FILTER PRINT=SMOOTH PRINT=( < FILTER > < SMOOTH > )
requests the printing of the ltered or smoothed estimate of the spline season component.
RKNOTS=(knot, . . . , knot ) . . . (knot, . . . , knot )
Experimental
species a grouping of knots such that the knots within the same group have identical seasonal values. The knots specied in this option must already be present in the list specied by the KNOTS= option. The knot groups must be non-overlapping and without any repeated knots.
VARIANCE=value
species an initial value for the regression coefcient random walk disturbance variance during the parameter estimation process. Any nonnegative value, including zero, is an acceptable starting value.
yt
C
2
j xjt C
i:i:d: N.0;
The terms t ; t , and t represent the trend, seasonal, and cyclical components, respectively. In fact the model can contain multiple seasons and cycles, and the seasons can be of different types. For simplicity of discussion the preceding model contains only one of each of these components. The P regression term, mD1 j xjt , includes contribution of regression variables with xed regression j coefcients. A model can also contain regression variables that have time varying regression coefcients or that have a nonlinear relationship with the dependent series (see Incorporating Predictors of Different Kinds on page 1793). The disturbance term t , also called the irregular component, is usually assumed to be Gaussian white noise. In some cases it is useful to model the irregular component as a stationary ARMA process. See the section Modeling the Irregular Component on page 1785 for additional information. By controlling the presence or absence of various terms and by choosing the proper avor of the included terms, the UCMs can generate a rich variety of time series patterns. A UCM can be applied to variables after transforming them by transforms such as log and difference. The components t ; t , and t model structurally different aspects of the time series. For example, the trend t models the natural tendency of the series in the absence of any other perturbing effects such as seasonality, cyclical components, and the effects of exogenous variables, while the seasonal component t models the correction to the level due to the seasonal effects. These components are assumed to be statistically independent of each other and independent of the irregular component. All of the component models can be thought of as stochastic generalizations of the relevant deterministic patterns in time. This way the deterministic cases emerge as special cases of the stochastic models. The different models available for these unobserved components are discussed next.
t 1
C t ;
i:i:d: N.0;
2 /
2 Note that if D 0, then the model becomes t D constant . In the LLT model the trend is locally linear, consisting of both the level and slope. The LLT model is 2 / 2
t 1 1
C t C
t;
C t ;
t
t
D t
The disturbances t and t are assumed to be independent. There are some interesting special cases 2 of this model obtained by setting one or both of the disturbance variances and 2 equal to zero. 2 If 2 is set equal to zero, then you get a linear trend model with xed slope. If is set to zero, then the resulting model usually has a smoother trend. If both the variances are set to zero, then the resulting model is the deterministic linear time trend: t D 0 C 0 t. You can incorporate these trend patterns in your model by using the LEVEL and SLOPE statements.
Modeling a Cycle
A deterministic cycle
t t
D cos. t / C sin. t /
If the argument t is measured on a continuous scale, then t is a periodic function with period 2 = , amplitude D . 2 C 2 /1=2 , and phase D tan 1 .=/. Equivalently, the cycle can be written in terms of the amplitude and phase as
t
cos. t
Note that when t is measured only at the integer values, it is not exactly periodic, unless D .2 j /=k for some integers j and k. The cycles in their pure form are not used very often in practice. However, they are very useful as building blocks for more complex periodic patterns. It is well known that the periodic pattern of any complexity can be written as a sum of pure cycles of different frequencies and amplitudes. In time series situations it is useful to generalize this simple cyclical pattern to a stochastic cycle that has a xed period but time-varying amplitude and phase. The stochastic cycle considered here is motivated by the following recursive formula for computing t: cos sin t t 1 D sin cos t t 1 starting with
2 t 0 t
D and
and
D 2 C 2
A stochastic generalization of the cycle t can be obtained by adding random noise to this recursion and by introducing a damping factor, , for additional modeling exibility. This model can be described as follows: cos sin t t 1 t C D sin cos t t t 1 where 0 1, and the disturbances t and t are independent N.0; 2 / variables. The resulting stochastic cycle has a xed period but time-varying amplitude and phase. The stationarity properties of the random sequence t depend on the damping factor . If < 1, t has a stationary 2 /. If D 1, distribution with mean zero and variance 2 =.1 t is nonstationary. You can incorporate a cycle in a UCM by specifying a CYCLE statement. You can include multiple cycles in the model by using separate CYCLE statements for each included cycle. As mentioned before, the cycles are very useful as building blocks for constructing more complex periodic patterns. Periodic patterns of almost any complexity can be created by superimposing cycles of different periods and amplitudes. In particular, the seasonal patterns, general periodic patterns with integer periods, can be constructed as sums of cycles. This important topic of modeling the seasonal components is considered next.
Modeling Seasons
The seasonal uctuations are a common source of variation in time series data. These uctuations arise because of the regular changes in seasons or some other periodic events. The seasonal effects are regarded as corrections to the general trend of the series due to the seasonal variations, and these effects sum to zero when summed over the full season cycle. Therefore the seasonal component t P 1 is modeled as a stochastic periodic pattern of an integer period s such that the sum sD0 t i is i always zero in the mean. The period s is called the season length. Two different models for the seasonal component are considered here. The rst model is called the dummy variable form of the seasonal component. It is described by the equation
s 1 X i D0 t i
D !t ;
!t
i:i:d: N.0;
2 !/
The other model is called the trigonometric form of the seasonal component. In this case modeled as a sum of cycles of different frequencies. This model is given as follows:
t
is
s=2 X j D1
j;t
where s=2 equals s=2 if s is even and .s 1/=2 if it is odd. The cycles j D 2 j=s and are specied by the matrix equation !j;t cos j sin j j;t j;t 1 D C !j;t sin j cos j j;t j;t 1
j;t
have frequencies
where the disturbances !j;t and !j;t are assumed to be independent and, for xed j , !j;t and 2 !j;t N.0; ! /. If s is even, then the equation for s=2;t is not needed and s=2;t is given by
s=2;t
s=2;t 1
C !s=2;t
The cycles j;t are called harmonics. If the seasonal component is deterministic, the decomposition of the seasonal effects into these harmonics is identical to its Fourier decomposition. In this case the sum of squares of the seasonal factors equals the sum of squares of the amplitudes of these harmonics. In many practical situations, the contribution of the high-frequency harmonics is negligible and can be ignored, giving rise to a simpler description of the seasonal. In the case of stochastic seasonals, the situation might not be so transparent; however, similar considerations still apply. 2 Note that if the disturbance variance ! D 0, then both the dummy and the trigonometric forms of seasonal components reduce to constant seasonal effects. That is, the seasonal component reduces to a deterministic function that is completely determined by its rst s 1 values. In the UCM procedure you can specify a seasonal component in a variety of ways, the SEASON statement being the simplest of these. The dummy and the trigonometric seasonal components discussed so far can be considered as saturated seasonal components that put no restrictions on the s 1 seasonal values. In some cases a more parsimonious representation of the seasonal might be more appropriate. This is particularly useful for seasonal components with large season lengths. In the UCM procedure you can obtain parsimonious representations of the seasonal components by one of the following ways: Use a subset trigonometric seasonal component obtained by deleting a few of the s=2 harmonics used in its sum. For example, a slightly smoother seasonal component of length 12, corresponding to the monthly seasonality, can be obtained by deleting the highest-frequency harmonic of period 2. That is, such a seasonal component will be a sum of ve stochastic cycles that have periods 12, 6, 4, 3, and 2.4. You can specify such subset seasonal components by using the KEEPH= or DROPH= option in the SEASON statement. Approximate the seasonal pattern by a suitable spline approximation. You can do this by using the SPLINESEASON statement. A block-seasonal pattern is a seasonal pattern where the pattern is divided into a few blocks of equal length such that the season values within a block are the samefor example, a monthly seasonal pattern that has only four different values, one for each quarter. In some situations a long seasonal pattern can be approximated by the sum of block season and a simple season, the length of the simple season being equal to the block length of the block season. You can obtain such approximation by using a combination of BLOCKSEASON and SEASON statements. Consider a seasonal component of a large season length as a sum of two or more seasonal components that are each of much smaller season lengths. This can be done by specifying more than one SEASON statements. Note that the preceding techniques of obtaining parsimonious seasonal components can also enable you to specify seasonal components that are more general than the simple saturated seasonal components. For example, you can specify a saturated trigonometric seasonal component that has some of its harmonics evolving according to one disturbance variance parameter while the others evolve with another disturbance variance parameter.
Modeling an Autoregression
An autoregression of order one can be thought of as a special case of a cycle when the frequency is either 0 or . Modeling this special case separately helps interpretation and parameter estimation. The autoregression component rt is modeled as follows: rt D rt
1
t;
i:i:d: N.0;
where 1 < 1. An autoregression can also provide an alternative to the IRREGULAR component when the model errors show some autocorrelation. You can incorporate an autoregression in your model by using the AUTOREG statement.
Model Specication
A UCM is specied by describing the components in the model. For example, consider the model yt D
t
consisting of the irregular, level, slope, and seasonal components. This model is called the basic structural model (BSM) by Harvey (1989). The syntax for a BSM with monthly seasonality of trigonometric type is as follows:
model y; irregular; level; slope; season length=12 type=trig;
Similarly the following syntax species a BSM with a response variable y, a regressor x, and dummy-type monthly seasonality:
model y = x; irregular; level; slope variance=0 noest; season length=12 type=dummy;
Moreover, the disturbance variance of the slope component is restricted to zero, giving rise to a local linear trend with xed slope. A model can contain multiple cycle and seasonal components. In such cases the model syntax contains a separate statement for each of these multiple cycle or seasonal components; for example, the syntax for a model containing irregular and level components along with two cycle components could be as follows:
model y = x; irregular; level; cycle; cycle;
t C1 D Tt t C N.0; P /
N.0; Qt /
The rst equation, called the observation equation, relates the response series yt to a state vector t that is usually unobserved. The second equation, called the state equation, describes the evolution of the state vector in time. The system matrices Zt and Tt are of appropriate dimensions and are known, except possibly for some unknown elements that become part of the parameter vector of the model. The noise series t consists of independent, zero-mean, Gaussian vectors with covariance matrices Qt . For most of the UCMs considered here, the system matrices Zt and Tt , and the noise covariances Qt , are time invariantthat is, they do not depend on time. In a few cases, however, some or all of them can depend on time. The initial state vector 1 is assumed to be independent of the noise series, and its covariance matrix P can be partially diffuse. A random vector has a partially diffuse covariance matrix if it can be partitioned such that one part of the vector has a properly dened probability distribution, while the covariance matrix of the other part is innite that is, you have no prior information about this part of the vector. The covariance of the initial state 1 is assumed to have the following form: P D P C P1 where P and P1 are nonnegative denite, symmetric matrices and is a constant that is assumed to be close to 1. In the case of UCMs considered here, P1 is always a diagonal matrix that consists of zeros and ones, and, if a particular diagonal element of P1 is one, then the corresponding row and column in P are zero. The state space formulation of a UCM has many computational advantages. In this formulation there are convenient algorithms for estimating and forecasting the unobserved states ft g by using the observed series fyt g. These algorithms also yield the in-sample and out-of-sample forecasts and the likelihood of fyt g. The state space representation of a UCM does not need to be unique. In the representation used here, the unobserved components in the UCM often appear as elements of the state vector. This makes the elements of the state interpretable and, more important, the sample estimates and forecasts of these unobserved components are easily obtained. For additional information about the computational aspects of the state space modeling, see Durbin and Koopman (2001). Next, some notation is developed to describe the essential quantities computed during the analysis of the state space models. Let fyt ; t D 1; : : : ; ng be the observed sample from a series that satises a state space model. Next, for 1 t n, let the one-step-ahead forecasts of the series, the states, and their variances be dened as follows, using the usual notation to denote the conditional expectation and conditional
1/ 1/
These are also called the ltered estimates of the series and the states. Similarly, for t following denote the full-sample estimates of the series and the state values at time t: t Q t yt Q Gt D E.t jy1 ; y2 ; : : : ; yn / D Var.t jy1 ; y2 ; : : : ; yn / D E.yt jy1 ; y2 ; : : : ; yn / D Var.yt jy1 ; y2 ; : : : ; yn /
1, let the
If the time t is in the historical period that is, if 1 t n then the full-sample estimates are called the smoothed estimates, and if t lies in the future then they are called out-of-sample forecasts. Note that if 1 t n, then yt D yt and Gt D 0, unless yt is missing. Q All the ltered and smoothed estimates (t ; t ; : : : ; Gt , and so on) are computed by using the O Q Kalman ltering and smoothing (KFS) algorithm, which is an iterative process. If the initial state is diffuse, as is often the case for the UCMs, its treatment requires modication of the traditional KFS, which is called the diffuse KFS (DKFS). The details of DKFS implemented in the UCM procedure can be found in de Jong and Chu-Chun-Lin (2003). Additional information on the state space models can be found in Durbin and Koopman (2001). The likelihood formulas described in this section are taken from the latter reference. In the case of diffuse initial condition, the effect of the improper prior distribution of 1 manifests itself in the rst few ltering iterations. During these initial ltering iterations the distribution of the ltered quantities remains diffuse; that is, during these iterations the one-step-ahead series and state forecast variances Ft and t have the following form: Ft t D F D
t t
C F1t C 1t
The actual number of iterationssay, I affected by this improper prior depends on the nature of the vectors Zt , the number of nonzero diagonal elements of P1 , and the pattern of missing values in the dependent series. After I iterations, 1t and F1t become zero and the one-step-ahead series and state forecasts have proper distributions. These rst I iterations constitute the initialization phase of the DKFS algorithm. The post-initialization phase of the DKFS and the traditional KFS is the same. In the state space modeling literature the pre-initialization and post-initialization phases are some times called pre-collapse and post-collapse phases of the diffuse Kalman ltering. In certain missing value patterns it is possible for I to exceed the sample size; that is, the sample information can be insufcient to create a proper prior for the ltering process. In these cases, parameter estimation and forecasting is done on the basis of this improper prior, and some or all
of the series and component forecasts can have innite variances (or zero precision). The forecasts that have innite variance are set to missing. The same situation can occur if the specied model contains components that are essentially multicollinear. In these situations no residual analysis is possible; in particular, no residuals-based goodness-of-t statistics are produced. The log likelihood of the sample (L1 ), which takes account of this diffuse initialization step, is computed by using the one-step-ahead series forecasts as follows: L1 .y1 ; : : : ; yn / D .n 2 d/ log 2 1X wt 2
t D1 I n 2 1 X .log Ft C t / 2 Ft t DI C1
where d is the number of diffuse elements in the initial state 1 , ahead residuals, and wt D log F1t D log F
t
D yt
if F1t > 0
2 t
if F1t D 0
If yt is missing at some time t, then the corresponding summand in the log likelihood expression is deleted, and the constant term is adjusted suitably. Moreover, if the initialization step does not completethat is, if I exceeds the sample size then the value of d is reduced to the number of diffuse states that are successfully initialized. The portion of the log likelihood that corresponds to the post-initialization period is called the nondiffuse log likelihood (L0 ). The nondiffuse log likelihood is given by L0 .y1 ; : : : ; yn / D
n 2 1 X .log Ft C t / 2 Ft t DI C1
In the P of UCMs considered in PROC UCM, it often happens that the diffuse part of the likelicase hood, I wt , does not depend on the model parameters, and in these cases the maximization of tD1 nondiffuse and diffuse likelihoods is equivalent. However, in some cases, such as when the model consists of dependent lags, the diffuse part does depend on the model parameters. In these cases the maximization of the diffuse and nondiffuse likelihood can produce different parameter estimates. In some situations it is convenient to reparameterize the nondiffuse initial state covariance P as 2 P and the state noise covariance Q as 2 Q for some common scalar parameter 2 . In this t t case the preceding log-likelihood expression, up to a constant, can be written as L1 .y1 ; : : : ; yn / D 1X wt 2
t D1 I n 1 X log Ft 2 t DI C1
1 2
2
n X t DI C1
2 t
.n 2
2
d/
Ft
log
can be shown to be
1 .n d/
n X t DI C1
2 t
Ft
When this expression of 2 is substituted back into the likelihood formula, an expression called the prole likelihood (Lprof ile ) of the data is obtained: 2Lprof ile .y1 ; : : : ; yn / D
I X t D1
wt C
n X t DI C1
log Ft C .n
d / log.
n X
2 t
t DI C1
Ft
In some situations the parameter estimation is done by optimizing the prole likelihood (see the section Parameter Estimation by Prole Likelihood Optimization on page 1798 and the PROFILE option in the ESTIMATE statement). In the remainder of this section the state space formulation of UCMs is further explained by using some particular UCMs as examples. The examples show that the state space formulation of the UCMs depends on the components in the model in a simple fashion; for example, the system matrix T is usually a block diagonal matrix with blocks that correspond to the components in the model. The only exception to this pattern is the UCMs that consist of the lags of dependent variable. This case is considered at the end of the section. In what follows, Di ag a; b; : : : denotes a diagonal matrix with diagonal entries a; b; : : : , and 0 the transpose of a matrix T is denoted as T .
D D
C
1
t 1
C t C
t
C t
D t
Here yt is the response series and t ; t ; and t are independent, zero-mean Gaussian disturbance 2 sequences with variances 2 ; , and 2 , respectively. This model can be formulated as a state space model where the state vector t D t t t and the state noise t D t t t . Note that the elements of the state vector are precisely the unobserved components in the model. The system matrices T and Z and the noise covariance Q corresponding to this choice of state and state noise vectors can be seen to be time invariant and are given by 2 3 000 h i 2 Z D 1 1 0 ; T D 4 0 1 1 5 and Q D Diag 2 ; ; 2 001 The distribution of the initial state vector 1 is diffuse, with P D Diag 2 ; 0; 0 and P1 D Di ag 0; 1; 1. The parameter vector consists of all the disturbance variancesthat is, D 2 . 2 ; ; 2 /.
0 0
short season length, season length = 4 (quarterly seasonality), is considered here. The pattern for longer season lengths such as 12 (monthly) and 52 (weekly) is easy to see. Let us rst consider the dummy form of seasonality. In this case the state and state noise vectors are 0 0 t D t t t 1;t 2;t 3;t and t D t t t !t 0 0 , respectively. The rst three elements of the state vector are the irregular, level, and slope components, respectively. The remaining elements, i;t , are lagged versions of the seasonal component t . 1;t corresponds to lag zerothat is, the same as t , 2;t to lag 1 and 3;t to lag 2. The system matrices are 2 6 6 6 Z D 1 1 0 1 0 0 ; T D 6 6 6 4 h 0 0 0 0 0 0 0 1 0 0 0 0 0 1 1 0 0 0 0 0 0 1 1 0 0 0 0 1 0 1 0 0 0 1 0 0 3 7 7 7 7 7 7 5
and Q D Di ag P D Di ag
2;
2 ;
2 ! ; 0; 0
2 ; 0; 0; 0; 0; 0
and P1 D Diag 0; 1; 1; 1; 1; 1.
h i0 In the case of the trigonometric type of seasonality, t D and t D t t t 1;t 1;t 2;t h i0 . The disturbance sequences, !j;t ; 1 j 2, and !1;t , are independent, t t t !1;t !1;t !2;t zero-mean, Gaussian sequences with variance 2 6 6 6 Z D 1 1 0 1 0 1 ; T D 6 6 6 4 h 0 0 0 0 0 0 0 1 0 0 0 0 0 1 1 0 0 0
2 !.
0 0 0 cos sin 0
j
1 1
1 1
and Q D Di ag
i 2 ; 2 ; 2 ; 2 ; 2 ; 2 . Here ! ! !
state vector 1 is diffuse, with P D Di ag 2 ; 0; 0; 0; 0; 0 and P1 D Diag 0; 1; 1; 1; 1; 1. The 2 2 parameter vector in both the cases is D . 2 ; ; 2 ; ! /.
to lag 2m. All the system matrices are time invariant, except the matrix T . They can be seen to be i h 2 2 Z D 1 1 0 1 0 0 , Q D Di ag 2 ; ; 2 ; ! ; 0; 0 , and 2 6 6 6 Tt D 6 6 6 4 0 0 0 0 0 0 0 1 0 0 0 0 0 1 1 0 0 0 0 0 0 1 1 0 0 0 0 1 0 1 0 0 0 1 0 0 3 7 7 7 7 7 7 5
when t is a multiple of the block size m, and 3 2 0 0 0 0 0 0 6 0 1 1 0 0 0 7 7 6 6 0 0 1 0 0 0 7 7 Tt D 6 6 0 0 0 1 0 0 7 7 6 4 0 0 0 0 1 0 5 0 0 0 0 0 1 otherwise. Note that when t is not a multiple of m, the portion of the Tt matrix corresponding to the seasonal is identity. The distribution of the initial state vector 1 is diffuse, with P D Di ag 2 ; 0; 0; 0; 0; 0 and P1 D Diag 0; 1; 1; 1; 1; 1. h i0 Similarly in the case of the trigonometric form of seasonality, t D t 1;t 1;t 2;t t t h i0 and t D . The disturbance sequences, !j;t ; 1 j 2, and !1;t , t t t !1;t !1;t !2;t are independent, zero-mean, Gaussian sequences with variance h i 2 2 2 2 Di ag 2 ; ; 2 ; ! ; ! ; ! , and 2 6 6 6 Tt D 6 6 6 4 0 0 0 0 0 0 0 1 0 0 0 0 0 1 1 0 0 0 0 0 0 cos sin 0 0 0 0 sin cos 0 0 0 0 0 0 cos 3 7 7 7 7 7 7 5
2 2 !.
Z D 1 1 0 1 0 1 , Q D
1 1
1 1
when t is a multiple of the block size m, and 2 3 0 0 0 0 0 0 6 0 1 1 0 0 0 7 6 7 6 0 0 1 0 0 0 7 6 7 Tt D 6 0 0 0 1 0 0 7 6 7 4 0 0 0 0 1 0 5 0 0 0 0 0 1 otherwise. As before, when t is not a multiple of m, the portion of the Tt matrix corresponding to the seasonal is identity. Here j D .2 j /=4. The distribution of the initial state vector 1 is diffuse, with P D Di ag 2 ; 0; 0; 0; 0; 0 and P1 D Diag 0; 1; 1; 1; 1; 1. The parameter vector 2 2 in both the cases is D . 2 ; ; 2 ; ! /.
t t
where t and t are independent, zero-mean, Gaussian disturbances with variance 2 . In what follows, a state space form for a model consisting of such a stochastic cycle and an irregular component is given. The state vector t D matrices are
0
. The system
0 ZD110 T D4 0 0
0 cos sin
5 Q D Diag
; h
D Diag
2;
i , where
2 .1
2/ 1.
2;
; ;
2 /.
An autoregression rt can be considered as a special case of cycle with frequency equal to 0 or . In this case the equation for t is not needed. Therefore, for a UCM consisting of an autoregressive component and an irregular component, the state space model simplies to the following form. The state vector t D rt , and the state noise vector 0 0 Z D 1 1 ; T D and Q D Diag 0
t
0
t 2
D ;
2
The distribution of the initial state vector 1 is proper, with P D Diag 2 .1 2 / 1 . The parameter vector D . 2 ; ; 2 /.
2;
2 r
, where
2 r
x; u1 ; u2 , and v. Let us assume that x is a simple regressor specied in the MODEL statement, u1 and u2 are random regressors with time-varying regression coefcients that are specied in the same RANDOMREG statement, and v is a nonlinear regressor specied on a SPLINEREG statement. Let us further assume that the spline associated with v has degree one and is based on two internal knots. As explained in the section SPLINEREG Statement on page 1778, using v is equivalent to using .nk not s C degree/ D .2 C 1/ D 3 derived (random) regressors: say, s1 ; s2 ; s3 . In all there are .1 C 2 C 3/ D 6 regressors, the rst one being a simple regressor and the others being time-varying coefcient regressors. The time-varying regressors are in two groups, the rst consisting of u1 and u2 and the other consisting of s1 ; s2 , and s3 . The dynamics of this model are as follows:
3 X i D1 t
yt
D D
C xt C 1t u1t C 2t u2t C C t
1/ 1/
i t si t
t 1
1t 2t
1t 2t 3t
D 1.t D 2.t D D D
C C C C C
1t 2t 1t 2t 3t
All the disturbances t ; t ; 1t ; 2t ; 1t ; 2t ; and 3t are independent, zero-mean, Gaussian variables, where 1t ; 2t share a common variance parameter 2 and 1t ; 2t ; 3t share a common variance 2 . These dynamics can be captured in the state space form by taking state t D t t 1t 2t tem matrices Zt T
1t 2t 3t
, state disturbance
t 0
1t
2t
1t
2t
3t
D Di ag 0; 1; 1; 1; 1; 1; 1; 1 h 2 Q D Di ag 2 ; ; 0; 2 ; 2 ; 2 ;
Note that the regression coefcients are elements of the state vector and that the system vector Zt is not time invariant. The distribution of the initial state vector 1 is diffuse, with P D Di ag 2 ; 0; 0; 0; 0; 0; 0; 0 and P1 D Diag 0; 1; 1; 1; 1; 1; 1; 1. The parameters of this 2 model are the disturbance variances, 2 , ; 2 ; and 2 , which get estimated by maximizing the likelihood. The regression coefcients, time-invariant and time-varying 1t ; 2t ; 1t ; 2t and 3t , get implicitly estimated during the state estimation (smoothing).
Reporting Parameter Estimates for Random Regressors
If the random walk disturbance variance associated with a random regressor is held xed at zero, then its coefcient is no longer time-varying. In the UCM procedure the random regressor parameter estimates are reported differently if the random walk disturbance variance associated with a random regressor is held xed at zero. The following points explain how the parameter estimates are reported in the parameter estimates table and in the OUTEST= data set.
If the random walk disturbance variance associated with a random regressor is not held xed, then its estimate is reported in the parameter estimates table and in the OUTEST= data set. If more that one random regressor is specied in a RANDOMREG statement, then the rst regressor in the list is used as a representative of the list while reporting the corresponding common variance parameter estimate. If the random walk disturbance variance is held xed at zero, then the parameter estimates table and the OUTEST= data set contain the corresponding regression parameter estimate rather than the variance parameter estimate. Similar considerations apply in the case of the derived random regressors associated with a spline-regressor.
yt
D D
i yt i
C 1 x1t C 2 x2t C
t 1
C t
The state space form of this augmented model can be described in terms of the state space form of a model that has random walk trend with two simple time-invariant regressors. A superscript dagger () has been added to distinguish the augmented model state space entities from the corresponding entities of the state space form of the random walk with predictors model. With this notation, the h 0 i0 state vector of the augmented model t D t yt yt 1 : : : yt kC1 and the new state noise h 0 i0 vector t D t ut 0 : : : 0 , where ut is the matrix product Zt t . Note that the length of the new state vector is k C length.t / D k C 4. The new system matrices, in block form, are 2 3 Tt 0 ... 0 5 ... Zt D 0 0 0 0 1 : : : 0 ; Tt D 4 Zt C1 Tt 1 k 0 Ik 1;k 1 0 where Ik
1;k 1
is the k
Qt Qt D 4 Zt Qt 0
3 0 0 5 0
Note that the T and Q matrices of the random walk with predictors model are time invariant, and in the expressions above their time indices are kept because they illustrate the pattern for more general models. The initial state vector is diffuse, with P 0 P1 0 ; P1 D P D 0 0 0 Ik;k
2 The parameters of this model are the disturbance variances 2 and , the lag coefcients 1 ; 2 ; : : : ; k , and the regression coefcients 1 and 2 . As before, the regression coefcients get estimated during the state smoothing, and the other parameters are estimated by maximizing the likelihood.
Outlier Detection
In time series analysis it is often useful to detect changes over time in the characteristics of the response series. In the UCM procedure you can search for two types of changes, additive outliers (AO) and level shifts (LS). An additive outlier is an unusual value in the series, the cause of which might be a data recording error or a temporary shock to the series generation process. A level shift represents a permanent shift, either up or down, in the level of the series. You can control different aspects of the outlier search, such as the signicance level of the reported outliers, by choosing different options in the OUTLIER statement. The search for AOs is done by default, whereas the CHECKBREAK option in the LEVEL statement must be used to turn on the search for LSs. The outlier detection process implemented in the UCM procedure is based on de Jong and Penzer (1998). In this approach the tted model is taken to be the null model, and the series values and level shifts that are not adequately accounted for by the null model are agged as outliers. The unusualness of a response series value at a particular time point t0 , with respect to the tted model, can be judged by estimating its value based on the rest of the data (that is, the series obtained
by deleting the series value at t0 ) and comparing the estimated value to the observed value. If the difference between the estimated and observed values is statistically signicant, then such value can be regarded as an AO. Note that this difference between the estimated and observed values is also the regression coefcient of a dummy regressor that takes the value 1.0 at t0 and is 0.0 elsewhere, assuming such a regressor is added to the null model. In this way the series value at t0 is regarded as AO if the regression coefcient of this dummy regressor is signicant. Similarly, you can say that a level shift has occurred at a time point t0 if the regression coefcient of a regressor, which is 0.0 before t0 and 1.0 at t0 and thereafter, is statistically signicant. De Jong and Penzer (1998) provide an efcient way to compute such AO and LS regression coefcients and their standard errors at all time points in the series. The outlier summary table, which is produced by default, simply lists the most statistically signicant candidates among these.
Missing Values
Embedded missing values in the dependent variable usually cause no problems in UCM modeling. However, no embedded missing values are allowed in the predictor variables. Certain patterns of missing values in the dependent variable can lead to failure of the initialization step of the diffuse Kalman ltering for some models. For example, if in a monthly series all values are missing for a certain monthsay, Maythen a BSM with monthly seasonality leads to such a situation. However, in this case the initialization step can complete successfully for a nonseasonal model such as local linear model.
Parameter Estimation
The parameter vector in a UCM consists of the variances of the disturbance terms of the unobserved components, the damping coefcients and frequencies in the cycles, the damping coefcient in the autoregression, the lag coefcients of the dependent lags, and the regression coefcients in the regression terms. The regression coefcients are always part of the state vector and are estimated by state smoothing. The remaining parameters are estimated by maximizing either the full diffuse likelihood or the nondiffuse likelihood. The decision to use the full diffuse likelihood or the nondiffuse likelihood depends on the presence or absence of the dependent lag coefcients in the parameter vector. If the parameter vector does not contain any dependent lag coefcients, then the full diffuse likelihood is used. If, on the other hand, the parameter vector does contain some dependent lag coefcients, then the parameters are estimated by maximizing the nondiffuse likelihood. The optimization of the full diffuse likelihood is often unstable when the parameter vector contains dependent lag coefcients. In this sense, when the parameter vector contains dependent lag coefcients, the parameter estimates are not true maximum likelihood estimates. The optimization of the likelihood, either full or nondiffuse, is carried out using one of several nonlinear optimization algorithms. The user can control many aspects of the optimization process by using the NLOPTIONS statement and by providing the starting values of the parameters while specifying the corresponding components. However, in most cases the default settings work quite well. The optimization process is not guaranteed to converge to a maximum likelihood estimate. In
most cases the difculties in parameter estimation are associated with the specication of a model that is not appropriate for the series being modeled.
Note that when the parameter estimation is done by optimizing the prole likelihood, the interpretation of the variance parameters that are held xed to nonzero values changes. In the presence of the PROFILE option, the disturbance variances that are held at a xed value by using the NOEST option in their respective component statements are interpreted as being restricted to be that xed multiple of the proled variance rather than being xed at that nominal value. That is, implicitly, the parameter estimation is done under the restriction of holding the disturbance variance ratio xed at a given value rather than the disturbance variance itself. See Example 29.5 for an example of this type of restriction to obtain a UC model that is equivalent to the famous Hodrick-Prescott lter.
t values
The t values reported in the table of parameter estimates are approximations whose accuracy depends on the validity of the model, the nature of the model, and the length of the observed series. The distributional properties of the maximum likelihood estimates of general unobserved components models have not been explored fully; therefore the probability values that correspond to a t distribution should be interpreted carefully, as they can be misleading. This is particularly true if the
parameters in question are close to the boundary of the parameter space. The two sources by Harvey (1989, 2001) are good references for information about this topic. For some parameters, such as, the cycle period, the reported t values are uninformative because comparison of the estimated parameter with zero is never needed. In such cases the t values and the corresponding probability values should be ignored.
Computational Issues
Convergence Problems
As explained in the section Parameter Estimation on page 1797, the model parameters are estimated by nonlinear optimization of the likelihood. This process is not guaranteed to succeed. For some data sets, the optimization algorithm can fail to converge. Nonconvergence can result from a number of causes, including at or ridged likelihood surfaces and ill-conditioned data. It is also possible for the algorithm to converge to a point that is not the global optimum of the likelihood. If you experience convergence problems, the following points might be helpful: Data that are extremely large or extremely small can adversely affect results because of the internal tolerances used during the ltering steps of the likelihood calculation. Rescaling the data can improve stability. Examine your model for redundancies in the included components and regressors. If some of the included components or regressors are nearly collinear to each other, then the optimization process can become unstable. Experimenting with different options offered by the NLOPTIONS statement can help. Lack of convergence can indicate model misspecication or a violation of the normality assumption.
Displayed Output
The default printed output produced by the UCM procedure is described in the following list: brief information about the input data set, including the data set name and label, and the name of the ID variable specied in the ID statement summary statistics for the data in the estimation and forecast spans, including the names of the variables in the model, their categorization as dependent or predictor, the index of the beginning and ending observations in the spans, the total number of observations and the number of missing observations, the smallest and largest measurements, and the mean and standard deviation information about the model parameters at the start of the model-tting stage, including the xed parameters in the model and the initial estimates of the free parameters in the model convergence status of the likelihood optimization process if any parameter estimation is done estimates of the free parameters at the end of the model tting-stage, including the parameter estimates, their approximate standard errors, t statistics, and the approximate p-value the likelihood-based goodness-of-t statistics, including the full likelihood, the portion of the likelihood corresponding to the diffuse initialization, the sum of squares of residuals normalized by their standard errors, and the information criteria: AIC, AICC, HQIC, BIC, and CAIC the t statistics that are based on the raw residuals (observed minus predicted), including the mean squared error (MSE), the root mean squared error (RMSE), the mean absolute percentage error (MAPE), the maximum percentage error (MAXPE), the R square, the adjusted R square, the random walk R square, and Amemiyas R square the signicance analysis of the components included in the model that is based on the estimation span brief information about the components included in the model additive outliers in the series, if any are detected the multistep series forecasts post-sample-prediction analysis table that compares the multistep forecasts with the observed series values, if the BACK= option is used in the FORECAST statement
Statistical Graphics
This section provides information about the basic ODS statistical graphics produced by the UCM procedure. To request graphics with PROC UCM, you must rst enable ODS Graphics by specifying the ODS GRAPHICS ON; statement. See Chapter 21, Statistical Graphics Using ODS (SAS/STAT Users Guide), for more information.
You can obtain most plots relevant to the specied model by using the global PLOTS= option in the PROC UCM statement. The plot of series forecasts in the forecast horizon is produced by default. You can further control the production of individual plots by using the PLOT= options in the different statements. The main types of plots available are as follows: Time series plots of the component estimates, either ltered or smoothed, can be requested by using the PLOT= option in the respective component statements. For example, the use of PLOT=SMOOTH option in a CYCLE statement produces a plot of smoothed estimate of that cycle. Residual plots for model diagnostics can be obtained by using the PLOT= option in the ESTIMATE statement. Plots of series forecasts and model decompositions can be obtained by using the PLOT= option in the FORECAST statement. The following example is a simple illustration of the available plot options.
96 612
The following statements specify a UCM that includes a cycle component and a random walk trend component:
ods graphics on; proc ucm data=sunspot; id year interval=year; model wolfer; irregular; level ; cycle plot=(filter smooth); estimate back=24 plot=(loess panel cusum wn);
The following subsections explain the graphics produced by the above statements.
Component Plots
The plots in Figure 29.8 and Figure 29.9, produced by specifying PLOT=(FILTER SMOOTH) in the CYCLE statement, show the ltered and smoothed estimates, respectively, of the cycle component in the model. Note that the smoothed estimate appears smoother than the ltered estimate. This is always true because the ltered estimate of a component at time t is based on the observations prior to time tthat is, it uses measurements from the rst observation up to the .t 1/th observation. On the other hand, the corresponding smoothed estimate uses all the available observationsthat is, all the measurements from the rst observation to the last. This makes the smoothed estimate of the component more precise than the ltered estimate for the time points within historical period. In the forecast horizon, both ltered and smoothed estimates are identical, being based on the same set of observations.
Figure 29.8 Sunspots Series: Filtered Cycle
Residual Diagnostics
If the tted model is appropriate for the given data, then the corresponding one-step-ahead residuals should be approximately whitethat is, uncorrelatedand approximately normal. Moreover, the residuals should not display any discernible pattern. You can detect departures from these conditions graphically. Different residual diagnostic plots can be requested by using the PLOT= option in the ESTIMATE statement. The normality can be checked by examining the histogram and the normal quantile plot of residuals. The whiteness can be checked by examining the ACF and PACF plots that show the sample autocorrelation and sample partial-autocorrelation at different lags. The diagnostic panel shown in Figure 29.10, produced by specifying PLOT=PANEL, contains these four plots.
The residual histogram and Q-Q plot show no serious violation of normality. The histogram appears reasonably symmetric and follows the overlaid normal density curve reasonably closely. Similarly in the Q-Q plot the residuals follow the reference line fairly closely. The ACF and PACF plots also do not exhibit any violation of the whiteness assumption; the correlations at all nonzero lags seem to be insignicant. The residual whiteness can also be formally tested by using the Ljung-Box portmanteau test. The plot in Figure 29.11, produced by specifying PLOT=WN, shows the p-values of the Ljung-Box test statistics at different lags. In these plots the p-values for the rst few lags, equal to the number of estimated parameters in the model, are not shown because they are always missing. This portion of the plot is shaded blue to indicate this fact. In the case of this model, ve parameters are estimated so the p-values for the rst ve lags are not shown. The p-values are displayed on a log scale in such a way that higher bars imply more extreme test statistics. In this plot some early p-values appear extreme. However, these p-values are based on large sample theory, which suggests that these statistics should be examined for lags larger than the square root of sample size. In this example it p means that the p-values for the rst 154 12 lags can be ignored. With this consideration, the plot shows no violation of whiteness since the p-values after the 12th lag do not appear extreme.
The plot in Figure 29.12, produced by specifying PLOT=LOESS, shows the residuals plotted against time with an overlaid LOESS curve. This plot is useful for checking whether any discernible pattern remains in the residuals. Here again, no signicant pattern appears to be present.
The plot in Figure 29.13, produced by specifying PLOT=CUSUM, shows the cumulative residuals plotted against time. This plot is useful for checking structural breaks. Here, there appears to be no evidence of structural break since the cumulative residuals remain within the condence band throughout the sample period. Similarly you can request a plot of the squared cumulative residuals by specifying PLOT=CUSUMSQ.
Brockwell and Davis (1991) can be consulted for additional information on diagnosing residuals. For more information on CUSUM and CUSUMSQ plots, you can consult Harvey (1989).
You can use the PLOT= option in the FORECAST statement to obtain the series forecast plot and the series decomposition plots. The series decomposition plots show the result of successively adding different components in the model starting with the trend component. The IRREGULAR component is left out of this process. The following two plots, produced by specifying PLOT=DECOMP, show the results of successive component addition for this example. The rst plot, shown in Figure 29.14, shows the smoothed trend component and the second plot, shown in Figure 29.15, shows the sum of smoothed trend and cycle.
Description
Statement
Tables Summarizing the Estimation and Forecast Spans EstimationSpan Estimation span summary information ForecastSpan Forecast span summary information Tables Related to Model Parameters ConvergenceStatus Convergence status of the estimation process
default
Table 29.2
continued
Description Fixed parameters in the model Initial estimates of the free parameters Final estimates of the free parameters
Statement
Tables Related to Model Information and Diagnostics BlockSeasonDescription Information about the block seasonals in the model ComponentSignicance Signicance analysis of the components in the model CycleDescription Information about the cycles in the model FitStatistics Fit statistics based on the one-step-ahead predictions FitSummary Likelihood-based t statistics OutlierSummary Summary table of the detected outliers SeasonDescription Information about the seasonals in the model SeasonHarmonics Summary of harmonics in a trigonometric seasonal component SplineSeasonDescription Information about the splineseasonals in the model TrendInformation Summary information of the level and slope components Tables Related to Filtered Component Estimates FilteredAutoReg Filtered estimate of an autoreg component FilteredBlockSeason Filtered estimate of a block seasonal component FilteredCycle Filtered estimate of a cycle component FilteredIrregular Filtered estimate of the irregular component FilteredLevel Filtered estimate of the level component FilteredRandomReg Filtered estimate of the timevarying random-regression coefcient
default default
AUTOREG
PRINT=FILTER
BLOCKSEASON PRINT=FILTER CYCLE IRREGULAR LEVEL RANDOMREG PRINT=FILTER PRINT=FILTER PRINT=FILTER PRINT=FILTER
Table 29.2
continued
Description Filtered estimate of a seasonal component Filtered estimate of the slope component Filtered estimate of the timevarying spline-regression coefcient Filtered estimate of a splineseasonal component
FilteredSplineSeason
SPLINESEASON PRINT=FILTER
Tables Related to Smoothed Component Estimates SmoothedAutoReg Smoothed estimate of an autoreg component SmoothedBlockSeason Smoothed estimate of a block seasonal component SmoothedCycle Smoothed estimate of the cycle component SmoothedIrregular Smoothed estimate of the irregular component SmoothedLevel Smoothed estimate of the level component SmoothedRandomReg Smoothed estimate of the time-varying randomregression coefcient SmoothedSeason Smoothed estimate of a seasonal component SmoothedSlope Smoothed estimate of the slope component SmoothedSplineReg Smoothed estimate of the time-varying splineregression coefcient SmoothedSplineSeason Smoothed estimate of a spline-seasonal component
AUTOREG
PRINT=SMOOTH
BLOCKSEASON PRINT=SMOOTH CYCLE IRREGULAR LEVEL RANDOMREG PRINT=SMOOTH PRINT=SMOOTH PRINT=SMOOTH PRINT=SMOOTH
SPLINESEASON PRINT=SMOOTH
Tables Related to Series Decomposition and Forecasting FilteredAllExceptIrreg Filtered estimate of sum of FORECAST all components except the irregular component FilteredTrend Filtered estimate of trend FORECAST FilteredTrendReg Filtered estimate of trend FORECAST plus regression FilteredTrendRegCyc Filtered estimate of trend FORECAST plus regression plus cycles and autoreg
PRINT=FDECOMP
Table 29.2
continued
Description Dependent series forecasts Forecasting performance in the holdout period Smoothed estimate of sum of all components except the irregular component Smoothed estimate of trend Smoothed estimate of trend plus regression Smoothed estimate of trend plus regression plus cycles and autoreg
N OTE : The tables are related to a single series within a BY group. In the case of models that contain multiple cycles, seasonal components, or block seasonal components, the corresponding component estimate tables are sequentially numbered. For example, if a model contains two cycles and a seasonal component and the PRINT=SMOOTH option is used for each of them, the ODS tables containing the smoothed estimates will be named SmoothedCycle1, SmoothedCycle2, and SmoothedSeason. Note that the seasonal table is not numbered because there is only one seasonal component.
Description
Plots Related to Residual Analysis ErrorACFPlot Prediction error autocorrelation plot ErrorPACFPlot Prediction error partial-autocorrelation plot
Table 29.3
continued
Description Prediction error histogram Prediction error normal quantile plot Plot of prediction errors Plot of p-values at different lags for the Ljung-Box portmanteau white noise test statistics Plot of cumulative residuals Plot of cumulative squared residuals Plot of one-step-ahead forecasts in the estimation span Panel of residual diagnostic plots Time series plot of residuals with superimposed LOESS smoother
PanelResidualPlot ResidualLoessPlot
ESTIMATE ESTIMATE
PLOT=PANEL PLOT=LOESS
Plots Related to Filtered Component Estimates FilteredAutoregPlot Plot of ltered autoreg component FilteredBlockSeasonPlot Plot of ltered block season component FilteredCyclePlot Plot of ltered cycle component FilteredIrregularPlot Plot of ltered irregular component FilteredLevelPlot Plot of ltered level component FilteredRandomRegPlot Plot of ltered timevarying regression coefcient FilteredSeasonPlot Plot of ltered season component FilteredSlopePlot Plot of ltered slope component
SEASON SLOPE
PLOT=FILTER PLOT=FILTER
Table 29.3
continued
Description Plot of ltered timevarying regression coefcient Plot of ltered splineseason component Plot of annual variation in the ltered season component
Statement SPLINEREG
Option PLOT=FILTER
FilteredSplineSeasonPlot AnnualSeasonPlot
SPLINESEASON SEASON
PLOT=FILTER PLOT=F_ANNUAL
Plots Related to Smoothed Component Estimates SmoothedAutoregPlot Plot of smoothed autoreg component SmoothedBlockSeasonPlot Plot of smoothed block season component SmoothedCyclePlot Plot of smoothed cycle component SmoothedIrregularPlot Plot of smoothed irregular component SmoothedLevelPlot Plot of smoothed level component SmoothedRandomRegPlot Plot of smoothed timevarying regression coefcient SmoothedSeasonPlot Plot of smoothed season component SmoothedSlopePlot Plot of smoothed slope component SmoothedSplineRegPlot Plot of smoothed timevarying regression coefcient SmoothedSplineSeasonPlot Plot of smoothed spline-season component AnnualSeasonPlot Plot of annual variation in the smoothed season component Plots Related to Series Decomposition and Forecasting ForecastsOnlyPlot Series forecasts beyond the historical period
SPLINESEASON
PLOT=SMOOTH
SEASON
PLOT=S_ANNUAL
FORECAST
DEFAULT
Table 29.3
continued
Description One-step-ahead as well as multistepahead forecasts Plot of sum of all ltered components except the irregular component Plot of ltered trend Plot of sum of ltered trend, cycles, and regression effects Plot of ltered trend plus regression effects Plot of sum of all smoothed components except the irregular component Plot of smoothed trend Plot of smoothed trend plus regression effects Plot of sum of smoothed trend, cycles, and regression effects Plot of standard error of sum of all ltered components except the irregular Plot of standard error of ltered trend Plot of standard error of ltered trend plus regression effects Plot of standard error of ltered trend, cycles, and regression effects Plot of standard error of sum of all smoothed components except the irregular Plot of standard error of smoothed trend
Statement FORECAST
Option PLOT=FORECASTS
FilteredAllExceptIrregPlot
FORECAST
PLOT= FDECOMP
FilteredTrendPlot FilteredTrendRegCycPlot
FORECAST FORECAST
FilteredTrendRegPlot SmoothedAllExceptIrregPlot
FORECAST FORECAST
FilteredAllExceptIrregVarPlot
FORECAST
PLOT= FDECOMPVAR
FilteredTrendVarPlot FilteredTrendRegVarPlot
FORECAST FORECAST
FilteredTrendRegCycVarPlot
FORECAST
PLOT= FDECOMPVAR
SmoothedAllExceptIrregVarPlot
FORECAST
PLOT= DECOMPVAR
SmoothedTrendVarPlot
FORECAST
PLOT= DECOMPVAR
Table 29.3
continued
Description Plot of standard error of smoothed trend plus regression effects Plot of standard error of smoothed trend, cycles, and regression effects
Statement FORECAST
SmoothedTrendRegCycVarPlot
FORECAST
PLOT= DECOMPVAR
F_AUTOREG, VF_AUTOREG, S_AUTOREG, and VS_AUTOREG, numerical variables containing the ltered and smoothed values of the autoreg component and the respective variances. These variables are present only if the model has an autoreg component. F_CYCLE, VF_CYCLE, S_CYCLE, and VS_CYCLE, numerical variables containing the ltered and smoothed values of the cycle component and the respective variances. If there are multiple cycles in the model, these variables are sequentially numbered as F_CYCLE1, F_CYCLE2, etc. These variables are present only if the model has at least one cycle component. F_SEASON, VF_SEASON, S_SEASON, and VS_SEASON, numerical variables containing the ltered and smoothed values of the season component and the respective variances. If there are multiple seasons in the model, these variables are sequentially numbered as F_SEASON1, F_SEASON2, etc. These variables are present only if the model has at least one season component. F_BLKSEAS, VF_BLKSEAS, S_BLKSEAS, and VS_BLKSEAS, numerical variables containing the ltered and smoothed values of the blockseason component and the respective variances. If there are multiple block seasons in the model, these variables are sequentially numbered as F_BLKSEAS1, F_BLKSEAS2, etc. F_SPLSEAS, VF_SPLSEAS, S_SPLSEAS, and VS_SPLSEAS, numerical variables containing the ltered and smoothed values of the splineseason component and the respective variances. If there are multiple spline seasons in the model, these variables are sequentially numbered as F_SPLSEAS1, F_SPLSEAS2, etc. These variables are present only if the model has at least one splineseason component. Filtered and smoothed estimates, and their variances, of the time-varying regression coefcients of the variables specied in the RANDOMREG and SPLINEREG statements. A variable is not included if its coefcient is time-invariant, that is, if the associated disturbance variance is zero. S_TREG and VS_TREG, numerical variables containing the smoothed values of level plus regression component and their variances. These variables are present only if the model has at least one predictor variable or has dependent lags. S_TREGCYC and VS_TREGCYC, numerical variables containing the smoothed values of level plus regression plus cycle component and their variances. These variables are present only if the model has at least one cycle or an autoreg component. S_NOIRREG and VS_NOIRREG, numerical variables containing the smoothed values of the sum of all components except the irregular component and their variances. These variables are present only if the model has at least one seasonal or block seasonal component.
the BY variables COMPONENT, a character variable containing the name of the component corresponding to the parameter being described PARAMETER, a character variable containing the parameter name TYPE, a character variable indicating whether the parameter value was xed by the user or estimated _STATUS_, a character variable indicating whether the parameter estimation process converged or failed or there was an error of some other kind ESTIMATE, a numerical variable containing the parameter estimate STD, a numerical variable containing the standard error of the parameter estimate. This has a missing value if the parameter value is xed. TVALUE, a numerical variable containing the t-statistic. This has a missing value if the parameter value is xed. PVALUE, a numerical variable containing the p-value. This has a missing value if the parameter value is xed.
Statistics of Fit
This section explains the goodness-of-t statistics reported to measure how well the specied model ts the data. First the various statistics of t that are computed using the prediction errors, yt yt , are considered. O In these formulas, n is the number of nonmissing prediction errors and k is the number of tted P parameters in the model. Moreover, the sum of squared errors, P D .yt yt /2 , and the total SSE O 2 , where y is the series sum of squares for the series corrected for the mean, SST D .yt y/ mean, and the sums are over all the nonmissing prediction errors. Mean Squared Error 1 The mean squared prediction error, MSE D n SSE Root Mean Squared Error p The root mean square error, RMSE = MSE Mean Absolute Percent Error The mean absolute percent prediction error, MAPE = The summation ignores observations where yt D 0.
100 n
Pn
t D1 j.yt
yt /=yt j. O
R-square The R-square statistic, R2 D 1 SSE=SST. If the model ts the series badly, the model error sum of squares, SSE, might be larger than SST and the R-square statistic will be negative. Adjusted R-square The adjusted R-square statistic, 1
n .n 1 /.1 k
R2 /
. nCk /.1 n k
R2 /
Random Walk R-square The random walk R-square statistic (Harveys R-square statistic that uses the random walk P model for comparison), 1 . n n 1 /SSE=RWSSE, where RWSSE D n .yt yt 1 /2 , tD2 Pn and D n 1 1 t D2 .yt yt 1 / Maximum Percent Error The largest percent prediction error, 100 max..yt tions where yt D 0 are ignored. yt /=yt /. In this computation the observaO
The likelihood-based t statistics are reported separately (see the section The UCMs as State Space Models on page 1787). They include the full log likelihood (L1 ), the diffuse part of the log likelihood, the normalized residual sum of squares, and several information criteria: AIC, AICC, HQIC, BIC, and CAIC. Let q denote the number of estimated parameters, n be the number of nonmissing measurements in the estimation span, and d be the number of diffuse elements in the initial state vector that are successfully initialized during the Kalman ltering process. Moreover, let n D .n d /. The reported information criteria, all in smaller-is-better form, are described in Table 29.4:
Table 29.4 Information Criteria
Reference Akaike (1974) Hurvich and Tsai (1989) Burnham and Anderson (1998) Hannan and Quinn (1979) Schwarz (1978) Bozdogan (1987)
some of the observed data are withheld during parameter estimation and forecast computations. The observations in the last two years, years 1959 and 1960, are not used in parameter estimation, while the observations in the last year, year 1960, are not used in the forecasting computations. This is done using the BACK= option in the ESTIMATE and FORECAST statements. In addition, a panel of residual diagnostic plots is obtained using the PLOT=PANEL option in the ESTIMATE statement.
data seriesG; set sashelp.air; logair = log(air); run;
proc ucm data = seriesG; id date interval = month; model logair; irregular; level; slope var = 0 noest; season length = 12 type=trig; estimate back=24 plot=panel; forecast back=12 lead=24 print=forecasts; run;
The following tables display the summary of data used in estimation and forecasting (Output 29.1.1 and Output 29.1.2). These tables provide simple summary statistics for the estimation and forecast spans; they include useful information such as the beginning and ending dates of the span, the number of nonmissing values, etc.
Output 29.1.1 Observation Span Used in Parameter Estimation (partial output)
Variable logair Type Dependent First JAN1949 Last DEC1958 Nobs 120 Mean 5.43035
The following tables display the xed parameters in the model, the preliminary estimates of the free parameters, and the nal estimates of the free parameters (Output 29.1.3, Output 29.1.4, and Output 29.1.5).
Two types of goodness-of-t statistics are reported after a model is t to the series (see Output 29.1.6 and Output 29.1.7). The rst type is the likelihood-based goodness-of-t statistics, which include the full likelihood of the data, the diffuse portion of the likelihood (see the section Details: UCM Procedure on page 1781), and the information criteria. The second type of statistics is based on the raw residuals, residual = observed predicted. If the model is nonstationary, then one-stepahead predictions are not available for some initial observations, and the number of values used in computing these t statistics will be different from those used in computing the likelihood-based test statistics.
Number of non-missing residuals used for computing the fit statistics = 107
The diagnostic plots based on the one-step-ahead residuals are shown in Output 29.1.8. The residual histogram and the Q-Q plot show no reasons to question the approximate normality of the residual distribution. The remaining plots check for the whiteness of the residuals. The sample correlation plots, the autocorrelation function (ACF) and the partial autocorrelation function (PACF), also do not show any signicant violations of the whiteness of the residuals. Therefore, on the whole, the model seems to t the data well.
Output 29.1.8 Residual Diagnostics for the Airline Series Using a BSM
The forecasts are given in Output 29.1.9. In order to save the space, the upper and lower condence limit columns are dropped from the output, and only the rows corresponding to the year 1960 are shown. Recall that the actual measurements in the years 1959 and 1960 were withheld during the parameter estimation, and the ones in 1960 were not used in the forecast computations.
The gure Output 29.1.10 shows the forecast plot. The forecasts in the year 1960 show that the model predictions were quite good.
Output 29.1.10 Forecast Plot of the Airline Series Using a BSM
32 0 33 5 24 15
31 0 34 3 26 13
28 2 33 3 27 12
25 4 32 3 28 11
22 8 30 4 29 11
18 11 27 5 28 10
The following statements use the TIMESERIES procedure to get a timeseries plot of the series (see Output 29.2.1).
proc timeseries data=star plot=series; var magnitude; run;
The plot clearly shows the cyclic nature of the series. Bloomeld shows that the series is very well explained by a model that includes two deterministic cycles that have periods 29.0003 and 24.0001 days, a constant term, and a simple error term. He also mentions the difculty involved in estimating the periods from the data (see Bloomeld 2000, Chapter 3). In his case the cycle periods are estimated by least squares, and the sum of squares surface has multiple local optima and ridges. The following statements show how to use the UCM procedure to t this two-cycle model to the series. The constant term in the model is specied by holding the variance parameter of the level component to zero.
proc ucm data=star; model magnitude; irregular; level var=0 noest; cycle; cycle; estimate; run;
The nal parameter estimates and the goodness-of-t statistics are shown in Output 29.2.2 and Output 29.2.3, respectively. The model t appears to be good.
Output 29.2.2 Two-Cycle Model: Parameter Estimates
The UCM Procedure Final Estimates of the Free Parameters Approx Std Error 0.0053845 1.81175E-7 0.0022709 5.27213E-6 2.11939E-7 0.0019128 3.56374E-6 Approx Pr > |t| <.0001 <.0001 <.0001 0.0944 <.0001 <.0001 0.1330
Parameter Error Variance Damping Factor Period Error Variance Damping Factor Period Error Variance
Number of non-missing residuals used for computing the fit statistics = 599
Note that the estimated periods are the same as in Bloomelds model, the damping factors are nearly equal to 1.0, and the disturbance variances are very close to zero, implying persistent deterministic cycles. In fact, this model is identical to Bloomelds model.
112 56 101
Initial exploration of the series clearly indicates that the series does not show any signicant trend, and time of day and day of the week have a signicant inuence on the number of calls received. These considerations suggest a simple random walk trend model along with a seasonal component of season length 28, the total number of shifts in a week. The following statements specify this model. Note the PRINT=HARMONICS option in the SEASON statement, which produces a table that lists the full set of harmonics contributing to the seasonal along with the signicance of their contribution. This table will be useful later in choosing a subset trigonometric model. The BACK=28 and the LEAD=28 specications in the FORECAST statement create a holdout region of 28 observations. The sum of absolute prediction errors (SAE) in this holdout region are used to
proc ucm data=callCenter; id datetime interval=dthour6; model calls; irregular; level; season length=28 type=trig print=(harmonics); estimate back=28; forecast back=28 lead=28; run;
The forecasting performance of this model in the holdout region is shown in Output 29.3.1. The sum of absolute prediction errors SAE = 516:22, which appears in the last row of the holdout analysis table.
Output 29.3.1 Predictions in the Holdout Region: Baseline Model
Obs 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 datetime 24APR00:00 24APR00:06 24APR00:12 24APR00:18 25APR00:00 25APR00:06 25APR00:12 25APR00:18 26APR00:00 26APR00:06 26APR00:12 26APR00:18 27APR00:00 27APR00:06 27APR00:12 27APR00:18 28APR00:00 28APR00:06 28APR00:12 28APR00:18 29APR00:00 29APR00:06 29APR00:12 29APR00:18 30APR00:00 30APR00:06 30APR00:12 30APR00:18 Actual 12 136 295 172 20 127 236 125 16 108 207 112 15 98 200 113 15 104 205 89 12 68 116 54 10 30 66 61 Forecast -4.004 110.825 262.820 145.127 2.188 105.442 217.043 114.313 2.855 95.202 194.184 97.687 1.270 85.875 184.891 93.113 -1.120 84.983 177.940 64.292 -6.020 46.286 100.339 34.700 -6.209 12.167 49.524 40.071 Error 16.004 25.175 32.180 26.873 17.812 21.558 18.957 10.687 13.145 12.798 12.816 14.313 13.730 12.125 15.109 19.887 16.120 19.017 27.060 24.708 18.020 21.714 15.661 19.300 16.209 17.833 16.476 20.929 SAE 16.004 41.179 73.360 100.232 118.044 139.602 158.559 169.246 182.391 195.189 208.005 222.317 236.047 248.172 263.281 283.168 299.288 318.305 345.365 370.073 388.093 409.807 425.468 444.768 460.978 478.811 495.287 516.216
Now that a baseline model is created, the exploration for alternate models can begin. The review of the harmonic table in Output 29.3.2 shows that all but the last three harmonics are signicant, and deleting any of them to form a subset trigonometric seasonal component will lead to a poorer model.
The last three harmonics, 12th, 13th and 14th, with periods of 2.333, 2.15 and 2.0, respectively, do appear to be possible choices for deletion. Note that the disturbance variance of the seasonal component is not very insignicant (see Output 29.3.3); therefore the seasonal component is stochastic and the preceding logic, which is based on the nal state estimate, provides only a rough guideline.
Output 29.3.2 Harmonic Analysis of the Season: Initial Model
The UCM Procedure Harmonic Analysis of Trigonometric Seasons (Based on the Final State) Season Length 28 28 28 28 28 28 28 28 28 28 28 28 28 28
Name Season Season Season Season Season Season Season Season Season Season Season Season Season Season
Harmonic 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Period 28.00000 14.00000 9.33333 7.00000 5.60000 4.66667 4.00000 3.50000 3.11111 2.80000 2.54545 2.33333 2.15385 2.00000
Chi-Square 234.19 264.19 95.65 105.64 146.74 121.93 4299.12 150.79 89.68 8.95 6.14 2.20 3.40 2.33
DF 2 2 2 2 2 2 2 2 2 2 2 2 2 1
Pr > ChiSq <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 0.0114 0.0464 0.3325 0.1828 0.1272
The following statements t a subset trigonometric model formed by dropping the last three harmonics by specifying the DROPH= option in the SEASON statement:
proc ucm data=callCenter; id datetime interval=dthour6; model calls; irregular; level; season length=28 type=trig droph=12 13 14; estimate back=28; forecast back=28 lead=28;
run;
The last row of the holdout region prediction analysis table for the preceding model is shown in Output 29.3.4. It shows that the subset trigonometric model has better prediction performance in the holdout region than the full trigonometric model, its SAE = 471:53 compared to the SAE = 516:22 for the full model.
Output 29.3.4 SAE for the Subset Trigonometric Model
Obs 552 datetime 30APR00:18 Actual 61 Forecast 40.836 Error 20.164 SAE 471.534
The following statements illustrate a spline approximation to this seasonal component. In the spline specication the knot placement is quite important, and usually some experimentation is needed. In the following model the knots are placed at the beginning and the middle of each day. Note that the knots at the beginning and end of the season, 1 and 28 in this case, should not be listed in the knot list because knots are always placed there anyway.
proc ucm data=callCenter; id datetime interval=dthour6; model calls; irregular; level; splineseason length=28 knots=3 5 7 9 11 13 15 17 19 21 23 25 27 degree=3; estimate back=28; forecast back=28 lead=28; run;
The spline season model takes about half the time to t that the baseline model takes. The last row of the holdout region prediction analysis table for this model is shown in Output 29.3.5, which shows that the spline season model performs even better than the previous two models in the holdout region, its SAE = 313:79 compared to SAE = 471:53 for the previous model.
Output 29.3.5 SAE for the Spline Season Model
Obs 552 datetime 30APR00:18 Actual 61 Forecast 23.350 Error 37.650 SAE 313.792
The following statements illustrate yet another way to approximate a long seasonal component. Here a combination of BLOCKSEASON and SEASON statements results in a seasonal component that is a sum of two seasonal patterns: one seasonal pattern is simply a regular season with season length 4 that captures the within-day seasonal pattern, and the other seasonal pattern is a block seasonal pattern that remains constant during the day but varies from day to day within a week.
Example 29.4: Modeling Time-Varying Regression Effects Using the RANDOMREG Statement ! 1833
Note the use of NLOPTIONS statement to change the optimization technique during the parameter estimation to DBLDOG, which in this case performs better than the default technique, TRUREG.
proc ucm data=callCenter; id datetime interval=dthour6; model calls; irregular; level; season length=4 type=trig; blockseason nblocks=7 blocksize=4 type=trig; estimate back=28; forecast back=28 lead=28; nloptions tech=dbldog; run;
This model also takes about half the time to t that the baseline model takes. The last row of the holdout region prediction analysis table for this model is shown in Output 29.3.6, which shows that the block season model does slightly better than the baseline model but not as good as the other two models, its SAE = 508:52 compared to the SAE = 516:22 of the baseline model.
Output 29.3.6 SAE for the Block Season Model
Obs 552 datetime 30APR00:18 Actual 61 Forecast 39.339 Error 21.661 SAE 508.522
This example showed a few different ways to model a long seasonal pattern. It showed that parsimonious models for long seasonal patterns can be useful, and in some cases even more effective than the full model. Moreover, for very long seasonal patterns the high memory requirements and long computing times might make full models impractical.
Example 29.4: Modeling Time-Varying Regression Effects Using the RANDOMREG Statement
In April 1979 the Albuquerque Police Department began a special enforcement program aimed at reducing the number of DWI (driving while intoxicated) accidents. The program was administered by a squad of police ofcers, who used breath alcohol testing (BAT) devices and a van that houses a BAT device (Batmobile). These data were collected by the Division of Governmental Research of the University of New Mexico, under a contract with the National Highway Trafc Safety Administration of the U.S. Department of Transportation, to evaluate the Batmobile program. The rst 29 observations are for a control period, and the next 23 observations are for the experimental (Batmobile) period. The data, freely available at http://lib.stat.cmu.edu/DASL/Datales/batdat.html, consist of two variables: ACC, which represents injuries and fatalities from Wednesday to Saturday nighttime accidents, and FUEL, which represents fuel consumption (millions of gallons) in Albu-
querque. The variables are measured quarterly starting from the rst quarter of 1972 up to the last quarter of 1984, covering the span of 13 years. The following DATA step statements create the input data set.
data bat; input ACC FUEL @@; batProgram = 0; if _n_ > 29 then batProgram = 1; date = INTNX( qtr, 1jan1972d, _n_- 1 ); format date qtr8.; datalines; 192 32.592 238 37.250 232 40.032 246 35.852 185 38.226 274 38.711 266 43.139 196 40.434 170 35.898 234 37.111 272 38.944 234 37.717 210 37.861 280 42.524 246 43.965 248 41.976 269 42.918 326 49.789 342 48.454 257 45.056 280 49.385 290 42.524 356 51.224 295 48.562 279 48.167 330 51.362 354 54.646 331 53.398 291 50.584 377 51.320 327 50.810 301 46.272 269 48.664 314 48.122 318 47.483 288 44.732 242 46.143 268 44.129 327 46.258 253 48.230 215 46.459 263 50.686 319 49.681 263 51.029 206 47.236 286 51.717 323 51.824 306 49.380 230 47.961 304 46.039 311 55.683 292 52.263 ;
There are a number of ways to study these data and the question of the effectiveness of the BAT program. One possibility is to study the before-after difference in the injuries and fatalities per million gallons of fuel consumed, by regressing ACC on the FUEL and the dummy variable BATPROGRAM, which is zero before the program began and one while the program is in place. However, it is possible that the effect of the Batmobiles might well be cumulative, because as awareness of the program becomes dispersed, its effectiveness as a deterrent to driving while intoxicated increases. This suggests that the regression coefcient of the BATPROGRAM variable might be time varying. The following program ts a model that incorporates these considerations. A seasonal component is included in the model since it is easy to see that the data show strong quarterly seasonality.
proc ucm data=bat; model acc = fuel; id date interval=qtr; irregular; level var=0 noest; randomreg batProgram / plot=smooth; season length=4 var=0 noest plot=smooth; estimate plot=(panel residual); forecast plot=forecasts lead=0; run;
Example 29.4: Modeling Time-Varying Regression Effects Using the RANDOMREG Statement ! 1835
The model seems to t the data adequately. No data are withheld for model validation because the series is relatively short. The plot of the time-varying coefcient of BATPROGRAM is shown in Output 29.4.1. As expected, it shows that the effectiveness of the program increases as awareness of the program becomes dispersed. The effectiveness eventually seems to level off. The residual diagnostic plots are shown in Output 29.4.2 and Output 29.4.3, the forecast plot is in Output 29.4.4, the goodness-of-t statistics are in Output 29.4.5, and the parameter estimates are in Output 29.4.6.
Output 29.4.1 Time-Varying Regression Coefcient of BATPROGRAM
Example 29.4: Modeling Time-Varying Regression Effects Using the RANDOMREG Statement ! 1837 Output 29.4.3 Residual Diagnostics for the Time-Varying Regression Model
Output 29.5.1 Smoothed Trend for the GNP Series as per the Hodrick-Prescott Filter
The time series plot of the load (not shown) also shows that, apart from a day-of-the-week seasonal effect, there are no additional easily identiable patterns in the series. The series has no apparent upward or downward trend. The following statements t a UCM to the series that takes into account these observations. The particular choice of the model is a result of a little modeling exercise that compared a small number of competing models. The chosen model is adequate but by no means the best possible. The temperature effect is modeled by a deterministic three-degree spline with knots at 30, 40, 50, 60, and 75. The knot locations and the degree were chosen by visual inspection of the plot (Output 29.6.1). An autoreg component is used in place of the simple irregular component, which improved the residual analysis. The last 60 days of data are withheld for outof-sample forecast evaluation (note the BACK= option in both the ESTIMATE and FORECAST statements). The OUTLIER statement is used to increase the number of outliers reported to 10. Since no CHECKBREAK option is used in the LEVEL statement, only the additive outliers are searched. In this example the use of the EXTRADIFFUSE= option in the ESTIMATE and FORECAST statements is useful for discarding some early one-step-ahead forecasts and residuals with large variance.
autoreg; level plot=smooth; splinereg temp knots=30 40 50 65 75 degree=3 variance=0 noest; season length=7 var=0 noest; estimate plot=panel back=60 extradiffuse=50; outlier maxnum=10; forecast back=60 lead=60 extradiffuse=50; run;
The parameter estimates are given in Output 29.6.2, and the residual goodness-of-t statistics are shown in Output 29.6.3. The residual diagnostic plots are shown in Output 29.6.4. The ACF and PACF plots appear satisfactory, but the normality plots, particularly the Q-Q plot, show possible violations. It appears that, at least in part, this nonNormal behavior of the residuals might be attributable to the outliers in the series. The outlier summary table, Output 29.6.5, shows the most likely outlying observations. Notice that most of these outliers are holidays, like July 4th, when the electricity load is lower than usual for that day of the week.
Output 29.6.2 Electricity Load: Parameter Estimates
The UCM Procedure Final Estimates of the Free Parameters Approx Approx Std Error t Value Pr > |t| 0.05025 0.03466 0.20478 1.93997 1.71243 1.56805 1.45098 1.36977 1.28898 1.09017 1.25102 4.22 16.60 10.79 2.44 1.28 -4.56 -7.86 -11.96 -14.55 -7.38 -1.84 <.0001 <.0001 <.0001 0.0149 0.2007 <.0001 <.0001 <.0001 <.0001 <.0001 0.0654
Component Level AutoReg AutoReg temp temp temp temp temp temp temp temp
Parameter Error Variance Damping Factor Error Variance Spline Coefficient_1 Spline Coefficient_2 Spline Coefficient_3 Spline Coefficient_4 Spline Coefficient_5 Spline Coefficient_6 Spline Coefficient_7 Spline Coefficient_8
Estimate 0.21185 0.57522 2.21057 4.72502 2.19116 -7.14492 -11.39950 -16.38055 -18.76075 -8.04628 -2.30525
Number of non-missing residuals used for computing the fit statistics = 791
Obs 1281 916 329 977 1341 693 915 1057 551 879
Time 04JUL2002 04JUL2001 25NOV1999 03SEP2001 02SEP2002 23NOV2000 03JUL2001 22NOV2001 04JUL2000 28MAY2001
Estimate -7.99908 -6.55778 -5.85047 -5.67254 -5.49631 -5.27968 5.06557 -5.01550 -4.89965 -4.76135
StdErr 1.3417486 1.338431 1.3379735 1.3389138 1.337843 1.3374368 1.3375273 1.3386184 1.3381557 1.3375349
ChiSq 35.54 24.01 19.12 17.95 16.88 15.58 14.34 14.04 13.41 12.67
DF 1 1 1 1 1 1 1 1 1 1
The plot of the load forecasts for the withheld data is shown in Output 29.6.6.
Output 29.6.6 Electricity Load: Forecast Evaluation of the Withheld Data
1370 958 774 1050 764 1040 771 848 975 714
1140 1140 840 969 821 759 676 890 815 740
In this situation it is known that a shift in the water level occurred within the span of the series, and its effect can be easily taken into account by including an appropriate indicator variable as a regressor. However, in many situation such prior information is not available, and it is useful to detect such a shift in a data analytic fashion. You can check for breaks in the level by using the CHECKBREAK option in the LEVEL statement. The following statements t a simple locally constant level plus error model to the series:
proc ucm data=nile; id year interval=year; model waterlevel; irregular; level plot=smooth checkbreak; estimate; forecast plot=decomp; run;
The plot in Output 29.7.2 shows a noticeable drop in the smoothed water level around 1899.
The Outlier Summary table in Output 29.7.3 shows the most likely types of breaks and their locations within the series span. The shift of 1899 is easily detected.
Output 29.7.3 Detection of Structural Breaks in the Nile River Level
Outlier Summary Standard Error Chi-Square 10.46
Estimate
-315.73791 97.639753
The following statements specify a UCM that models the level of the river as a locally constant series with a shift in the year 1899, represented by a dummy regressor (SHIFT1899):
data nile; set nile; shift1899 = ( year >= 1jan1899d ); run;
proc ucm data=nile; id year interval=year; model waterlevel = shift1899; irregular; level; estimate; forecast plot=decomp; run;
The plot in Output 29.7.4 shows the smoothed trend, including the correction due to the shift in the year 1899. Notice the simplicity in the shape of the smoothed curve after the incorporation of the shift information.
Output 29.7.4 Smoothed Trend plus Shift of 1899
data. See Chapter 7, The ARIMA Procedure, for additional details about the ARIMA procedure. However, if there are missing values in the data and the model is nonstationary, then the UCM and ARIMA procedures can produce signicantly different parameter estimates and predictions. An article by Kohn and Ansley (1986) suggests a statistically sound method of estimation, prediction, and interpolation for nonstationary ARIMA models with missing data. This method is based on an algorithm that is equivalent to the Kalman ltering and smoothing algorithm used in the UCM procedure. The results of an illustrative example in their article are reproduced here using the UCM procedure. In this example an ARIMA(0,1,1) (0,1,1)12 model is applied to the logarithm of the air series in the sashelp.air data set. Four different missing value patterns are considered to highlight different aspects of the problem: Data1. The full data set of 144 observations. Data2. The set of 78 observations that omit January through November in each of the last 6 years. Data3. The data set with the 5 observations July 1949, June, July, and August 1957, and July 1960 missing. Data4. The data set with all July observations missing and June and August 1957 also missing. The following DATA steps create these data sets:
data Data1; set sashelp.air; logair = log(air); run; data Data2; set data1; if year(date) >= 1955 and month(date) < 12 then logair = .; run; data Data3; set data1; if (year(date) = 1949 and month(date) = 7) then logair = .; if ( year(date) = 1957 and (month(date) = 6 or month(date) = 7 or month(date) = 8)) then logair = .; if (year(date) = 1960 and month(date) = 7) then logair = .; run; data Data4; set data1; if month(date) = 7 then logair = .; if year(date) = 1957 and (month(date) = 6 or month(date) = 8) then logair = .; run;
The following statements specify the ARIMA.0; 1; 1/ rst data set (Data1):
proc ucm data=Data1; id date interval=month; model logair; irregular q=1 sq=1 s=12; deplag lags=(1)(12) phi=1 1 noest; estimate outest=est1; forecast outfor=for1; run;
Note that the moving-average part of the model is specied by using the Q=, SQ=, and S= options in the IRREGULAR statement and the differencing operator, .1 B/.1 B 12 /, is specied by using the DEPLAG statement. The model does not contain an intercept term; therefore no LEVEL statement is needed. The parameter estimates are saved in a data set EST1 by using the OUTEST= option in the ESTIMATE statement and the forecasts and the component estimates are saved in a data set FOR1 by using the OUTFOR= option in the FORECAST statement. The same analysis is performed on the other three data sets, but is not shown here. Output 29.8.1 resembles Table 1 in Kohn and Ansley (1986). This table is generated by merging the parameter estimates from the four analyses. Only the moving-average parameter estimates and their standard errors are reported. The columns EST1 and STD1 correspond to the estimates for Data1. The parameter estimates and their standard errors for other three data sets are similarly named. Note that the parameter estimates closely match the parameter estimates in the article. However, their standard errors differ slightly. This difference could be the result of different ways of computing the Hessian at the optimum. The white noise error variance estimates are not reported here, but they agree quite closely with those in the article.
Output 29.8.1 Data Sets 14: Parameter Estimates and Standard Errors
P A R A M E T E R MA_1 SMA_1
e s t 1 0.402 0.557
s t d 1 0.090 0.073
e s t 2 0.457 0.758
s t d 2 0.121 0.236
e s t 3 0.408 0.566
s t d 3 0.092 0.075
e s t 4 0.431 0.573
s t d 4 0.091 0.074
Output 29.8.2 resembles Table 2 in Kohn and Ansley (1986). It contains forecasts and their standard errors for the four data sets. The numbers are very close to those in the article.
Output 29.8.3 is based on Data2. It resembles Table 3 in Kohn and Ansley (1986). The columns
S_SERIES and VS_SERIES in the OUTFOR= data set contain the interpolated values of logair and their variances. The estimate column in Output 29.8.3 reports interpolated values (which are the same as S_SERIES), and the std column reports their standard errors (which are computed as square root of VS_SERIES) for JanuaryNovember 1957. The actual logair values for these months, which
are missing in Data2, are also provided for comparison. The numbers are very close to those in the article.
Output 29.8.3 Data Set 2: Interpolated Values and Standard Errors
DATE JAN57 FEB57 MAR57 APR57 MAY57 JUN57 JUL57 AUG57 SEP57 OCT57 NOV57 logair 5.753 5.707 5.875 5.852 5.872 6.045 6.142 6.146 6.001 5.849 5.720 estimate 5.733 5.738 5.893 5.850 5.843 5.951 6.051 6.055 5.938 5.812 5.680 std 0.045 0.049 0.052 0.054 0.055 0.055 0.055 0.054 0.052 0.049 0.045
Output 29.8.4 resembles Table 4 in Kohn and Ansley (1986). These numbers are based on Data3, and they also are very close to those in the article.
References ! 1853
Output 29.8.5 resembles Table 5 in Kohn and Ansley (1986). As before, the numbers are very close to those in the article.
Output 29.8.5 Data Set 4: Interpolated Values and Standard Errors
DATE JUN57 AUG57 logair 6.045 6.146 estimate 6.023 6.147 std 0.030 0.030
The similarity between the outputs in this example and the results shown in Kohn and Ansley (1986) demonstrate that PROC UCM can be effectively used for nonstationary ARIMA models with missing data.
References
Akaike, H. (1974), A New Look at the Statistical Model Identication, IEEE Transaction on Automatic Control, AC19, 716723. Anderson, T. W. (1971), The Statistical Analysis of Time Series, New York: John Wiley & Sons. Bloomeld, P. (2000), Fourier Analysis of Time Series, Second Edition, New York: John Wiley & Sons. Box, G. E. P. and Jenkins, G. M. (1976), Time Series Analysis: Forecasting and Control, San Francisco: Holden-Day. Bozdogan, H. (1987), Model Selection and Akaikes Information Criterion (AIC): The General Theory and Its Analytical Extensions, Psychometrika, 52, 345370. Brockwell, P.J., and Davis, R.A. (1991), Time Series: Theory and Methods, Second Edition, New York: Springer-Verlag. Burnham, K. P. and Anderson, D. R. (1998), Model Selection and Inference: A Practical Information-Theoretic Approach, New York: Springer-Verlag.
Cobb, G. W. (1978), The Problem of the Nile: Conditional Solution to a Change Point Problem, Biometrika, 65, 243251. de Jong, P. and Chu-Chun-Lin, S. (2003), Smoothing with an Unknown Initial Condition, Journal of Time Series Analysis, vol. 24, no. 2, 141148. de Jong, P. and Penzer, J. (1998), Diagnosing Shocks in Time Series, Journal of the American Statistical Association, vol. 93, no. 442, 796806. Durbin, J. and Koopman, S. J. (2001), Time Series Analysis by State Space Methods, Oxford: Oxford University Press. Hannan, E.J. and Quinn, B.G. (1979), The Determination of the Order of an Autoregression, Journal of the Royal Statistical Society, Series B, 41, 190195. Harvey, A. C. (1989), Forecasting, Structural Time Series Models and the Kalman Filter, Cambridge: Cambridge University Press. Harvey, A. C. (2001), Testing in Unobserved Components Models, Journal of Forecasting, 20, 119. Hodrick, R. and Prescott, E. (1997) Postwar U.S. Business Cycles: An Empirical Investigation, Journal of Money, Credit, and Banking, 29, 116. Hurvich, C. M. and Tsai, C.-L. (1989), Regression and Time Series Model Selection in Small Samples, Biometrika, 76, 297307. Jones, Richard H. (1980), Maximum Likelihood Fitting of ARMA Models to Time Series with Missing Observations, Technometrics, 22, 389396. Kohn, R. and Ansley C. F. (1986), Estimation, Prediction, and Interpolation for ARIMA models With Missing Data, Journal of the American Statistical Association, vol. 81, no. 395, 751761. Schwarz, G. (1978), Estimating the Dimension of a Model, Annals of Statistics, 6, 461464. West, M. and Harrison, J. (1999) Bayesian Forecasting and Dynamic Models, Second Edition, New York: Springer-Verlag.
Chapter 30
Vector Error Correction Modeling . . . . . . . . . . . . I(2) Model . . . . . . . . . . . . . . . . . . . . . . . . Multivariate GARCH Modeling . . . . . . . . . . . . . Output Data Sets . . . . . . . . . . . . . . . . . . . . . OUT= Data Set . . . . . . . . . . . . . . . . . . . . . . OUTEST= Data Set . . . . . . . . . . . . . . . . . . . OUTHT= Data Set . . . . . . . . . . . . . . . . . . . . OUTSTAT= Data Set . . . . . . . . . . . . . . . . . . . Printed Output . . . . . . . . . . . . . . . . . . . . . . ODS Table Names . . . . . . . . . . . . . . . . . . . . ODS Graphics . . . . . . . . . . . . . . . . . . . . . . Computational Issues . . . . . . . . . . . . . . . . . . Examples: VARMAX Procedure . . . . . . . . . . . . . . . . Example 30.1: Analysis of U.S. Economic Variables . . Example 30.2: Analysis of German Economic Variables Example 30.3: Numerous Examples . . . . . . . . . . Example 30.4: Illustration of ODS Graphics . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
1962 1978 1981 1987 1987 1989 1991 1992 1994 1995 2000 2001 2002 2002 2013 2025 2028 2031
Schwarz Bayesian criterion (SBC), also known as Bayesian information criterion (BIC) If you do not want to use the automatic order selection, the VARMAX procedure provides autoregressive order identication aids: partial cross-correlations Yule-Walker estimates partial autoregressive coefcients partial canonical correlations For situations where the stationarity of the time series is in question, the VARMAX procedure provides tests to aid in determining the presence of unit roots and cointegration. These tests include the following: Dickey-Fuller tests Johansen cointegration test for nonstationary vector processes of integrated order one Stock-Watson common trends test for the possibility of cointegration among nonstationary vector processes of integrated order one Johansen cointegration test for nonstationary vector processes of integrated order two For stationary vector times series (or nonstationary series made stationary by appropriate differencing), the VARMAX procedure provides for vector autoregressive and moving-average (VARMA) and Bayesian vector autoregressive (BVAR) models. To cope with the problem of high dimensionality in the parameters of the VAR model, the VARMAX procedure provides both vector error correction model (VECM) and Bayesian vector error correction model (BVECM). Bayesian models are used when prior information about the model parameters is available. The VARMAX procedure also allows independent (exogenous) variables with their distributed lags to inuence dependent (endogenous) variables in various models such as VARMAX, BVARX, VECMX, and BVECMX models. Forecasting is one of the main objectives of multivariate time series analysis. After successfully tting the VARMAX, BVARX, VECMX, and BVECMX models, the VARMAX procedure computes predicted values based on the parameter estimates and the past values of the vector time series. The model parameter estimation methods are the following: least squares (LS) maximum likelihood (ML) The VARMAX procedure provides various hypothesis tests of long-run effects and adjustment coefcients by using the likelihood ratio test based on Johansen cointegration analysis. The VARMAX procedure offers the likelihood ratio test of the weak exogeneity for each variable. After tting the model parameters, the VARMAX procedure provides for model checks and residual analysis by using the following tests:
Durbin-Watson (DW) statistics F test for autoregressive conditional heteroscedastic (ARCH) disturbance F test for AR disturbance Jarque-Bera normality test Portmanteau test The VARMAX procedure supports several modeling features, including the following: seasonal deterministic terms subset models multiple regression with distributed lags dead-start model that does not have present values of the exogenous variables GARCH-type multivariate conditional heteroscedasticity models The VARMAX procedure provides a Granger causality test to determine the Granger-causal relationships between two distinct groups of variables. It also provides the following: innite order AR representation impulse response function (or innite order MA representation) decomposition of the predicted error covariances roots of the characteristic functions for both the AR and MA parts to evaluate the proximity of the roots to the unit circle contemporaneous relationships among the components of the vector time series
C p yt
where the t is a vector white noise process with t D . 1t ; : : : ; k t /0 such that E. t / D 0, 0 0 E. t t / D , and E. t s / D 0 for t s; D .1 ; : : : ; k /0 is a constant vector and i is a k k matrix. Analyzing and modeling the series jointly enables you to understand the dynamic relationships over time among the series and to improve the accuracy of forecasts for individual series by using the additional information available from the related series and their forecasts.
C
t;
with D
The following IML procedure statements simulate a bivariate vector time series from this model to provide test data for the VARMAX procedure:
proc iml; sig = {1.0 0.5, 0.5 1.25}; phi = {1.2 -0.5, 0.6 0.3}; /* simulate the vector time series */ call varmasim(y,phi) sigma = sig n = 100 seed = 34657; cn = {y1 y2}; create simul1 from y[colname=cn]; append from y; quit;
The following statements plot the simulated vector time series yt shown in Figure 30.1:
data simul1; set simul1; date = intnx( year, 01jan1900d, _n_-1 ); format date year4.; run; proc timeseries data=simul1 vectorplot=series; id date interval=year; var y1 y2; run;
The following statements t a VAR(1) model to the simulated data. First, you specify the input data set in the PROC VARMAX statement. Then, you use the MODEL statement to designate the dependent variables, y1 and y2 . To estimate a VAR model with mean zero, you specify the order of the autoregressive model with the P= option and the NOINT option. The MODEL statement ts the model to the data and prints parameter estimates and their signicance. The PRINT=ESTIMATES option prints the matrix form of parameter estimates, and the PRINT=DIAGNOSE option prints various diagnostic tests. The LAGMAX=3 option is used to print the output for the residual diagnostic checks. To output the forecasts to a data set, you specify the OUTPUT statement with the OUT= option. If you want to forecast ve steps ahead, you use the LEAD=5 option. The ID statement species the yearly interval between observations and provides the Time column in the forecast output. The VARMAX procedure output is shown in Figure 30.2 through Figure 30.10.
/*--- Vector Autoregressive Model ---*/ proc varmax data=simul1; id date interval=year; model y1 y2 / p=1 noint lagmax=3 print=(estimates diagnose);
N 100 100
The VARMAX procedure rst displays descriptive statistics. The Type column species that the variables are dependent variables. The column N stands for the number of nonmissing observations. Figure 30.3 shows the type and the estimation method of the tted model for the simulated data. It also shows the AR coefcient matrix in terms of lag 1, the parameter estimates, and their signicance, which can indicate how well the model ts the data. The second table schematically represents the parameter estimates and allows for easy verication of their signicance in matrix form. In the last table, the rst column gives the left-hand-side variable of the equation; the second column is the parameter name ARl_i _j , which indicates the (i; j )th element of the lag l autoregressive coefcient; the last column is the regressor that corresponds to the displayed parameter.
AR1 +++
Model Parameter Estimates Standard Error t Value Pr > |t| Variable 0.05508 0.05898 0.05779 0.06188 21.06 -8.66 9.45 6.22 0.0001 0.0001 0.0001 0.0001 y1(t-1) y2(t-1) y1(t-1) y2(t-1)
The tted VAR(1) model with estimated standard errors in parentheses is given as 1 1:160 0:511 B .0:055/ .0:059/ C Cy yt D B @ 0:546 0:385 A t .0:058/ .0:062/ 0
The model can also be written as two univariate regression equations. y1t y2t D 1:160 y1;t D 0:546 y1;t
1 1
1 1
C C
1t 2t
The table in Figure 30.4 shows the innovation covariance matrix estimates and the various information criteria results. The smaller value of information criteria ts the data better when it is compared to other models. The variable names in the covariance matrix are printed for convenience; y1 means the innovation for y1, and y2 means the innovation for y2.
Figure 30.4 Innovation Covariance Estimates and Information Criteria
Covariances of Innovations Variable y1 y2 y1 1.28875 0.39751 y2 0.39751 1.41839
Information Criteria AICC HQC AIC SBC FPEC 0.554443 0.595201 0.552777 0.65763 1.738092
Figure 30.5 shows the cross covariances of the residuals. The values of the lag zero are slightly different from Figure 30.4 due to the different degrees of freedom.
Figure 30.5 Multivariate Diagnostic Checks
Cross Covariances of Residuals Lag 0 1 2 3 Variable y1 y2 y1 y2 y1 y2 y1 y2 y1 1.24909 0.36398 0.01635 -0.07210 0.06591 0.01096 -0.00203 -0.02129 y2 0.36398 1.34203 0.03087 -0.09834 0.07835 -0.05780 0.08804 0.07277
Figure 30.6 and Figure 30.7 show tests for white noise residuals. The output shows that you cannot reject the null hypothesis that the residuals are uncorrelated.
The VARMAX procedure provides diagnostic checks for the univariate form of the equations. The table in Figure 30.8 describes how well each univariate equation ts the data. From two univariate regression equations in Figure 30.3, the values of R2 in the second column are 0.84 and 0.80 for each equation. The standard deviations in the third column are the square roots of the diagonal elements of the covariance matrix from Figure 30.4. The F statistics are in the fourth column for hypotheses to test 11 D 12 D 0 and 21 D 22 D 0, respectively, where ij is the .i; j /th element of the matrix 1 . The last column shows the p-values of the F statistics. The results show that each univariate model is signicant.
Variable y1 y2
The check for white noise residuals in terms of the univariate equation is shown in Figure 30.9. This output contains information that indicates whether the residuals are correlated and heteroscedastic. In the rst table, the second column contains the Durbin-Watson test statistics to test the null hypothesis that the residuals are uncorrelated. The third and fourth columns show the Jarque-Bera normality test statistics and their p-values to test the null hypothesis that the residuals have normality. The last two columns show F statistics and their p-values for ARCH(1) disturbances to test the null hypothesis that the residuals have equal covariances. The second table includes F statistics and their p-values for AR(1), AR(1,2), AR(1,2,3) and AR(1,2,3,4) models of residuals to test the null hypothesis that the residuals are uncorrelated.
Figure 30.9 Univariate Diagnostic Checks Continued
Univariate Model White Noise Diagnostics Durbin Watson 1.96656 2.13609 Normality Chi-Square Pr > ChiSq 3.32 5.46 0.1900 0.0653 ARCH Pr > F 0.7199 0.1503
Variable y1 y2
Univariate Model AR Diagnostics AR1 F Value Pr > F 0.02 0.52 0.8980 0.4709 AR2 F Value Pr > F 0.14 0.41 0.8662 0.6650 AR3 F Value Pr > F 0.09 0.32 0.9629 0.8136 AR4 F Value Pr > F 0.82 0.32 0.5164 0.8664
Variable y1 y2
The table in Figure 30.10 gives forecasts, their prediction errors, and 95% condence limits. See the section Forecasting on page 1930 for details.
Variable y1
Obs 101 102 103 104 105 101 102 103 104 105
Time 2000 2001 2002 2003 2004 2000 2001 2002 2003 2004
Forecast -3.59212 -3.09448 -2.17433 -1.11395 -0.14342 -2.09873 -2.77050 -2.75724 -2.24943 -1.47460
95% Confidence Limits -5.81713 -6.44435 -6.37792 -5.87992 -5.21463 -4.43298 -5.66469 -6.17173 -6.20709 -5.88782 -1.36711 0.25539 2.02925 3.65203 4.92779 0.23551 0.12369 0.65725 1.70823 2.93863
y2
The output in Figure 30.11 shows that parameter estimates are slightly different from those in Figure 30.3. By choosing the appropriate priors, you might be able to get more accurate forecasts by using a BVAR model rather than by using an unconstrained VAR model. See the section Bayesian VAR and VARX Modeling on page 1947 for details.
Model Parameter Estimates Standard Error t Value Pr > |t| Variable 0.05050 0.04824 0.04889 0.05740 20.92 -7.19 8.20 8.49 0.0001 0.0001 0.0001 0.0001 y1(t-1) y2(t-1) y1(t-1) y2(t-1)
yt D C yt
1C
i yt
yt
1;
yt D C .Ik C C 1 /yt
.i
1 /yt i
1 yt p
where Ik is a k
k identity matrix.
yt
2
This process can be given the following VECM(2) representation with the cointegration rank one: yt D 0:4 0:1 .1; 2/yt
1
yt
1
The following PROC IML statements generate simulated data for the VECM(2) form specied above and plot the data as shown in Figure 30.12:
proc iml; sig = 100*i(2); phi = {-0.2 0.1, 0.5 0.2, 0.8 0.7, -0.4 0.6}; call varmasim(y,phi) sigma=sig n=100 initial=0 seed=45876; cn = {y1 y2}; create simul2 from y[colname=cn]; append from y; quit; data simul2; set simul2; date = intnx( year, 01jan1900d, _n_-1 ); format date year4. ; run; proc timeseries data=simul2 vectorplot=series; id date interval=year; var y1 y2; run;
Cointegration Testing
The following statements use the Johansen cointegration rank test. The COINTTEST=(JOHANSEN) option does the Johansen trace test and is equivalent to specifying COINTTEST with no additional options or the COINTTEST=(JOHANSEN=(TYPE=TRACE)) option.
/*--- Cointegration Test ---*/ proc varmax data=simul2; model y1 y2 / p=2 noint dftest cointtest=(johansen); run;
Figure 30.13 shows the output for Dickey-Fuller tests for the nonstationarity of each series and Johansen cointegration rank test between series.
y2
H0: Rank=r 0 1
H1: Rank>r 0 1
In Dickey-Fuller tests, the second column species three types of models, which are zero mean, single mean, or trend. The third column ( Rho ) and the fth column ( Tau ) are the test statistics for unit root testing. Other columns are their p-values. You can see that both series have unit roots. For a description of Dickey-Fuller tests, see the section PROBDF Function for Dickey-Fuller Tests on page 158 in Chapter 5, SAS Macros and Functions. In the cointegration rank test, the last two columns explain the drift in the model or process. Since the NOINT option is specied, the model is yt D yt
1
C 1 yt
The column Drift In ECM means there is no separate drift in the error correction model, and the column Drift In Process means the process has a constant drift before differencing. H0 is the null hypothesis, and H1 is the alternative hypothesis. The rst row tests r D 0 against r > 0; the second row tests r D 1 against r > 1. The Trace test statistics in the fourth column are P computed by T kDrC1 log.1 i / where T is the available number of observations and i is the i eigenvalue in the third column. By default, the critical values at 5% signicance level are used for testing. You can compare the test statistics and critical values in each row. There is one cointegrated process in this example since the Trace statistic for testing r D 0 against r > 0 is greater than the critical value, but the Trace statistic for testing r D 1 against r > 1 is not greater than the critical value. The following statements t a VECM(2) form to the simulated data. From the result in Figure 30.13, the time series are cointegrated with rank=1. You specify the ECM= option with the RANK=1 option. For normalizing the value of the cointegrated vector, you specify the normalized variable
with the NORMALIZE= option. The PRINT=(IARR) option provides the VAR(2) representation. The VARMAX procedure output is shown in Figure 30.14 through Figure 30.16.
/*--- Vector Error-Correction Model ---*/ proc varmax data=simul2; model y1 y2 / p=2 noint lagmax=3 ecm=(rank=1 normalize=y1) print=(iarr estimates); run;
The ECM= option produces the estimates of the long-run parameter, , and the adjustment coefcient, . In Figure 30.14, 1 indicates the rst column of the and matrices. Since the cointegration rank is 1 in the bivariate system, and are two-dimensional vectors. The estimated O cointegrating vector is D .1; 1:96/0 . Therefore, the long-run relationship between y1t and y2t O is y1t D 1:96y2t . The rst element of is 1 since y1 is specied as the normalized variable.
Figure 30.14 Parameter Estimates for the VECM(2) Form
The VARMAX Procedure Type of Model Estimation Method Cointegrated Rank VECM(2) Maximum Likelihood Estimation 1
Figure 30.15 shows the parameter estimates in terms of lag one coefcients, yt 1 , and lag one rst differenced coefcients, yt 1 , and their signicance. Alpha * Beta0 indicates the coefcients of yt 1 and is obtained by multiplying the Alpha and Beta estimates in Figure 30.14. The parameter AR1_i_j corresponds to the elements in the Alpha * Beta0 matrix. The t values and p-values corresponding to the parameters AR1_i _j are missing since the parameters AR1_i _j have non-Gaussian distributions. The parameter AR2_i _j corresponds to the elements in the differenced lagged AR coefcient matrix. The D_ prexed to a variable name in Figure 30.15 implies differencing.
AR Coefficients of Differenced Lag DIF Lag 1 Variable y1 y2 y1 -0.74332 0.40493 y2 -0.74621 -0.57157
Model Parameter Estimates Standard Error t Value Pr > |t| Variable 0.04786 0.09359 0.04526 0.04769 0.05146 0.10064 0.04867 0.05128 y1(t-1) y2(t-1) D_y1(t-1) D_y2(t-1) y1(t-1) y2(t-1) D_y1(t-1) D_y2(t-1)
Equation Parameter D_y1 AR1_1_1 AR1_1_2 AR2_1_1 AR2_1_2 AR1_2_1 AR1_2_2 AR2_2_1 AR2_2_2
-16.42 -15.65
0.0001 0.0001
D_y2
8.32 -11.15
0.0001 0.0001
The tted model is given as 1 0:467 0:913 B .0:048/ .0:094/ C Cy yt D B @ 0:107 0:209 A t .0:051/ .0:100/ 0 1 0:743 0:746 B .0:045/ .0:048/ C C y B 1C@ 0:405 0:572 A t .0:049/ .0:051/ 0
The PRINT=(IARR) option in the previous SAS statements prints the reparameterized coefcient estimates. For the LAGMAX=3 in the SAS statements, the coefcient matrix of lag 3 is zero. The VECM(2) form in Figure 30.16 can be rewritten as the following second-order vector autoregressive model: yt D 0:210 0:167 0:512 0:220 yt
1
yt
2
Figure 30.17 shows the model type tted to the data, the estimates of the adjustment coefcient (), the parameter estimates in terms of lag one coefcients (yt 1 ), and lag one rst differenced coefcients (yt 1 ).
AR Coefficients of Differenced Lag DIF Lag 1 Variable y1 y2 y1 -0.80070 0.33417 y2 -0.59320 -0.53480
i yt
s X i D0
i xt
where xt D .x1t ; : : : ; xrt /0 is an r-dimensional time series vector and i is a k For example, a VARX(1,0) model is yt D C 1 yt
1
r matrix.
C 0 xt C
where yt D .y1t ; y2t ; y3t /0 and xt D .x1t ; x2t /0 . The following statements t the VARX(1,0) model to the given data:
data grunfeld; input year y1 y2 y3 x1 x2 x3; label y1=Gross Investment GE y2=Capital Stock Lagged GE y3=Value of Outstanding Shares GE Lagged x1=Gross Investment W x2=Capital Stock Lagged W x3=Value of Outstanding Shares Lagged W; datalines; 1935 33.1 1170.6 97.8 12.93 191.5 1.8 1936 45.0 2015.8 104.4 25.90 516.0 .8 1937 77.2 2803.3 118.0 35.05 729.0 7.4 ... more lines ...
/*--- Vector Autoregressive Process with Exogenous Variables ---*/ proc varmax data=grunfeld; model y1-y3 = x1 x2 / p=1 lagmax=5 printform=univariate print=(impulsx=(all) estimates); run;
The VARMAX procedure output is shown in Figure 30.18 through Figure 30.20. Figure 30.18 shows the descriptive statistics for the dependent (endogenous) and independent (exogenous) variables with labels.
Simple Summary Statistics Standard Deviation 48.58450 413.84329 250.61885 19.11019 222.39193
N 20 20 20 20 20
Simple Summary Statistics Variable Label y1 y2 y3 x1 x2 Gross Investment GE Capital Stock Lagged GE Value of Outstanding Shares GE Lagged Gross Investment W Capital Stock Lagged W
Figure 30.19 shows the parameter estimates for the constant, the lag zero coefcients of exogenous variables, and the lag one AR coefcients. From the schematic representation of parameter estimates, the signicance of the parameter estimates can be easily veried. The symbol C means the constant and XL0 means the lag zero coefcients of exogenous variables.
AR Lag 1 Variable y1 y2 y3 y1 0.23699 -2.46656 0.95116 y2 0.00763 0.16379 0.00224 y3 0.02941 -0.84090 0.93801
C . + -
XL0 +. .+ ..
Equation Parameter y1 CONST1 XL0_1_1 XL0_1_2 AR1_1_1 AR1_1_2 AR1_1_3 CONST2 XL0_2_1 XL0_2_2 AR1_2_1 AR1_2_2 AR1_2_3 CONST3 XL0_3_1 XL0_3_2 AR1_3_1 AR1_3_2 AR1_3_3
Estimate -12.01279 1.69281 -0.00859 0.23699 0.00763 0.02941 702.08673 -6.09850 2.57980 -2.46656 0.16379 -0.84090 -22.42110 -0.02317 -0.01274 0.95116 0.00224 0.93801
y2
y3
The tted model is given as 0 y1t B B B y2t B @ y3t 1 0 12:013 .27:471/ 702:086 .256:480/ 22:421 .10:312/ 1 0 1 1:693 0:009 1 .0:544/ .0:054/ C 0 C x1t C 6:099 2:580 C @ A .5:078/ .0:501/ C C x2t 0:023 0:013 A .0:204/ .0:020/ 1 1 0 0:029 0 C y1;t 1 .0:049/ C B C B C B 0:841 C B C B y2;t 1 C C B CB C B .0:453/ C @ A @ A 0:938 y3;t 1 .0:018/
B C B C B C D B C B B A @ 0
C B C B C B CCB C B C B A @
0:237 0:008 B .0:207/ .0:016/ B B 2:467 0:164 C B B .1:930/ .0:152/ B @ 0:951 0:002 .0:078/ .0:006/
1t 2t 3t
1 C C C C A
with 1 0 11 12 0 D @ 21 22 A 1 D @ 31 32 0
11 21 31 12 22 32 13 23 33
1 A
In Figure 30.20 of the preceding section, you can see several insignicant parameters. For example, the coefcients XL0_1_2, AR1_1_2, and AR1_3_2 are insignicant. The following statements restrict the coefcients of 12 D model.
12
32
/*--- Models with Restrictions and Tests ---*/ proc varmax data=grunfeld; model y1-y3 = x1 x2 / p=1 print=(estimates); restrict XL(0,1,2)=0, AR(1,1,2)=0, AR(1,3,2)=0; run;
The output in Figure 30.21 shows that three parameters 12 , 12 , and 32 are replaced by the restricted values, zeros. In the schematic representation of parameter estimates, the three restricted parameters 12 , 12 , and 32 are replaced by .
AR Lag 1 Variable y1 y2 y3 y1 0.27671 -2.16968 0.96398 y2 0.00000 0.10945 0.00000 y3 0.01747 -0.93053 0.93412
C . + -
XL0 +* .+ ..
The output in Figure 30.22 shows the estimates of the Lagrangian parameters and their signicance. Based on the p-values associated with the Lagrangian parameters, you cannot reject the null hypotheses 12 D 0, 12 D 0, and 32 D 0 with the 0.05 signicance level.
Figure 30.22 RESTRICT Statement Results
Testing of the Restricted Parameters Standard Error 21.44026 70.74347 164.03075
31
D 0 and 12 D
12
32
D 0 for the
proc varmax data=grunfeld; model y1-y3 = x1 x2 / p=1; test AR(1,3,1)=0; test XL(0,1,2)=0, AR(1,1,2)=0, AR(1,3,2)=0; run;
The output in Figure 30.23 shows that the rst column in the output is the index corresponding to each TEST statement. You can reject the hypothesis test 31 D 0 at the 0.05 signicance level, but you cannot reject the joint hypothesis test 12 D 12 D 32 D 0 at the 0.05 signicance level.
Figure 30.23 TEST Statement Results
The VARMAX Procedure Testing of the Parameters Test 1 2 DF 1 3 Chi-Square 150.31 0.34 Pr > ChiSq <.0001 0.9522
Causality Testing
The following statements use the CAUSAL statement to compute the Granger causality test for a VAR(1) model. For the Granger causality tests, the autoregressive order should be dened by the P= option in the MODEL statement. The variable groups are dened in the MODEL statement as well. Regardless of whether the variables specied in the GROUP1= and GROUP2= options are designated as dependent or exogenous (independent) variables in the MODEL statement, the CAUSAL statement ts the VAR(p) model by considering the variables in the two groups as dependent variables.
/*--- Causality Testing ---*/ proc varmax data=grunfeld; model y1-y3 = x1 x2 / p=1 noprint; causal group1=(x1) group2=(y1-y3); causal group1=(y3) group2=(y1 y2); run;
The output in Figure 30.24 is associated with the CAUSAL statement. The rst CAUSAL statement ts the VAR(1) model by using the variables y1, y2, y3, and x1. The second CAUSAL statement ts the VAR(1) model by using the variables y1, y3, and y2.
Test 1:
x1 y1 y2 y3
Test 2:
y3 y1 y2
The null hypothesis of the Granger causality test is that GROUP1 is inuenced only by itself, and not by GROUP2. The rst column in the output is the index corresponding to each CAUSAL statement. The output shows that you cannot reject that x1 is inuenced by itself and not by .y1; y2; y3/ at the 0.05 signicance level for Test 1. You can reject that y3 is inuenced by itself and not by .y1; y2/ for Test 2. See the section VAR and VARX Modeling on page 1941 for details.
Functional Summary
The statements and options used with the VARMAX procedure are summarized in the following table:
Table 30.1 VARMAX Functional Summary
Description Data Set Options specify the input data set write parameter estimates to an output data set include covariances in the OUTEST= data set write the diagnostic checking tests for a model and the cointegration test results to an output data set write actuals, predictions, residuals, and condence limits to an output data set write the conditional covariance matrix to an output data set BY Groups specify BY-group processing ID Variable specify identifying variable specify the time interval between observations control the alignment of SAS Date values Options to Control the Optimization Process specify the optimization options Printing Control Options specify how many lags to print results suppress the printed output request all printing options request the printing format controls plots produced through ODS GRAPHICS PRINT= Option print the correlation matrix of parameter estimates print the cross-correlation matrices of independent variables print the cross-correlation matrices of dependent variables print the covariance matrices of prediction errors print the cross-covariance matrices of the independent variables
BY
ID ID ID
INTERVAL= ALIGN=
NLOPTIONS
Table 30.1
continued
Description print the cross-covariance matrices of the dependent variables print the covariance matrix of parameter estimates print the decomposition of the prediction error covariance matrix print the residual diagnostics print the contemporaneous relationships among the components of the vector time series print the parameter estimates print the innite order AR representation print the impulse response function print the impulse response function in the transfer function print the partial autoregressive coefcient matrices print the partial canonical correlation matrices print the partial correlation matrices print the eigenvalues of the companion matrix print the Yule-Walker estimates Model Estimation and Order Selection Options center the dependent variables specify the degrees of differencing for the specied model variables specify the degrees of differencing for all independent variables specify the degrees of differencing for all dependent variables specify the vector error correction model specify the estimation method select the tentative order suppress the current values of independent variables suppress the intercept parameters specify the number of seasonal periods specify the order of autoregressive polynomial specify the Bayesian prior model specify the order of moving-average polynomial center the seasonal dummies specify the degree of time trend polynomial specify the denominator for error covariance matrix estimates specify the lag order of independent variables
Statement MODEL MODEL MODEL MODEL MODEL MODEL MODEL MODEL MODEL MODEL MODEL MODEL MODEL MODEL
Option COVY COVB DECOMPOSE DIAGNOSE DYNAMIC ESTIMATES IARR IMPULSE= IMPULSX= PARCOEF PCANCORR PCORR ROOTS YW
MODEL MODEL MODEL MODEL MODEL MODEL MODEL MODEL MODEL MODEL MODEL MODEL MODEL MODEL MODEL MODEL MODEL
CENTER DIF= DIFX= DIFY= ECM= METHOD= MINIC= NOCURRENTX NOINT NSEASON= P= PRIOR= Q= SCENTER TREND= VARDEF= XLAG=
Table 30.1
continued
Description GARCH Related Options specify the GARCH-type model specify the order of the GARCH polynomial specify the order of the ARCH polynomial
Statement
Option
FORM= P= Q=
Cointegration Related Options print the results from the weak exogeneity test of the long-run parameters specify the restriction on the cointegrated coefcient matrix specify the restriction on the adjustment coefcient matrix specify the variable name whose cointegrating vectors are normalized specify a cointegration rank print the Johansen cointegration rank test print the Stock-Watson common trends test print the Dickey-Fuller unit root test Tests and Restrictions on Parameters test the Granger causality place and test restrictions on parameter estimates test hypotheses on parameter estimates Forecasting Control Options specify the size of condence limits for forecasting start forecasting before end of the input data specify how many periods to forecast suppress the printed forecasts
GROUP1= GROUP2=
DATA=SAS-data-set
species the input SAS data set. If the DATA= option is not specied, the PROC VARMAX statement uses the most recently created SAS data set.
OUTEST=SAS-data-set
writes the covariance matrix for the parameter estimates to the OUTEST= data set. This option is valid only if the OUTEST= option is specied.
OUTSTAT=SAS-data-set
writes residual diagnostic results to an output data set. If the COINTTEST=(JOHANSEN) option is specied, the results of this option are also written to the output data set. The following statements are the examples of these options in the PROC VARMAX statement:
proc varmax data=one outest=est outcov outstat=stat; model y1-y3 / p=1; run; proc varmax data=one outest=est outstat=stat; model y1-y3 / p=1 cointtest=(johansen); run;
PLOTS< (global-plot-option) > = plot-request-option < (options) > PLOTS< (global-plot-option) > = ( plot-request-option < (options) > ... plot-request-option < (options) > )
controls the plots produced through ODS Graphics. When you specify only one plot, you can omit the parentheses around the plot request. Some examples follow:
plots=none plots=all plots(unpack)=residual(residual normal) plots=(forecasts model)
You must enable ODS Graphics before requesting plots as shown in the following example. For general information about ODS Graphics, see Chapter 21, Statistical Graphics Using ODS (SAS/STAT Users Guide).
ods graphics on; proc varmax data=one plots=impulse(simple); model y1-y3 / p=1; run; proc varmax data=one plots=(model residual); model y1-y3 / p=1; run;
proc varmax data=one plots=forecasts; model y1-y3 / p=1; output lead=12; run;
The rst VARMAX program produces the simple response impulse plots. The second VARMAX program produces the plots associated with the model and prediction errors. The plots associated with prediction errors are the ACF, PACF, IACF, distribution, white-noise, and Normal quantile plots and the prediction error plot. The third VARMAX program produces the FORECASTS and FORECASTSONLY plots. The global-plot-option applies to the impulse and prediction error analysis plots generated by the VARMAX procedure. The following global-plot-option is available: UNPACK breaks a graphic that is otherwise paneled into individual component plots.
The following plot-request-options are available: ALL produces all plots appropriate for the particular analysis.
FORECASTS < (forecasts-plot-options ) > produces plots of the forecasts. The forecastsonly plot that shows the multistep forecasts in the forecast region is produced by default. The following forecasts-plot-options are available: ALL FORECASTS produces the FORECASTSONLY and the FORECASTS plots. This is the default. produces a plot that shows the one-step-ahead as well as the multistep forecasts.
FORECASTSONLY produces a plot that shows only the multistep forecasts. IMPULSE < (impulse-plot-options ) > produces the plots of impulse response function and the impulse response of the transfer function. ALL ACCUM ORTH SIMPLE MODEL NONE produces all impulse plots. This is the default. produces the accumulated impulse plot. produces the orthogonalized impulse plot. produces the simple impulse plot.
produces plots of dependent variables listed in the MODEL statement and plots of the one-step-ahead predicted values for each dependent variables. suppresses all plots.
RESIDUAL < (residual-plot-options ) > produces plots associated with the prediction errors obtained after modeling the data. The following residual-plot-options are available: ALL RESIDUAL produces all plots associated with the analysis of the prediction errors. This is the default. produces prediction error plot.
DIAGNOSTICS produces a panel of plots useful in assessing the autocorrelations and white-noise of the prediction errors. The panel consists of the following: the autocorrelation plot of the prediction errors the partial autocorrelation plot of the prediction errors the inverse autocorrelation plot of the prediction errors the log scaled white noise plot of the prediction errors NORMAL produces a panel of plots useful in assessing normality of the prediction errors. The panel consists of the following: distribution of the prediction errors with overlaid the normal curve normal quantile plot of the prediction errors
Other Options
In addition, any of the following MODEL statement options can be specied in the PROC VARMAX statement, which is equivalent to specifying the option for every MODEL statement: CENTER, DFTEST=, DIF=, DIFX=, DIFY=, LAGMAX=, METHOD=, MINIC=, NOCURRENTX, NOINT, NOPRINT, NSEASON=, P=, PRINT=, PRINTALL, PRINTFORM=, Q=, SCENTER, TREND=, VARDEF=, and XLAG= options. The following is an example of the options in the PROC VARMAX statement:
proc varmax data=one lagmax=3 method=ml; model y1-y3 / p=1; run;
BY Statement
BY variables ;
A BY statement can be used with PROC VARMAX to obtain separate analyses on observations in groups dened by the BY variables. When a BY statement appears, the procedure expects the input data set to be sorted in order of the BY variables. If your input data set is not sorted in ascending order, use one of the following alternatives: Sort the data using the SORT procedure with a similar BY statement.
Specify the BY statement option NOTSORTED or DESCENDING in the BY statement for the VARMAX procedure. The NOTSORTED option does not mean that the data are unsorted but rather that the data are arranged in groups (according to values of the BY variables) and that these groups are not necessarily in alphabetical or increasing numeric order. Create an index on the BY variables using the DATASETS procedure. For more information about the BY statement, see in SAS Language Reference: Concepts. For more information about the DATASETS procedure, see the discussion in the Base SAS Procedures Guide. The following is an example of the BY statement:
proc varmax data=one; by region; model y1-y3 / p=1; run;
CAUSAL Statement
CAUSAL GROUP1=( variables) GROUP2=( variables) ;
A CAUSAL statement prints the Granger causality test by tting the VAR(p) model by using all variables dened in GROUP1 and GROUP2. Any number of CAUSAL statements can be specied. The CAUSAL statement proceeds with the MODEL statement and uses the variables and the autoregressive order, p, specied in the MODEL statement. Variables in the GROUP1= and GROUP2= options should be dened in the MODEL statement. If the P=0 option is specied in the MODEL statement, the CAUSAL statement is not applicable. The null hypothesis of the Granger causality test is that GROUP1 is inuenced only by itself, and not by GROUP2. If the hypothesis test fails to reject the null, then the variables listed in GROUP1 might be considered as independent variables. See the section VAR and VARX Modeling on page 1941 for details. The following is an example of the CAUSAL statement. You specify the CAUSAL statement with the GROUP1= and GROUP2= options.
proc varmax data=one; model y1-y3 = x1 / p=1; causal group1=(x1) group2=(y1-y3); causal group1=(y2) group2=(y1 y3); run;
The rst CAUSAL statement ts the VAR(1) model by using the variables y1, y2, y3, and x1 and tests the null hypothesis that x1 causes the other variables, y1, y2, and y3, but the other variables do not cause x1. The second CAUSAL statement ts the VAR(1) model by using the variables y1, y3, and y2 and tests the null hypothesis that y2 causes the other variables, y1 and y3, but the other variables do not cause y2.
COINTEG Statement
COINTEG RANK=number < H=( matrix) > < J=( matrix) > < EXOGENEITY > < NORMALIZE=variable > ;
The COINTEG statement ts the vector error correction model to the data, tests the restrictions of the long-run parameters and the adjustment parameters, and tests for the weak exogeneity in the long-run parameters. The cointegrated system uses the maximum likelihood analysis proposed by Johansen and Juselius (1990) and Johansen (1995a, 1995b). Only one COINTEG statement is allowed. You specify the ECM= option in the MODEL statement or the COINTEG statement to t the VECM(p). The P= option in the MODEL statement is used to specify the autoregressive order of the VECM. The following statements are equivalent for tting a VECM(2).
proc varmax data=one; model y1-y3 / p=2 ecm=(rank=1); run;
To test restrictions of either or or both, you specify either J= or H= or both, respectively. You specify the EXOGENEITY option in the COINTEG statement for tests of the weak exogeneity in the long-run parameters. The following is an example of the COINTEG statement.
proc varmax data=one; model y1-y3 / p=2; cointeg rank=1 h=(1 0, -1 0, 0 1) j=(1 0, 0 0, 0 1) exogeneity; run;
formulates the likelihood ratio tests for testing weak exogeneity in the long-run parameters. The null hypothesis is that one variable is weakly exogenous for the others.
H=(matrix)
species the restrictions H on the k r or .k C 1/ r cointegrated coefcient matrix such that D H , where H is known and is unknown. If the VECM(p) is specied with the COINTEG statement or with the ECM= option in the MODEL statement and the ECTREND
option is not included with the ECM= specication, then the H matrix has dimension k m. If the VECM(p) is specied with the COINTEG statement or with the ECM= option in the MODEL statement and the ECTREND option is also used, then the H matrix has dimension .k C 1/ m. Here k is the number of dependent variables, and m is r m < k where r is dened with the RANK=r option. For example, consider a system that contains four variables and the RANK=1 option with D .1 ; 2 ; 3 ; 4 /0 . The restriction matrix for the test of 1 C 2 D 0 can be specied as
cointeg rank=1 h=(1 0 0, -1 0 0, 0 1 0, 0 0 1);
Here the matrix H is 4 3 where k D 4 and m D 3, and each row of the matrix H is separated by commas. When the series has no separate deterministic trend, the constant term should be restricted by 0 D 0. In the preceding example, the can be either D .1 ; 2 ; 3 ; 4 ; 1/0 or D ? .1 ; 2 ; 3 ; 4 ; t /0 . You can specify the restriction matrix for the previous test of 1 C2 D 0 as follows:
cointeg rank=1 h=(1 0 0 0, -1 0 0 0, 0 1 0 0, 0 0 1 0, 0 0 0 1);
When the cointegrated system contains three dependent variables and the RANK=2 option, you can specify the restriction matrix for the test of 1j D 2j for j D 1; 2 as follows:
cointeg rank=2 h=(1 0, -1 0, 0 1);
J=(matrix)
species the restrictions J on the k r adjustment matrix such that D J , where J is known and is unknown. The k m matrix J is specied by using this option, where k is the number of dependent variables, m is r m < k, and r is dened with the RANK=r option. For example, when the system contains four variables and the RANK=1 option is used, you can specify the restriction matrix for the test of j D 0 for j D 2; 3; 4 as follows:
cointeg rank=1 j=(1, 0, 0, 0);
When the system contains three variables and the RANK=2 option, you can specify the restriction matrix for the test of 2j D 0 for j D 1; 2 as follows:
cointeg rank=2 j=(1 0, 0 0, 0 1);
NORMALIZE=variable
species a single dependent (endogenous) variable name whose cointegrating vectors are normalized. If the variable name is different from that specied in the COINTTEST=(JOHANSEN= ) or ECM= option in the MODEL statement, the variable name specied in the COINTEG statement is used. If the normalized variable is not specied, cointegrating vectors are not normalized.
RANK=number
species the cointegration rank of the cointegrated system. This option is required in the COINTEG statement. The rank of cointegration should be greater than zero and less than the number of dependent (endogenous) variables. If the value of the RANK= option in the COINTEG statement is different from that specied in the ECM= option, the rank specied in the COINTEG statement is used.
ID Statement
ID variable INTERVAL=value < ALIGN=value > ;
The ID statement species a variable that identies observations in the input data set. The datetime variable specied in the ID statement is included in the OUT= data set if the OUTPUT statement is specied. Note that the ID variable is usually a SAS datetime variable. The values of the ID variable are extrapolated for the forecast observations based on the value of the INTERVAL= option.
ALIGN=value
controls the alignment of SAS dates used to identify output observations. The ALIGN= option allows the following values: BEGINNING | BEG | B, MIDDLE | MID | M, and ENDING | END | E. The default is BEGINNING. The ALIGN= option is used to align the ID variable to the beginning, middle, or end of the time ID interval specied by the INTERVAL= option.
INTERVAL=value
species the time interval between observations. This option is required in the ID statement. The INTERVAL= option is used in conjunction with the ID variable to check that the input data are in order and have no missing periods. The INTERVAL= option is also used to extrapolate the ID values past the end of the input data when the OUTPUT statement is specied. The following is an example of the ID statement:
proc varmax data=one; id date interval=qtr align=mid; model y1-y3 / p=1; run;
MODEL Statement
MODEL dependents < = regressors > < , dependents < = regressors > . . . > < / options > ;
The MODEL statement species dependent (endogenous) variables and independent (exogenous) variables for the VARMAX model. The multivariate model can have the same or different independent variables corresponding to the dependent variables. As a special case, the VARMAX procedure allows you to analyze one dependent variable. Only one MODEL statement is allowed.
For example, the following statements are equivalent ways of specifying the multivariate model for the vector .y1; y2; y3/:
model y1 y2 y3 </options>; model y1-y3 </options>;
The following statements are equivalent ways of specifying the multivariate model with independent variables, where y1; y2; y3, and y4 are the dependent variables and x1; x2; x3; x4, and x5 are the independent variables:
model model model model y1 y2 y3 y4 = x1 x2 x3 x4 x5 </options>; y1 y2 y3 y4 = x1-x5 </options>; y1 = x1-x5, y2 = x1-x5, y3 y4 = x1-x5 </options>; y1-y4 = x1-x5 </options>;
When the multivariate model has different independent variables that correspond to each of the dependent variables, equations are separated by commas (,) and the model can be specied as illustrated by the following MODEL statement:
model y1 = x1-x3, y2 = x3-x5, y3 y4 = x1-x5 </options>;
The following options can be used in the MODEL statement after a forward slash (/):
CENTER
centers the dependent (endogenous) variables by subtracting their means. Note that centering is done after differencing when the DIF= or DIFY= option is specied. If there are exogenous (independent) variables, this option is not applicable.
model y1 y2 / p=1 center;
DIF(variable (number-list) < ... variable (number-list) >) DIF=(variable (number-list) < ... variable (number-list) >)
species the degrees of differencing to be applied to the specied dependent or independent variables. The number-list must contain one or more numbers, each of which should be greater than zero. The differencing can be the same for all variables, or it can vary among variables. For example, the DIF=(y1 (1,4) y3 (1) x2 (2)) option species that the series y1 is differenced at lag 1 and at lag 4, which is .1 B 4 /.1 B/y1t D .y1t y1;t
1/
.y1;t
y1;t
1 /;
5/
the series y3 is differenced at lag 1, which is .y3t at lag 2, which is .x2t x2;t 2 /.
y3;t
The following uses the data dy1, y2, x1, and dx2, where dy1 D .1 .1 B/2 x2t .
model y1 y2 = x1 x2 / p=1 dif=(y1(1) x2(2));
DIFX(number-list) DIFX=(number-list)
species the degrees of differencing to be applied to all independent variables. The numberlist must contain one or more numbers, each of which should be greater than zero. For example, the DIFX=(1) option species that all of the independent series are differenced once at lag 1. The DIFX=(1,4) option species that all of the independent series are differenced at lag 1 and at lag 4. If independent variables are specied in the DIF= option, then the DIFX= option is ignored. The following statement uses the data y1, y2, dx1, and dx2, where dx1 D .1 dx2 D .1 B/x2t .
model y1 y2 = x1 x2 / p=1 difx(1);
B/x1t and
DIFY(number-list) DIFY=(number-list)
species the degrees of differencing to be applied to all dependent (endogenous) variables. The number-list must contain one or more numbers, each of which should be greater than zero. For details, see the DIFX= option. If dependent variables are specied in the DIF= option, then the DIFY= option is ignored.
model y1 y2 / p=1 dify(1);
METHOD=value
requests the type of estimates to be computed. The possible values of the METHOD= option are as follows: LS ML species least squares estimates. species maximum likelihood estimates.
When the ECM=, PRIOR=, and Q= options and the GARCH statement are specied, the default ML method is used regardless of the method given by the METHOD= option.
model y1 y2 / p=1 method=ml;
NOCURRENTX
suppresses the current values xt of the independent variables. In general, the VARX(p; s) model is yt D C
p X i D1
i yt
s X i D0
i xt
where p is the number of lags of the dependent variables included in the model, and s is the number of lags of the independent variables included in the model, including the contemporaneous values of xt . A VARX(1,2) model can be specied as:
model y1 y2 = x1 x2 / p=1 xlag=2;
If the NOCURRENTX option is specied, it suppresses the current values xt and starts with xt 1 . The VARX(p; s) model is redened as: yt D C
p X iD1
i yt
s X i D1
i xt
NOINT
NSEASON=number
species the number of seasonal periods. When the NSEASON=number option is specied, (number 1) seasonal dummies are added to the regressors. If the NOINT option is specied, the NSEASON= option is not applicable.
model y1 y2 / p=1 nseason=4;
SCENTER
centers seasonal dummies specied by the NSEASON= option. The centered seasonal dummies are generated by c .1=s/, where c is a seasonal dummy generated by the NSEASON=s option.
model y1 y2 / p=1 nseason=4 scenter;
TREND=value
species the degree of deterministic time trend included in the model. Valid values are as follows: LINEAR QUAD includes a linear time trend as a regressor. includes linear and quadratic time trends as regressors.
VARDEF=value
corrects for the degrees of freedom of the denominator for computing an error covariance matrix for the METHOD=LS option. If the METHOD=ML option is specied, the VARDEF=N option is always used. Valid values are as follows: DF N species that the number of nonmissing observation minus the number of regressors be used. species that the number of nonmissing observation be used.
model y1 y2 / p=1 vardef=n;
species the maximum number of lags for which results are computed and displayed by the PRINT=(CORRX CORRY COVX COVY IARR IMPULSE= IMPULSX= PARCOEF PCANCORR PCORR) options. This option is also used to limit the printed results for the cross covariances and cross-correlations of residuals. The default is LAGMAX=min(12, T 2), where T is the number of nonmissing observations.
model y1 y2 / p=1 lagmax=6;
NOPRINT
PRINTALL
requests all printing control options. The options set by the option PRINTALL are DFTEST=, MINIC=, PRINTFORM=BOTH, and PRINT=(CORRB CORRX CORRY COVB COVPE COVX COVY DECOMPOSE DYNAMIC IARR IMPULSE=(ALL) IMPULSX=(ALL) PARCOEF PCANCORR PCORR ROOTS YW). You can also specify this option as the option ALL.
model y1 y2 / p=1 printall;
PRINTFORM=value
requests the printing format of the output generated by the PRINT= option and cross covariances and cross-correlations of residuals. Valid values are as follows: BOTH MATRIX UNIVARIATE prints output in both MATRIX and UNIVARIATE forms. prints output in matrix form. This is the default. prints output by variables.
Printing Options
PRINT=(options)
The following options can be used in the PRINT=( ) option. The options are listed within parentheses. If a number in parentheses follows an option listed below, then the option prints the number of lags specied by number in parentheses. The default is the number of lags specied by the LAGMAX=number option.
CORRB
CORRX CORRX(number )
prints the cross-correlation matrices of exogenous (independent) variables. The number should be greater than zero.
CORRY CORRY(number )
prints the cross-correlation matrices of dependent (endogenous) variables. The number should be greater than zero.
COVB
prints the covariance matrices of number-ahead prediction errors for the VARMAX(p,q,s) model. The number should be greater than zero. If the DIF= or DIFY= option is specied, the covariance matrices of multistep prediction errors are computed based on the differenced data. This option is not applicable when the PRIOR= option is specied. See the section Forecasting on page 1930 for details.
COVX COVX(number )
prints the cross-covariance matrices of exogenous (independent) variables. The number should be greater than zero.
COVY COVY(number )
prints the cross-covariance matrices of dependent (endogenous) variables. The number should be greater than zero.
DECOMPOSE DECOMPOSE(number )
prints the decomposition of the prediction error covariances using up to the number of lags specied by number in parentheses for the VARMA(p,q) model. The number should be greater than zero. It can be interpreted as the contribution of innovations in one variable to the mean squared error of the multistep forecast of another variable. The DECOMPOSE option also prints proportions of the forecast error variance. If the DIF= or DIFY= option is specied, the covariance matrices of multistep prediction errors are computed based on the differenced data. This option is not applicable when the PRIOR= option is specied. See the section Forecasting on page 1930 for details.
DIAGNOSE
prints the contemporaneous relationships among the components of the vector time series.
ESTIMATES
prints the coefcient estimates and a schematic representation of the signicance and sign of the parameter estimates.
IARR IARR(number )
prints the innite order AR representation of a VARMA process. The number should be greater than zero. If the ECM= option and the COINTEG statement are specied, then the reparameterized AR coefcient matrices are printed.
IMPULSE IMPULSE(number ) IMPULSE=(SIMPLE ACCUM ORTH STDERR ALL) IMPULSE(number )=(SIMPLE ACCUM ORTH STDERR ALL)
prints the impulse response function. The number should be greater than zero. It investigates the response of one variable to an impulse in another variable in a system that involves a number of other variables as well. It is an innite order MA representation of a VARMA process. See the section Impulse Response Function on page 1919 for details. The following options can be used in the IMPULSE=( ) option. The options are specied within parentheses. ACCUM ALL ORTH SIMPLE STDERR prints the accumulated impulse response function. is equivalent to specifying all of SIMPLE, ACCUM, ORTH, and STDERR. prints the orthogonalized impulse response function. prints the impulse response function. This is the default. prints the standard errors of the impulse response function, the accumulated impulse response function, or the orthogonalized impulse response function.
If the exogenous variables are used to t the model, then the STDERR option is ignored.
IMPULSX IMPULSX(number ) IMPULSX=(SIMPLE ACCUM ALL) IMPULSX(number )=(SIMPLE ACCUM ALL)
prints the impulse response function related to exogenous (independent) variables. The number should be greater than zero. See the section Impulse Response Function on page 1919 for details. The following options can be used in the IMPULSX=( ) option. The options are specied within parentheses. ACCUM ALL prints the accumulated impulse response matrices for the transfer function. is equivalent to specifying both SIMPLE and ACCUM.
SIMPLE
prints the impulse response matrices for the transfer function. This is the default.
PARCOEF PARCOEF(number )
prints the partial autoregression coefcient matrices, mm up to the lag number. The number should be greater than zero. With a VAR process, this option is useful for the identication of the order since the mm have the property that they equal zero for m > p under the hypothetical assumption of a VAR(p) model. See the section Tentative Order Selection on page 1935 for details.
PCANCORR PCANCORR(number )
prints the partial canonical correlations of the process at lag m and the test for testing m =0 for m > p up to the lag number. The number should be greater than zero. The lag m partial canonical correlations are the canonical correlations between yt and yt m , after adjustment for the dependence of these variables on the intervening values yt 1 , . . . , yt mC1 . See the section Tentative Order Selection on page 1935 for details.
PCORR PCORR(number )
prints the partial correlation matrices. The number should be greater than zero. With a VAR process, this option is useful for a tentative order selection by the same property as the partial autoregression coefcient matrices, as described in the PRINT=(PARCOEF) option. See the section Tentative Order Selection on page 1935 for details.
ROOTS
prints the eigenvalues of the kp kp companion matrix associated with the AR characteristic function .B/, where k is the number of dependent (endogenous) variables, and .B/ is the nite order matrix polynomial in the backshift operator B, such that B i yt D yt i . These eigenvalues indicate the stationary condition of the process since the stationary condition on the roots of j.B/j D 0 in the VAR(p) model is equivalent to the condition in the corresponding VAR(1) representation that all eigenvalues of the companion matrix be less than one in absolute value. Similarly, you can use this option to check the invertibility of the MA process. In addition, when the GARCH statement is specied, this option prints the roots of the GARCH characteristic polynomials to check covariance stationarity for the GARCH process.
YW
prints Yule-Walker estimates of the preliminary autoregressive model for the dependent (endogenous) variables. The coefcient matrices are printed using the maximum order of the autoregressive process. Some examples of the PRINT= option are as follows:
model model model model y1 y1 y1 y1 y2 y2 y2 y2 / / / / p=1 p=1 p=1 p=1 print=(covy(10) corry(10)); print=(parcoef pcancorr pcorr); print=(impulse(8) decompose(6) covpe(6)); print=(dynamic roots yw);
species the order of the vector autoregressive process. Subset models of vector autoregressive orders can be specied by listing the desired set of lags. For example, you can specify the P=(1,3,4) option. The P=3 option is equivalent to the P=(1,2,3) option. The default is P=0. If P=0 and there are no exogenous (independent) variables, then the AR polynomial order is automatically determined by minimizing an information criterion. If P=0 and the PRIOR= or ECM= option or both are specied, then the AR polynomial order is determined automatically. If the ECM= option is specied, then subset models of vector autoregressive orders are not allowed and the AR maximum order specied is used. Examples illustrating the P= option follow:
model y1 y2 / p=3; model y1 y2 / p=(1,3); model y1 y2 / p=(1,3) prior;
Q=number Q=(number-list)
species the order of the moving-average error process. Subset models of moving-average orders can be specied by listing the desired set of lags. For example, you can specify the Q=(1,5) option. The default is Q=0.
model y1 y2 / p=1 q=1; model y1 y2 / q=(2);
XLAG=number XLAG=(number-list)
species the lags of exogenous (independent) variables. Subset models of distributed lags can be specied by listing the desired set of lags. For example, XLAG=(2) selects only a lag 2 of the exogenous variables. The default is XLAG=0. To exclude the present values of exogenous variables from the model, the NOCURRENTX option must be used.
model y1 y2 = x1-x3 / xlag=2 nocurrentx; model y1 y2 = x1-x3 / p=1 xlag=(2);
prints the information criterion for the appropriate AR and MA tentative order selection and for the diagnostic checks of the tted model.
If the MINIC= option is not specied, all types of information criteria are printed for diagnostic checks of the tted model. The following options can be used in the MINIC=( ) option. The options are specied within parentheses.
P=number P=(pmi n :pmax )
species the range of AR orders to be considered in the tentative order selection. The default is P=(0:5). The P=3 option is equivalent to the P=(0:3) option.
PERROR=number PERROR=(p
;mi n :p ;max )
species the range of AR orders for obtaining the error series. PERROR=(pmax W pmax C qmax ).
Q=number Q=(qmi n :qmax )
The default is
species the range of MA orders to be considered in the tentative order selection. The default is Q=(0:5).
TYPE=value
species the criterion for the model order selection. Valid criteria are as follows: AIC AICC FPE HQC SBC species the Akaike information criterion. species the corrected Akaike information criterion. This is the default criterion. species the nal prediction error criterion. species the Hanna-Quinn criterion. species the Schwarz Bayesian criterion. You can also specify this value as TYPE=BIC.
prints the Dickey-Fuller unit root tests. The DLAG=(number) . . . (number) option species the regular or seasonal unit root test. Supported values of number are in 1, 2, 4, 12. If the number is greater than one, a seasonal Dickey-Fuller test is performed. If the TREND= option is specied, the seasonal unit root test is not available. The default is DLAG=1.
For example, the DFTEST=(DLAG=(1)(12)) option produces two tables: the Dickey-Fuller regular unit root test and the seasonal unit root test. Some examples of the DFTEST= option follow:
model model model model y1 y1 y1 y1 y2 y2 y2 y2 / / / / p=2 p=2 p=2 p=2 dftest; dftest=(dlag=4); dftest=(dlag=(1)(12)); dftest cointtest;
The following options can be used with the COINTTEST=( ) option. The options are specied within parentheses.
JOHANSEN JOHANSEN=(TYPE=value IORDER=number NORMALIZE=variable)
prints the cointegration rank test for multivariate time series based on Johansens method. This test is provided when the number of dependent (endogenous) variables is less than or equal to 11. See the section Vector Error Correction Modeling on page 1962 for details. The VARX(p,s) model can be written as the error correction model yt D yt
1C p 1 X i D1
i yt
C ADt C
s X i D0
i xt
where , i , A, and i are coefcient parameters; Dt is a deterministic term such as a constant, a linear trend, or seasonal dummies. The I.1/ model is dened by one reduced-rank condition. If the cointegration rank is r < k, then there exist k r matrices and of rank r such that D 0 . The I.1/ model is rewritten as the I.2/ model 2 yt D yt where D Ik yt
1C p 2 X i D1
i 2 yt Pp
C ADt C .
s X i D0
i xt
Pp
1 i D1 i
and i D
1 j Di C1 i
The I.2/ model is dened by two reduced-rank conditions. One is that D 0 , where and are k r matrices of full-rank r. The other is that 0 ? D 0 where and are ? .k r/ s matrices with s k r; ? and ? are k .k r/ matrices of full-rank k r such that 0 ? D 0 and 0 ? D 0. The following options can be used in the JOHANSEN=( ) option. The options are specied within parentheses. IORDER=number species the integrated order.
IORDER=1
prints the cointegration rank test for an integrated order 1 and prints the long-run parameter, , and the adjustment coefcient, . This is the default. If the IORDER=1 option is specied, then the AR order should be greater than or equal to 1. When the P=0 option, the value of P is set to 1 for the Johansen test. prints the cointegration rank test for integrated orders 1 and 2. If the IORDER=2 option is specied, then the AR order should be greater than or equal to 2. If the P=1 option with the IORDER=2 option, then the value of IORDER is set to 1; if the P=0 option with the IORDER=2 option, then the value of P is set to 2.
IORDER=2
NORMALIZE=variable species the dependent (endogenous) variable name whose cointegration vectors are to be normalized. If the normalized variable is different from that specied in the ECM= option or the COINTEG statement, then the value specied in the COINTEG statement is used. TYPE=value species the type of cointegration rank test to be printed. Valid values are as follows: MAX TRACE prints the cointegration maximum eigenvalue test. prints the cointegration trace test. This is the default.
If the NOINT option is not specied, the procedure prints two different cointegration rank tests in the presence of the unrestricted and restricted deterministic terms (constant or linear trend) models. If the IORDER=2 option is specied, the procedure automatically determines that the TYPE=TRACE option. Some examples illustrating the COINTTEST= option follow:
model y1 y2 / p=2 cointtest=(johansen=(type=max normalize=y1)); model y1 y2 / p=2 cointtest=(johansen=(iorder=2 normalize=y1));
SIGLEVEL=value
sets the size of cointegration rank tests and common trends tests. The SIGLEVEL=value can be set to 0.1, 0.05, or 0.01. The default is SIGLEVEL=0.05.
model y1 y2 / p=2 cointtest=(johansen siglevel=0.1); model y1 y2 / p=2 cointtest=(sw siglevel=0.1);
SW SW=(TYPE=value LAG=number )
prints common trends tests for a multivariate time series based on the Stock-Watson method. This test is provided when the number of dependent (endogenous) variables is less than or equal to 6. See the section Common Trends on page 1960 for details. The following options can be used in the SW=( ) option. The options are listed within parentheses.
LAG=number
species the number of lags. The default is LAG=max(1,p) for the TYPE=FILTDIF or TYPE=FILTRES option, where p is the AR maximum order specied by the P= option; LAG=T 1=4 for the TYPE=KERNEL option, where T is the number of nonmissing observations. If the specied LAG=number exceeds the default, then it is replaced by the default. species the type of common trends test to be printed. Valid values are as follows: FILTDIF prints the common trends test based on the ltering method applied to the differenced series. This is the default. prints the common trends test based on the ltering method applied to the residual series. prints the common trends test based on the kernel method.
TYPE=value
FILTRES KERNEL
model y1 y2 / p=2 cointtest=(sw); model y1 y2 / p=2 cointtest=(sw=(type=kernel)); model y1 y2 / p=2 cointtest=(sw=(type=kernel lag=3));
species the prior value of parameters for the BVARX(p, s) model. The BVARX model allows for a subset model specication. If the ECM= option is specied with the PRIOR option, the BVECMX(p, s) form is tted. To compute the standard errors of the forecasts, a bootstrap procedure is used. See the section Bayesian VAR and VARX Modeling on page 1947 for details. The following options can be used with the PRIOR=(prior-options) option. The prior-options are listed within parentheses.
IVAR IVAR=(variables)
species an integrated BVAR(p) model. The variables should be specied in the MODEL statement as dependent variables. If you use the IVAR option without variables, then it sets the overall prior mean of the rst lag of each variable equal to one in its own equation and sets all other coefcients to zero. If variables are specied, it sets the prior mean of the rst lag of the specied variables equal to one in its own equation and sets all other coefcients to zero. When the series yt D .y1 ; y2 /0 follows a bivariate BVAR(2) process, the IVAR or IVAR=(y1 y2 ) option is equivalent to specifying MEAN=(1 0 0 0 0 1 0 0). If the PRIOR=(MEAN=) or ECM= option is specied, the IVAR= option is ignored.
LAMBDA=value
species the prior standard deviation of the AR coefcient parameter matrices. It should be a positive number. The default is LAMBDA=1. As the value of the LAMBDA= option is increased, the BVAR(p) model becomes closer to a VAR(p) model.
MEAN=(vector )
species the mean vector in the prior distribution for the AR coefcients. If the vector is not specied, the prior value is assumed to be a zero vector. See the section Bayesian VAR and VARX Modeling on page 1947 for details. You can specify the mean vector by order of the equation. Let .; 1 ; : : : ; p / be the parameter sets to be estimated and D .1 ; : : : ; p / be the AR parameter sets. The mean vector is specied by row-wise from ; that is, the MEAN=(vec.0 /) option. For the PRIOR=(mean) option in the BVAR(2), 1;11 1;12 2;11 2;12 D D
1;21 1;22 2;21 2;22
2 0:1 1 0:5 3 0
0 1
where l;ij is an element of , l is a lag, i is associated with the rst dependent variable, and j is associated with the second dependent variable.
model y1 y2 / p=2 prior=(mean=(2 0.1 1 0 0.5 3 0 -1));
The deterministic terms and exogenous variables are considered to shrink toward zero; you must omit prior means of exogenous variables and deterministic terms such as a constant, seasonal dummies, or trends. For a Bayesian error correction model estimated when both the ECM= and PRIOR= options are used, a mean vector for only lagged AR coefcients, i , in terms of regressors yt i , for i D 1; : : : ; .p 1/ is used in the VECM(p) representation. The diffused prior variance of O is used, since is replaced by estimated in a nonconstrained VECM(p) form. yt D zt where zt D 0 yt . For example, in the case of a bivariate (k D 2) BVECM(2) form, the option MEAN D . where
NREP=number
1;ij 1;11 1;12 1;21 1;22 /
p 1 X i D1
i yt
C ADt C
s X i D0
i xt
species the number of periods to compute the measure of forecast accuracy. The default is NREP=0:5T , where T is the number of observations.
THETA=value
species the prior standard deviation of the AR coefcient parameter matrices. The value is in the interval (0,1). The default is THETA=0.1. As the value of the THETA= option approaches 1, the specied BVAR(p) model approaches a VAR(p) model. Some examples of the PRIOR= option follow:
model y1 y2 / p=2 prior; model y1 y2 / p=2 prior=(theta=0.2 lambda=5); model y1 y2 = x1 / p=2 prior=(theta=0.2 lambda=5); model y1 y2 = x1 / p=2 prior=(theta=0.2 lambda=5 mean=(2 0.1 1 0 0.5 3 0 -1));
See the section Bayesian VAR and VARX Modeling on page 1947 for details.
species a vector error correction model. The following options can be used in the ECM=( ) option. The options are specied within parentheses.
NORMALIZE=variable
species a single dependent variable name whose cointegrating vectors are normalized. If the variable name is different from that specied in the COINTEG statement, then the value specied in the COINTEG statement is used.
RANK=number
species the cointegration rank. This option is required in the ECM= option. The value of the RANK= option should be greater than zero and less than or equal to the number of dependent (endogenous) variables, k. If the rank is different from that specied in the COINTEG statement, then the value specied in the COINTEG statement is used.
ECTREND
species the restriction on the drift in the VECM(p) form. There is no separate drift in the VECM(p) form, but a constant enters only through the error correction term. yt D . 0 ; 0 /.y0 t
0 1 ; 1/ C p 1 X i D1
i yt
There is a separate drift and no separate linear trend in the VECM(p) form, but a linear trend enters only through the error correction term. yt D . 0 ; 1 /.y0 t
0 1 ; t/ C p 1 X i D1
i yt
C 0 C
If the NSEASON option is specied, then the NSEASON option is ignored; if the NOINT option is specied, then the ECTREND option is ignored. Some examples of the ECM= option follow:
model y1 y2 / p=2 ecm=(rank=1 normalized=y1); model y1 y2 / p=2 ecm=(rank=1 ectrend) trend=linear;
See the section Vector Error Correction Modeling on page 1962 for details.
GARCH Statement
GARCH options ;
The GARCH statement species a GARCH-type multivariate conditional heteroscedasticity model. The following options can be used in the GARCH statement.
FORM=value
species the representation for a GARCH model. Valid values are as follows: BEKK CCC species a BEKK representation. This is the default. species a constant conditional correlation representation.
OUTHT=SAS-data-set
species the order of the process or the subset of GARCH terms to be tted. For example, you can specify the P=(1,3) option. The P=3 option is equivalent to the P=(1,2,3) option. The default is P=0.
Q=number Q=(number-list)
species the order of the process or the subset of ARCH terms to be tted. This option is required in the GARCH statement. For example, you can specify the Q=(2) option. The Q=2 option is equivalent to the Q=(1,2) option. For the VAR(1)ARCH(1) model,
model y1 y2 / p=1; garch q=1 form=bekk;
See the section Multivariate GARCH Modeling on page 1981 for details.
NLOPTIONS Statement
NLOPTIONS options ;
The VARMAX procedure uses the nonlinear optimization (NLO) subsystem to perform nonlinear optimization tasks. For a list of all the options of the NLOPTIONS statement, see Chapter 6, Nonlinear Optimization Methods. An example of the NLOPTIONS statement follows:
proc varmax data=one; nloptions tech=qn; model y1 y2 / p=2; run;
The VARMAX procedure uses the dual quasi-Newton optimization method by default when no NLOPTIONS statement is specied. However, it uses Newton-Raphson ridge optimization when the NLOPTIONS statement is specied. The following example uses the TECH=QUANEW by default.
proc varmax data=one; model y1 y2 / p=2 method=ml; run;
OUTPUT Statement
OUTPUT < options > ;
The OUTPUT statement generates and prints forecasts based on the model estimated in the previous MODEL statement and, optionally, creates an output SAS data set that contains these forecasts. When the GARCH model is estimated, the upper and lower condence limits of forecasts are calculated by assuming that the error covariance has homoscedastic conditional covariance.
ALPHA=number
sets the forecast condence limit size, where number is between 0 and 1. When you specify the ALPHA=number option, the upper and lower condence limits dene the 100(1 )% condence interval. The default is ALPHA=0.05, which produces 95% condence intervals.
BACK=number
species the number of observations before the end of the data at which the multistep forecasts begin. The BACK= option value must be less than or equal to the number of observations minus the number of lagged regressors in the model. The default is BACK=0, which means that the forecasts start at the end of the available data.
LEAD=number
species the number of multistep forecast values to compute. The default is LEAD=12.
NOPRINT
writes the forecast values to an output data set. Some examples of the OUTPUT statements follow:
proc varmax data=one; model y1 y2 / p=2; output lead=6 back=2; run; proc varmax data=one; model y1 y2 / p=2; output out=for noprint; run;
RESTRICT Statement
RESTRICT restriction, . . . , restriction ;
The RESTRICT statement restricts the specied parameters to the specied values. Only one RESTRICT statement is allowed, but multiple restrictions can be specied in one RESTRICT statement. The restrictions form is parameter=value and each restriction is separated by commas. Parameters are referred by the following keywords: CONST(i ) is the intercept parameter of the ith time series yi t
AR(l; i; j ) is the autoregressive parameter of the lag l value of the j th dependent (endogenous) variable, yj;t l , to the i th dependent variable at time t, yi t MA(l; i; j ) is the moving-average parameter of the lag l value of the j th error process, to the i th dependent variable at time t , yi t
j;t l ,
XL(l; i; j ) is the exogenous parameter of the lag l value of the j th exogenous (independent) variable, xj;t l , to the i th dependent variable at time t, yi t SDUMMY(i; j ) is the j th seasonal dummy of the i th time series at time t, yi t , where j D 1; : : : ; .nseason 1/, where nseason is based on the NSEASON= option in the MODEL statement LTREND(i ) is the linear trend parameter of the current value i th time series yi t QTREND(i ) is the quadratic trend parameter of the current value i th time series yi t The following keywords are for the tted GARCH model. The indexes i and j refer to the position of the element in the coefcient matrix. GCHC(i ,j ) is the constant parameter of the covariance matrix, Ht , and (i ,j ) is 1 i D j k for CCC representation and 1 i j k for BEKK representations, where k is the number of dependent variables
0 ACH(l,i,j ) is the ARCH parameter of the lag l value of t t , where i; j D 1; : : : ; k for BEKK representation and i D j D 1; : : : ; k for CCC representation
GCH(l,i ,j ) is the GARCH parameter of the lag l value of covariance matrix, Ht , where i; j D 1; : : : ; k for BEKK representation and i D j D 1; : : : ; k for CCC representation CCC(i ,j ) is the constant conditional correlation parameter for only the CCC representation; (i ,j ) is 1 i < j k To use the RESTRICT statement, you need to know the form of the model. If the P=, Q=, and XLAG= options are not specied, then the RESTRICT statement is not applicable. Restricted parameter estimates are computed by introducing a Lagrangian parameter for each restriction (Pringle and Rayner 1971). The Lagrangian parameter measures the sensitivity of the sum of square errors to the restriction. The estimates of these Lagrangian parameters and their signicance are printed in the restriction results table. The following are examples of the RESTRICT statement. The rst example shows a bivariate (k=2) VAR(2) model,
proc varmax data=one; model y1 y2 / p=2; restrict AR(1,1,2)=0, AR(2,1,2)=0.3; run;
The AR(1,1,2) and AR(2,1,2) parameters are xed as AR(1,1,2)=0 and AR(2,1,2)=0.3, respectively, and other parameters are to be estimated. The following shows a bivariate (k=2) VARX(1,1) model with three exogenous variables,
proc varmax data=two; model y1 = x1 x2, y2 = x2 x3 / p=1 xlag=1; restrict XL(0,1,1)=-1.2, XL(1,2,3)=0; run;
The XL(0,1,1) and XL(1,2,3) parameters are xed as XL(0,1,1)=1.2 and XL(1,2,3)=0, respectively, and other parameters are to be estimated.
TEST Statement
TEST restriction, . . . , restriction ;
The TEST statement performs the Wald test for the joint hypothesis specied in the statement. The restrictions form is parameter=value, and each restriction is separated by commas. The restrictions are specied in the same manner as in the RESTRICT statement. See the RESTRICT statement for description of model parameter naming conventions used by the RESTRICT and TEST statements. Any number of TEST statements can be specied. To use the TEST statement, you need to know the form of the model. If the P=, Q=, and XLAG= options are not specied, then the TEST statement is not applicable. See the section Granger Causality Test on page 1944 for the Wald test. The following is an example of the TEST statement. In the case of a bivariate (k=2) VAR(2) model,
proc varmax data=one; model y1 y2 / p=2; test AR(1,1,2)=0, AR(2,1,2)=0; run;
After estimating the parameters, the TEST statement tests the null hypothesis that AR(1,1,2)=0 and AR(2,1,2)=0.
Missing Values
The VARMAX procedure currently does not support missing values. The procedure uses the rst contiguous group of observations with no missing values for any of the MODEL statement variables. Observations at the beginning of the data set with missing values for any MODEL statement variables are not used or included in the output data set. At the end of the data set, observations can have dependent (endogenous) variables with missing values and independent (exogenous) variables with nonmissing values.
VARMAX Model
The vector autoregressive moving-average model with exogenous variables is called the VARMAX(p,q,s) model. The form of the model can be written as yt D
p X i D1
i yt
s X i D0
i xt
q X i D1
t i
where the output variables of interest, yt D .y1t ; : : : ; yk t /0 , can be inuenced by other input variables, xt D .x1t ; : : : ; xrt /0 , which are determined outside of the system of interest. The variables yt are referred to as dependent, response, or endogenous variables, and the variables xt are referred to as independent, input, predictor, regressor, or exogenous variables. The unobserved noise variables, 0 t D . 1t ; : : : ; kt / , are a vector white noise process. The VARMAX(p,q,s) model can be written .B/yt where .B/ D Ik .B/ D Ik 1 B 1 B p B p C s B s q B q
i,
D .B/xt C .B/
.B/ D 0 C 1 B C
are matrix polynomials in B in the backshift operator, such that B i yt D yt k k matrices, and the i are k r matrices. The following assumptions are made: E. t / D 0, E.
0 t t/
0 t s/
D 0 for t s.
For stationarity and invertibility of the VARMAX process, the roots of j.z/j D 0 and j.z/j D 0 are outside the unit circle.
0 The exogenous (independent) variables xt are not correlated with residuals t , E.xt t / D 0. The exogenous variables can be stochastic or nonstochastic. When the exogenous variables are stochastic and their future values are unknown, forecasts of these future values are needed to forecast the future values of the endogenous (dependent) variables. On occasion, future values of the exogenous variables can be assumed to be known because they are deterministic variables. The VARMAX procedure assumes that the exogenous variables are nonstochastic if future values are available in the input data set. Otherwise, the exogenous variables are assumed to be stochastic and their future values are forecasted by assuming that they follow the VARMA(p,q) model, prior to forecasting the endogenous variables, where p and q are the same as in the VARMAX(p,q,s) model.
State-Space Representation
Another representation of the VARMAX(p,q,s) model is in the form of a state-variable or a statespace model, which consists of a state equation zt D F zt
1
C Kxt C G
1 0 : : : 0 0 0 : : : 0 0 Ik : : : 0
: : :
: : :
: : :
q 0 : : : 0 0 0 : : : 0 0 0 : : : Ik
3 q 0 7 7 : 7 : 7 : 7 0 7 7 0 7 7 0 7 7 : 7 : 7 : 7 7 0 7 7 0 7 7 0 7 7 : 7 : 5 : 0
and H D Ik ; 0k
k ; : : : ; 0k k ; 0k r ; : : : ; 0k r ; 0k k ; : : : ; 0k k
Ai xt
C at
q X i D1
Ci at
The model can also be expressed as A.B/xt D C.B/at where A.B/ D Ir A1 B Ap B p and C.B/ D Ir C1 B Cq B q are matrix polynomials in B, and the Ai and Ci are r r matrices. Without loss of generality, the AR and MA orders can be taken to be the same as the VARMAX(p,q,s) model, and at and t are independent white noise processes. Under suitable conditions such as stationarity, xt is represented by an innite order moving-average process
1 x 1 X j D0 x j at
xt D A.B/
C.B/at D .B/at D
1 C.B/
P1
x j j D0 j B .
The optimal minimum mean squared error (minimum MSE) i -step-ahead forecast of xt Ci is xt Ci jt D
1 X j Di x xt Ci jt C1 D xt Ci jt C i 1 atC1 x j at Ci
For i > q, xt Ci jt D
p X j D1
Aj xtCi
j jt
.B/xt C .B/
x 1 t
.B/
t
.B/
Vj at
1 X j D0
t j
P1
j D0 j B
j,
.B/ D .B/
yt Ci jt
Vj atCi
1 X j Di
t Ci j 1 t C1
yt Ci jt C1 D yt Ci jt C Vi
1 at C1
C i
yt Ci jt
j yt Ci j yt Ci j yt Ci j yt Ci
j jt
j xt Ci
j jt s X j D1
j jt
C 0 xtCi jt C C 0 C
u X j D1 p X j D1
j xt Ci C
j jt
j jt
Aj xt Ci
j jt
s X j D1 j jt
j xt Ci
j jt
j jt
.0 Aj C j /xt Ci
where u D max.p; s/. Dene j D 0 Aj C j . For i D v > q with v D max.p; q C 1/, you obtain
p X j D1 p X j D1 u X j D1
yt Cvjt yt Cvjt
j yt Cv j yt Cv
j jt
j xt Cv j xt Cv
j jt
for u v
j jt
r X j D1
j jt
for u > v
From the preceding relations, a state equation is zt C1 D F zt C Kxt C Get C1 and an observation equation is yt D H zt
where 2 yt 3 6 ytC1jt 7 6 7 2 3 6 7 : : xt Cv u 6 7 : 6 7 6x 7 6ytCv 1jt 7 6 7 ; x D 6 t Cv: uC1 7 ; etC1 D at C1 zt D 6 6 7 t 7 : 4 5 t C1 : 6 xt 7 6 x 7 xt 1 6 tC1jt 7 6 7 : : 4 5 : xtCv 1jt 2 0 Ik 0 0 0 0 0 60 0 Ik 0 0 0 0 6 6 : : : : : : : :: : : : : : 6 : : : : : : : : : 6 : 6v v 1 v 2 1 v v 1 v 2 F D6 60 0 0 0 0 Ir 0 6 60 0 0 0 0 0 Ir 6 6 : : : : : : : :: : : : : : 4 : : : : : : : : : : 0 0 0 0 Av Av 1 Av 2 3 2 2 3 V0 Ik 0 0 0 6 V1 1 7 6 7 6 0 0 0 7 6 7 6 : : 7 : : 7 6 : 6 : : : 7 :: : 7 : : 7 6 : 6 : : : 7 6 : 6 Vv 1 v 1 7 7 vC1 7 ; G D 6 K D 6u u 1 6 7 6 Ir 0r k 7 6 0 7 6 7 0 0 7 6 6 x 0r k 7 7 6 : 6 1 : : 7 :: : : 5 6 : 4 : : 7 : : : : : : 5 4 : : 0 0 0 x v 1 0r k and H D Ik ; 0k
k ; : : : ; 0k k ; 0k r ; : : : ; 0k r
7 7 7 7 : 7 1 7 7 0 7 7 0 7 7 : 7 :: : 5 : : A1 ::
0 0 : : :
Note that the matrix K and the input vector xt are dened only when u > v.
A 0 yt D
Ai yt
Ci xt
C C0
Ci
t i
A0 yt D
Ai yt
Ci x t
C at
Ci at
where Ai D A0 i , Ci D A0 i , Ci D A0 i A0 1 , and at D C0 t has a diagonal covariance matrix d . The PRINT=(DYNAMIC) option returns the parameter estimates that result from estimating the model in this form. A dynamic simultaneous equations model involves a leading (lower triangular) coefcient matrix for yt at lag 0 or a leading coefcient matrix for t at lag 0. Such a representation of the VARMAX(p,q,s) model can be more useful in certain circumstances than the standard representation. From the linear combination of the dependent variables obtained by A0 yt , you can easily see the relationship between the dependent variables in the current time. The following statements provide the dynamic simultaneous equations of the VAR(1) model.
proc iml; sig = {1.0 0.5, 0.5 1.25}; phi = {1.2 -0.5, 0.6 0.3}; /* simulate the vector time series */ call varmasim(y,phi) sigma = sig n = 100 seed = 34657; cn = {y1 y2}; create simul1 from y[colname=cn]; append from y; quit; data simul1; set simul1; date = intnx( year, 01jan1900d, _n_-1 ); format date year4.; run; proc varmax data=simul1; model y1 y2 / p=1 noint print=(dynamic); run;
This is the same data set and model used in the section Getting Started: VARMAX Procedure on page 1858. You can compare the results of the VARMA model form and the dynamic simultaneous equations model form.
AR Lag 0 1 Variable y1 y2 y1 y2 y1 1.00000 -0.30845 1.15977 0.18861 y2 0.00000 1.00000 -0.51058 0.54247
Dynamic Model Parameter Estimates Standard Error t Value Pr > |t| Variable 0.05508 0.07140 0.05779 0.07491 21.06 -7.15 3.26 7.24 0.0001 y1(t-1) 0.0001 y2(t-1) y1(t) 0.0015 y1(t-1) 0.0001 y2(t-1)
In Figure 30.4 in the section Getting Started: VARMAX Procedure on page 1858, the covariance of t estimated from the VARMAX model form is D 1:28875 0:39751 0:39751 1:41839
Figure 30.25 shows the results from estimating the model as a dynamic simultaneous equations model. By the decomposition of , you get a diagonal matrix (a ) and a lower triangular matrix (A0 ) such as a D A0 A0 where 0 a D 1:28875 0 0 1:29578 and A0 D 1 0 0:30845 1
The lower triangular matrix (A0 ) is shown in the left side of the simultaneous equations model. The parameter estimates in equations system are shown in the right side of the two-equations system. The simultaneous equations model is written as 1 0 0:30845 1 yt D 1:15977 0:18861 0:51058 0:54247 yt
1
C at
0:51058y2;t
1
C a1t
1
D 0:30845y1t C 0:18861y1;t
C 0:54247y2;t
C a2t
.B/ D
P1
j D0 j B
1 .B/
P1
j D0 j B
j.
The elements of the matrices j from the operator .B/, called the impulse response, can be interpreted as the impact that a shock in one variable has on another variable. Let j;i n be the i nt h element of j at lag j , where i is the index for the impulse variable, and n is the index for the response variable (impulse ! response). For instance, j;11 is an impulse response to y1t ! y1t , and j;12 is an impulse response to y1t ! y2t .
t.
o The elements of the matrices j , called the orthogonal impulse response, can be interpreted as the effects of the components of the standardized shock process ut on the process yt at lag j .
In Figure 30.26, the variables x1 and x2 are impulses and the variables y1, y2, and y3 are responses. You can read the table matching the pairs of impulse ! response such as x1 ! y1, x1 ! y2, x1 ! y3, x2 ! y1, x2 ! y2, and x2 ! y3. In the pair of x1 ! y1, you can see the long-run responses of y1 to an impulse in x1 (the values are 1.69281, 0.35399, 0.09090, and so on for lag 0, lag 1, lag 2, and so on, respectively).
Lag 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5
x1 1.69281 0.35399 0.09090 0.05136 0.04717 0.04620 -6.09850 -5.15484 -3.04168 -2.23797 -1.98183 -1.87415 -0.02317 1.57476 1.80231 1.77024 1.70435 1.63913
x2 -0.00859 0.01727 0.00714 0.00214 0.00072 0.00040 2.57980 0.45445 0.04391 -0.01376 -0.01647 -0.01453 -0.01274 -0.01435 0.00398 0.01062 0.01197 0.01187
y2
y3
Figure 30.27 shows the responses of y1, y2, and y3 to a forecast error impulse in x1.
Lag 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5
x1 1.69281 2.04680 2.13770 2.18906 2.23623 2.28243 -6.09850 -11.25334 -14.29502 -16.53299 -18.51482 -20.38897 -0.02317 1.55159 3.35390 5.12414 6.82848 8.46762
x2 -0.00859 0.00868 0.01582 0.01796 0.01867 0.01907 2.57980 3.03425 3.07816 3.06440 3.04793 3.03340 -0.01274 -0.02709 -0.02311 -0.01249 -0.00052 0.01135
y2
y3
Figure 30.29 shows the accumulated responses of y1, y2, and y3 to a forecast error impulse in x1.
The following statements provide the impulse response function, the accumulated impulse response function, and the orthogonalized impulse response function with their standard errors for a VAR(1) model. Parts of the VARMAX procedure output are shown in Figure 30.30, Figure 30.32, and Figure 30.34.
proc varmax data=simul1 plot=impulse; model y1 y2 / p=1 noint lagmax=5 print=(impulse=(all)) printform=univariate; run;
Figure 30.30 is the output in a univariate format associated with the PRINT=(IMPULSE=) option for the impulse response function. The keyword STD stands for the standard errors of the elements. The matrix in terms of the lag 0 does not print since it is the identity. In Figure 30.30, the variables y1 and y2 of the rst row are impulses, and the variables y1 and y2 of the rst column are responses. You can read the table matching the impulse ! response pairs, such as y1 ! y1, y1 ! y2, y2 ! y1, and y2 ! y2. For example, in the pair of y1 ! y1 at lag 3, the response is 0.8055. This represents the impact on y1 of one-unit change in y1 after 3 periods. As the lag gets higher, you can see the long-run responses of y1 to an impulse in itself.
Lag 1 STD 2 STD 3 STD 4 STD 5 STD 1 STD 2 STD 3 STD 4 STD 5 STD
y1 1.15977 0.05508 1.06612 0.10450 0.80555 0.14522 0.47097 0.17191 0.14315 0.18214 0.54634 0.05779 0.84396 0.08481 0.90738 0.10307 0.78943 0.12318 0.56123 0.14236
y2 -0.51058 0.05898 -0.78872 0.10702 -0.84798 0.14121 -0.73776 0.15864 -0.52450 0.16115 0.38499 0.06188 -0.13073 0.08556 -0.48124 0.09865 -0.64856 0.11661 -0.65275 0.13482
y2
Figure 30.31 shows the responses of y1 and y2 to a forecast error impulse in y1 with two standard errors.
Figure 30.32 is the output in a univariate format associated with the PRINT=(IMPULSE=) option for the accumulated impulse response function. The matrix in terms of the lag 0 does not print since it is the identity.
Lag 1 STD 2 STD 3 STD 4 STD 5 STD 1 STD 2 STD 3 STD 4 STD 5 STD
y1 2.15977 0.05508 3.22589 0.21684 4.03144 0.52217 4.50241 0.96922 4.64556 1.51137 0.54634 0.05779 1.39030 0.17614 2.29768 0.36166 3.08711 0.65129 3.64834 1.07510
y2 -0.51058 0.05898 -1.29929 0.22776 -2.14728 0.53649 -2.88504 0.97088 -3.40953 1.47122 1.38499 0.06188 1.25426 0.18392 0.77302 0.36874 0.12447 0.65333 -0.52829 1.06309
y2
Figure 30.33 shows the accumulated responses of y1 and y2 to a forecast error impulse in y1 with two standard errors.
Figure 30.34 is the output in a univariate format associated with the PRINT=(IMPULSE=) option for the orthogonalized impulse response function. The two right-hand side columns, y1 and y2, represent the y1_i nnovat i on and y2_i nnovation variables. These are the impulses variables. The left-hand side column contains responses variables, y1 and y2. You can read the table by matching the i mpulse ! response pairs such as y1_i nnovation ! y1, y1_i nnovation ! y2, y2_i nnovat i on ! y1, and y2_i nnovation ! y2.
Lag 0 STD 1 STD 2 STD 3 STD 4 STD 5 STD 0 STD 1 STD 2 STD 3 STD 4 STD 5 STD
y1 1.13523 0.08068 1.13783 0.10666 0.93412 0.13113 0.61756 0.15348 0.27633 0.16940 -0.02115 0.17432 0.35016 0.11676 0.75503 0.06949 0.91231 0.10553 0.86158 0.12266 0.66909 0.13305 0.40856 0.14189
y2 0.00000 0.00000 -0.58120 0.14110 -0.89782 0.16776 -0.96528 0.18595 -0.83981 0.19230 -0.59705 0.18830 1.13832 0.08855 0.43824 0.10937 -0.14881 0.13565 -0.54780 0.14825 -0.73827 0.15846 -0.74304 0.16765
y2
In Figure 30.4, there is a positive correlation between "1t and "2t . Therefore, shock in y1 can be accompanied by a shock in y2 in the same period. For example, in the pair of y1_i nnovation ! y2, you can see the long-run responses of y2 to an impulse in y1_i nnovation. Figure 30.35 shows the orthogonalized responses of y1 and y2 to a forecast error impulse in y1 with two standard errors.
Forecasting
The optimal (minimum MSE) l-step-ahead forecast of yt Cl is
p X j D1 s X j D0 q X j Dl
yt Cljt
j yt Cl
j jt
j xtCl
j jt
t Cl j ;
l q
yt Cljt
p X j D1
j yt Cl
j jt
s X j D0
j xtCl
j jt ;
l >q
with yt Cl j jt D yt Cl j and xt Cl j jt D xt Cl j for l j . For the forecasts xt Cl section State-Space Representation on page 1913.
j jt ,
see the
Forecasting ! 1931
o where j D j P with a lower triangular matrix P such that D PP 0 . Under the assumption of normality of the t , the l-step-ahead prediction error et Cljt is also normally distributed as multivariate N.0; .l//. Hence, it follows that the diagonal elements i2i .l/ of .l/ can be used, together with the point forecasts yi;t Cljt , to construct l-step-ahead prediction intervals of the future values of the component series, yi;t Cl .
The following statements use the COVPE option to compute the covariance matrices of the prediction errors for a VAR(1) model. The parts of the VARMAX procedure output are shown in Figure 30.36 and Figure 30.37.
proc varmax data=simul1; model y1 y2 / p=1 noint lagmax=5 printform=both print=(decompose(5) impulse=(all) covpe(5)); run;
Figure 30.36 is the output in a matrix format associated with the COVPE option for the prediction error covariance matrices.
Figure 30.36 Covariances of Prediction Errors (COVPE Option)
The VARMAX Procedure Prediction Error Covariances Lead 1 2 3 4 5 Variable y1 y2 y1 y2 y1 y2 y1 y2 y1 y2 y1 1.28875 0.39751 2.92119 1.00189 4.59984 1.98771 5.91299 3.04856 6.69463 3.85346 y2 0.39751 1.41839 1.00189 2.18051 1.98771 3.03498 3.04856 4.07738 3.85346 5.07010
Figure 30.37 is the output in a univariate format associated with the COVPE option for the prediction error covariances. This printing format more easily explains the prediction error covariances of each variable.
y2
Vj at Cl
1 X j Dl
t Cl j
Vj at Cl
l 1 X j D0
t Cl j
Therefore, the covariance matrix of the l-step-ahead prediction error is given as .l/ D Cov.et Cljt / D
l 1 X j D0
Vj a Vj0 C
l 1 X j D0
0 j j
where a is the covariance of the white noise series at , and at is the white noise series for the VARMA(p,q) model of exogenous (independent) variables, which is assumed not to be correlated with t or its lags. When future exogenous (independent) variables are specied:
Forecasting ! 1933
The optimal forecast yt Cljt of yt conditioned on the past information and also on known future values xt C1 ; : : : ; xt Cl can be represented as yt Cljt D
1 X j D0
j xtCl
1 X j Dl
tCl j
tCl j
Thus, the covariance matrix of the l-step-ahead prediction error is given as .l/ D Cov.et Cljt / D
l 1 X j D0 0 j j
yi;t Chjt /2 D
P 1 2 Note that l D0 j;i n is interpreted as the contribution of innovations in variable n to the prediction j error covariance of the l-step-ahead forecast of variable i . The proportion, !l;i n , of the l-step-ahead forecast error covariance of variable i accounting for the innovations in variable n is !l;i n D
l 1 X j D0 2 j;i n =MSE.yi;t Chjt /
The following statements use the DECOMPOSE option to compute the decomposition of prediction error covariances and their proportions for a VAR(1) model:
proc varmax data=simul1; model y1 y2 / p=1 noint print=(decompose(15)) printform=univariate; run;
The proportions of decomposition of prediction error covariances of two variables are given in Figure 30.38. The output explains that about 91.356% of the one-step-ahead prediction error covariances of the variable y2t is accounted for by its own innovations and about 8.644% is accounted for by y1t innovations.
y2
.B/yt D P1
1 X j D0
j yt
j
1 .B/
j D0 j B
and 0 D Ik .
j yt Cl
j jt
1 X j Dl
j yt Cl
0 @
j yt Cl
yt Cl
j jt
j X
1 u j
u A t Cl j
uD0
Letting z .0/ D 0, the covariance matrix of the l-step-ahead prediction error of zt Cl , z .l/, is
l 1 X j D0
0 @
z .l/ D
j X
1 u j 0
uA
0 @
j X uD0
10 u j 0
uA
uD0 l 1 X
1 j l
1 j
D z .l
1/ C @
A @
l 1 X j D0
10 j l
1 j
j D0
If there are stochastic exogenous (independent) variables, the covariance matrix of the l-step-ahead prediction error of zt Cl , z .l/, is 0 z .l/ D z .l 0 C@ 1/ C @
l 1 X l 1 X
1 j l 1
1 j
A @
l 1 X
10 j l 10
1 j
j D0
j D0
j V l
1 j
A a @
l 1 X
j V l
1 j
j D0
j D0
.l/D
where D is a diagonal matrix with the standard deviations of the components of yt on the diagonal. The sample cross-covariance matrix at lag l, denoted as C.l/, is computed as
T l 1 X O Q Qt yt y0 Cl .l/ D C.l/ D T t D1
O Q where yt is the centered data and T is the number of nonmissing observations. Thus, .l/ has .i; j /th element Oij .l/ D cij .l/. The sample cross-correlation matrix at lag l is computed as Oij .l/ D cij .l/=ci i .0/cjj .0/1=2 ; i; j D 1; : : : ; k The following statements use the CORRY option to compute the sample cross-correlation matrices and their summary indicator plots in terms of C; ; and , where C indicates signicant positive cross-correlations, indicates signicant negative cross-correlations, and indicates insignicant cross-correlations.
proc varmax data=simul1; model y1 y2 / p=1 noint lagmax=3 print=(corry) printform=univariate; run;
Figure 30.39 shows the sample cross-correlation matrices of y1t and y2t . As shown, the sample autocorrelation functions for each variable decay quickly, but are signicant with respect to two standard errors.
Figure 30.39 Cross-Correlations (CORRY Option)
The VARMAX Procedure Cross Correlations of Dependent Series by Variable Variable y1 Lag 0 1 2 3 0 1 2 3 y1 1.00000 0.83143 0.56094 0.26629 0.67041 0.29707 -0.00936 -0.22058 y2 0.67041 0.84330 0.81972 0.66154 1.00000 0.77132 0.48658 0.22014
y2
3 ++ -+
.l/ D
.l
i /0 ; l D 1; 2; : : : ; m im
The sequence of the partial autoregression matrices mm of order m has the characteristic property that if the process follows the AR(p), then pp D p and mm D 0 for m > p. Hence, the matrices mm have the cutoff property for a VAR(p) model, and so they can be useful in the identication of the order of a pure VAR model. The following statements use the PARCOEF option to compute the partial autoregression matrices:
proc varmax data=simul1; model y1 y2 / p=1 noint lagmax=3 printform=univariate print=(corry parcoef pcorr pcancorr roots); run;
Figure 30.40 shows that the model can be obtained by an AR order m D 1 since partial autoregression matrices are insignicant after lag 1 with respect to two standard errors. The matrix for lag 1 is the same as the Yule-Walker autoregressive matrix.
Figure 30.40 Partial Autoregression Matrices (PARCOEF Option)
The VARMAX Procedure Partial Autoregression Lag 1 2 3 Variable y1 y2 y1 y2 y1 y2 y1 1.14844 0.54985 -0.00724 0.02409 -0.02578 -0.03720 y2 -0.50954 0.37409 0.05138 0.05909 0.03885 0.10149
yt D
i;m
1 yt i
C um;t
yt
i;m
1 yt mCi
C um;t
The matrices P .m/ dened by Ansley and Newbold (1979) are given by P .m/ D m where
m 1 X i D1 1=2 0 1=2 1 mm m 1
D Cov.um;t / D .0/
. i/0 i;m
and
m 1 X i D1
0
D Cov.um;t
m / D .0/
.m
i/m
i;m 1
P .m/ are the partial cross-correlation matrices at lag m between the elements of yt and yt m , given yt 1 ; : : : ; yt mC1 . The matrices P .m/ have the cutoff property for a VAR(p) model, and so they can be useful in the identication of the order of a pure VAR structure. The following statements use the PCORR option to compute the partial cross-correlation matrices:
proc varmax data=simul1; model y1 y2 / p=1 noint lagmax=3 print=(pcorr) printform=univariate; run;
The partial cross-correlation matrices in Figure 30.41 are insignicant after lag 1 with respect to two standard errors. This indicates that an AR order of m D 1 can be an appropriate choice.
y2
E.um;t um;t
m /fCov.um;t m /g
E.um;t
m um;t /
D mm mm
It follows that the test statistic to test for m D 0 in the VAR model of order m > p is approximately .T m/ tr fmm mm g
0 0
.T
m/
k X i D1
2 i .m/
and has an asymptotic chi-square distribution with k 2 degrees of freedom for m > p. The following statements use the PCANCORR option to compute the partial canonical correlations:
proc varmax data=simul1; model y1 y2 / p=1 noint lagmax=3 print=(pcancorr); run;
Figure 30.42 shows that the partial canonical correlations i .m/ between yt and yt m are {0.918, 0.773}, {0.092, 0.018}, and {0.109, 0.011} for lags m D1 to 3. After lag m D1, the partial
canonical correlations are insignicant with respect to the 0.05 signicance level, indicating that an AR order of m D 1 can be an appropriate choice.
Figure 30.42 Partial Canonical Correlations (PCANCORR Option)
The VARMAX Procedure Partial Canonical Correlations Lag 1 2 3 Correlation1 0.91783 0.09171 0.10861 Correlation2 0.77335 0.01816 0.01078 DF 4 4 4 Chi-Square 142.61 0.86 1.16 Pr > ChiSq <.0001 0.9307 0.8854
Op i yt
O p ; t D p C 1; : : : ; T
In the second step, you select the order (p; q) of the VARMA model for p in .pmi n W pmax / and q in .qmi n W qmax / yt D C
p X i D1
i yt
q X i D1
i Q t
which minimizes a selection criterion like SBC or HQ. The following statements use the MINIC= option to compute a table that contains the information criterion associated with various AR and MA orders:
proc varmax data=simul1; model y1 y2 / p=1 noint minic=(p=3 q=3); run;
Figure 30.43 shows the output associated with the MINIC= option. The criterion takes the smallest value at AR order 1.
yt
i .yt
/C
or .B/.yt
/D
with .B/ D Ik
Pp
i D1 i B
i.
yt D C
i yt
or .B/yt D C
with D .Ik
Pp
i D1 i /
Stationarity
For stationarity, the VAR process must be expressible in the convergent causal innite MA form as
1 X j D0
yt D
t j
P P where .B/ D .B/ 1 D 1 j B j with 1 jjj jj < 1, where jjAjj denotes a norm for j D0 j D0 the matrix A such as jjAjj2 D trfA0 Ag. The matrix j can be recursively obtained from the relation .B/.B/ D I ; it is j D 1 j
1
C 2 j
C p j
where 0 D Ik and j D 0 for j < 0. The stationarity condition is satised if all roots of j.z/j D 0 are outside of the unit circle. The stationarity condition is equivalent to the condition in the corresponding VAR(1) representation, Yt D Yt 1 C "t , that all eigenvalues of the kp kp companion matrix be less than one in 0 absolute value, where Yt D .y0 ; : : : ; y0 pC1 /0 , "t D . t ; 00 ; : : : ; 00 /0 , and t t 2 1 2 6 Ik 0 6 6 D 6 0 Ik 6 : : : 4 : : : 0 0 3 p 1 p 0 0 7 7 0 0 7 7 : : 7 :: : : 5 : : : Ik 0
If the stationarity condition is not satised, a nonstationary model (a differenced model or an error correction model) might be more appropriate. The following statements estimate a VAR(1) model and use the ROOTS option to compute the characteristic polynomial roots:
proc varmax data=simul1; model y1 y2 / p=1 noint print=(roots); run;
Figure 30.44 shows the output associated with the ROOTS option, which indicates that the series is stationary since the modulus of the eigenvalue is less than one.
Figure 30.44 Stationarity (ROOTS Option)
The VARMAX Procedure Roots of AR Characteristic Polynomial Index 1 2 Real 0.77238 0.77238 Imaginary 0.35899 -0.35899 Modulus 0.8517 0.8517 Radian 0.4351 -0.4351 Degree 24.9284 -24.9284
Parameter Estimation
Consider the stationary VAR(p) model yt D C
p X i D1
i yt
where y pC1 ; : : : ; y0 are assumed to be available (for convenience of notation). This can be represented by the general form of the multivariate linear model, Y D XB C E or y D .X Ik / C e
B D .; 1 ; : : : ; p /0
0 pC1 /
with vec denoting the column stacking operator. The conditional least squares estimator of is O D ..X 0 X /
1
X 0 Ik /y
Ot Ot 0
where Ot is the residual vectors. Consistency and asymptotic normality of the LS estimator are that p O T . / ! N.0; p 1 /
d d
where X 0 X=T converges in probability to p and ! denotes convergence in distribution. The (conditional) maximum likelihood estimator in the VAR(p) model is equal to the (conditional) least squares estimator on the assumption of normality of the error vectors.
where D p 1 and Gj D
j 1 X @vec.j / D J. 0 /j 0 @ i D0 1 i
where J D Ik ; 0; : : : ; 0 is a k
kp matrix and is a kp
kp companion matrix.
where Fl D
j D1 Gj .
H D
1D0 k
and
D vech.E/.
The variables y1t are said to cause y2t , but y2t do not cause y1t if 12 .B/ D 0. The implication of this model structure is that future values of the process y1t are inuenced only by its own past and not by the past of y2t , where future values of y2t are inuenced by the past of both y1t and y2t . If the future y1t are not inuenced by the past values of y2t , then it can be better to model y1t separately from y2t . Consider testing H0 W C D c, where C is a s .k 2 p Ck/ matrix of rank s and c is an s-dimensional vector where s D k1 k2 p. Assuming that p O T . / ! N.0; p 1 /
d
O .C
c/ !
.s/
For the Granger causality test, the matrix C consists of zeros or ones and c is the zero vector. See Ltkepohl(1993) for more details of the Granger causality test.
VARX Modeling
The vector autoregressive model with exogenous variables is called the VARX(p,s) model. The form of the VARX(p,s) model can be written as
p X i D1 s X i D0
yt D C
i yt
i xt
The parameter estimates can be obtained by representing the general form of the multivariate linear model, Y D XB C E or y D .X Ik / C e where Y X Xt D .y1 ; : : : ; yT /0 D .X0 ; : : : ; XT D .1; y0 ; : : : ; y0 t t
0 T/ 1/ 0
B D .; 1 ; : : : ; p ; 0 ; : : : ; s /0
0 0 0 pC1 ; xt C1 ; : : : ; xt sC1 /
The conditional least squares estimator of can be obtained by using the same method in a VAR(p) modeling. If the multivariate linear model has different independent variables that correspond to dependent variables, the SUR (seemingly unrelated regression) method is used to improve the regression estimates. The following example ts the ordinary regression model:
proc varmax data=one; model y1-y3 = x1-x5; run;
The following example ts the ordinary regression model with different regressors:
proc varmax data=one; model y1 = x1-x3, y2 = x2 x3; run;
From the output in Figure 30.20 in the section Getting Started: VARMAX Procedure on page 1858, you can see that the parameters, XL0_1_2, XL0_2_1, XL0_3_1, and XL0_3_2 associated with the exogenous variables, are not signicant. The following example ts the VARX(1,0) model with different regressors:
proc varmax data=grunfeld; model y1 = x1, y2 = x2, y3 / p=1 print=(estimates); run;
As you can see in Figure 30.45, the symbol _ in the elements of matrix corresponds to endogenous variables that do not take the denoted exogenous variables.
C p yt
exp
1 . 2
/V 1 .
The likelihood function for the Gaussian process becomes `.jy/ D . 1 kT =2 / jIT j 1=2 2 1 exp .y .X Ik //0 .IT 2
/.y
.X Ik //
C .X 0 X
V 1 C .X 0
/y
C .X 0 X
In practice, the prior mean and the prior variance V need to be specied. If all the parameters are considered to shrink toward zero, the null prior mean should be specied. According to Litterman (1986), the prior variance can be given by vij .l/ D . = l/2 . i i =l if i D j /2 if i j jj
where vij .l/ is the prior variance of the .i; j /th element of l , is the prior standard deviation of the diagonal elements of l , is a constant in the interval .0; 1/, and i2i is the i th diagonal element of . The deterministic terms have diffused prior variance. In practice, you replace the i2i by the diagonal element of the ML estimator of in the nonconstrained model. For example, for a bivariate BVAR(2) model, y1t y2t D 0C D 0C
1;11 y1;t 1 1;21 y1;t 1
C C
C C
C C
C C
1t 2t
;.
2=
1; .
2 1= 2/ ; . 2 2 ;. 1/ ;
=2/2 ; .
2 =2 1 /
2 1 =2 2 / ; 2 2
; . =2/ /
For the Bayesian estimation of integrated systems, the prior mean is set to the rst lag of each variable equal to one in its own equation and all other coefcients at zero. For example, for a bivariate BVAR(2) model, y1t y2t D 0 C 1 y1;t D 0 C 0 y1;t
1 1
C 0 y2;t C 1 y2;t
1 1
C 0 y1;t C 0 y1;t
2 2
C 0 y2;t C 0 y2;t
2 2
C C
1t 2t
Fts /2
where At is the actual value at time t and Fts is the forecast made s periods earlier.
yt D C
i yt
i xt
The parameter estimates can be obtained by representing the general form of the multivariate linear model, y D .X Ik / C e The prior means for the AR coefcients are the same as those specied in BVAR(p). The prior means for the exogenous coefcients are set to zero. Some examples of the Bayesian VARX model are as follows:
model y1 y2 = x1 / p=1 xlag=1 prior; model y1 y2 = x1 / p=(1 3) xlag=1 nocurrentx prior=(lambda=0.9 theta=0.1);
yt D C or
i yt
t i
t i
i D1 i B
and .B/ D Ik
Pq
i D1 i B
i.
Parameter Estimation
Under the assumption of normality of the t with mean vector zero and nonsingular covariance matrix , consider the conditional (approximate) log-likelihood function of a VARMA(p,q) model with mean zero. Dene Y D .y1 ; : : : ; yT /0 and E D . 1 ; : : : ; T /0 with B i Y D .y1 i ; : : : ; yT . 1 i ; : : : ; T i /0 ; dene y D vec.Y 0 / and e D vec.E 0 /. Then y
p X iD1 0 i/
and B i E D
.IT i /B y D e
q X i D1
.IT i /B i e
where B i y D vec.B i Y /0 and B i e D vec.B i E/0 . Then, the conditional (approximate) log-likelihood function can be written as follows (Reinsel 1997): T log jj 2 T log jj 2 Pp
i D1 .IT
` D D where w D y
1X 2
t D1
0 1 t t 1 1 1
1 0 0 w 2
.IT
w Pq
i D1 .IT
i /B i e D e.
For the exact log-likelihood function of a VARMA(p,q) model, the Kalman ltering method is used transforming the VARMA process into the state-space form (Reinsel 1997). The state-space form of the VARMA(p,q) model consists of a state equation zt D F zt
1
CG
and an observation equation yt D H zt where for v D max.p; q C 1/ zt D .y0 ; y0 C1jt ; : : : ; y0 Cv t t t 2 0 Ik 0 60 0 Ik 6 F D6 : : : : : : 4 : : : v v and H D Ik ; 0; : : : ; 0 The Kalman ltering approach is used for evaluation of the likelihood function. The updating equation is O O zt jt D zt jt
1 1 0 1jt /
7 6 7 6 7; G D 6 :: 5 4 : 1 v
0 0 : : :
Ik 1 : : :
1
3 7 7 7 5
C Kt
tjt 1
with Kt D Pt jt
1H 0
HPt jt
1H
O D F zt
1jt 1 ;
Pt jt
1
D FPt
1jt 1 F
C GG 0
with Pt jt D I
Kt H Pt jt
for t D 1; 2; : : : ; n.
.yt
O yt jt
1/
1 t jt 1 .yt
O yt jt
1 /
O where ytjt 1 and t jt 1 are determined recursively from the Kalman lter procedure. To construct O O O the likelihood function from Kalman ltering, you obtain yt jt 1 D H zt jt 1 , O t jt 1 D yt ytjt 1 , and t jt 1 D HPt jt 1 H 0 . Dene the vector D. where
i 0 1; : : : ; 0 0 0 0 p ; 1 ; : : : ; q ; vech.//
The log-likelihood equations are solved by iterative numerical procedures such as the quasi-Newton optimization. The starting values for the AR and MA parameters are obtained from the least squares estimates.
the asymptotic distribution of the impulse response function for a VARMA(p; q) model is p O T vec.j
0 j / ! N.0; Gj Gj / j D 1; 2; : : : d
JAi J0
where H D Ik ; 0; : : : ; 0; Ik ; 0; : : : ; 00 is a k.p C q/ k matrix with the second Ik following after p block matrices; J D Ik ; 0; : : : ; 0 is a k k.p C q/ matrix; A is a k.p C q/ k.p C q/ matrix, AD where 1 2 6 Ik 0 6 6 D 6 0 Ik 6 : : : 4 : : : 0 0 2 2 3 p 1 p 1 6 0 0 0 7 7 6 0 0 7 A12 D 6 0 7 6 6 : : : 7 :: : : 5 4 : : : : : Ik 0 0 q 0 0 : : : 0
1
A11
::
3 q 0 7 7 0 7 7 : 7 : 5 : 0
A21 is a kq 2
A22
0 0 6Ik 0 6 6 D 6 0 Ik 6: : : 4: : : 0 0
::
t 1
where t is the white noise process with a mean zero vector and the positive-denite covariance matrix . The following IML procedure statements simulate a bivariate vector time series from this model to provide test data for the VARMAX procedure:
proc iml; sig = {1.0 0.5, 0.5 1.25}; phi = {1.2 -0.5, 0.6 0.3}; theta = {0.5 -0.2, 0.1 0.3}; /* to simulate the vector time series */ call varmasim(y,phi,theta) sigma=sig n=100 seed=34657; cn = {y1 y2}; create simul3 from y[colname=cn]; append from y; run;
The following statements t a VARMA(1,1) model to the simulated data. You specify the order of the autoregressive model with the P= option and the order of moving-average model with the Q= option. You specify the quasi-Newton optimization in the NLOPTIONS statement as an optimization method.
proc varmax data=simul3; nloptions tech=qn; model y1 y2 / p=1 q=1 noint print=(estimates); run;
Figure 30.46 shows the initial values of parameters. The initial values were estimated using the least squares method.
Figure 30.46 Start Parameter Estimates for the VARMA(1, 1) Model
The VARMAX Procedure Optimization Start Parameter Estimates Gradient Objective Function -0.987110 0.163904 0.826824 -9.605845 -1.929771 2.408518 -0.632995 0.888222
Figure 30.47 shows the default option settings for the quasi-Newton optimization technique.
Iter 1 2 3 4 5 6 7 8 9 10 11 12 13
Rest arts 0 0 0 0 0 0 0 0 0 0 0 0 0
Func Calls 38 57 76 95 114 133 152 171 207 226 245 264 283
Act Con 0 0 0 0 0 0 0 0 0 0 0 0 0
Objective Function 121.86537 121.76369 121.55605 121.36386 121.25741 121.22578 121.20582 121.18747 121.13613 121.13536 121.13528 121.13528 121.13528
Step Size
0.1020 6.3068 0.00376 0.1017 6.9381 0.0100 0.2076 5.1526 0.0100 0.1922 5.7292 0.0100 0.1064 5.5107 0.0100 0.0316 4.9213 0.0240 0.0200 6.0114 0.0316 0.0183 5.5324 0.0457 0.0513 0.5829 0.628 0.000766 0.2054 1.000 0.000082 0.0534 1.000 2.537E-6 0.0101 1.000 1.475E-7 0.000270 1.000
Figure 30.50 shows the AR coefcient matrix in terms of lag 1, the MA coefcient matrix in terms of lag 1, the parameter estimates, and their signicance, which is one indication of how well the model ts the data.
AR1 +++
MA1 +. .+
Model Parameter Estimates Standard Error t Value Pr > |t| Variable 0.10256 0.09643 0.14529 0.14199 0.10062 0.08421 0.15699 0.14114 9.93 -4.01 2.22 -0.15 3.89 6.57 -1.06 4.15 0.0001 0.0001 0.0285 0.8798 0.0002 0.0001 0.2939 0.0001 y1(t-1) y2(t-1) e1(t-1) e2(t-1) y1(t-1) y2(t-1) e1(t-1) e2(t-1)
Equation Parameter y1 AR1_1_1 AR1_1_2 MA1_1_1 MA1_1_2 AR1_2_1 AR1_2_2 MA1_2_1 MA1_2_2
y2
The tted VARMA(1,1) model with estimated standard errors in parentheses is given as 1 1:01809 0:38651 B .0:10256/ .0:09644/ C Cy yt D B @ 0:39147 0:55290 A t .0:10062/ .0:08421/ 0 0
1
t 1
VARMAX Modeling
A VARMAX(p; q; s) process is written as
p X i D1 s X i D0 q X i D1
yt D C or
i yt
i xt
t i
.B/ D 0 C 1 B C
C s B s , and .B/ D Ik
The dimension of the state-space vector of the Kalman ltering method for the parameter estimation of the VARMAX(p,q,s) model is large, which takes time and memory for computing. For convenience, the parameter estimation of the VARMAX(p,q,s) model uses the two-stage estimation method, which rst estimates the deterministic terms and exogenous parameters, and then maximizes the log-likelihood function of a VARMA(p,q) model. Some examples of VARMAX modeling are as follows:
model y1 y2 = x1 / q=1; nloptions tech=qn; model y1 y2 = x1 / p=1 q=1 xlag=1 nocurrentx; nloptions tech=qn;
(AICC), the nal prediction error criterion (FPE), the Hannan-Quinn criterion (HQC), and the Schwarz Bayesian criterion (SBC, also referred to as BIC): Q AIC D log.jj/ C 2r=T Q AICC D log.jj/ C 2r=.T
r=k/ T C r=k k Q / jj FPE D . T r=k Q HQC D log.jj/ C 2r log.log.T //=T Q SBC D log.jj/ C r log.T /=T
where r denotes the number of parameters estimated, k is the number of dependent variables, Q T is the number of observations used to estimate the model, and is the maximum likelihood estimate of . When comparing models, choose the model with the smallest criterion values. An example of the output was displayed in Figure 30.4. Portmanteau Qs statistic The Portmanteau Qs statistic is used to test whether correlation remains on the model residuals. The null hypothesis is that the residuals are uncorrelated. Let C .l/ be the residual cross-covariance matrices, O .l/ be the residual cross-correlation matrices as C .l/ D T and O O .l/ D V
1=2 1 T l X t D1 0 t t Cl
O C .l/V
1=2
and O . l/ D O .l/0
2 2 O O where V D Diag. O 11 ; : : : ; O kk / and O i2i are the diagonal elements of . The multivariate portmanteau test dened in Hosking (1980) is s X lD1
Qs D T
.T
l/
trf O .l/
O . l/
The statistic Qs has approximately the chi-square distribution with k 2 .s freedom. An example of the output is displayed in Figure 30.7.
q/ degrees of
Cointegration ! 1959
Jarque-Bera normality test: This test is helpful in determining whether the model residuals represent a white noise process. This tests the null hypothesis that the residuals have normality. F tests for autoregressive conditional heteroscedastic (ARCH) disturbances: F test statistics test for the heteroscedastic disturbances in the residuals. This tests the null hypothesis that the residuals have equal covariances F tests for AR disturbance: These test statistics are computed from the residuals of the univariate AR(1), AR(1,2), AR(1,2,3) and AR(1,2,3,4) models to test the null hypothesis that the residuals are uncorrelated.
Cointegration
This section briey introduces the concepts of cointegration (Johansen 1995b). Denition 1. (Engle and Granger 1987): If a series yt with no deterministic components can be represented by a stationary and invertible ARMA process after differencing d times, the series is integrated of order d , that is, yt I.d /. Denition 2. (Engle and Granger 1987): If all elements of the vector yt are I.d / and there exists a cointegrating vector 0 such that 0 yt I.d b/ for any b > 0, the vector process is said to be cointegrated CI.d; b/. A simple example of a cointegrated process is the following bivariate system: y1t y2t D y2t C
1 1t 2t
D y2;t
with 1t and 2t being uncorrelated white noise processes. In the second equation, y2t is a random walk, y2t D 2t , 1 B. Differencing the rst equation results in y1t D y2t C
1t
2t
1t
1;t 1
Thus, both y1t and y2t are I.1/ processes, but the linear combination y1t y2t is stationary. 0 is cointegrated with a cointegrating vector D .1; 0. Hence yt D .y1t ; y2t / / In general, if the vector process yt has k components, then there can be more than one cointegrating vector 0 . It is assumed that there are r linearly independent cointegrating vectors with r < k, which make the k r matrix . The rank of matrix is r, which is called the cointegration rank of yt .
Common Trends
This section briey discusses the implication of cointegration for the moving-average representation. Let yt be cointegrated CI.1; 1/, then yt has the Wold representation: yt D C .B/ where
t t
is i id.0; /, .B/ D
t
P1
j D0 j B
with 0 D Ik , and
P1
j D0 j jj j
< 1.
yt D y0 C t C .1/
t X i D0
C .B/
where .B/ D .1
B/
1 ..B/
Assume that the rank of .1/ is m D k r. When the process yt is cointegrated, there is a cointegrating k r matrix such that 0 yt is stationary. Premultiplying yt by 0 results in 0 yt D 0 y0 C 0 .B/
t
because 0 .1/ D 0 and 0 D 0. Stock and Watson (1988) showed that the cointegrated process yt has a common trends representation derived from the moving-average representation. Since the rank of .1/ is m D k r, there is a k r matrix H1 with rank r such that .1/H1 D 0. Let H2 be a k m matrix with rank m such 0 that H2 H1 D 0; then A D C.1/H2 has rank m. The H D .H1 ; H2 / has rank k. By construction of H , .1/H D 0; A D ASm where Sm D .0m r ; Im /. Since 0 .1/ D 0 and 0 D 0, lies in the column space of .1/ and can be written Q D C.1/ Q where is a k-dimensional vector. The common trends representation is written as Q D y0 C .1/t C D y0 C .1/H H D y0 C A
t t X i D0 1Q
yt
C .B/
1 t X i D0
t C H
C at
C at
Cointegration ! 1961
and
t
t 1
C vt D Sm H
1 , Q t
where at D .B/ t ,
D Sm H
1 t Q
CH
Pt
i D0 i ,
and vt D Sm H
t.
Stock and Watson showed that the common trends representation expresses yt as a linear combination of m random walks ( t ) with drift plus I.0/ components (at /.
tC
0 0 ? .1/
t X i D1
0 .B/ 0 ? .B/
The Stock-Watson common trends test is performed based on the component wt by testing whether 0 ? .1/ has rank m against rank s. The following statements perform the Stock-Watson test for common trends:
proc iml; sig = 100*i(2); phi = {-0.2 0.1, 0.5 0.2, 0.8 0.7, -0.4 0.6}; call varmasim(y,phi) sigma=sig n=100 initial=0 seed=45876; cn = {y1 y2}; create simul2 from y[colname=cn]; append from y; quit; data simul2; set simul2; date = intnx( year, 01jan1900d, _n_-1 ); format date year4. ; run;
In Figure 30.51, the rst column is the null hypothesis that yt has m k common trends; the second column is the alternative hypothesis that yt has s < m common trends; the third column contains the eigenvalues used for the test statistics; the fourth column contains the test statistics using AR(p) ltering of the data. The table shows the output of the case p D 2.
Figure 30.51 Common Trends Test (COINTTEST=(SW) Option)
The VARMAX Procedure Common Trend Test 5% Critical Value -14.10 -8.80 -23.00
H0: Rank=m 1 2
H1: Rank=s 0 0 1
Lag 2
The test statistic for testing for 2 versus 1 common trends is more negative (35.1) than the critical value (23.0). Therefore, the test rejects the null hypothesis, which means that the series has a single common trend.
yt D or
i yt
.B/yt D
where the initial values, y pC1 ; : : : ; y0 , are xed and t N.0; /. Since the AR operator .B/ Pp 1 i can be re-expressed as .B/ D .B/.1 B/ C .1/B, where .B/ D Ik i D1 i B with Pp i D j Di C1 j , the vector error correction model is .B/.1 B/yt D 0 yt
1
or yt D 0 yt where 0 D
1C p 1 X i D1
i yt
.1/ D
Ik C 1 C 2 C
C p .
One motivation for the VECM(p) form is to consider the relation 0 yt D c as dening the underlying economic relations and assume that the agents react to the disequilibrium error 0 yt c through the adjustment coefcient to restore equilibrium; that is, they satisfy the economic relations. The cointegrating vector, is sometimes called the long-run parameters. You can consider a vector error correction model with a deterministic term. The deterministic term Dt can contain a constant, a linear trend, and seasonal dummy variables. Exogenous variables can also be included in the model.
p 1 X i D1 s X i D0
yt D yt
i yt
C ADt C
i xt
where D 0 . The alternative vector error correction representation considers the error correction term at lag t and is written as yt D
p 1 X i D1
i yt
C ] yt
p C ADt C
s X i D0
i xt
If the matrix has a full-rank (r D k), all components of yt are I.0/. On the other hand, yt are stationary in difference if rank./ D 0. When the rank of the matrix is r < k, there are k r linear combinations that are nonstationary and r stationary cointegrating relations. Note that the linearly independent vector zt D 0 yt is stationary and this transformation is not unique unless r D 1. There does not exist a unique cointegrating matrix since the coefcient matrix can also be decomposed as D MM where M is an r
1 0
r nonsingular matrix.
Different Specications of Deterministic Trends When you construct the VECM(p) form from the VAR(p) model, the deterministic terms in the VECM(p) form can differ from those in the VAR(p) model. When there are deterministic cointegrated relationships among variables, deterministic terms in the VAR(p) model are not present in the VECM(p) form. On the other hand, if there are stochastic cointegrated relationships in the VAR(p) model, deterministic terms appear in the VECM(p) form via the error correction term or as an independent term in the VECM(p) form. There are ve different specications of deterministic trends in the VECM(p) form. Case 1: There is no separate drift in the VECM(p) form. yt D 0 yt
1C p 1 X i D1
i yt
Case 2: There is no separate drift in the VECM(p) form, but a constant enters only via the error correction term. yt D . 0 ; 0 /.y0 t
0 1 ; 1/ C p 1 X i D1
i yt
Case 3: There is a separate drift and no separate linear trend in the VECM(p) form. yt D yt
0 1
p 1 X i D1
i yt
C 0 C
Case 4: There is a separate drift and no separate linear trend in the VECM(p) form, but a linear trend enters only via the error correction term. yt D . 0 ; 1 /.y0 t
0 1 ; t/ C p 1 X i D1
i yt
C 0 C
i yt
C 0 C 1 t C
First, focus on Cases 1, 3, and 5 to test the null hypothesis that there are at most r cointegrating vectors. Let Z0t Z1t Z2t D yt D yt D
1 y0 1 ; : : : ; y0 pC1 ; Dt 0 t t Z01 ; : : : ; Z0T 0 Z11 ; : : : ; Z1T 0 Z21 ; : : : ; Z2T 0
Z0 D Z1 D Z2 D
where Dt can be empty for Case 1, 1 for Case 3, and .1; t/ for Case 5. In Case 2, Z1t and Z2t are dened as Z1t Z2t D y0 t D
0 1 ; 1 y0 1 ; : : : ; y0 pC1 0 t t
Let be the matrix of parameters consisting of 1 , . . . , p 1 , A, and 0 , . . . , s , where parameters A corresponds to regressors Dt . Then the VECM(p) form is rewritten in these variables as Z0t D 0 Z1t C Z2t C
t
.Z0t
0 Z1t
Z2t /
The residuals, R0t and R1t , are obtained by regressing Z0t and Z1t on Z2t , respectively. The regression equation of residuals is R0t D 0 R1t C O t The crossproducts matrices are computed Sij D
T 1 X 0 Rit Rjt ; i; j D 0; 1 T t D1
Then the maximum likelihood estimator for is obtained from the eigenvectors that correspond to the r largest eigenvalues of the following equation: j S11 S10 S001 S01 j D 0
The eigenvalues of the preceding equation are squared canonical correlations between R0t and R1t , and the eigenvectors that correspond to the r largest eigenvalues are the r linear combinations of yt 1 , which have the largest squared partial correlations with the stationary process yt after correcting for lags and deterministic terms. Such an analysis calls for a reduced rank regression of yt on yt 1 corrected for .yt 1 ; : : : ; yt pC1 ; Dt /, as discussed by Anderson (1951). Johansen (1988) suggests two test statistics to test the null hypothesis that there are at most r cointegrating vectors H0 W
i
D 0 for i D r C 1; : : : ; k
Trace Test The trace statistic for testing the null hypothesis that there are at most r cointegrating vectors is as follows:
t race
k X i DrC1
log.1
i/
Q where t r.A/ is the trace of a matrix A, W is the k r dimensional Brownian motion, and W is the Brownian motion itself, or the demeaned or detrended Brownian motion according to the different specications of deterministic trends in the vector error correction model. Maximum Eigenvalue Test The maximum eigenvalue statistic for testing the null hypothesis that there are at most r cointegrating vectors is as follows:
max
T log.1
rC1 /
Q Q W W 0 dr/
Z
0
Q W .d W /0 g
where max.A/ is the maximum eigenvalue of a matrix A. Osterwald-Lenum (1992) provided detailed tables of the critical values of these statistics. The following statements use the JOHANSEN option to compute the Johansen cointegration rank trace test of integrated order 1:
proc varmax data=simul2; model y1 y2 / p=2 cointtest=(johansen=(normalize=y1)); run;
Figure 30.52 shows the output based on the model specied in the MODEL statement, an intercept term is assumed. In the Cointegration Rank Test Using Trace table, the column Drift In ECM means there is no separate drift in the error correction model and the column Drift In Process means the process has a constant drift before differencing. The Cointegration Rank Test Using Trace table shows the trace statistics based on Case 3 and the Cointegration Rank Test Using Trace under Restriction table shows the trace statistics based on Case 2. The output indicates that the series are cointegrated with rank 1 because the trace statistics are smaller than the critical values in both Case 2 and Case 3.
H0: Rank=r 0 1
H1: Rank>r 0 1
Cointegration Rank Test Using Trace Under Restriction 5% Critical Value 19.99 9.13
H0: Rank=r 0 1
H1: Rank>r 0 1
Figure 30.53 shows which result, either Case 2 (the hypothesis H0) or Case 3 (the hypothesis H1), is appropriate depending on the signicance level. Since the cointegration rank is chosen to be 1 by the result in Figure 30.52, look at the last row that corresponds to rank=1. Since the p-value is 0.054, the Case 2 cannot be rejected at the signicance level 5%, but it can be rejected at the signicance level 10%. For modeling of the two Case 2 and Case 3, see Figure 30.56 and Figure 30.57.
Figure 30.53 Cointegration Rank Test Continued
Hypothesis of the Restriction Drift in ECM Constant Constant Drift in Process Constant Linear
Rank 0 1
DF 2 1
Figure 30.54 shows the estimates of long-run parameter (Beta) and adjustment coefcients (Alpha) based on Case 3.
Using the NORMALIZE= option, the rst low of the Beta table has 1. Considering that the cointegration rank is 1, the long-run relationship of the series is 0 yt D 1 2:04869 2:04869y2t D y1t y1t y1 y2
D 2:04869y2t
Figure 30.55 shows the estimates of long-run parameter (Beta) and adjustment coefcients (Alpha) based on Case 2.
Figure 30.55 Cointegration Rank Test Continued
Beta Under Restriction Variable y1 y2 1 1 1.00000 -2.04366 6.75919 2 1.00000 -2.75773 101.37051
Considering that the cointegration rank is 1, the long-run relationship of the series is 3 y1 2:04366 6:75919 4 y2 5 1 2:04366 y2t C 6:75919 2 6:75919
0 yt
D y1t y1t
D 2:04366 y2t
O Z2 0
O Z1 0 /=T
The estimators of the orthogonal complements of and are O ? D S11 vrC1 ; : : : ; vk and O ? D S001 S01 vrC1 ; : : : ; vk The ML estimators have the following asymptotic properties: p O O T vec.; ; / ! N.0; co /
d
0 0 Ik
0 0 0 Ik
The following statements are examples of tting the ve different cases of the vector error correction models mentioned in the previous section.
From Figure 30.53 that uses the COINTTEST=(JOHANSEN) option, you can t the model by using either Case 2 or Case 3 because the test was not signicant at the 0.05 level, but was signicant at the 0.10 level. Here both models are tted to show the difference in output display. Figure 30.56 is for Case 2, and Figure 30.57 is for Case 3. For Case 2,
proc varmax data=simul2; model y1 y2 / p=2 ecm=(rank=1 normalize=y1 ectrend) print=(estimates); run;
AR Coefficients of Differenced Lag DIF Lag 1 Variable y1 y2 y1 -0.72759 0.38982 y2 -0.77463 -0.55173
Model Parameter Estimates Standard Error t Value Pr > |t| Variable 0.33022 0.04886 0.09984 0.04623 0.04978 0.35394 0.05236 0.10702 0.04955 0.05336 1, EC y1(t-1) y2(t-1) D_y1(t-1) D_y2(t-1) 1, EC y1(t-1) y2(t-1) D_y1(t-1) D_y2(t-1)
Equation Parameter D_y1 CONST1 AR1_1_1 AR1_1_2 AR2_1_1 AR2_1_2 CONST2 AR1_2_1 AR1_2_2 AR2_2_1 AR2_2_2
Estimate -3.24543 -0.48015 0.98126 -0.72759 -0.77463 0.84748 0.12538 -0.25624 0.38982 -0.55173
-15.74 -15.56
0.0001 0.0001
D_y2
7.87 -10.34
0.0001 0.0001
Figure 30.56 can be reported as follows: yt D C 2 0:48015 0:12538 0:72759 0:38982 0:98126 0:25624 0:77463 0:55173 3:24543 0:84748 yt
1
y1;t 4 y2;t 1 C
t
1 1
3 5
The keyword EC in the Model Parameter Estimates table means that the ECTREND option is used for tting the model. For tting Case 3,
proc varmax data=simul2; model y1 y2 / p=2 ecm=(rank=1 normalize=y1) print=(estimates); run;
AR Coefficients of Differenced Lag DIF Lag 1 Variable y1 y2 y1 -0.74052 0.34820 y2 -0.76305 -0.51194
Model Parameter Estimates Standard Error t Value Pr > |t| Variable 1.32398 0.05474 0.11215 0.05060 0.05352 1.39587 0.05771 0.11824 0.05335 0.05643 -1.97 0.0518 1 y1(t-1) y2(t-1) 0.0001 D_y1(t-1) 0.0001 D_y2(t-1) 0.0159 1 y1(t-1) y2(t-1) 0.0001 D_y1(t-1) 0.0001 D_y2(t-1)
Equation Parameter D_y1 CONST1 AR1_1_1 AR1_1_2 AR2_1_1 AR2_1_2 CONST2 AR1_2_1 AR1_2_2 AR2_2_1 AR2_2_2
Estimate -2.60825 -0.46421 0.95103 -0.74052 -0.76305 3.43005 0.17535 -0.35923 0.34820 -0.51194
D_y2
6.53 -9.07
Figure 30.57 can be reported as follows: yt D C 0:46421 0:17535 2:60825 3:43005 0:95103 0:35293 C
t
yt
1
0:74052 0:34820
0:76305 0:51194
yt
is the s
0 0 1 0
3 0 0 7 7 0 5 1
Restriction H0 W D H When the linear restriction D H is given, it implies that the same restrictions are imposed on all cointegrating vectors. You obtain the maximum likelihood estimator of by reduced rank regression of yt on H yt 1 corrected for .yt 1 ; : : : ; yt pC1 ; Dt /, solving the following equation j H 0 S11 H H 0 S10 S001 S01 H j D 0
for the eigenvalues 1 > 1 > > s > 0 and eigenvectors .v1 ; : : : ; vs /, Sij given in the preceding O O D .v1 ; : : : ; vr / that corresponds to the r largest eigenvalues, and the is section. Then choose H O. The test statistic for H0 W D H is given by T
r X i D1
logf.1
i /=.1
i /g
2 r.k s/
If the series has no deterministic trend, the constant term should be restricted by 0 0 D 0 as in ? Case 2. Then H is given by 2 6 6 H D6 6 4 1 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 3 7 7 7 7 5
Figure 30.58 shows the results of testing H0 W 21 C 2 D 0. The input H matrix is H D .1 2/0 . The adjustment coefcient is reestimated under the restriction, and the test indicates that you cannot reject the null hypothesis.
Index 1
Eigenvalue 0.4644
DF 1
Chi-Square 0.51
p 1 X i D1
i yt
C ADt C
Divide the process yt into .y0 ; y0 /0 with dimension k1 and k2 and the into 1t 2t D 11 12 21 22
Then the VECM(p) form can be rewritten by using the decomposed parameters and processes: y1t y2t D 1 2
p 1 X i D1
yt
1i 2i
yt
i
A1 A2
Dt C
1t 2t
y1t
D !y2t C .1 C.A1
1C
.1i
!2i /yt
!A2 /Dt C
2t
2i yt
C A2 Dt C
2t
The test of weak exogeneity of y2t for the parameters .1 ; / determines whether 2 D 0. Weak exogeneity means that there is no information about in the marginal model or that the variables y2t do not react to a disequilibrium. Restriction H0 W D J Consider the null hypothesis H0 W D J , where J is a k From the previous residual regression equation R0t D 0 R1t C O t D J 0 R1t C O t you can obtain N J 0 R0t
0 J? R0t
N 0 R1t C J 0 O t
N J 0 R0t D
0 !J? O t
0 N Using the marginal distribution of J? R0t and the conditional distribution of J 0 R0t , the new residuals are computed as
Q RJ t Q R1t
N D J 0 R0t D R1t
where
0 0 N SJJ? D J 0 S00 J? ; SJ? J? D J? S00 J? ; and SJ? 1 D J? S01
Q Q In terms of RJ t and R1t , the MLE of is computed by using the reduced rank regression. Let Sij:J? D
T 1 X Q Q0 Rit Rjt ; for i; j D 1; J T t D1
Q Under the null hypothesis H0 W D J , the MLE is computed by solving the equation j S11:J?
1 S1J:J? SJJ:J? SJ1:J? j D 0
Q Then D .v1 ; : : : ; vr /, where the eigenvectors correspond to the r largest eigenvalues. The likelihood ratio test for H0 W D J is T
r X i D1
logf.1
i /=.1
i /g
2 r.k m/
The test of weak exogeneity of y2t is a special case of the test D J , considering J D .Ik1 ; 0/0 . Consider the previous example with four variables ( mt ; yt ; itb ; itd ). If r D 1, you formulate the weak exogeneity of (yt ; itb ; itd ) for mt as J D 1; 0; 0; 00 and the weak exogeneity of itd for (mt ; yt ; itb ) as J D I3 ; 00 . The following statements test the weak exogeneity of other variables, assuming r D 1:
proc varmax data=simul2; model y1 y2 / p=2 ecm=(rank=1 normalize=y1); cointeg rank=1 exogeneity; run;
Figure 30.59 shows that each variable is not the weak exogeneity of other variable.
Figure 30.59 Testing of Weak Exogeneity (EXOGENEITY Option)
The VARMAX Procedure Testing Weak Exogeneity of Each Variables Variable y1 y2 DF 1 1 Chi-Square 53.46 8.76 Pr > ChiSq <.0001 0.0031
yt D t C
yt Cl D .t C l/ C
t Ci
yt Cljt D .t C l/ C
Note that
l!1
lim 0 yt Cljt D 0
P since liml!1 t Cl i j D .1/ and 0 .1/ D 0. The long-run forecast of the cointegrated j D0 system shows that the cointegrated relationship holds, although there might exist some deviations from the equilibrium status in the short-run. The covariance matrix of the predict error et Cljt D ytCl yt Cljt is
l l i l i X X X 0 .l/ D . j /. j / i D1 j D0 j D0
When the linear process is represented as a VECM(p) model, you can obtain
p 1 X j D1
yt D yt
1C
j yt
CC
C et
where zt D .y0 t 2 6 6 6 F D6 6 4
0 0 1 ; yt ; yt 1 ;
; y0 t
0 pC2 /
Ik Ik 0 . C 1 / 2 0 Ik 0 : : : : : : : : : 0 0
0 p 0 : :: : : : Ik 0
1
7 7 7 7 7 5
where 0 is a k
1C
i yt
C ADt C
i xt
The following statements demonstrate how to t VECMX(p; s), where p D 2 and s D 1 from the P=2 and XLAG=1 options:
proc varmax data=simul3; model y1 y2 = x1 / p=2 xlag=1 ecm=(rank=1); run;
I(2) Model
The VARX(p,s) model can be written in the error correction form: yt D 0 yt
p 1 X i D1 s X i D0
1C
i yt
C ADt C
i xt
Let D Ik
Pp
1 i D1 i
If the condition rank.0 ? / D k r fails and 0 ? has reduced-rank 0 ? D 0 ? ? ? where and are .k r/ s matrices with s k r, then ? and ? are dened as k .k r/ matrices of full rank such that 0 ? D 0 and 0 ? D 0. If and have full-rank s, then the process yt is I.2/, which has the implication of I.2/ model for the moving-average representation.
j t XX j D1 i D1 t X i D1
yt D B0 C B1 t C C2
C C1
C C0 .B/
The matrices C1 , C2 , and C0 .B/ are determined by the cointegration properties of the process, and B0 and B1 are determined by the initial values. For details, see Johansen (1995a). The implication of the I.2/ model for the autoregressive representation is given by
p 2 X i D1 s X i D0
yt D yt
yt
i yt
C ADt C
i xt
where i D
Pp
1 j Di C1 i
and D Ik
Pp
1 i D1 i
Table 30.2
rnk
I.1/
: : :
Johansen (1995a) proposed the two-step procedure to analyze the I.2/ model. In the rst step, the values of .r; ; / are estimated using the reduced rank regression analysis, performing the regression analysis 2 yt , yt 1 , and yt 1 on 2 yt 1 ; : : : ; 2 yt pC2 ; and Dt . This gives residuals R0t , R1t , and R2t , and residual product moment matrices Mij D
T 1 X 0 Rit Rjt for i; j D 0; 1; 2 T t D1 1,
Perform the reduced rank regression analysis 2 yt on yt 1 corrected for yt 2 yt 1 ; : : : ; 2 yt pC2 ; and Dt , and solve the eigenvalue problem of the equation j M22:1
1 M20:1 M00:1 M02:1 j D 0
In the second step, if .r; ; / are known, the values of .s; ; / are determined usO0 O? ing the reduced rank regression analysis, regressing 0 2 yt on ? yt 1 corrected for 2y 2y 0 y O t 1 ; : : : ; t pC2 ; Dt , and t 1. The reduced rank regression analysis reduces to the solution of an eigenvalue problem for the equation j M? ? : where
0 M? ? : D ? .M11 1 M? ? : M? ? : M? ? : j D 0
0 M? ? :
M? ? : D N where D .0 /
1.
M11 . 0 M11 /
1 0
>
>
The likelihood ratio test for the reduced rank model Hr;s with rank s in the model Hr;k is given by
k r X i DsC1
0 D Hr
Qr;s D
log.1
i /;
s D 0; : : : ; k
The following statements compute the rank test to test for cointegrated order 2:
The last two columns in Figure 30.60 explain the cointegration rank test with integrated order 1. The results indicate that there is the cointegrated relationship with the cointegration rank 1 with respect to the 0.05 signicance level because the test statistic of 0.5552 is smaller than the critical value of 3.84. Now, look at the row associated with r D 1. Compare the test statistic value, 211.84512, to the critical value, 3.84, for the cointegrated order 2. There is no evidence that the series are integrated order 2 at the 0.05 signicance level.
Figure 30.60 Cointegrated I(2) Test (IORDER= Option)
The VARMAX Procedure Cointegration Rank Test for I(2) Trace of I(1) 61.7522 0.5552 5% CV of I(1) 15.34 3.84
r\k-r-s 0 1 5% CV I(2)
2 720.40735 15.34000
BEKK Representation
Engle and Kroner (1995) propose a general multivariate GARCH model and call it a BEKK representation. Let F.t 1/ be the sigma eld generated by the past values of t , and let Ht be the conditional covariance matrix of the k-dimensional random vector t . Let Ht be measurable with respect to F.t 1/; then the multivariate GARCH model can be written as
t jF.t
1/ Ht
N.0; Ht / q X D CC A0 i
i D1
0 t i t i Ai
p X i D1
Gi0 Ht
i Gi
k parameter matrices.
Consider a bivariate GARCH(1,1) model as follows: 0 a11 a12 c11 c12 C D a21 a22 c12 c22 2;t 0 g11 g12 g11 g12 C Ht 1 g21 g22 g21 g22
2 1;t 1 1 1;t 1 1;t 1 2;t 1 2 2;t 1
Ht
C 2a11 a21
1;t 1 2;t 1 1
2 C a21 1
2 2;t 1 2 2;t 1
2 C g21 h22;t
1;t 1 2;t 1
C a21 a22
1 C .g21 g12 C g11 g22 /h12;t 1 C g21 g22 h22;t 1 2 2 2 2 c22 C a12 1;t 1 C 2a12 a22 1;t 1 2;t 1 C a22 2;t 1 2 2 Cg12 h11;t 1 C 2g12 g22 h12;t 1 C g22 h22;t 1
For the BEKK representation of the bivariate GARCH(1,1) model, the SAS statements are:
model y1 y2; garch q=1 p=1 form=bekk;
CCC Representation
Bollerslev (1990) propose a multivariate GARCH model with time-varying conditional variances and covariances but constant conditional correlations. The conditional covariance matrix Ht consists of Ht D Dt Dt where Dt is a k k stochastic diagonal matrix with element matrix with the typical element ij . The elements of Ht are
q X lD1 2 i;t l 1=2 p X lD1 it
and is a k
k time-invariant
hi i;t hij;t
D ci C D
ai i;l
gi i;l hi i;t
i; j D 1; : : : k
i j
`D
0 1 t t Ht
The log-likelihood function is maximized by an iterative numerical method such as quasi-Newton optimization. The starting values for the regression parameters are obtained from the least squares estimates. The covariance of t is used as the starting values for the GARCH constant parameters, and the starting value used for the other GARCH parameters is either 10 6 or 10 3 depending on the GARCH models representation. For the identication of the parameters of a BEKK representation GARCH model, the diagonal elements of the GARCH constant, the ARCH, and the GARCH parameters are restricted to be positive.
Covariance Stationarity
Dene the multivariate GARCH process as ht D
1 X i D1
G.B/i
c C A.B/t
0 t t /.
where ht D vec.Ht /, c D vec.C0 /, and t D vec. GARCH(p; q) model by the following algebra:
1 X i D2
ht
D c C A.B/t C
G.B/i
1 X i D1
c C A.B/t
1
D c C A.B/t C G.B/
G.B/i
tmbc C A.B/t
Ai /0 B i and G.B/ D
Pp
i D1 .Gi
The necessary and sufcient conditions for covariance stationarity of the multivariate GARCH process is that all the eigenvalues of A.1/ C G.1/ are less than one in modulus.
esq1 = 0; esq2 = 0; ly1 = 0; ly2 = 0; do i = 1 to 1000; ht = 6.25 + 0.5*esq1; call rannor(seed,ehat); e1 = sqrt(ht)*ehat; ht = 1.25 + 0.7*esq2; call rannor(seed,ehat); e2 = sqrt(ht)*ehat; y1 = 2 + 1.2*ly1 - 0.5*ly2 + e1; y2 = 4 + 0.6*ly1 + 0.3*ly2 + e2; if i>500 then output; esq1 = e1*e1; esq2 = e2*e2; ly1 = y1; ly2 = y2; end; keep y1 y2; run;
The following statements t a VAR(1)ARCH(1) model to the data. For a VAR-ARCH model, you specify the order of the autoregressive model with the P=1 option in the MODEL statement and the Q=1 option in the GARCH statement. In order to produce the initial and nal values of parameters, the TECH=QN option is specied in the NLOPTIONS statement.
proc varmax data=garch; model y1 y2 / p=1 print=(roots estimates diagnose); garch q=1; nloptions tech=qn; run;
Figure 30.61 through Figure 30.65 show the details of this example. Figure 30.61 shows the initial values of parameters.
N Parameter 1 2 3 4 5 6 7 8 9 10 11 12 13 CONST1 CONST2 AR1_1_1 AR1_2_1 AR1_1_2 AR1_2_2 GCHC1_1 GCHC1_2 GCHC2_2 ACH1_1_1 ACH1_2_1 ACH1_1_2 ACH1_2_2
Estimate 2.249575 3.902673 1.231775 0.576890 -0.528405 0.343714 9.929763 0.193163 4.063245 0.001000 0 0 0.001000
Figure 30.63 shows the conditional variance using the BEKK representation of the ARCH(1) model. The ARCH parameters are estimated by the vectorized parameter matrices.
t jF.t
1/ Ht
0 1 t 1
GARCH Model Parameter Estimates Standard Error 0.73116 0.21706 0.19398 0.07470 0.06971 0.02622 0.06844
Figure 30.64 shows the AR parameter estimates and their signicance. The tted VAR(1) model with the previous conditional covariance ARCH model is written as follows: 1:94399 1:22094 0:52712 yt D C yt 1 C t 4:07390 0:60826 0:30301
y2
Figure 30.65 shows the roots of the AR and ARCH characteristic polynomials. The eigenvalues have a modulus less than one.
Figure 30.65 Roots for the VAR(1)ARCH(1) Model
Roots of AR Characteristic Polynomial Index 1 2 Real 0.76198 0.76198 Imaginary 0.33163 -0.33163 Modulus 0.8310 0.8310 Radian 0.4105 -0.4105 Degree 23.5197 -23.5197
Roots of GARCH Characteristic Polynomial Index 1 2 3 4 Real 0.51180 0.26627 0.26627 0.13853 Imaginary 0.00000 0.00000 0.00000 0.00000 Modulus 0.5118 0.2663 0.2663 0.1385 Radian 0.0000 0.0000 0.0000 0.0000 Degree 0.0000 0.0000 0.0000 0.0000
the BY variables the ID variable the MODEL statement dependent (endogenous) variables. These variables contain the actual values from the input data set. FORi, numeric variables that contain the forecasts. The FORi variables contain the forecasts for the ith endogenous variable in the MODEL statement list. Forecasts are one-step-ahead predictions until the end of the data or until the observation specied by the BACK= option. Multistep forecasts can be computed after that point based on the LEAD= option. RESi, numeric variables that contain the residual for the forecast of the ith endogenous variable in the MODEL statement list. For multistep forecast observations, the actual values are missing and the RESi variables contain missing values. STDi, numeric variables that contain the standard deviation for the forecast of the ith endogenous variable in the MODEL statement list. The values of the STDi variables can be used to construct univariate condence limits for the corresponding forecasts. LCIi, numeric variables that contain the lower condence limits for the corresponding forecasts of the ith endogenous variable in the MODEL statement list. UCIi, numeric variables that contain the upper condence limits for the corresponding forecasts of the ith endogenous variable in the MODEL statement list. The OUT= data set contains the values shown in Table 30.3 and Table 30.4 for a bivariate case.
Table 30.3 OUT= Data Set
Obs 1 2 : : :
y1 y11 y12
STD1
11 11
Table 30.4
Obs 1 2 : : :
y2 y21 y22
STD2
22 22
The output in Figure 30.66 shows part of the results of the OUT= data set for the preceding example.
Figure 30.66 OUT= Data Set
Obs 98 99 100 101 102 103 104 105 Obs 98 99 100 101 102 103 104 105 date 1997 1998 1999 2000 2001 2002 2003 2004 y2 0.64397 0.35925 -0.64999 . . . . . y1 -0.58433 -2.07170 -3.38342 . . . . . FOR2 -0.34932 -0.07132 -0.99354 -2.09873 -2.77050 -2.75724 -2.24943 -1.47460 FOR1 -0.13500 -1.00649 -2.58612 -3.59212 -3.09448 -2.17433 -1.11395 -0.14342 RES1 -0.44934 -1.06522 -0.79730 . . . . . STD1 1.13523 1.13523 1.13523 1.13523 1.70915 2.14472 2.43166 2.58740 LCI1 -2.36001 -3.23150 -4.81113 -5.81713 -6.44435 -6.37792 -5.87992 -5.21463 LCI2 -2.68357 -2.40557 -3.32779 -4.43298 -5.66469 -6.17173 -6.20709 -5.88782 UCI1 2.09002 1.21853 -0.36111 -1.36711 0.25539 2.02925 3.65203 4.92779 UCI2 1.98492 2.26292 1.34070 0.23551 0.12369 0.65725 1.70823 2.93863
LTREND, a numeric variable that contains the estimates of linear trend parameters and their standard errors QTREND, a numeric variable that contains the estimates of quadratic trend parameters and their standard errors XLl_i , numeric variables that contain the estimates of exogenous parameters and their standard errors, where l is the lag lth coefcient matrix and i D 1; : : : ; r, where r is the number of exogenous variables ARl_i , numeric variables that contain the estimates of autoregressive parameters and their standard errors, where l is the lag lth coefcient matrix and i D 1; : : : ; k, where k is the number of endogenous variables MAl_i , numeric variables that contain the estimates of moving-average parameters and their standard errors, where l is the lag lth coefcient matrix and i D 1; : : : ; k, where k is the number of endogenous variables ACHl_i are numeric variables that contain the estimates of the ARCH parameters of the covariance matrix and their standard errors, where l is the lag lth coefcient matrix and i D 1; : : : ; k for BEKK and CCC representations, where k is the number of endogenous variables. GCHl_i are numeric variables that contain the estimates of the GARCH parameters of the covariance matrix and their standard errors, where l is the lag lth coefcient matrix and i D 1; : : : ; k for BEKK and CCC representations, where k is the number of endogenous variables. GCHC_i are numeric variables that contain the estimates of the constant parameters of the covariance matrix and their standard errors, where i D 1; : : : ; k for BEKK representation, k is the number of endogenous variables, and i D 1 for CCC representation. CCC_i are numeric variables that contain the estimates of the conditional constant correlation parameters for CCC representation where i D 2; : : : ; k. The OUTEST= data set contains the values shown Table 30.5 for a bivariate case.
Table 30.5 OUTEST= Data Set
Obs 1 2 3 4
NAME y1 y2
AR1_1
1;11
AR1_2
1;12
AR2_1
2;11
AR2_2
2;12
se( se(
se( se(
se( se(
se( se(
The output in Figure 30.67 shows the results of the OUTEST= data set.
Figure 30.67 OUTEST= Data Set
Obs 1 2 3 4 NAME y1 y2 TYPE EST STD EST STD AR1_1 -0.46680 0.04786 0.10667 0.05146 AR1_2 0.91295 0.09359 -0.20862 0.10064 AR2_1 -0.74332 0.04526 0.40493 0.04867 AR2_2 -0.74621 0.04769 -0.57157 0.05128
Obs 1 2 :
The output in Figure 30.68 shows the part of the OUTHT= data set.
Figure 30.68 OUTHT= Data Set
Obs 495 496 497 498 499 500 h1_1 9.36568 8.46807 9.19686 8.40787 8.88429 8.60844 h1_2 -1.10406 -0.17464 0.09762 -0.33463 0.03646 -0.40260 h2_2 2.44644 1.60330 1.69639 2.07687 1.69401 1.79703
Alpha_i , numeric variables that contain adjustment parameter estimates, If the JOHANSEN=(IORDER=2) option is specied, the following items are added: EValueI2_i , numeric variables that contain eigenvalues for the cointegration rank test of integrated order 2 EValueI1, a numeric variable that contains eigenvalues for the cointegration rank test of integrated order 1 Eta_i , numeric variables that contain the parameter estimates in integrated order 2, Xi_i , numeric variables that contain the parameter estimates in integrated order 2, The OUTSTAT= data set contains the values shown Table 30.7 for a bivariate case.
Table 30.7 OUTSTAT= Data Set
Obs 1 2
NAME y1 y2
SIGMA_1
11 21
SIGMA_2
12 22
AICC aicc .
RSquare
2 R1 2 R2
FValue F1 F2
Obs 1 2
EValueI2_2 e12 .
EValueI1 e1 e2
Beta_1 11 21
Beta_2 12 21
Obs 1 2
Alpha_1 11 21
Alpha_2 12 22
Eta_1 11 21
Eta_2 12 22
Xi_1
11 21
Xi_2
12 22
The output in Figure 30.69 shows the results of the OUTSTAT= data set.
Obs 1 2
Obs 1 2
Printed Output
The default printed output produced by the VARMAX procedure is described in the following list: descriptive statistics, which include the number of observations used, the names of the variables, their means and standard deviations (STD), their minimums and maximums, the differencing operations used, and the labels of the variables a type of model to t the data and an estimation method a table of parameter estimates that shows the following for each parameter: the variable name for the left-hand side of equation, the parameter name, the parameter estimate, the approximate standard error, t value, the approximate probability (P r > jtj), and the variable name for the right-hand side of equations in terms of each parameter the innovation covariance matrix the information criteria If PRINT=ESTIMATES is specied, the VARMAX procedure prints the following list with the default printed output: the estimates of the constant vector (or seasonal constant matrix), the trend vector, the coefcient matrices of the distributed lags, the AR coefcient matrices, and the MA coefcient matrices the ALPHA and BETA parameter estimates for the error correction model the schematic representation of parameter estimates
If PRINT=DIAGNOSE is specied, the VARMAX procedure prints the following list with the default printed output: the cross-covariance and cross-correlation matrices of the residuals the tables of test statistics for the hypothesis that the residuals of the model are white noise: Durbin-Watson (DW) statistics F test for autoregressive conditional heteroscedastic (ARCH) disturbances F test for AR disturbance Jarque-Bera normality test Portmanteau test
Description
Option
ODS Tables Created by the MODEL Statement AccumImpulse Accumulated impulse response matrices IMPULSE=(ACCUM) IMPULSE=(ALL) IMPULSE=(ACCUM) IMPULSE=(ALL) IMPULSX=(ACCUM) IMPULSX=(ALL) IMPULSX=(ACCUM) IMPULSX=(ALL) JOHANSEN= ECM= JOHANSEN= ECM= PRINT=DIAGNOSE P= ROOTS with P= JOHANSEN= ECM= JOHANSEN=
AccumImpulsebyVar Accumulated impulse response by variable AccumImpulseX Accumulated transfer function matrices AccumImpulseXbyVar Accumulated transfer function by variable Alpha coefcients AlphaInECM coefcients when rank=r AlphaOnDrift coefcients under the restriction of a deterministic term AlphaBetaInECM D 0 coefcients when rank=r ANOVA Univariate model diagnostic checks for the residuals ARCoef AR coefcients ARRoots Roots of AR characteristic polynomial Beta coefcients BetaInECM coefcients when rank=r BetaOnDrift coefcients under the restriction of a deterministic term
Table 30.8
continued
Description
Option without NOINT CORRB PRINT=DIAGNOSE PRINT=DIAGNOSE PRINT=DIAGNOSE CORRX CORRY CORRX CORRX CORRY CORRY COVB default COVPE COVPE PRINT=DIAGNOSE PRINT=DIAGNOSE COVX COVX COVY COVY DECOMPOSE DECOMPOSE DFTEST PRINT=DIAGNOSE PRINT=DIAGNOSE DYNAMIC DYNAMIC DYNAMIC DYNAMIC
Constant estimates Correlations of parameter estimates Correlations of residuals Correlations of residuals by variable Schematic representation of correlations of residuals CorrXGraph Schematic representation of sample correlations of independent series CorrYGraph Schematic representation of sample correlations of dependent series CorrXLags Correlations of independent series CorrXbyVar Correlations of independent series by variable CorrYLags Correlations of dependent series CorrYbyVar Correlations of dependent series by variable CovB Covariances of parameter estimates CovInnovation Covariances of the innovations CovPredictError Covariance matrices of the prediction error CovPredictErrorbyVar Covariances of the prediction error by variable CovResiduals Covariances of residuals CovResidualsbyVar Covariances of residuals by variable CovXLags Covariances of independent series CovXbyVar Covariances of independent series by variable CovYLags Covariances of dependent series CovYbyVar Covariances of dependent series by variable DecomposeCovDecomposition of the prediction error PredictError covariances DecomposeCovDecomposition of the prediction error PredictErrorbyVar covariances by variable DFTest Dickey-Fuller test DiagnostAR Test the AR disturbance for the residuals DiagnostWN Test the ARCH disturbance and normality for the residuals DynamicARCoef AR coefcients of the dynamic model DynamicConstant Constant estimates of the dynamic model DynamicCov- Inno- Covariances of the innovations of the dyvation namic model DynamicLinearTrend Linear trend estimates of the dynamic model
Table 30.8
continued
Description
Option DYNAMIC DYNAMIC DYNAMIC DYNAMIC DYNAMIC DYNAMIC DYNAMIC JOHANSEN= JOHANSEN= JOHANSEN= (IORDER=2) JOHANSEN= (IORDER=2) IARR default TREND= Q= ROOTS with Q= JOHANSEN= (TYPE=MAX) MINIC MINIC= default default IMPULSE=(ORTH) IMPULSE=(ALL) IMPULSE=(ORTH) IMPULSE=(ALL) default PRINT=ESTIMATES PARCOEF PARCOEF PCANCORR PCORR PCORR
MA coefcients of the dynamic model Seasonal constant estimates of the dynamic model DynamicParameter- Parameter estimates table of the dynamic Estimates model DynamicParameter- Schematic representation of the parameGraph ters of the dynamic model DynamicQuadTrend Quadratic trend estimates of the dynamic model DynamicSeasonGraph Schematic representation of the seasonal dummies of the dynamic model DynamicXLagCoef Dependent coefcients of the dynamic model Hypothesis Hypothesis of different deterministic terms in cointegration rank test HypothesisTest Test hypothesis of different deterministic terms in cointegration rank test EigenvalueI2 Eigenvalues in integrated order 2 Eta InniteARRepresent InfoCriteria LinearTrend MACoef MARoots MaxTest Minic ModelType NObs OrthoImpulse OrthoImpulsebyVar ParameterEstimates ParameterGraph PartialAR PartialARGraph PartialCanCorr PartialCorr PartialCorrbyVar coefcients Innite order ar representation Information criteria Linear trend estimates MA coefcients Roots of MA characteristic polynomial Cointegration rank test using the maximum eigenvalue Tentative order selection Type of model Number of observations Orthogonalized impulse response matrices Orthogonalized impulse response by variable Parameter estimates table Schematic representation of the parameters Partial autoregression matrices Schematic representation of partial autoregression Partial canonical correlation analysis Partial cross-correlation matrices Partial cross-correlations by variable
Table 30.8
continued
ODS Table Name PartialCorrGraph PortmanteauTest ProportionCov- PredictError ProportionCov- PredictErrorbyVar RankTestI2 RestrictMaxTest
Description Schematic representation of partial cross-correlations Chi-square test table for residual crosscorrelations Proportions of prediction error covariance decomposition Proportions of prediction error covariance decomposition by variable Cointegration rank test in integrated order 2 Cointegration rank test using the maximum eigenvalue under the restriction of a deterministic term Cointegration rank test using the trace under the restriction of a deterministic term Quadratic trend estimates Schematic representation of the seasonal dummies Seasonal constant estimates Impulse response matrices
Option PCORR PRINT=DIAGNOSE DECOMPOSE DECOMPOSE JOHANSEN= (IORDER=2) JOHANSEN= (TYPE=MAX) without NOINT JOHANSEN= (TYPE=TRACE) without NOINT TREND=QUAD PRINT=ESTIMATES NSEASON= IMPULSE=(SIMPLE) IMPULSE=(ALL) IMPULSE=(SIMPLE) IMPULSE=(ALL) IMPULSX=(SIMPLE) IMPULSX=(ALL) IMPULSX=(SIMPLE) IMPULSX=(ALL) default SW= JOHANSEN= (TYPE=TRACE) JOHANSEN= (IORDER=2) XLAG= YW
RestrictTraceTest
SimpleImpulsebyVar Impulse response by variable SimpleImpulseX Impulse response matrices of transfer function SimpleImpulseXbyVar Impulse response of transfer function by variable Summary Simple summary statistics SWTest Common trends test TraceTest Cointegration rank test using the trace Xi XLagCoef YWEstimates coefcient matrix Dependent coefcients Yule-Walker estimates
ODS Tables Created by the GARCH Statement ARCHCoef GARCHCoef GARCHConstant GARCHParameterEstimates ARCH coefcients GARCH coefcients GARCH constant estimates GARCH parameter estimates table Q= P= PRINT=ESTIMATES default
Table 30.8
continued
Description Schematic representation of the garch parameters Roots of GARCH characteristic polynomial
ODS Tables Created by the COINTEG Statement or the ECM option AlphaInECM AlphaBetaInECM AlphaOnAlpha AlphaOnBeta AlphaTestResults BetaInECM BetaOnBeta BetaOnAlpha BetaTestResults GrangerRepresent HMatrix JMatrix WeakExogeneity coefcients when rank=r D 0 coefcients when rank=r coefcients under the restriction of coefcients under the restriction of Hypothesis testing of coefcients when rank=r coefcients under the restriction of coefcients under the restriction of Hypothesis testing of Coefcient of Granger representation Restriction matrix for Restriction matrix for Testing weak exogeneity of each dependent variable with respect to BETA PRINT=ESTIMATES PRINT=ESTIMATES J= H= J= PRINT=ESTIMATES H= J= H= PRINT=ESTIMATES H= J= EXOGENEITY
ODS Tables Created by the CAUSAL Statement CausalityTest GroupVars Granger causality test Two groups of variables default default
ODS Tables Created by the RESTRICT Statement Restrict Restriction table default
ODS Tables Created by the TEST Statement Test Wald test default
ODS Tables Created by the OUTPUT Statement Forecasts Forecasts table without NOPRINT
Note that the ODS table names sufxed by byVar can be obtained with the PRINTFORM=UNIVARIATE option.
ODS Graphics
This section describes the use of ODS for creating statistical graphs with the VARMAX procedure. To request these graphs, you must specify the ODS GRAPHICS ON statement. When ODS GRAPHICS are in effect, the VARMAX procedure produces a variety of plots for each dependent variable. The plots available are as follows: The procedure displays the following plots for each dependent variable in the MODEL statement with the PLOT= option in the VARMAX statement: impulse response function impulse response of the transfer function time series and predicted series prediction errors distribution of the prediction errors normal quantile of the prediction errors ACF of the prediction errors PACF of the prediction errors IACF of the prediction errors log scaled white noise test of the prediction errors The procedure displays forecast plots for each dependent variable in the OUTPUT statement with the PLOT= option in the VARMAX statement.
Plot Description Autocorrelation function of prediction errors Inverse autocorrelation function of prediction errors Partial autocorrelation function of prediction errors Diagnostics of prediction errors Histogram and Q-Q plot of prediction errors
Table 30.9
continued
ODS Table Name ErrorDistribution ErrorQQPlot ErrorWhiteNoisePlot ErrorPlot ModelPlot AccumulatedIRFPanel AccumulatedIRFXPanel OrthogonalIRFPanel SimpleIRFPanel SimpleIRFXPanel ModelForecastsPlot ForecastsOnlyPlot
Plot Description Distribution of prediction errors Q-Q plot of prediction errors White noise test of prediction errors Prediction errors Time series and predicted series Accumulated impulse response function Accumulated impulse response of transfer function Orthogonalized impulse response function Simple impulse response function Simple impulse response of transfer function Time series and forecasts Forecasts
Statement MODEL MODEL MODEL MODEL MODEL MODEL MODEL MODEL MODEL MODEL OUTPUT OUTPUT
Computational Issues
Computational Method
The VARMAX procedure uses numerous linear algebra routines and frequently uses the sweep operator (Goodnight 1979) and the Cholesky root (Golub and Van Loan 1983). In addition, the VARMAX procedure uses the nonlinear optimization (NLO) subsystem to perform nonlinear optimization tasks for the maximum likelihood estimation. The optimization requires intensive computation.
Convergence Problems
For some data sets, the computation algorithm can fail to converge. Nonconvergence can result from a number of causes, including at or ridged likelihood surfaces and ill-conditioned data. If you experience convergence problems, the following points might be helpful: Data that contain extreme values can affect results in PROC VARMAX. Rescaling the data can improve stability. Changing the TECH=, MAXITER=, and MAXFUNC= options in the NLOPTIONS statement can improve the stability of the optimization process. Specifying a different model that might t the data more closely and might improve convergence.
Memory
Let T be the length of each series, k be the number of dependent variables, p be the order of autoregressive terms, and q be the order of moving-average terms. The number of parameters to estimate for a VARMA(p; q) model is k C .p C q/k 2 C k .k C 1/=2
As k increases, the number of parameters to estimate increases very quickly. Furthermore the memory requirement for VARMA(p; q) quadratically increases as k and T increase. For a VARMAX(p; q; s) model and GARCH-type multivariate conditional heteroscedasticity models, the number of parameters to estimate and the memory requirements are considerable.
Computing Time
PROC VARMAX is computationally intensive, and execution times can be long. Extensive CPU time is often required to compute the maximum likelihood estimates.
The following statements plot the series and proceed with the VARMAX procedure.
proc varmax data=us_money; id date interval=qtr; model y1-y4 / p=2 lagmax=6 dftest print=(iarr(3) estimates diagnose) cointtest=(johansen=(iorder=2)) ecm=(rank=1 normalize=y1); cointeg rank=1 normalize=y1 exogeneity; run;
This example performs the Dickey-Fuller test for stationarity, the Johansen cointegrated test integrated order 2, and the exogeneity test. The VECM(2) is t to the data. From the outputs shown in Output 30.1.5, you can see that the series has unit roots and is cointegrated in rank 1 with integrated order 1. The tted VECM(2) is given as
0 yt B D B @ 0 1 0 0:0408 0:0860 C B CCB 0:0052 A @ 0:0144 0:0140 0:0281 0:0022 0:0051 0:3535 0:2390 0:0223 0:0329 0:0065 0:0131 0:0010 0:0024 0:2026 0:4080 0:0312 0:0741 1
1
The prexed to a variable name implies differencing. Output 30.1.3 through Output 30.1.14 show the details. Output 30.1.3 shows the descriptive statistics.
Output 30.1.3 Descriptive Statistics
Analysis of U.S. Economic Variables The VARMAX Procedure Number of Observations Number of Pairwise Missing 136 0
Simple Summary Statistics Variable Label y1 y2 y3 y4 log(real money stock M1) log(GNP in bil. of 1982 dollars) Discount rate on 91-day T-bills Yield on 20-year Treasury bonds
Output 30.1.4 shows the output for Dickey-Fuller tests for the nonstationarity of each series. The null hypotheses is to test a unit root. All series have a unit root.
Output 30.1.4 Unit Root Tests
Unit Root Test Variable y1 Type Zero Mean Single Mean Trend Zero Mean Single Mean Trend Zero Mean Single Mean Trend Zero Mean Single Mean Trend Rho 0.05 -2.97 -5.91 0.13 -0.43 -9.21 -1.28 -8.86 -18.97 0.40 -2.79 -12.12 Pr < Rho 0.6934 0.6572 0.7454 0.7124 0.9309 0.4787 0.4255 0.1700 0.0742 0.7803 0.6790 0.2923 Tau 1.14 -0.76 -1.34 5.14 -0.79 -2.16 -0.69 -2.27 -2.86 0.45 -1.29 -2.33 Pr < Tau 0.9343 0.8260 0.8725 0.9999 0.8176 0.5063 0.4182 0.1842 0.1803 0.8100 0.6328 0.4170
y2
y3
y4
The Johansen cointegration rank test shows whether the series is integrated order either 1 or 2 as shown in Output 30.1.5. The last two columns in Output 30.1.5 explain the cointegration rank test with integrated order 1. The results indicate that there is the cointegrated relationship with the cointegration rank 1 with respect to the 0.05 signicance level because the test statistic of 20.6542 is smaller than the critical value of 29.38. Now, look at the row associated with r D 1. Compare the test statistic value and critical value pairs such as (219.62395, 29.38), (89.21508, 15.34), and (27.32609, 3.84). There is no evidence that the series are integrated order 2 at the 0.05 signicance level.
Output 30.1.5 Cointegration Rank Test
Cointegration Rank Test for I(2) Trace of I(1) 55.9633 20.6542 2.6477 0.0149
r\k-r-s 0 1 2 3 5% CV I(2)
4 384.60903
3 214.37904 219.62395
47.21000
29.38000
Cointegration Rank Test for I(2) 5% CV of I(1) 47.21 29.38 15.34 3.84
r\k-r-s 0 1 2 3 5% CV I(2)
Output 30.1.6 shows the estimates of the long-run parameter, , and the adjustment coefcient, .
Alpha Variable y1 y2 y3 y4 1 -0.01396 -0.02811 -0.00215 0.00510 2 0.01396 -0.02739 -0.04967 -0.02514 3 -0.01119 -0.00032 -0.00183 -0.00220 4 0.00008 0.00076 -0.00072 0.00016
Xi Variable y1 y2 y3 y4 1 -0.00842 0.00141 -0.00445 -0.00211 2 -0.00052 0.00213 0.00541 -0.00064 3 -0.00208 -0.00736 -0.00150 -0.00130 4 -0.00250 -0.00058 0.00310 0.00197
Output 30.1.8 shows that the VECM(2) is t to the data. The ECM=(RANK=1) option produces the estimates of the long-run parameter, , and the adjustment coefcient, .
Output 30.1.9 shows the parameter estimates in terms of the constant, the lag one coefcients (yt 1 ) contained in the 0 estimates, and the coefcients associated with the lag one rst differences (yt 1 ).
Parameter Alpha * Beta Estimates Variable y1 y2 y3 y4 y1 -0.01396 -0.02811 -0.00215 0.00510 y2 0.00648 0.01306 0.00100 -0.00237 y3 -0.20263 -0.40799 -0.03121 0.07407 y4 0.13059 0.26294 0.02011 -0.04774
AR Coefficients of Differenced Lag DIF Lag 1 Variable y1 y2 y3 y4 y1 0.34603 0.09936 0.18118 0.03222 y2 0.09131 0.03791 0.07859 0.04961 y3 -0.35351 0.23900 0.02234 -0.03292 y4 -0.96895 0.28661 0.40508 0.18568
Equation Parameter D_y1 CONST1 AR1_1_1 AR1_1_2 AR1_1_3 AR1_1_4 AR2_1_1 AR2_1_2 AR2_1_3 AR2_1_4 CONST2 AR1_2_1 AR1_2_2 AR1_2_3 AR1_2_4 AR2_2_1 AR2_2_2 AR2_2_3 AR2_2_4 CONST3 AR1_3_1 AR1_3_2 AR1_3_3 AR1_3_4 AR2_3_1 AR2_3_2 AR2_3_3 AR2_3_4 CONST4 AR1_4_1 AR1_4_2 AR1_4_3 AR1_4_4 AR2_4_1 AR2_4_2 AR2_4_3 AR2_4_4
Estimate 0.04076 -0.01396 0.00648 -0.20263 0.13059 0.34603 0.09131 -0.35351 -0.96895 0.08595 -0.02811 0.01306 -0.40799 0.26294 0.09936 0.03791 0.23900 0.28661 0.00518 -0.00215 0.00100 -0.03121 0.02011 0.18118 0.07859 0.02234 0.40508 -0.01438 0.00510 -0.00237 0.07407 -0.04774 0.03222 0.04961 -0.03292 0.18568
D_y2
D_y3
D_y4
Output 30.1.11 shows the innovation covariance matrix estimates, the various information criteria results, and the tests for white noise residuals. The residuals have signicant correlations at lag 2 and 3. The Portmanteau test results into signicant. These results show that a VECM(3) model might be better t than the VECM(2) model is.
Information Criteria AICC HQC AIC SBC FPEC -40.6284 -40.4343 -40.6452 -40.1262 2.23E-18
Schematic Representation of Cross Correlations of Residuals Variable/ Lag 0 1 2 3 4 5 6 y1 y2 y3 y4 ++.. ++++ .+++ .+++ .... .... .... .... ++.. .... +.-. .... .... .... ..++ ..+. +... .... -... .... ..-.... .... .... .... .... .... ....
. is between
Portmanteau Test for Cross Correlations of Residuals Up To Lag 3 4 5 6 DF 16 32 48 64 Chi-Square 53.90 74.03 103.08 116.94 Pr > ChiSq <.0001 <.0001 <.0001 <.0001
Output 30.1.12 describes how well each univariate equation ts the data. The residuals for y3 and y4 are off from the normality. Except the residuals for y3, there are no AR effects on other residuals. Except the residuals for y4, there are no ARCH effects on other residuals.
Variable y1 y2 y3 y4
Univariate Model White Noise Diagnostics Durbin Watson 2.13418 2.04003 1.86892 1.98440 Normality Chi-Square Pr > ChiSq 7.19 1.20 253.76 105.21 0.0275 0.5483 <.0001 <.0001 ARCH F Value Pr > F 1.62 1.23 1.78 21.01 0.2053 0.2697 0.1847 <.0001
Variable y1 y2 y3 y4
Univariate Model AR Diagnostics AR1 F Value Pr > F 0.68 0.05 0.56 0.01 0.4126 0.8185 0.4547 0.9340 AR2 F Value Pr > F 2.98 0.12 2.86 0.16 0.0542 0.8842 0.0610 0.8559 AR3 F Value Pr > F 2.01 0.41 4.83 1.21 0.1154 0.7453 0.0032 0.3103 AR4 F Value Pr > F 2.48 0.30 3.71 0.95 0.0473 0.8762 0.0069 0.4358
Variable y1 y2 y3 y4
Output 30.1.14 shows whether each variable is the weak exogeneity of other variables. The variable y1 is not the weak exogeneity of other variables, y2, y3, and y4; the variable y2 is not the weak exogeneity of other variables, y1, y3, and y4; the variable y3 and y4 are the weak exogeneity of other variables.
Output 30.1.14 Weak Exogeneity Test
Testing Weak Exogeneity of Each Variables Variable y1 y2 y3 y4 DF 1 1 1 1 Chi-Square 6.55 12.54 0.09 1.81 Pr > ChiSq 0.0105 0.0004 0.7695 0.1786
448
data use; set west; where date < 01jan79d; keep date y1 y2 y3; run; proc varmax data=use; id date interval=qtr; model y1-y3 / p=2 dify=(1) print=(decompose(6) impulse=(stderr) estimates diagnose)
First, the differenced data is modeled as a VAR(2) with the following result: 0 yt D @ 0 C@ 1 0 0:01672 0:01577 A C @ 0:01293 0:31963 0:04393 0:00242 0:14599 0:15273 0:22481 1 0:93439 0:01020 A yt 0:02223 1 0:96122 0:28850 A yt 0:26397
2
The parameter estimates AR1_1_2, AR1_1_3, AR2_1_2, and AR2_1_3 are insignicant, and the VARX model is tted in the next step. The detailed output is shown in Output 30.2.1 through Output 30.2.8. Output 30.2.1 shows the descriptive statistics.
Output 30.2.1 Descriptive Statistics
Analysis of German Economic Variables The VARMAX Procedure Number of Observations Number of Pairwise Missing Observation(s) eliminated by differencing 75 0 1
N 75 75 75
Simple Summary Statistics Variable Difference y1 y2 y3 1 1 1 Label logarithm of investment logarithm of income logarithm of consumption
AR Lag 1 Variable y1 y2 y3 y1 y2 y3 y1 -0.31963 0.04393 -0.00242 -0.16055 0.05003 0.03388 y2 0.14599 -0.15273 0.22481 0.11460 0.01917 0.35491 y3 0.96122 0.28850 -0.26397 0.93439 -0.01020 -0.02223
C . + +
Model Parameter Estimates Standard Error t Value Pr > |t| Variable 0.01723 0.12546 0.54567 0.66431 0.12491 0.53457 0.66510 0.00437 0.03186 0.13857 0.16870 0.03172 0.13575 0.16890 0.00353 0.02568 0.11168 0.13596 0.02556 0.10941 0.13612 -0.97 -2.55 0.27 1.45 -1.29 0.21 1.40 3.60 1.38 -1.10 1.71 1.58 0.14 -0.06 3.67 -0.09 2.01 -1.94 1.33 3.24 -0.16 0.3352 0.0132 0.7899 0.1526 0.2032 0.8309 0.1647 0.0006 0.1726 0.2744 0.0919 0.1195 0.8882 0.9520 0.0005 0.9251 0.0482 0.0565 0.1896 0.0019 0.8708 1 y1(t-1) y2(t-1) y3(t-1) y1(t-2) y2(t-2) y3(t-2) 1 y1(t-1) y2(t-1) y3(t-1) y1(t-2) y2(t-2) y3(t-2) 1 y1(t-1) y2(t-1) y3(t-1) y1(t-2) y2(t-2) y3(t-2)
Equation Parameter y1 CONST1 AR1_1_1 AR1_1_2 AR1_1_3 AR2_1_1 AR2_1_2 AR2_1_3 CONST2 AR1_2_1 AR1_2_2 AR1_2_3 AR2_2_1 AR2_2_2 AR2_2_3 CONST3 AR1_3_1 AR1_3_2 AR1_3_3 AR2_3_1 AR2_3_2 AR2_3_3
Estimate -0.01672 -0.31963 0.14599 0.96122 -0.16055 0.11460 0.93439 0.01577 0.04393 -0.15273 0.28850 0.05003 0.01917 -0.01020 0.01293 -0.00242 0.22481 -0.26397 0.03388 0.35491 -0.02223
y2
y3
Output 30.2.4 shows the innovation covariance matrix estimates, the various information criteria results, and the tests for white noise residuals. The residuals are uncorrelated except at lag 3 for y2 variable.
Information Criteria AICC HQC AIC SBC FPEC -24.4884 -24.2869 -24.5494 -23.8905 2.18E-11
Cross Correlations of Residuals Lag 0 Variable y1 y2 y3 y1 y2 y3 y1 y2 y3 y1 y2 y3 y1 1.00000 0.13242 0.28275 0.01461 -0.01125 -0.00993 0.07253 -0.08096 -0.02660 0.09915 -0.00289 -0.03364 y2 0.13242 1.00000 0.55526 -0.00666 -0.00167 -0.06780 -0.00226 -0.01066 -0.01392 0.04484 0.14059 0.05374 y3 0.28275 0.55526 1.00000 -0.02394 -0.04515 -0.09593 -0.01621 -0.02047 -0.02263 0.05243 0.25984 0.05644
Schematic Representation of Cross Correlations of Residuals Variable/ Lag 0 1 2 3 y1 y2 y3 +.+ .++ +++ ... ... ... ... ... ... ... ..+ ...
Portmanteau Test for Cross Correlations of Residuals Up To Lag 3 DF 9 Chi-Square 9.69 Pr > ChiSq 0.3766
Output 30.2.5 describes how well each univariate equation ts the data. The residuals are off from the normality, but have no AR effects. The residuals for y1 variable have the ARCH effect.
Output 30.2.5 Diagnostic Checks Continued
Univariate Model ANOVA Diagnostics Standard Deviation 0.04615 0.01172 0.00944
Variable y1 y2 y3
Univariate Model White Noise Diagnostics Durbin Watson 1.96269 1.98145 2.14583 Normality Chi-Square Pr > ChiSq 10.22 11.98 34.25 0.0060 0.0025 <.0001 ARCH F Value Pr > F 12.39 0.38 0.10 0.0008 0.5386 0.7480
Variable y1 y2 y3
Univariate Model AR Diagnostics AR1 F Value Pr > F 0.01 0.00 0.68 0.9029 0.9883 0.4129 AR2 F Value Pr > F 0.19 0.00 0.38 0.8291 0.9961 0.6861 AR3 F Value Pr > F 0.39 0.46 0.30 0.7624 0.7097 0.8245 AR4 F Value Pr > F 1.39 0.34 0.21 0.2481 0.8486 0.9320
Variable y1 y2 y3
Output 30.2.6 is the output in a matrix format associated with the PRINT=(IMPULSE=) option for the impulse response function and standard errors. The y3 variable in the rst row is an impulse variable. The y1 variable in the rst column is a response variable. The numbers, 0.96122, 0.41555, 0.40789 at lag 1 to 3 are decreasing.
Lag 1 STD 2 STD 3 STD 1 STD 2 STD 3 STD 1 STD 2 STD 3 STD
y1 -0.31963 0.12546 -0.05430 0.12919 0.11904 0.08362 0.04393 0.03186 0.02858 0.03184 -0.00884 0.01583 -0.00242 0.02568 0.04517 0.02563 -0.00055 0.01646
y2 0.14599 0.54567 0.26174 0.54728 0.35283 0.38489 -0.15273 0.13857 0.11377 0.13425 0.07147 0.07914 0.22481 0.11168 0.26088 0.10820 -0.09818 0.07823
y3 0.96122 0.66431 0.41555 0.66311 -0.40789 0.47867 0.28850 0.16870 -0.08820 0.16250 0.11977 0.09462 -0.26397 0.13596 0.10998 0.13101 0.09096 0.10280
y2
y3
The proportions of decomposition of the prediction error covariances of three variables are given in Output 30.2.7. If you see the y3 variable in the rst column, then the output explains that about 64.713% of the one-step-ahead prediction error covariances of the variable y3t is accounted for by its own innovations, about 7.995% is accounted for by y1t innovations, and about 27.292% is accounted for by y2t innovations.
y2
y3
The table in Output 30.2.8 gives forecasts and their prediction error covariances.
Output 30.2.8 Forecasts
Forecasts Standard Error 0.04615 0.05825 0.06883 0.08021 0.09117 0.01172 0.01691 0.02156 0.02615 0.03005 0.00944 0.01282 0.01808 0.02205 0.02578
Variable y1
Obs 77 78 79 80 81 77 78 79 80 81 77 78 79 80 81
Time 1979:1 1979:2 1979:3 1979:4 1980:1 1979:1 1979:2 1979:3 1979:4 1980:1 1979:1 1979:2 1979:3 1979:4 1980:1
Forecast 6.54027 6.55105 6.57217 6.58452 6.60193 7.68473 7.70508 7.72206 7.74266 7.76240 7.54024 7.55489 7.57472 7.59344 7.61232
95% Confidence Limits 6.44982 6.43688 6.43725 6.42732 6.42324 7.66176 7.67193 7.67980 7.69140 7.70350 7.52172 7.52977 7.53928 7.55022 7.56179 6.63072 6.66522 6.70708 6.74173 6.78063 7.70770 7.73822 7.76431 7.79392 7.82130 7.55875 7.58001 7.61015 7.63666 7.66286
y2
y3
Output 30.2.9 shows that you cannot reject Granger noncausality from .y2; y3/ to y1 using the 0.05 signicance level.
Test 1:
y1 y2 y3
The following SAS statements show that the variable y1 is the exogenous variable and t the VARX(2,1) model to the data.
proc varmax data=use; id date interval=qtr; model y2 y3 = y1 / p=2 dify=(1) difx=(1) xlag=1 lagmax=3 print=(estimates diagnose); run;
The tted VARX(2,1) model is written as y 2t y 3t 0:01542 0:02520 0:03870 D C y 1t C y 1;t 0:01319 0:05130 0:00363 0:12258 0:25811 y 2;t 1 C 0:24367 0:31809 y 3;t 1 0:01651 0:03498 y 2;t 2 1t C C 0:34921 0:01664 y 3;t 2 2t
The detailed output is shown in Output 30.2.10 through Output 30.2.13. Output 30.2.10 shows the parameter estimates in terms of the constant, the current and the lag one coefcients of the exogenous variable, and the lag two coefcients of the dependent variables.
AR Lag 1 2 Variable y2 y3 y2 y3 y2 -0.12258 0.24367 0.01651 0.34921 y3 0.25811 -0.31809 0.03498 -0.01664
Equation Parameter y2 CONST1 XL0_1_1 XL1_1_1 AR1_1_1 AR1_1_2 AR2_1_1 AR2_1_2 CONST2 XL0_2_1 XL1_2_1 AR1_2_1 AR1_2_2 AR2_2_1 AR2_2_2
Estimate 0.01542 0.02520 0.03870 -0.12258 0.25811 0.01651 0.03498 0.01319 0.05130 0.00363 0.24367 -0.31809 0.34921 -0.01664
y3
Output 30.2.12 shows the innovation covariance matrix estimates, the various information criteria results, and the tests for white noise residuals. The residuals is uncorrelated except at lag 3 for y2 variable.
Information Criteria AICC HQC AIC SBC FPEC -18.3902 -18.2558 -18.4309 -17.9916 9.91E-9
Cross Correlations of Residuals Lag 0 1 2 3 Variable y2 y3 y2 y3 y2 y3 y2 y3 y2 1.00000 0.56462 -0.02312 -0.07056 -0.02849 -0.05804 0.16071 0.10882 y3 0.56462 1.00000 -0.05927 -0.09145 -0.05262 -0.08567 0.29588 0.13002
Portmanteau Test for Cross Correlations of Residuals Up To Lag 3 DF 4 Chi-Square 8.38 Pr > ChiSq 0.0787
Output 30.2.13 describes how well each univariate equation ts the data. The residuals are off from the normality, but have no ARCH and AR effects.
Variable y2 y3
Univariate Model White Noise Diagnostics Durbin Watson 2.02413 2.13414 Normality Chi-Square Pr > ChiSq 14.54 32.27 0.0007 <.0001 ARCH Pr > F 0.4842 0.7782
Variable y2 y3
Univariate Model AR Diagnostics AR1 F Value Pr > F 0.04 0.62 0.8448 0.4343 AR2 F Value Pr > F 0.04 0.62 0.9570 0.5383 AR3 F Value Pr > F 0.62 0.72 0.6029 0.5452 AR4 F Value Pr > F 0.42 0.36 0.7914 0.8379
Variable y2 y3
/* Check if the series is nonstationary */ proc varmax data=a; model y1 y2 / p=1 dftest print=(roots); run; /* Fit VAR(1) in differencing */ proc varmax data=a; model y1 y2 / p=1 print=(roots) dify=(1); run; /* Fit VAR(1) in seasonal differencing */ proc varmax data=a; model y1 y2 / p=1 dify=(4) lagmax=5; run; /* Fit VAR(1) in both regular and seasonal differencing */ proc varmax data=a; model y1 y2 / p=1 dify=(1,4) lagmax=5; run; /* Fit VAR(1) in different differencing */ proc varmax data=a; model y1 y2 / p=1 dif=(y1(1,4) y2(1)) lagmax=5; run; /* Options related to prediction */ proc varmax data=a; model y1 y2 / p=1 lagmax=3 print=(impulse covpe(5) decompose(5)); run; /* Options related to tentative order selection */ proc varmax data=a; model y1 y2 / p=1 lagmax=5 minic print=(parcoef pcancorr pcorr); run; /* Automatic selection of the AR order */ proc varmax data=a; model y1 y2 / minic=(type=aic p=5); run; /* Compare results of LS and Yule-Walker Estimators */ proc varmax data=a; model y1 y2 / p=1 print=(yw); run; /* BVAR(1) of the nonstationary series y1 and y2 */ proc varmax data=a; model y1 y2 / p=1 prior=(lambda=1 theta=0.2 ivar); run;
/* BVAR(1) of the nonstationary series y1 */ proc varmax data=a; model y1 y2 / p=1 prior=(lambda=0.1 theta=0.15 ivar=(y1)); run; /* Data b Generated Process */ proc iml; sig = { 0.5 0.14 -0.08 -0.03, 0.14 0.71 0.16 0.1, -0.08 0.16 0.65 0.23, -0.03 0.1 0.23 0.16}; sig = sig * 0.0001; phi = {1.2 -0.5 0. 0.1, 0.6 0.3 -0.2 0.5, 0.4 0. -0.2 0.1, -1.0 0.2 0.7 -0.2}; call varmasim(y,phi) sigma = sig n = 100 seed = 32567; cn = {y1 y2 y3 y4}; create b from y[colname=cn]; append from y; quit; /* Cointegration Rank Test using Trace statistics */ proc varmax data=b; model y1-y4 / p=2 lagmax=4 cointtest; run; /* Cointegration Rank Test using Max statistics */ proc varmax data=b; model y1-y4 / p=2 lagmax=4 cointtest=(johansen=(type=max)); run; /* Common Trends Test using Filter(Differencing) statistics */ proc varmax data=b; model y1-y4 / p=2 lagmax=4 cointtest=(sw); run; /* Common Trends Test using Filter(Residual) statistics */ proc varmax data=b; model y1-y4 / p=2 lagmax=4 cointtest=(sw=(type=filtres lag=1)); run; /* Common Trends Test using Kernel statistics */ proc varmax data=b; model y1-y4 / p=2 lagmax=4 cointtest=(sw=(type=kernel lag=1)); run; /* Cointegration Rank Test for I(2) */ proc varmax data=b; model y1-y4 / p=2 lagmax=4 cointtest=(johansen=(iorder=2)); run; /* Fit VECM(2) with rank=3 */ proc varmax data=b; model y1-y4 / p=2 lagmax=4 print=(roots iarr) ecm=(rank=3 normalize=y1);
run; /* Weak Exogenous Testing for each variable */ proc varmax data=b outstat=bbb; model y1-y4 / p=2 lagmax=4 ecm=(rank=3 normalize=y1); cointeg rank=3 exogeneity; run; /* Hypotheses Testing for long-run and adjustment parameter */ proc varmax data=b outstat=bbb; model y1-y4 / p=2 lagmax=4 ecm=(rank=3 normalize=y1); cointeg rank=3 normalize=y1 h=(1 0 0, 0 1 0, -1 0 0, 0 0 1) j=(1 0 0, 0 1 0, 0 0 1, 0 0 0); run; /* ordinary regression model */ proc varmax data=grunfeld; model y1 y2 = x1-x3; run; /* Ordinary regression model with subset lagged terms */ proc varmax data=grunfeld; model y1 y2 = x1 / xlag=(1,3); run; /* VARX(1,1) with no current time Exogenous Variables */ proc varmax data=grunfeld; model y1 y2 = x1 / p=1 xlag=1 nocurrentx; run; /* VARX(1,1) with different Exogenous Variables */ proc varmax data=grunfeld; model y1 = x3, y2 = x1 x2 / p=1 xlag=1; run; /* VARX(1,2) in difference with current Exogenous Variables */ proc varmax data=grunfeld; model y1 y2 = x1 / p=1 xlag=2 difx=(1) dify=(1); run;
workers and its interaction with the series of masonry workers. The series and predict plots, the residual plot, and the forecast plot are created in Output 30.4.1 through Output 30.4.3. These are a selection of the plots created by the VARMAX procedure.
title "Illustration of ODS Graphics"; proc varmax data=sashelp.workers plot(unpack)=(residual model forecasts); id date interval=month; model electric masonry / dify=(1,12) noint p=1; output lead=12; run;
References ! 2031
References
Anderson, T. W. (1951), Estimating Linear Restrictions on Regression Coefcients for Multivariate Normal Distributions, Annals of Mathematical Statistics, 22, 327-351. Ansley, C. F. and Newbold, P. (1979), Multivariate Partial Autocorrelations, ASA Proceedings of the Business and Economic Statistics Section, 349353. Bollerslev, T. (1990), Modeling the Coherence in Short-Run Nominal Exchange Rates: A Multivariate Generalized ARCH Model, Review of Econometrics and Stochastics, 72, 498505. Engle, R. F. and Granger, C. W. J. (1987), Co-integration and Error Correction: Representation, Estimation and Testing, Econometrica, 55, 251276. Engle, R. F. and Kroner, K. F. (1995), Multivariate Simultaneous Generalized ARCH, Econometric Theory, 11, 122150. Golub, G. H. and Van Loan, C. F. (1983), Matrix Computations, Baltimore and London: Johns Hopkins University Press.
Goodnight, J. H. (1979), A Tutorial on the SWEEP Operator, The American Statistician, 33, 149 158. Hosking, J. R. M. (1980), The Multivariate Portmanteau Statistic, Journal of the American Statistical Association, 75, 602608. Johansen, S. (1988), Statistical Analysis of Cointegration Vectors, Journal of Economic Dynamics and Control, 12, 231254. Johansen, S. (1995a), A Statistical Analysis of Cointegration for I(2) Variables, Econometric Theory, 11, 2559. Johansen, S. (1995b), Likelihood-Based Inference in Cointegrated Vector Autoregressive Models, New York: Oxford University Press. Johansen, S. and Juselius, K. (1990), Maximum Likelihood Estimation and Inference on Cointegration: With Applications to the Demand for Money, Oxford Bulletin of Economics and Statistics, 52, 169210. Koreisha, S. and Pukkila, T. (1989), Fast Linear Estimation Methods for Vector Autoregressive Moving Average Models, Journal of Time Series Analysis, 10, 325-339. Litterman, R. B. (1986), Forecasting with Bayesian Vector Autoregressions: Five Years of Experience, Journal of Business & Economic Statistics, 4, 2538. Ltkepohl, H. (1993), Introduction to Multiple Time Series Analysis, Berlin: Springer-Verlag. Osterwald-Lenum, M. (1992), A Note with Quantiles of the Asymptotic Distribution of the Maximum Likelihood Cointegration Rank Test Statistics, Oxford Bulletin of Economics and Statistics, 54, 461472. Pringle, R. M. and Rayner, D. L. (1971), Generalized Inverse Matrices with Applications to Statistics, Second Edition, New York: McGraw-Hill Inc. Quinn, B. G. (1980), Order Determination for a Multivariate Autoregression, Journal of the Royal Statistical Society, B, 42, 182185. Reinsel, G. C. (1997), Elements of Multivariate Time Series Analysis, Second Edition, New York: Springer-Verlag. Spliid, H. (1983), A Fast Estimation for the Vector Autoregressive Moving Average Models with Exogenous Variables, Journal of the American Statistical Association, 78, 843849. Stock, J. H. and Watson, M. W. (1988), Testing for Common Trends, Journal of the American Statistical Association, 83, 10971107.
Chapter 31
ODS Table Names . . . . . . . . . . . . . . . . . . . . . Examples: X11 Procedure . . . . . . . . . . . . . . . . . . . . Example 31.1: Component EstimationMonthly Data . Example 31.2: Components EstimationQuarterly Data Example 31.3: Outlier Detection and Removal . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
In the multiplicative model, the trend cycle component Ct keeps the same scale as the original series Ot , while St , Dt , and It vary around 1.0. In all printed tables and in the output data set, these latter components are expressed as percentages, and thus will vary around 100.0 (in the additive case, they vary around 0.0). The naming convention used in PROC X11 for the tables follows the original U.S. Bureau of the Census X-11 Seasonal Adjustment program specication (Shiskin, Young, and Musgrave 1967). Also, see the section Printed Output on page 2074. This convention is outlined in Figure 31.1. The tables corresponding to parts A C are intermediate calculations. The nal estimates of the individual components are found in the D tables: table D10 contains the nal seasonal factors, table D12 contains the nal trend cycle, and table D13 contains the nal irregular series. If you are primarily interested in seasonally adjusting a series without consideration of intermediate calculations or diagnostics, you only need to look at table D11, the nal seasonally adjusted series. For further details about the X-11-ARIMA tables, see Ladiray and Quenneville (2001).
/*--- X-11 ARIMA ---*/ proc x11 data=sales; monthly date=date; var sales; tables d11; run;
Year
JAN
MAY
JUN
1978 . . . . . . 1979 124.935 126.533 125.282 125.650 127.754 129.648 1980 128.734 139.542 143.726 143.854 148.723 144.530 1981 176.329 166.264 167.433 167.509 173.573 175.541 1982 186.747 202.467 192.024 202.761 197.548 206.344 1983 233.109 223.345 218.179 226.389 224.249 227.700 1984 238.261 239.698 246.958 242.349 244.665 247.005 1985 275.766 282.316 294.169 285.034 294.034 296.114 1986 325.471 332.228 330.401 330.282 333.792 331.349 1987 363.592 373.118 368.670 377.650 380.316 376.297 1988 370.966 384.743 386.833 405.209 380.840 389.132 1989 428.276 418.236 429.409 446.467 437.639 440.832 1990 480.631 474.669 486.137 483.140 481.111 499.169 --------------------------------------------------------------------Avg 277.735 280.263 282.435 286.358 285.354 288.638 D11 Final Seasonally Adjusted Series AUG SEP OCT NOV
Year
JUL
DEC
Total
1978 . . 123.507 125.776 124.735 129.870 503.887 1979 127.880 129.285 126.562 134.905 133.356 136.117 1547.91 1980 140.120 153.475 159.281 162.128 168.848 165.159 1798.12 1981 179.301 182.254 187.448 197.431 184.341 184.304 2141.73 1982 211.690 213.691 214.204 218.060 228.035 240.347 2513.92 1983 222.045 222.127 222.835 212.227 230.187 232.827 2695.22 1984 251.247 253.805 264.924 266.004 265.366 277.025 3037.31 1985 294.196 309.162 311.539 319.518 318.564 323.921 3604.33 1986 337.095 341.127 346.173 350.183 360.792 362.333 4081.23 1987 379.668 375.607 374.257 372.672 368.135 364.150 4474.13 1988 385.479 377.147 397.404 403.156 413.843 416.142 4710.89 1989 450.103 454.176 460.601 462.029 427.499 485.113 5340.38 1990 485.370 485.103 . . . . 3875.33 -------------------------------------------------------------------------Avg 288.683 291.413 265.728 268.674 268.642 276.442 Total: 40324 Mean: 280.03 S.D.: 111.31
You can compare the original series, table B1, and the nal seasonally adjusted series, table D11, by plotting them together. These tables are requested and named in the OUTPUT statement.
title Monthly Retail Sales Data (in $1000); proc x11 data=sales noprint; monthly date=date; var sales; output out=out b1=sales d11=adjusted; run;
/ markers markerattrs=(color=red symbol=asterisk) lineattrs=(color=red) legendlabel="original" ; series x=date y=adjusted / markers markerattrs=(color=blue symbol=circle) lineattrs=(color=blue) legendlabel="adjusted" ; yaxis label=Original and Seasonally Adjusted Time Series; run;
X-11-ARIMA
An inherent problem with the X-11 method is the revision of the seasonal factor estimates as new data become available. The X-11 method uses a set of centered moving averages to estimate the seasonal components. These moving averages apply symmetric weights to all observations except those at the beginning and end of the series, where asymmetric weights have to be applied. These
X-11-ARIMA ! 2039
asymmetric weights can cause poor estimates of the seasonal factors, which then can cause large revisions when new data become available. While large revisions to seasonally adjusted values are not common, they can happen. When they do happen, it undermines the credibility of the X-11 seasonal adjustment method. A method to address this problem was developed at Statistics Canada (Dagum 1980, 1982a). This method, known as X-11-ARIMA, applies an ARIMA model to the original data (after adjustments, if any) to forecast the series one or more years. This extended series is then seasonally adjusted, allowing symmetric weights to be applied to the end of the original data. This method was tested against a large number of Canadian economic series and was found to greatly reduce the amount of revisions as new data were added. The X-11-ARIMA method is available in PROC X11 through the use of the ARIMA statement. The ARIMA statement extends the original series either with a user-specied ARIMA model or by an automatic selection process in which the best model from a set of ve predened ARIMA models is used. The following example illustrates the use of the ARIMA statement. The ARIMA statement does not contain a user-specied model, so the best model is chosen by the automatic selection process. Forecasts from this best model are then used to extend the original series by one year. The following partial listing shows parameter estimates and model diagnostics for the ARIMA model chosen by the automatic selection process.
proc x11 data=sales; monthly date=date; var sales; arima; run;
Lag 0 1 2 12
* *
* Does not include log determinant Criteria Summary for Model 2: (0,1,2)(0,1,1)s, Log Transform Box-Ljung Chi-square: 22.03 with 21 df Prob= 0.40 (Criteria prob > 0.05) Test for over-differencing: sum of MA parameters = 0.57 (must be < 0.90) MAPE - Last Three Years: 2.84 (Must be < 15.00 %) - Last Year: 3.04 - Next to Last Year: 1.96 - Third from Last Year: 3.51
Table D11 (nal seasonally adjusted series) is now constructed using symmetric weights on observations at the end of the actual data. This should result in better estimates of the seasonal factors and, thus, smaller revisions in Table D11 as more data become available.
Either the MONTHLY or QUARTERLY statement must be specied, depending on the type of time series data you have. The PDWEIGHTS and MACURVES statements can be used only with the MONTHLY statement. The TABLES statement controls the printing of tables, while the OUTPUT statement controls the creation of the OUT= data set.
Functional Summary
The statements and options controlling the X11 procedures are summarized in the following table. Description Data Set Options specify input data set write the trading-day regression results to an output data set write the stable seasonality test results to an output data set write table values to an output data set add extrapolated values to the output data set add year ahead estimates to the output data set write the sliding spans analysis results to an output data set Printing Control Options suppress all printed output suppress all printed ARIMA output print all ARIMA output print selected tables and charts print selected groups of tables print selected groups of charts print preliminary tables associated with ARIMA processing specify number of decimals for printed tables suppress all printed SSPAN output print all SSPAN output Date Information Options specify a SAS date variable specify the beginning date specify the ending date specify beginning year for trading-day regression Declaring the Role of Variables specify BY-group processing Statement Option
PROC X11 PROC X11 PROC X11 OUTPUT PROC X11 PROC X11 PROC X11
PROC X11 ARIMA ARIMA TABLES MONTHLY QUARTERLY MONTHLY QUARTERLY ARIMA MONTHLY QUARTERLY SSPAN SSPAN
NOPRINT NOPRINT PRINTALL PRINTOUT= PRINTOUT= CHARTS= CHARTS= PRINTFP NDEC= NDEC= NOPRINT PRINTALL
BY
Description specify the variables to be seasonally adjusted specify identifying variables specify the prior monthly factor Controlling the Table Computations use additive adjustment specify seasonal factor moving average length specify the extreme value limit for trading-day regression specify the lower bound for extreme irregulars specify the upper bound for extreme irregulars include the length-of-month in trading-day regression specify trading-day regression action compute summary measure only modify extreme irregulars prior to trend cycle estimation specify moving average length in trend cycle estimation specify weights for prior trading-day factors
Option
PMFACTOR=
MONTHLY QUARTERLY MACURVES MONTHLY MONTHLY QUARTERLY MONTHLY QUARTERLY MONTHLY MONTHLY MONTHLY QUARTERLY MONTHLY QUARTERLY MONTHLY QUARTERLY PDWEIGHTS
ADDITIVE ADDITIVE EXCLUDE= FULLWEIGHT= FULLWEIGHT= ZEROWEIGHT= ZEROWEIGHT= LENGTH TDREGR= SUMMARY SUMMARY TRENDADJ TRENDADJ TRENDMA= TRENDMA=
species the input SAS data set used. If it is omitted, the most recently created SAS data set is used.
OUTEXTRAP
adds the extra observations used in ARIMA processing to the output data set. When ARIMA forecasting/backcasting is requested, extra observations are appended to the ends of the series, and the calculations are carried out on this extended series. The appended observations are not normally written to the OUT= data set. However, if OUTEXTRAP is specied, these extra observations are written to the output data set. If a DATE= variable is specied in the MONTHLY/QUARTERLY statement, the date variable is extrapolated to identify forecasts/backcasts. The OUTEXTRAP option can be abbreviated as OUTEX.
NOPRINT
suppresses any printed output. The NOPRINT option overrides any PRINTOUT=, CHARTS=, or TABLES statement and any output associated with the ARIMA statement.
OUTSPAN= SAS-data-set
species the output data set to store the sliding spans analysis results. Tables A1, C18, D10, and D11 for each span are written to this data set. See the section The OUTSPAN= Data Set on page 2071 for details.
OUTSTB= SAS-data-set
species the output data set to store the stable seasonality test results (table D8). All the information in the analysis of variance table associated with the stable seasonality test is contained in the variables written to this data set. See the section OUTSTB= Data Set on page 2071 for details.
OUTTDR= SAS-data-set
species the output data set to store the trading-day regression results (tables B15 and C15). All the information in the analysis of variance table associated with the trading-day regression is contained in the variables written to this data set. This option is valid only when TDREGR=PRINT, TEST, or ADJUST is specied in the MONTHLY statement. See the section OUTTDR= Data Set on page 2072 for details.
YRAHEADOUT
adds one-year-ahead forecast values to the output data set for tables C16, C18, and D10. The original purpose of this option was to avoid recomputation of the seasonal adjustment factors when new data became available. While computing costs were an important factor when the X-11 method was developed, this is no longer the case and this option is obsolete. See the section The YRAHEADOUT Option on page 2067 for details.
ARIMA Statement
ARIMA options ;
The ARIMA statement applies the X-11-ARIMA method to the series specied in the VAR statement. This method uses an ARIMA model estimated from the original data to extend the series one or more years. The ARIMA statement options control the ARIMA model used and the estimation, forecasting, and printing of this model. There are two ways of obtaining an ARIMA model to extend the series. A model can be given explicitly with the MODEL= and TRANSFORM= options. Alternatively, the best-tting model from a set of ve predened models is found automatically whenever the MODEL= option is absent. See the section Details of Model Selection on page 2068 for details.
BACKCAST= n
species the number of years to backcast the series. The default is BACKCAST= 0. See the section Effect of Backcast and Forecast Length on page 2067 for details.
CHICR= value
species the criteria for the signicance level for the Box-Ljung chi-square test for lack of t when testing the ve predened models. The default is CHICR= 0.05. The CHICR= option values must be between 0.01 and 0.90. The hypothesis being tested is that of model adequacy. Nonrejection of the hypothesis is evidence for an adequate model. Making the CHICR= value smaller makes it easier to accept the model. See the section Criteria Details on page 2069 for further details on the CHICR= option.
CONVERGE= value
species the convergence criterion for the estimation of an ARIMA model. The default value is 0.001. The CONVERGE= value must be positive.
FORECAST= n
species the number of years to forecast the series. The default is FORECAST= 1. See the section Effect of Backcast and Forecast Length on page 2067 for details.
MAPECR= value
species the criteria for the mean absolute percent error (MAPE) when testing the ve predened models. A small MAPE value is evidence for an adequate model; a large MAPE value results in the model being rejected. The MAPECR= value is the boundary for acceptance/rejection. Thus a larger MAPECR= value would make it easier for a model to pass the criteria. The default is MAPECR= 15. The MAPECR= option values must be between 1 and 100. See the section Criteria Details on page 2069 for further details on the MAPECR= option.
MAXITER= n
species the maximum number of iterations in the estimation process. MAXITER must be between 1 and 60; the default value is 15.
METHOD= CLS METHOD= ULS METHOD= ML
species the estimation method. ML requests maximum likelihood, ULS requests unconditional least squares, and CLS requests conditional least squares. METHOD=CLS is the default. The maximum likelihood estimates are more expensive to compute than the conditional least squares estimates. In some cases, however, they can be preferable. For further information on the estimation methods, see Estimation Details on page 248 in Chapter 7, The ARIMA Procedure.
MODEL= ( P=n1 Q=n2 SP=n3 SQ=n4 DIF=n5 SDIF=n6 < NOINT > < CENTER >)
species the ARIMA model. The AR and MA orders are given by P=n1 and Q=n2, respectively, while the seasonal AR and MA orders are given by SP=n3 and SQ=n4, respectively. The lag corresponding to seasonality is determined by the MONTHLY or QUARTERLY statement. Similarly, differencing and seasonal differencing are given by DIF=n5 and SDIF=n6, respectively. For example
arima model=( p=2 q=1 sp=1 dif=1 sdif=1 );
species a (2,1,1)(1,1,0)s model, where s, the seasonality, is either 12 (monthly) or 4 (quarterly). More examples of the MODEL= syntax are given in the section Details of Model Selection on page 2068.
NOINT
suppresses the tting of a constant (or intercept) parameter in the model. (That is, the parameter is omitted.)
CENTER
centers each time series by subtracting its sample mean. The analysis is done on the centered data. Later, when forecasts are generated, the mean is added back. Note that centering is done after differencing. The CENTER option is normally used in conjunction with the NOCONSTANT option of the ESTIMATE statement. For example, to t an AR(1) model on the centered data without an intercept, use the following ARIMA statement:
arima model=( p=1 center noint );
NOPRINT
suppresses the normal printout generated by the ARIMA statement. Note that the effect of specifying the NOPRINT option in the ARIMA statement is different from the effect of specifying the NOPRINT in the PROC X11 statement, since the former only affects ARIMA output.
OVDIFCR= value
species the criteria for the over-differencing test when testing the ve predened models. When the MA parameters in one of these models sum to a number close to 1.0, this is an indication of over-parameterization and the model is rejected. The OVDIFCR= value is the boundary for this rejection; values greater than this value fail the over-differencing test. A larger OVDIFCR= value would make it easier for a model to pass the criteria. The default is OVDIFCR= 0.90. The OVDIFCR= option values must be between 0.80 and 0.99. See the section Criteria Details on page 2069 for further details on the OVDIFCR= option.
PRINTALL
provides the same output as the default printing for all models t and, in addition, prints an estimation summary and chi-square statistics for each model t. See Printed Output on page 2074 for details.
PRINTFP
prints the results for the initial pass of X11 made to exclude trading-day effects. This option has an effect only when the TDREGR= option species ADJUST, TEST, or PRINT. In these cases, an initial pass of the standard X11 method is required to get rid of calendar effects before doing any ARIMA estimation. Usually this rst pass is not of interest, and by default no tables are printed. However, specifying PRINTFP in the ARIMA statement causes any tables printed in the nal pass to also be printed for this initial pass.
The ARIMA statement in PROC X11 allows certain transformations on the series before estimation. The specied transformation is applied only to a user-specied model. If TRANSFORM= is specied and the MODEL= option is not specied, the transformation request is ignored and a warning is printed. The LOG transformation requests that the natural log of the series be used for estimation. The resulting forecast values are transformed back to the original scale. A general power transformation of the form Xt ! .Xt C a/b is obtained by specifying
transform= ( a ** b )
If the constant a is not specied, it is assumed to be zero. The specied ARIMA model is then estimated using the transformed series. The resulting forecast values are transformed back to the original scale.
BY Statement
BY variables ;
A BY statement can be used with PROC X11 to obtain separate analyses on observations in groups dened by the BY variables. When a BY statement appears, the procedure expects the input DATA= data set to be sorted in order of the BY variables.
ID Statement
ID variables ;
If you are creating an output data set, use the ID statement to put values of the ID variables, in addition to the table values, into the output data set. The ID statement has no effect when an output data set is not created. If the DATE= variable is specied in the MONTHLY or QUARTERLY statement, this variable is included automatically in the OUTPUT data set. If no DATE= variable is specied, the variable _DATE_ is added. The date variable (or _DATE_) values outside the range of the actual data (from ARIMA forecasting or backcasting, or from YRAHEADOUT) are extrapolated, while all other ID variables are missing.
MACURVES Statement
MACURVES month=option . . . ;
The MACURVES statement species the length of the moving-average curves for estimating the seasonal factors for any month. This statement can be used only with monthly time series data. The month=option specications consist of the month name (or the rst three letters of the month name), an equal sign, and one of the following option values: 3 3X3 3X5 3X9 STABLE species a three-term moving average for the month species a three-by-three moving average species a three-by-ve moving average species a three-by-nine moving average species a stable seasonal factor (average of all values for the month)
uses a three-term moving average to estimate seasonal factors for January, a 3 3 (a three-term moving average of a three-term moving average) for February, a 3 5 (a three-term moving average of a ve-term moving average) for March, and a 3 9 (a three-term moving average of a nine-term moving average) for April. The numeric values used for the weights of the various moving averages and a discussion of the derivation of these weights are given in Shiskin, Young, and Musgrave (1967). A general discussion of moving average weights is given in Dagum (1985). If the specication for a month is omitted, the X11 procedure uses a three-by-three moving average for the rst estimate of each iteration and a three-by-ve average for the second estimate.
MONTHLY Statement
MONTHLY options ;
The MONTHLY statement must be used when the input data to PROC X11 are a monthly time series. The MONTHLY statement species options that determine the computations performed by PROC X11 and what is included in its output. Either the DATE= or START= option must be used. The following options can appear in the MONTHLY statement.
ADDITIVE
performs additive adjustments. If the ADDITIVE option is omitted, PROC X11 performs multiplicative adjustments.
species the charts produced by the procedure. The default is CHARTS=STANDARD, which species 12 monthly seasonal charts and a trend cycle chart. If you specify CHARTS=FULL (or CHARTS=ALL), the procedure prints additional charts of irregular and seasonal factors. To print no charts, specify CHARTS=NONE. The TABLES statement can also be used to specify particular monthly charts to be printed. If no CHARTS= option is given, and a TABLES statement is given, the TABLES statement overrides the default value of CHARTS=STANDARD; that is, no charts (or tables) are printed except those specied in the TABLES statement. However, if both the CHARTS= option and a TABLES statement are given, the charts corresponding to the CHARTS= option and those requested by the TABLES statement are printed. For example, suppose you wanted only charts G1, the nal seasonally adjusted series and trend cycle, and G4, the nal irregular and nal modied irregular series. You would specify the following statements:
monthly date=date; tables g1 g4;
DATE= variable
species a variable that gives the date for each observation. The starting and ending dates are obtained from the rst and last values of the DATE= variable, which must contain SAS date values. The procedure checks values of the DATE= variable to ensure that the input observations are sequenced correctly. This variable is automatically added to the OUTPUT= data set if one is requested and extrapolated if necessary. If the DATE= option is not specied, the START= option must be specied. The DATE= option and the START= and END= options can be used in combination to subset a series for processing. For example, suppose you have 12 years of monthly data (144 observations, no missing values) beginning in January 1970 and ending in December 1981, and you wanted to seasonally adjust only six years beginning in January 1974. Specifying
monthly date=date start=jan1974 end=dec1979;
would seasonally adjust only this subset of the data. If instead you wanted to adjust the last eight years of data, only the START= option is needed:
monthly date=date start=jan1974;
END= mmmyyyy
species that only the part of the input series ending with the month and year given be adjusted (for example, END=DEC1970). See the DATE=variable option for using the START= and END= options to subset a series for processing.
EXCLUDE= value
excludes from the trading-day regression any irregular values that are more than value standard deviations from the mean. The EXCLUDE=value must be between 0.1 and 9.9, with the default value being 2.5.
FULLWEIGHT= value
assigns weights to irregular values based on their distance from the mean in standard deviation units. The weights are used for estimating seasonal and trend cycle components. Irregular values less than the FULLWEIGHT= value (in standard deviation units) are assigned full weights of 1, values that fall between the ZEROWEIGHT= and FULLWEIGHT= limits are assigned weights linearly graduated between 0 and 1, and values greater than the ZEROWEIGHT= limit are assigned a weight of 0. For example, if ZEROWEIGHT=2 and FULLWEIGHT=1, a value 1.3 standard deviations from the mean would be assigned a graduated weight. The FULLWEIGHT=value must be between 0.1 and 9.9 but must be less than the ZEROWEIGHT=value. The default is FULLWEIGHT=1.5.
LENGTH
includes length-of-month allowance in computing trading-day factors. If this option is omitted, length-of-month allowances are included with the seasonal factors.
NDEC= n
species the number of decimal places shown in the printed tables in the listing. This option has no effect on the precision of the variable values in the output data set.
PMFACTOR= variable
species a variable containing the prior monthly factors. Use this option if you have previous knowledge of monthly adjustment factors. The PMFACTOR= option can be used to make the following adjustments: adjust the level of all or part of a series with discontinuities adjust for the inuence of holidays that fall on different dates from year to year, such as the effect of Easter on certain retail sales adjust for unreasonable weather inuence on series, such as housing starts adjust for changing starting dates of scal years (for budget series) or model years (for automobiles) adjust for temporary dislocating events, such as strikes See the section Prior Daily Weights and Trading-Day Regression on page 2065 for details and examples using the PMFACTOR= option.
PRINTOUT= STANDARD | LONG | FULL | NONE
species the tables to be printed by the procedure. If the PRINTOUT=STANDARD option is specied, between 17 and 27 tables are printed, depending on the other options that are specied. PRINTOUT=LONG prints between 27 and 39 tables, and PRINTOUT=FULL prints between 44 and 59 tables. Specifying PRINTOUT=NONE results in no tables being printed; however, charts are still printed. The default is PRINTOUT=STANDARD. The TABLES statement can also be used to specify particular monthly tables to be printed. If no PRINTOUT= option is specied, and a TABLES statement is given, the TABLES statement overrides the default value of PRINTOUT=STANDARD; that is, no tables (or charts) are printed except those given in the TABLES statement. However, if both the PRINTOUT=
option and a TABLES statement are specied, the tables corresponding to the PRINTOUT= option and those requested by the TABLES statement are printed.
START= mmmyyyy
adjusts only the part of the input series starting with the specied month and year. When the DATE= option is not used, the START= option gives the year and month of the rst input observation for example, START=JAN1966. START= must be specied if DATE= is not given. If START= is specied (and no DATE= option is given), and an OUT= data set is requested, a variable named _DATE_ is added to the data set, giving the date value for each observation. See the DATE= variable option for using the START= and END= options to subset a series.
SUMMARY
species that the data are already seasonally adjusted and the procedure is to produce summary measures. If the SUMMARY option is omitted, the X11 procedure performs seasonal adjustment of the input data before calculating summary measures.
TDCOMPUTE= year
uses the part of the input series beginning with January of the specied year to derive tradingday weights. If this option is omitted, the entire series is used.
TDREGR= NONE | PRINT | ADJUST | TEST
species the treatment of trading-day regression. TDREG=NONE omits the computation of the trading-day regression. TDREG=PRINT computes and prints the trading-day regressions but does not adjust the series. TDREG=ADJUST computes and prints the trading-day regression and adjusts the irregular components to obtain preliminary weights. TDREG=TEST adjusts the nal series if the trading-day regression estimates explain signicant variation on the basis of an F test (or residual trading-day variation if prior weights are used). The default is TDREGR=NONE. See the section Prior Daily Weights and Trading-Day Regression on page 2065 for details and examples using the TDREGR= option. If ARIMA processing is requested, any value of TDREGR other than the default TDREGR=NONE will cause PROC X11 to perform an initial pass (see the Details: X11 Procedure on page 2056 section and the PRINTFP option). The signicance level reported in Table C15 should be viewed with caution. The dependent variable in the trading-day regression is the irregular component formed by an averaging operation. This induces a correlation in the dependent variable and hence in the residuals from which the F test is computed. Hence the distribution of the trading-day regression F statistics differs from an exact F; see Cleveland and Devlin (1980) for details.
TRENDADJ
modies extreme irregular values prior to computing the trend cycle estimates in the rst iteration. If the TRENDADJ option is omitted, the trend cycle is computed without modications for extremes.
TRENDMA= 9 | 13 | 23
species the number of terms in the moving average to be used by the procedure in estimating
the variable trend cycle component. The value of the TRENDMA= option must be 9, 13, or 23. If the TRENDMA= option is omitted, the procedure selects an appropriate moving average. For information about the number of terms in the moving average, see Shiskin, Young, and Musgrave (1967).
ZEROWEIGHT= value
assigns weights to irregular values based on their distance from the mean in standard deviation units. The weights are used for estimating seasonal and trend cycle components. Irregular values beyond the standard deviation limit specied in the ZEROWEIGHT= option are assigned zero weights. Values that fall between the two limits (ZEROWEIGHT= and FULLWEIGHT=) are assigned weights linearly graduated between 0 and 1. For example, if ZEROWEIGHT=2 and FULLWEIGHT=1, a value 1.3 standard deviations from the mean would be assigned a graduated weight. The ZEROWEIGHT=value must be between 0.1 and 9.9 but must be greater than the FULLWEIGHT=value. The default is ZEROWEIGHT=2.5. The ZEROWEIGHT option can be used in conjunction with the FULLWEIGHT= option to adjust outliers from a monthly or quarterly series. See Example 31.3 later in this chapter for an illustration of this use.
OUTPUT Statement
OUTPUT OUT= SAS-data-set tablename=var1 var2 . . . ;
The OUTPUT statement creates an output data set containing specied tables. The data set is named by the OUT= option.
OUT= SAS-data-set
If OUT= is omitted, the SAS System names the new data set by using the DATAn convention. For each table to be included in the output data set, write the X11 table identication keyword, an equal sign, and a list of new variable names: tablename = var1 var2 ... The tablename keywords that can be used in the OUTPUT statement are listed in the section Printed Output on page 2074. The following is an example of a VAR statement and an OUTPUT statement:
var z1 z2 z3; output out=out_x11 b1=s d11=w x y;
The variable s contains the table B1 values for the variable z1, while the table D11 values for variables z1, z2, and z3 are contained in variables w, x, and y, respectively. As this example shows, the list of variables following a tablename= keyword can be shorter than the VAR variable list. In addition to the variables named by tablename =var1 var2 . . . , the ID variables, and BY variables, the output data set contains a date identier variable. If the DATE= option is given
in the MONTHLY or QUARTERLY statement, the DATE= variable is the date identier. If no DATE= option is given, a variable named _DATE_ is the date identier.
PDWEIGHTS Statement
PDWEIGHTS day=w . . . ;
The PDWEIGHTS statement can be used to specify one to seven daily weights. The statement can only be used with monthly series that are seasonally adjusted using the multiplicative model. These weights are used to compute prior trading-day factors, which are then used to adjust the original series prior to the seasonal adjustment process. Only relative weights are needed; the X11 procedure adjusts the weights so that they sum to 7.0. The weights can also be corrected by the procedure on the basis of estimates of trading-day variation from the input data. See the section Prior Daily Weights and Trading-Day Regression on page 2065 for details and examples using the PDWEIGHTS statement. Each day=w option species a weight (w) for the named day. The day can be any day, Sunday through Saturday. The day keyword can be the full spelling of the day, or the three-letter abbreviation. For example, SATURDAY=1.0 and SAT=1.0 are both valid. The weights w must be a numeric value between 0.0 and 10.0. The following is an example of a PDWEIGHTS statement:
pdweights sun=.2 mon=.9 tue=1 wed=1 thu=1 fri=.8 sat=.3;
Any number of days can be specied with one PDWEIGHTS statement. The default weight value for any day that is not specied is 0. If you do not use a PDWEIGHTS statement, the program computes daily weights if TDREGR=ADJUST is specied. See Shiskin, Young, and Musgrave (1967) for details.
QUARTERLY Statement
QUARTERLY options ;
The QUARTERLY statement must be used when the input data are quarterly time series. This statement includes options that determine the computations performed by the procedure and what is in the printed output. The DATE= option or the START= option must be used. The following options can appear in the QUARTERLY statement.
ADDITIVE
performs additive adjustments. If this option is omitted, the procedure performs multiplicative adjustments.
species the charts to be produced by the procedure. The default value is CHARTS=STANDARD, which species four quarterly seasonal charts and a trend cycle chart. If you specify CHARTS=FULL (or CHARTS=ALL), the procedure prints additional charts of irregular and seasonal factors. To print no charts, specify CHARTS=NONE. The TABLES statement can also be used to specify particular charts to be printed. The presence of a TABLES statement overrides the default value of CHARTS=STANDARD; that is, if a TABLES statement is specied, and no CHARTS=option is specied, no charts (nor tables) are printed except those given in the TABLES statement. However, if both the CHARTS= option and a TABLES statement are given, the charts corresponding to the CHARTS= option and those requested by the TABLES statement are printed. For example, suppose you wanted only charts G1, the nal seasonally adjusted series and trend cycle, and G4, the nal irregular and nal modied irregular series. This is accomplished by specifying the following statements:
quarterly date=date; tables g1 g4;
DATE= variable
species a variable that gives the date for each observation. The starting and ending dates are obtained from the rst and last values of the DATE= variable, which must contain SAS date values. The procedure checks values of the DATE= variable to ensure that the input observations are sequenced correctly. This variable is automatically added to the OUTPUT= data set if one is requested, and extrapolated if necessary. If the DATE= option is not specied, the START= option must be specied. The DATE= option and the START= and END= options can be used in combination to subset a series for processing. For example, suppose you have a series with 10 years of quarterly data (40 observations, no missing values) beginning in 1970Q1 and ending in 1979Q4, and you want to seasonally adjust only four years beginning in 1974Q1 and ending in 1977Q4. Specifying
quarterly date=variable start=1974q1 end=1977q4;
seasonally adjusts only this subset of the data. If instead you wanted to adjust the last six years of data, only the START= option is needed:
quarterly date=variable start=1974q1;
END= yyyyQq
species that only the part of the input series ending with the quarter and year given be adjusted (for example, END=1973Q4). The specication must be enclosed in quotes and q must be 1, 2, 3, or 4. See the DATE= variable option for using the START= and END= options to subset a series.
FULLWEIGHT= value
assigns weights to irregular values based on their distance from the mean in standard deviation
units. The weights are used for estimating seasonal and trend cycle components. Irregular values less than the FULLWEIGHT= value (in standard deviation units) are assigned full weights of 1, values that fall between the ZEROWEIGHT= and FULLWEIGHT= limits are assigned weights linearly graduated between 0 and 1, and values greater than the ZEROWEIGHT= limit are assigned a weight of 0. For example, if ZEROWEIGHT=2 and FULLWEIGHT=1, a value 1.3 standard deviations from the mean would be assigned a graduated weight. The default is FULLWEIGHT=1.5.
NDEC= n
species the number of decimal places shown on the output tables. This option has no effect on the precision of the variables in the output data set.
PRINTOUT= STANDARD PRINTOUT= LONG PRINTOUT= FULL PRINTOUT= NONE
species the tables to print. If PRINTOUT=STANDARD is specied, between 17 and 27 tables are printed, depending on the other options that are specied. PRINTOUT=LONG prints between 27 and 39 tables, and PRINTOUT=FULL prints between 44 and 59 tables. Specifying PRINTOUT=NONE results in no tables being printed. The default is PRINTOUT=STANDARD. The TABLES statement can also specify particular quarterly tables to be printed. If no PRINTOUT= is given, and a TABLES statement is given, the TABLES statement overrides the default value of PRINTOUT=STANDARD; that is, no tables (or charts) are printed except those given in the TABLES statement. However, if both the PRINTOUT= option and a TABLES statement are given, the tables corresponding to the PRINTOUT= option and those requested by the TABLES statement are printed.
START= yyyyQq
adjusts only the part of the input series starting with the quarter and year given. When the DATE= option is not used, the START= option gives the year and quarter of the rst input observation (for example, START=1967Q1). The specication must be enclosed in quotes, and q must be 1, 2, 3, or 4. START= must be specied if the DATE= option is not given. If START= is specied (and no DATE= is given), and an OUTPUT= data set is requested, a variable named _DATE_ is added to the data set, giving the date value for a given observation. See the DATE= option for using the START= and END= options to subset a series.
SUMMARY
species that the input is already seasonally adjusted and that the procedure is to produce summary measures. If this option is omitted, the procedure performs seasonal adjustment of the input data before calculating summary measures.
TRENDADJ
modies extreme irregular values prior to computing the trend cycle estimates. If this option is omitted, the trend cycle is computed without modication for extremes.
ZEROWEIGHT= value
assigns weights to irregular values based on their distance from the mean in standard deviation units. The weights are used for estimating seasonal and trend cycle components. Irregular values beyond the standard deviation limit specied in the ZEROWEIGHT= option are assigned zero weights. Values that fall between the two limits (ZEROWEIGHT= and FULLWEIGHT=) are assigned weights linearly graduated between 0 and 1. For example, if ZEROWEIGHT=2 and FULLWEIGHT=1, a value 1.3 standard deviations from the mean would be assigned a graduated weight. The default is ZEROWEIGHT=2.5. The ZEROWEIGHT option can be used in conjunction with the FULLWEIGHT= option to adjust outliers from a monthly or quarterly series. See Example 31.3 later in this chapter for an illustration of this use.
SSPAN Statement
SSPAN options ;
The SSPAN statement applies sliding spans analysis to determine the suitability of seasonal adjustment for an economic series. The following options can appear in the SSPAN statement:
NDEC= n
species the number of decimal places shown on selected sliding span reports. This option has no effect on the precision of the variables values in the OUTSPAN output data set.
CUTOFF= value
gives the percentage value for determining an excessive difference within a span for the seasonal factors, the seasonally adjusted series, and month-to-month and year-to-year differences in the seasonally adjusted series. The default value is 3.0. The use of the CUTOFF=value in determining the maximum percent difference (MPD) is described in the section Computational Details for Sliding Spans Analysis on page 2062. Caution should be used in changing the default CUTOFF=value. The empirical threshold ranges found by the U.S. Census Bureau no longer apply when value is changed.
TDCUTOFF= value
gives the percentage value for determining an excessive difference within a span for the trading-day factors. The default value is 2.0. The use of the TDCUTOFF=value in determining the maximum percent difference (MPD) is described in the section Computational Details for Sliding Spans Analysis on page 2062. Caution should be used in changing the default TDCUTOFF=value. The empirical threshold ranges found by the U.S. Census Bureau no longer apply when the value is changed.
NOPRINT
suppresses all sliding span reports. See Computational Details for Sliding Spans Analysis on page 2062 for more details on sliding span reports.
prints the summary sliding spans report S 0 through S 6.E, along with detail reports S 7.A through S 7.E.
TABLES Statement
TABLES tablenames ;
The TABLES statement prints the tables specied in addition to the tables that are printed as a result of the PRINTOUT= option in the MONTHLY or QUARTERLY statement. Table names are listed in Table 31.4 later in this chapter. To print only selected tables, omit the PRINTOUT= option in the MONTHLY or QUARTERLY statement and list the tables to be printed in the TABLES statement. For example, to print only the nal seasonal factors and nal seasonally adjusted series, use the statement
tables d10 d11;
VAR Statement
VAR variables ;
The VAR statement is used to specify the variables in the input data set that are to be analyzed by the procedure. Only numeric variables can be specied. If the VAR statement is omitted, all numeric variables are analyzed except those appearing in a BY or ID statement or the variable named in the DATE= option in the MONTHLY or QUARTERLY statement.
availability of time series software. For further discussions about statistical problems associated with the X-11 method, see Ghysels (1990). Seasonal adjustment methods began to be developed in the 1920s and 1930s, before there were suitable analytic models available and before electronic computing devices were in existence. The lack of any suitable model led to methods that worked the same for any series that is, methods that were not model-based and that could be applied to any series. Experience with economic series had shown that a given mathematical form could adequately represent a time series only for a xed length; as more data were added, the model became inadequate. This suggested an approach that used moving averages. For further analysis of the properties of X-11 moving averages, see Cleveland and Tiao (1976). The basic method was to break up an economic time series into long-term trend, long-term cyclical movements, seasonal movements, and irregular uctuations. Early investigators found that it was not possible to uniquely decompose the trend and cycle components. Thus, these two were grouped together; the resulting component is usually referred to as the trend cycle component. It was also found that estimating seasonal components in the presence of trend produced biased estimates of the seasonal components, but, at the same time, estimating trend in the presence of seasonality was difcult. This eventually lead to the iterative approach used in the X-11 method. Two other problems were encountered by early investigators. First, some economic series appear to have changing or evolving seasonality. Secondly, moving averages were very sensitive to extreme values. The estimation method used in the X-11 method allows for evolving seasonal components. For the second problem, the X-11 method uses repeated adjustment of extreme values. All of these problems encountered in the early investigation of seasonal adjustment methods suggested the use of moving averages in estimating components. Even with the use of moving averages instead of a model-based method, massive amounts of hand calculations were required. Only a small number of series could be adjusted, and little experimentation could be done to evaluate variations on the method. With the advent of electronic computing in the 1950s, work on seasonal adjustment methods proceeded rapidly. These methods still used the framework previously described; variants of these basic methods could now be easily tested against a large number of series. Much of the work was done by Julian Shiskin and others at the U.S. Bureau of the Census beginning in 1954 and culminating after a number of variants into the X-11 Variant of the Census Method II Seasonal Adjustment Program, which PROC X11 implements. References for this work during this period include Shiskin and Eisenpress (1957), Shiskin (1958), and Marris (1961). The authoritative documentation for the X-11 Variant is in Shiskin, Young, and Musgrave (1967). This document is not equivalent to a program specication; however, the FORTRAN code that implements the X-11 Variant is in the public domain. A less detailed description of the X-11 Variant is given in U.S. Bureau of the Census (1969).
The trading-day component can be further factored as Dt D Dr;t Dt r;t ; where Dt r;t are the trading-day factors derived from the prior daily weights, and Dr;t are the residual trading-day factors estimated from the trading-day regression. For further information about estimating trading day variation, see Young (1965).
identied and modied, nal estimates of the seasonal component, seasonally adjusted series, trend cycle, and irregular components are produced. Step 2 consists of three substeps: a) During the rst iteration, a centered, 12-term moving average is applied to the original O series Ot to provide a preliminary estimate Ct of the trend cycle curve Ct . This moving average combines 13 (a 2-term moving average of a 12-term moving average) consecutive monthly values, removing the St and It . Next, it obtains a preliminary estimate St It by
St It D
Ot O Ct
O b) A moving average is then applied to the St It to obtain an estimate St of the seasonal O factors. St It is then divided by this estimate to obtain an estimate It of the irregular component. Next, a moving standard deviation is calculated from the irregular component and is used in assigning a weight to each monthly value for measuring its degree of extremeness. These weights are used to modify extreme values in St It . New seasonal factors are estimated by applying a moving average to the modied value of St It . A preliminary seasonally adjusted series is obtained by dividing the original series by these new seasonal factors. A second estimate of the trend cycle is obtained by applying a weighted moving average to this seasonally adjusted series.
c) The same process is used to obtain second estimates of the seasonally adjusted series and improved estimates of the irregular component. This irregular component is again modied for extreme values and then used to provide estimates of trading-day factors and rened weights for the identication of extreme values. 3. Using the same computations, a second iteration is performed on the original series that has been adjusted by the trading-day factors and irregular weights developed in the rst iteration. The second iteration produces nal estimates of the trading-day factors and irregular weights. 4. A third and nal iteration is performed using the original series that has been adjusted for trading-day factors and irregular weights computed during the second iteration. During the third iteration, PROC X11 develops nal estimates of seasonal factors, the seasonally adjusted series, the trend cycle, and the irregular components. The procedure computes summary measures of variation and produces a moving average of the nal adjusted series.
from a dominating irregular component or sudden changes in the seasonally component. When the seasonal component estimates of a series is unstable in this manner, they have little meaning and the series is likely to be unsuitable for seasonal adjustment. Sliding spans analysis, developed at the Statistical Research Division of the U.S. Census Bureau (Findley et al. 1990; Findley and Monsell 1986), performs a repeated seasonal adjustment on subsets or spans of the full series. In particular, an initial span of the data, typically eight years in length, is seasonally adjusted, and the Tables C18, the trading-day factors (if trading-day regression performed), D10, the seasonal factors, and D11, the seasonally adjusted series are retained for further processing. Next, one year of data is deleted from the beginning of the initial span and one year of data is added. This new span is seasonally adjusted as before, with the same tables retained. This process continues until the end of the data is reached. The beginning and ending dates of the spans are such that the last observation in the original data is also the last observation in the last span. This is discussed in more detail in the following paragraphs. The following notation for the components or differences computed in the sliding spans analysis follows Findley et al. (1990). The meaning for the symbol Xt .k/ is component X in month (or quarter) t , computed from data in the kth span. These components are now dened. Seasonal Factors (Table D10): St .k/ Trading-Day Factors (Table C18): TDt .k/ Seasonally Adjusted Data (Table D11): SAt .k/ Month-to-Month Changes in the Seasonally Adjusted Data: MMt .k/ Year-to-Year Changes in the Seasonally Adjusted Data: Y Yt .k/ The key measure is the maximum percent difference across spans. For example, consider a series that begins in January 1972, ends in December 1984, and has four spans, each of length 8 years (see Figure 1 in Findley et al. (1990), p. 346). Consider St .k/ the seasonal factor (Table D10) for month t for span k, and let Nt denote the number of spans containing month t ; that is, Nt D fk W span k cont ai ns month t g In the middle years of the series there is overlap of all four spans, and Nt will be 4. The last year of the series will have only one span, while the beginning can have 1 or 0 spans depending on the original length. Since we are interested in how much the seasonal factors vary for a given month across the spans, a natural quantity to consider is maxk
Nt St .k/
mi nk
Nt St .k/
In the case of the multiplicative model, it is useful to compute a percentage difference; dene the maximum percentage difference (MPD) at time t as MPDt D maxk
Nt St .k/
mi nk mi nk Nt St .k/
Nt St .k/
The seasonal factor for month t is then unreliable if MPDt is large. While no exact signicance level can be computed for this statistic, empirical levels have been established by considering over 500 economic series (Findley et al. 1990; Findley and Monsell 1986). For these series it was found that for four spans, stable series typically had less than 15% of the MPD values exceeding 3.0%, while in marginally stable series, between 15% and 25% of the MPD values exceeded 3.0%. A series in which 25% or more of the MPD values exceeded 3.0% is almost always unstable. While these empirical values cannot be considered an exact signicance level, they provide a useful empirical basis for deciding if a series is suitable for seasonal adjustment. These percentage values are shifted down when fewer than four spans are used.
mi nk
Nt St .k/
mi nk mi nk Nt St .k/
Nt St .k/
A series for which less than 15% of the MPD values of D10 exceed 3.0% is stable; between 15% and 25% is marginally stable; and greater than 25% is unstable. Span reports S 2.A through S 2.C give the various breakdowns for the number of times the MPD exceeded these levels.
mi nk
Nt TDt .k/
The U.S. Census Bureau currently gives no recommendation concerning MPD thresholds for the trading-day factors. Span reports S 3.A through S 3.C give the various breakdowns for MPD thresholds. When TDREGR=NONE is specied, no trading-day computations are done, and this table is skipped.
mi nk
Nt SAt .k/
A series for which less than 15% of the MPD values of D11 exceed 3.0% is stable; between 15% and 25% is marginally stable; and greater than 25% is unstable. Span reports S 4.A through S 4.C give the various breakdowns for the number of times the MPD exceeded these levels.
For the additive model, the month-to-month change for span k is dened as MMt .k/ D SAt SAt
1
while for the multiplicative model MMt .k/ D SAt SAt SAt 1
1
Since this quantity is already in percentage form, the MPD for both the additive and multiplicative model is dened as MPDt D maxk
N1t MMt .k/
mi nk
The current recommendation of the U.S. Census Bureau is that if 35% or more of the MPD values of the month-to-month differences of D11 exceed 3.0%, then the series is usually not stable; 40% exceeding this level clearly marks an unstable series. Span reports S 5.A.1 through S 5.C give the various breakdowns for the number of times the MPD exceeds these levels.
Year-to-Year Changes in the Seasonally Adjusted Data
(Appropriate changes in notation for a quarterly series are obvious.) For the additive model, the month-to-month change for span k is dened as Y Yt .k/ D SAt SAt
12
Since this quantity is already in percentage form, the MPD for both the additive and multiplicative model is dened as MPDt D maxk
N1t Y Yt .k/
mi nk
N1t Y Yt .k/
The current recommendation of the U.S. Census Bureau is that if 10% or more of the MPD values of the month-to-month differences of D11 exceed 3.0%, then the series is usually not stable. Span reports S 6.A through S 6.C give the various breakdowns for the number of times the MPD exceeds these levels.
Data Requirements
The input data set must contain either quarterly or monthly time series, and the data must be in chronological order. For the standard X-11 method, there must be at least three years of observations (12 for quarterly time series or 36 for monthly) in the input data sets or in each BY group in the input data set if a BY statement is used. For the X-11-ARIMA method, there must be at least ve years of observations (20 for quarterly time series or 60 for monthly) in the input data sets or in each BY group in the input data set if a BY statement is used.
Missing Values
Missing values at the beginning of a series to be adjusted are skipped. Processing starts with the rst nonmissing value and continues until the end of the series or until another missing value is found. Missing values are not allowed for the DATE= variable. The procedure terminates if missing values are found for this variable. Missing values found in the PMFACTOR= variable are replaced by 100 for the multiplicative model (default) and by 0 for the additive model. Missing values can occur in the output data set. If the time series specied in the OUTPUT statement is not computed by the procedure, the values of the corresponding variable are missing. If the time series specied in the OUTPUT statement is a moving average, the values of the corresponding variable are missing for the rst n and last n observations, where n depends on the length of the moving average. Additionally, if the time series specied is an irregular component modied for extremes, only the modied values are given, and the remaining values are missing.
output out=x11out a1=a1 a4=a4 b1=b1 c14=c14 c16=c16 c18=c18 d11=d11; run;
Tables of interest include A1, A4, B15, B16, C14, C15, C18, and D11. Table A4 contains the adjustment factors derived from the prior daily weights; Table C14 contains the extreme irregular values excluded from trading-day regression; Table C15 contains the trading-day-regression results; Table C16 contains the monthly factors derived from the trading-day regression; and Table C18 contains the nal trading-day factors derived from the combined daily weights. Finally, Table D11 contains the nal seasonally adjusted series.
Table A2 contains the prior monthly factors (the values of PMF), and Table A3 contains the prior adjusted series.
D10t
24 /;
t D N C 1; ::; N C 12
If the input data are monthly time series, 12 extra observations are added to the end of the output data set. (If a BY statement is used, 12 extra observations are added to the end of each BY group.) If the input data are a quarterly time series, four extra observations are added to the end of the output data set. (If a BY statement is used, four extra observations are added to each BY group.) The DATE= variable (or _DATE_) is extrapolated for the extra observations generated by the YRAHEADOUT option, while all other ID variables will have missing values. If ARIMA processing is requested, and if both the OUTEXTRAP and YRAHEADOUT options are specied in the PROC X11 statement, an additional 12 (or 4) observations are added to the end of output data set for monthly (or quarterly) data after the ARIMA forecasts, using the same linear combination of past values as before.
macurves dec=3x9;
would require ve additional December values to compute the seasonal moving average.
Model # 1 2 3 4 5
Multiplicative log transform log transform log transform log transform no transform
The selection process proceeds as follows. The ve models are estimated and one-step-ahead forecasts are produced in the order shown in Table 31.2. As each model is estimated, the following three criteria are checked: The mean absolute percent error (MAPE) for the last three years of the series must be less than 15%. The signicance probability for the Box-Ljung chi-square for up to lag 24 for monthly (8 for quarterly) must greater than 0.05. The over-differencing criteria must not exceed 0.9. The descriptions of these three criteria are given in the section Criteria Details on page 2069. The default values for these criteria are those used by X11ARIMA/88 from Statistics Canada; these defaults can be changed by the MAPECR=, CHICR=, and OVDIFCR= options. A model that fails any one of these three criteria is excluded from further consideration. In addition, if the ARIMA estimation fails for a given model, a warning is issued, and the model is excluded. The nal set of all models considered consists of those that pass all three criteria and are estimated successfully. From this set, the model with the smallest MAPE for the last three years is chosen. If all ve models fail, ARIMA processing is skipped for the variable being processed, and the standard X-11 seasonal adjustment is performed. A note is written to the log with this information.
The chosen model is then used to forecast the series one or more years (determined by the FORECAST= option in the ARIMA statement). These forecasts are appended to the original data (or the prior and calendar-adjusted data). If a BACKCAST= option is specied, the chosen model form is used, but the parameters are reestimated using the reversed series. Using these parameters, the reversed series is forecast for the number of years specied by the BACKCAST= option. These forecasts are then reversed and appended to the beginning of the original series, or the prior and calendar-adjusted series, to produce the backcasts. Note that the nal selection rule (the smallest MAPE using the last three years) emphasizes the quality of the forecasts at the end of the series. This is consistent with the purpose of the X-11ARIMA methodology, which is to improve the estimates of seasonal factors and thus minimize revisions to recent past data as new data become available.
Criteria Details
Mean Absolute Percent Error (MAPE)
For the MAPE criteria testing, only the last three years of the original series (or prior and calendar adjusted series) is used in computing the MAPE. Let yt , t = 1,..,n, be the last three years of the series, and denote its one-step-ahead forecast by yt , O where n D 36 for a monthly series and n D 12 for a quarterly series. With this notation, the MAPE criteria are computed as 100 X jyt yt j O MAPE D n jyt j
t D1 n
Box-Ljung Chi-Square
The Box-Ljung chi-square is a lack-of-t test based on the model residuals. This test statistic is computed using the Ljung-Box formula
2 m
D n.n C 2/
m X kD1
2 rk
.n
k/
where n is the number of residuals that can be computed for the time series, and Pn k t D1 at at Ck rk D Pn 2 t D1 at where the at s are the residual sequence. This formula has been suggested by Ljung and Box (1978) as yielding a better t to the asymptotic chi-square distribution. Some simulation studies of the nite sample properties of this statistic are given by Davies, Triggs, and Newbold (1977) and by Ljung and Box (1978). For monthly series, m D 24, while for quarterly series, m D 8.
Over-Differencing Test
From Table 31.2 you can see that all models have a single seasonal MA factor and at most two nonseasonal MA factors. Also, all models have seasonal and nonseasonal differencing. Consider model 2 applied to a monthly series yt with E.yt / D : .1 B 1 /.1 B 12 /.yt / D .1 1 B 2 B 2 /.1 3 B 12 /at
3 B 12 / and .1
B/.1
B/
for some 0:0. Again, this results in cancellation and a lower-order model. Since the parameters are not exact, it is not reasonable to require that 3 < 1:0 and 1 C 2 < 1:0 Instead, an approximate test is performed by requiring that 3 0:9 and 1 C 2 0:9 The default value of 0.9 can be changed by the OVDIFCR= option. Similar reasoning applies to the other models.
ARIMA Statement Options MODEL=( Q=1 SQ=1 DIF=1 SDIF=1 ) TRANSFORM=LOG MODEL=( Q=2 SQ=1 DIF=1 SDIF=1 ) TRANSFORM=LOG MODEL=( P=2 SQ=1 DIF=1 SDIF=1 ) TRANSFORM=LOG MODEL=( Q=2 SQ=1 DIF=2 SDIF=1 ) TRANSFORM=LOG MODEL=( P=2 Q=2 SQ=1 DIF=1 SDIF=1 )
are placed in the OUT= data set. A list of tables available for monthly and quarterly series is given later, in Table 31.4.
that overlapping spans will contain identical values for this variable.
C18, a numeric variable that contains the trading-day factors for the seasonal adjustment for
current span
D11, a numeric variable that contains the seasonally adjusted series for the current span DATE, a numeric variable that contains the date within the current span SPAN, a numeric variable that contains the current span. The rst span is the earliest span
separate sliding spans analysis is performed on each variable in the VAR list.
VARNAME, a character variable containing the name of each variable in the VAR list TABLE, a character variable specifying the table from which the analysis of variance is per-
formed. When ARIMA processing is requested, and two passes of X11 are required (when TDREGR=PRINT, TEST, or ADJUST), Table D8 and the stable seasonality test are computed twice: once in the initial pass, then again in the nal pass. Both of these computations are put in the OUTSTB data set and are identied by D18.1 and D18.2, respectively.
SOURCE, a character variable corresponding to the source column in the analysis of vari-
source term
DF, a numeric variable containing the degrees of freedom associated with the corresponding
source term
MS, a numeric variable containing the mean square associated with the corresponding source
These types are (a) the regression, (b) the listing of the standard error associated with lengthof-month, and (c) the analysis of variance. The rst seven observations in the OUTTDR data set correspond to the regression on days of the week; thus the _TYPE_ variable is given the value REGRESS (day-of-week regression coefcient). The next four observations correspond to 31-, 30-, 29-, and 28-day months and are given the value _TYPE_=LOM_STD (length-of-month standard errors). Finally, the last three observations correspond to the analysis of variance table, and _TYPE_=ANOVA.
PARM, a character variable, further identifying the nature of the observation. PARM is set to
SOURCE, a character variable containing the source in the regression. This variable is missing
weight found from regression). The variable is missing for all _TYPE_=LOM_STD and _TYPE_=ANOVA.
PRWGT, a numeric variable containing the prior weight.
The prior weight is 1.0 if PDWEIGHTS are not specied. This variable is missing for all _TYPE_=LOM_STD and _TYPE_=ANOVA.
COEFF, a numeric variable containing the calculated regression coefcient for the given day. This variable is missing for all _TYPE_=LOM_STD and _TYPE_=ANOVA. STDERR, a numeric variable containing the standard errors.
For observations with _TYPE_=REGRESS, this is the standard error corresponding to the regression coefcient. For observations with _TYPE_=LOM_STD, this is standard error for the corresponding length-of-month. This variable is missing for all _TYPE_=ANOVA.
T1, a numeric variable containing the t statistic corresponding to the test that the
combined weight is different from the prior weight. _TYPE_=LOM_STD and _TYPE_=ANOVA.
T2, a numeric variable containing the t statistic corresponding to the test that the com-
bined weight is different from 1.0. This variable is missing for all _TYPE_=LOM_STD and _TYPE_=ANOVA.
PROBT1, a numeric variable containing the signicance level for t statistic T1. The variable
source term. This variable is missing for all _TYPE_=REGRESS and LOM_STD.
DF, a numeric variable containing the degrees of freedom associated with the corresponding
source term. This variable is missing for all _TYPE_=REGRESS and LOM_STD.
MS, a numeric variable containing the mean square associated with the corresponding source
term. This variable is missing for the source term Total and for all _TYPE_=REGRESS and LOM_STD.
F, a numeric variable containing the F statistic for the Regression source term. The vari-
able is missing for the source terms Total and Error, and for all _TYPE_=REGRESS and LOM_STD.
PROBF, a numeric variable containing the signicance level for the F statistic. This variable is missing for the source term Total and Error and for all _TYPE_=REGRESS and LOM_STD.
Printed Output
The output from PROC X11, both printed tables and the series written to the OUT= data set, depends on whether the data are monthly or quarterly. For the printed tables, the output depends further on the value of the PRINTOUT= option and the TABLE statement, along with other options specied. The printed output is organized into tables identied by a part letter and a sequence number within the part. The seven major parts of the X11 procedure are as follows: A B C D E F G prior adjustments (optional) preliminary estimates of irregular component weights and regression trading-day factors nal estimates of irregular component weights and regression trading-day factors nal estimates of seasonal, trend cycle, and irregular components analytical tables summary measures charts
Table 31.4 describes the individual tables and charts. Most tables apply both to quarterly and monthly series. Those that apply only to a monthly time series are indicated by an M in the notes section, while P indicates the table is not a time series, and is only printed, not output to the OUT= data set.
Table 31.4 Table Names and Descriptions
Description original series prior monthly adjustment factors original series adjusted for prior monthly factors prior trading-day adjustments prior adjusted or original series ARIMA forecasts ARIMA backcasts prior adjusted or original series extended by ARIMA backcasts and forecasts prior adjusted or original series trend cycle unmodied seasonal-irregular (S-I) ratios replacement values for extreme S-I ratios seasonal factors seasonally adjusted series trend cycle unmodied S-I ratios replacement values for extreme S-I ratios
Notes M M M M M
Table 31.4
continued
Table B10 B11 B13 B14 B15 B16 B17 B18 B19 C1 C2 C4 C5 C6 C7 C9 C10 C11 C13 C14 C15 C16 C17 C18 C19 D1 D2 D4 D5 D6 D7 D8 D9 D10 D11 D12 D13 E1 E2 E3 E4 E5
Description seasonal factors seasonally adjusted series irregular series extreme irregular values excluded from trading-day regression preliminary trading-day regression trading-day adjustment factors preliminary weights for irregular components trading-day factors derived from combined daily weights original series adjusted for trading-day and prior variation original series modied by preliminary weights and adjusted for trading-day and prior variation trend cycle modied S-I ratios seasonal factors seasonally adjusted series trend cycle modied S-I ratios seasonal factors seasonally adjusted series irregular series extreme irregular values excluded from trading-day regression nal trading-day regression nal trading-day adjustment factors derived from regression coefcients nal weight for irregular components nal trading-day factors derived from combined daily weights original series adjusted for trading-day and prior variation original series modied for nal weights and adjusted for tradingday and prior variation trend cycle modied S-I ratios seasonal factors seasonally adjusted series trend cycle nal unmodied S-I ratios nal replacement values for extreme S-I ratios nal seasonal factors nal seasonally adjusted series nal trend cycle nal irregular series original series with outliers replaced modied seasonally adjusted series modied irregular series ratios of annual totals percent changes in original series
Notes
M M,P M M M
M M,P M
M M
Table 31.4
continued
Table E6 F1 F2 G1 G2 G3 G4
Description percent changes in nal seasonally adjusted series MCD moving average summary measures chart of nal seasonally adjusted series and trend cycle chart of S-I ratios with extremes, S-I ratios without extremes, and nal seasonal factors chart of S-I ratios with extremes, S-I ratios without extremes, and nal seasonal factors in calendar order chart of nal irregular and nal modied irregular series
Notes
P P P P P
The actual number of tables printed depends on the options and statements specied. If a table is not computed, it is not printed. For example, if TDREGR=NONE is specied, none of the tables associated with the trading-day are printed.
Stable, Moving, and Combined Seasonality Tests on the Final Unmodied SI Ratios (Table D8)
PROC X11 displays four tests used to identify stable seasonality and moving seasonality and to measure identiable seasonality. These tests are displayed after Table D8. They are Stable Seasonality Test, Moving Seasonality Test, Nonparametric Test for the Presence of Seasonality Assuming Stability, and Summary of Results and Combined Test for the Presence of Identiable Seasonality. The motivation, interpretation, and statistical details of all these tests are now given.
Motivation
The seasonal component of this time series, St , is dened as the intrayear variation that is repeated constantly (stable) or in an evolving fashion from year to year (moving seasonality). If the increase in the seasonal factors from year to year is too large, then the seasonal factors will introduce distortion into the model. It is important to determine if seasonality is identiable without distorting the series. To determine if stable seasonality is present in a series, PROC X11 computes a one-way analysis of variance by using the seasons (months or quarters) as the factor on the Final Unmodied SI Ratios (Table D8). This is the appropriate table to use because the removal of the trend cycle is equivalent to detrending. PROC X11 prints this test, labeled Stable Seasonality Test, immediately after the Table D8. The X11 seasonal adjustment method tests for moving seasonality. Moving seasonality can be a source of distortion when seasonal factors are used in the model. PROC X11 computes and prints a test for moving seasonality. The test is a two-way analysis of variance that uses months (or quarters) and years. As in the Stable Seasonality Test, this analysis of variance is performed on the Final Unmodied SI Ratios (Table D8). PROC X11 prints this test, labeled Moving Seasonality Test, after the Stable Seasonality Test. PROC X11 next computes a nonparametric Kruskal-Wallis chi-squared test for stable seasonality, Nonparametric Test for the Presence of Seasonality Assuming Stability. The Kruskal-Wallis test is performed on the ranks of the Final Unmodied SI Ratios (Table D8). For further details about the Kruskal-Wallis test, see Lehmann (1998, pp. 204210). The results of the preceding three tests are combined into a joint test to measure identiable seasonality, Summary of Results and Combined Test for the Presence of Identiable Seasonality. This test combines the two F tests previously described, along with the Kruskal-Wallis chi-squared test for stable seasonality, to determine identiable seasonality. This test is printed after Nonparametric Test for the Presence of Seasonality Assuming Stability.
The Stable Seasonality Test is a one-way analysis of variance on the Final Unmodied SI Ratios with seasons (months or quarters) as the factor. To determine whether stable seasonality is present in a series, PROC X11 computes a one-way analysis of variance by using the seasons (months or quarters) as the factor on the Final Unmodied SI Ratios (Table D8). This is the appropriate table to use because the removal of the trend cycle is similar to detrending. A large F statistic and a small signicance level are evidence that a signicant amount of variation in the SI-ratios is due to months or quarters, which in turn is evidence of seasonality; the null hypothesis of no month/quarter effect is rejected. Conversely, a small F statistic and a large signicance level (close to 1.0) are evidence that variation due to month or quarter could be due to random error, and the null hypothesis of no month/quarter effect is not rejected. The interpretation and utility of seasonal adjustment are problematic under such conditions.
The F test for moving seasonality is performed by a two-way analysis of variance. The two factors are seasons (months or quarters) and years. The years effect is tested separately; the null hypothesis is no effect due to years after accounting for variation due to months or quarters. For further details about the moving seasonality test, see Lothian (1984a, b, 1978) and Higginson (1975). The signicance level reported in both the moving and stable seasonality tests are only approximate. Table D8, the Final Unmodied SI Ratios, is constructed from an averaging operation that induces a correlation in the residuals from which the F test is computed. Hence the computed F statistic differs from an exact F statistic; see Cleveland and Devlin (1980) for details. The test for identiable seasonality is performed by combining the F tests for stable and moving seasonality, along with a Kruskal-Wallis test for stable seasonality. The following description is based on Lothian and Morry (1978b); other details can be found in Dagum (1988, 1983). Let Fs and Fm denote the F value for the stable and moving seasonality tests, respectively. The combined test is performed as shown in Table 31.5 and as follows: 1. If the null hypothesis in the stable seasonality test is not rejected at the 0.10% signicance level (0.001), this is an indication that the series is not seasonal. PROC X11 displays Identiable Seasonality Not Present. 2. If the null hypothesis in step 1 is rejected, then compute the following quantities: T1 D T2 D 7 Fm 3Fm Fs
If the moving seasonality null hypothesis is not rejected at the 5.0% signicance level (0.05) and if T 1:0, the null hypothesis of identiable seasonality not present is accepted and PROC X11 displays Identiable Seasonality Not Present. 3. If the null hypothesis of identiable seasonality not present has not been accepted, but T1 1:0, T2 1:0, or the Kruskal-Wallis chi-squared test fails at the 0.10% signicance level (0.001), then PROC X11 displays Identiable Seasonality Probably Not Present. 4. If the FS and Kruskal-Wallis chi-squared tests pass, and if none of the combined measures described in steps 2 and 3 fail, then the null hypothesis of identiable seasonality not present is rejected, and PROC X11 displays Identiable Seasonality Present.
Table S 0.A gives the variable name, the length and number of spans, and the beginning and ending dates of each span.
Table S 0.B
Table S 0.B gives the summary of the two F tests performed during the standard X11 seasonal adjustments for stable and moving seasonality on Table D8, the nal SI ratios. These tests are described in the section Printed Output on page 2074.
Table S 1.A
Table S 1.A gives the range analysis of seasonal factors. This includes the means for each month (or quarter) within a span, the maximum percentage difference across spans for each month, and the average. The minimum and maximum within a span are also indicated. For example, for a monthly series and an analysis with four spans, the January row would contain a column for each span, with the value representing the average seasonal factor (Table D10) over all January calendar months occurring within the span. Beside each span column is a character column with either a MIN, MAX, or blank value, indicating which calendar month had the minimum and maximum value over that span. N Denote the average over the j th calendar month in span k; k D 1; ::; 4, by Sj .k/; then the maximum percent difference (MPD) for month j is dened by MPDj D N maxkD1;::;4 Sj .k/ N mi nkD1;::;4 Sj .k/ N mi nkD1;::;4 Sj .k/
The last numeric column of Table S 1.A is the average value over all spans for each calendar month, with the minimum and maximum row agged as in the span columns.
Table S 1.B
Table S 1.B gives a summary of range measures for each span. The rst column, Range Means, is calculated by computing the maximum and minimum over all months or quarters in a span, then taking the difference. The next column is the range ratio means, which is simply the ratio of the previously described maximum and minimum. The next two columns are the minimum and maximum seasonal factors over the entire span, while the range sf column is the difference of these. Finally, the last column is the ratio of the Max SF and Min SF columns.
Breakdown Tables
Table S 2.A.1 begins the breakdown analysis for the various series considered in the sliding spans analysis. The key concept here is the MPD described above in the section Table S 1.A on page 2080 and in the section Computational Details for Sliding Spans Analysis on page 2062. For a month or quarter that appears in two or more spans, the maximum percentage difference is computed and tested against a cutoff level. If it exceeds this cutoff, it is counted as an instance of exceeding the level. It is of interest to see if such instances fall disproportionately in certain months and years. Tables S 2.A.1 through S 6.A.3 display this breakdown for all series considered.
Table S 2.A.1
Table S 2.A.1 gives the monthly (quarterly) breakdown for the seasonal factors (table D10). The rst column identies the month or quarter. The next column is the number of times the MPD for D10 exceeded 3.0%, followed by the total count. The last is the average maximum percentage difference for the corresponding month or quarter.
Table S 2.A.2
Table S 2.A.2 gives the same information as Table S 2.A.1, but on a yearly basis.
Table S 2.A.3
The description of Table S 2.A.3 requires the denition of Sign Change and Turning Point. First, some motivation. Recall that for a highly stable series, adding or deleting a small number of observations should not affect the estimation of the various components of a seasonal adjustment procedure. Consider Table D10, the seasonal factors in a sliding spans analysis that uses four spans. For a given observation t , looking across the four spans, we can easily pick out large differences if they occur. More subtle differences can occur when estimates go from above to below (or vice versa) a base level. In the case of multiplicative model, the seasonal factors have a base level of 100.0. So it is useful to enumerate those instances where both a large change occurs (an MPD value exceeding 3.0%) and a change of sign (with respect to the base) occur. Let B denote the base value (which in general depends on the component being considered and the model type, multiplicative or additive). If, for span 1, St (1) is below B (i.e., St .1/ B is negative) and for some subsequent span k, St .k/ is above B (i.e., St .k/ B is positive), then a positive
Change in Sign has occurred at observation t. Similarly, if, for span 1, St (1) is above B, and for some subsequent span k, St .k/ is below B, then a negative Change in Sign has occurred. Both cases, positive or negative, constitute a Change in Sign; the actual direction is indicated in tables S 7.A through S 7.E, which are described below. Another behavior of interest occurs when component estimates increase then decrease (or vice versa) across spans for a given observation. Using the preceding example, the seasonal factors at observation t could rst increase, then decrease across the four spans. This behavior, combined with an MPD exceeding the level, is of interest in questions of stability. Again, consider Table D10, the seasonal factors in a sliding spans analysis that uses four spans. For a given observation t (containing at least three spans), note the level of D10 for the rst span. Continue across the spans until a difference of 1.0% or greater occurs (or no more spans are left), noting whether the difference is up or down. If the difference is up, continue until a difference of 1.0% or greater occurs downward (or no more spans are left). If such an up-down combination occurs, the observation is counted as an up-down turning point. A similar description occurs for a down-up turning point. Tables S 7.A through S 7.E, described below, show the occurrence of turning points, indicating whether up-down or down-up. Note that it requires at least three spans to test for a turning point. Hence Tables S 2.A.3 through S 6.A.3 show a reduced number in the Turning Point row for the Total Tested column, and in Tables S 7.A through S 7.E, the turning points symbols can occur only where three or more spans overlap. With these descriptions of sign change and turning point, we now describe Table S 2.A.3. The rst column gives the type or category, the second column gives the total number of observations falling into the category, the third column gives the total number tested, and the last column gives the percentage for the number found in the category. The rst category (row) of the table is for agged observationsthat is, those observations where the MPD exceeded the appropriate cutoff level (3.0% is default for the seasonal factors). The second category is for level changes, while the third category is for turning points. The fourth category is for agged sign changesthat is, for those observations that are sign changes, how many are also agged. Note the total tested column for this category equals the number found for sign change, reecting the denition of the fourth category. The fth column is for agged turning pointsthat is, for those observations that are turning points, how many are also agged. The footnote to Table S 2.A.3 gives the U.S. Census Bureau recommendation for thresholds, as described in the section Computational Details for Sliding Spans Analysis on page 2062.
Table S 2.B
Table S 2.B gives the histogram of agged for seasonal factors (Table D10) using the appropriate cutoff value (default 3.0%). This table looks at the spread of the number of times the MPD exceeded the corresponding level. The range is divided up into four intervals: 3.0%4.0%, 4.0%5.0%, 5.0%6.0%, and greater than 6.0%. The rst column shows the symbol used in Table S 7.A, the second column gives the range in interval notation, and the last column gives the number found in the corresponding interval. Note that the sum of the last column should agree with the Number Found column of the Flagged MPD row in Table S 2.A.3.
Table S 2.C
Table S 2.C gives selected percentiles for the MPD for the seasonal factors (Table D10).
These table relate to the trading-day factors (Table C18) and follow the same format as Tables S 2.A.1 through S 2.A.3. The only difference between these tables and Tables S 2.A.1 through S 2.A.3 is the default cutoff value of 2.0% instead of the 3.0% used for the seasonal factors.
These tables, applied to the trading-day factors (Table C18), are the same format as Tables S 2.B through S 2.C. The default cutoff value is different, with corresponding differences in the intervals in S 3.B.
These tables relate to the seasonally adjusted series (Table D11) and follow the same format as Tables S 2.A.1 through S 2.A.3. The same default cutoff value of 3.0% is used.
These tables, applied to the seasonally adjusted series (Table D11), are the same format as tables S 2.B through S 2.C.
These table relate to the month-to-month (or quarter-to-quarter) differences in the seasonally adjusted series, and follow the same format as Tables S 2.A.1 through S 2.A.3. The same default cutoff value of 3.0% is used.
These tables, applied to the month-to-month (or quarter-to-quarter) differences in the seasonally adjusted series, are the same format as tables S 2.B through S 2.C. The same default cutoff value of 3.0% is used.
These table relate to the year-to-year differences in the seasonally adjusted series, and follow the same format as Tables S 2.A.1 through S 2.A.3. The same default cutoff value of 3.0% is used.
These tables, applied to the year-to-year differences in the seasonally adjusted series, are the same format as tables S 2.B through S 2.C. The same default cutoff value of 3.0% is used.
Table S 7.A
Table S 7.A gives the entire listing of the seasonal factors (Table D10) for each span. The rst column gives the date for each observation included in the spans. Note that the dates do not cover the entire original data set. Only those observations included in one or more spans are listed. The next N columns (where N is the number of spans) are the individual spans starting at the earliest span. The span columns are labeled by their beginning and ending dates. Following the last span is the Sign Change column. As explained in the description of Table S 2.A.3, a sign change occurs at a given observation when the seasonal factor estimates go from above to below, or below to above, a base level. For the seasonal factors, 100.0 is the base level for the multiplicative model, 0.0 for the additive model. A blank value indicates no sign change, a U indicates a movement upward from the base level and a D indicates a movement downward from the base level. The next column is the Turning Point column. As explained in the description of Table S 2.A.3, a turning point occurs when there is an upward then downward movement, or downward then upward movement, of sufcient magnitude. A blank value indicates no turning point, a U-D indicates a movement upward then downward, and a D-U indicates a movement downward then upward. The next column is the maximum percentage difference (MPD). This quantity, described in the section Computational Details for Sliding Spans Analysis on page 2062, is the main computation for sliding spans analysis. A measure of how extreme the MPD value is given in the last column, the Level of Excess column. The symbols used and their meaning are described in Table S 2.A.3. If a given observation has exceeded the cutoff, the level of excess column is blank.
Table S 7.B
Table S 7.B gives the entire listing of the trading-day factors (Table C18) for each span. The format of this table is exactly like that of Table S 7.A.
Table S 7.C
Table S 7.C gives the entire listing of the seasonally adjusted data (Table D11) for each span. The format of this table is exactly like that of Table S 7.A except for the Sign Change column, which is not printed. The seasonally adjusted data have the same units as the original data; there is no natural base level as in the case of a percentage. Hence the sign change is not appropriate for D11.
Table S 7.D
Table S 7.D gives the entire listing of the month-to-month (or quarter-to-quarter) changes in seasonally adjusted data for each span. The format of this table is exactly like that of Table S 7.A.
Table S 7.E
Table S 7.E gives the entire listing of the year-to-year changes in seasonally adjusted data for each span. The format of this table is exactly like that of Table S 7.A.
Description
Option
ODS Tables Created by the MONTHLY and QUARTERLY Statements Preface X11 Seasonal Adjustment Program information giving credits, dates, and so on Table A1: original series Table A2: prior monthly Table A3: original series adjusted for prior monthly factors Table A4: prior trading day adjustment factors with and without length of month adjustment always printed unless NOPRINT
A1 A2 A3 A4
Table 31.5
continued
ODS Table Name A5 B1 B2 B3 B4 B5 B6 B7 B8 B9 B10 B11 B13 B15 B16 B17 B18 B19 C1 C2 C4 C5 C6 C7 C9 C10 C11 C13 C15 C16 C17
Description Table A5: original series adjusted for priors Table B1: original series or original series adjusted for priors Table B2: trend cyclecentered nn-term moving average Table B3: unmodied SI ratios Table B4: replacement values for extreme SI ratios Table B5: seasonal factors Table B6: seasonally adjusted series Table B7: trend cycleHenderson curve Table B8: unmodied SI ratios Table B9: replacement values for extreme SI ratios Table B10: seasonal factors Table B11: seasonally adjusted series Table B13: irregular series Table B15: preliminary trading day regression Table B16: trading day adjustment factors derived from regression Table B17: preliminary weights for irregular component Table B18: trading day adjustment factors from combined weights Table B19: original series adjusted for preliminary combined trading day weights Table C1: original series adjusted for preliminary weights Table C2: trend cyclecentered nn-term moving average Table C4: modied SI ratios Table C5: seasonal factors Table C6: seasonally adjusted series Table C7 trend cycleHenderson curve Table C9: modied SI ratios Table C10: seasonal factors Table C11: seasonally adjusted series Table C13: irregular series Table C15: nal trading day regression Table C16: trading day adjustment factors derived from regression Table C17: nal weights for irregular component
Option
Table 31.5
continued
ODS Table Name C18 C19 D1 D4 D5 D6 D7 D8 D10 D11 D12 D13 E1 E2 E3 E5 E6 F1 A13 A14 A15 B14
Description Table C18: trading day adjustment factors from combined weights Table C19: original series adjusted for nal combined trading day weights Table D1: original series adjusted for nal weights nn-term moving average Table D4: modied SI ratios Table D5: seasonal factors Table D6: seasonally adjusted series Table D7: trend cycleHenderson curve Table D8: nal unmodied SI ratios Table D10: nal seasonal factors Table D11: nal seasonally adjusted series Table D12: nal trend cycleHenderson curve Table D13: nal irregular series Table E1: original series modied for extremes Table E2: modied seasonally adjusted series Table E3: modied irregular series Table E5: month-to-month changes in original series Table E6: month-to-month changes in nal seasonally adjusted series Table F1: MCD moving average Table A13: ARIMA forecasts Table A14: ARIMA backcasts Table A15: ARIMA extrapolation Table B14: irregular values excluded from trading day regression Table C14: irregular values excluded from trading day regression Table D9: nal replacement values adjusted prior daily weights nal/preliminary trading day regression, part 1 nal/preliminary trading day regression, part 2
Option
TDR_1
Table 31.5
continued
D9A
year-to-year change in irregular and seasonal components and moving seasonality ratio
stable seasonality test moving seasonality test nonparametric test for the presence of seasonality assuming stability CombinedSeasonalityTest summary of results and combined test for the presence of identiable seasonality f2a f2b f2c f2d f2f F2 summary measures, part 1 F2 summary measures, part 2 F2 summary measures, part 3 I/C ratio for monthly/quarterly span average % change with regard to sign and standard deviation over span differences or ratios of annual totals for original and adjusted series chart G1 chart G2
E4
ChartG1 ChartG2
ODS Tables Created by the ARIMA Statement CriteriaSummary ConvergeSummary ArimaEst ArimaEst2 Model_Summary Ljung_BoxQ A13 A14 A15 criteria summary convergence summary ARIMA estimation results, part 1 ARIMA estimation results, part 2 model summary table of Ljung-Box Q statistics Table A13: ARIMA forecasts Table A14: ARIMA backcasts Table A15: ARIMA extrapolation ARIMA statement
ODS Tables Created by the SSPAN Statement SPR0A_1 SpanDates S 0.A sliding spans analysis, number, length of spans S 0.A sliding spans analysis: dates of spans default printing
Table 31.5
continued
ODS Table Name SPR0B SPR1_1 SPR1_b SPRXA SPRXB_2 SPRXA_2 MpdStats S_X_A_3 SPR7_X
Description S 0.B summary of F tests for stable and moving seasonality S 1.A range analysis of seasonal factors S 1.B summary of range measures 2XA.1 breakdown of differences by month or quarter S X.B histogram of agged observations S X.A.2 breakdown of differences by year S X.C: statistics for maximum percentage differences S 2.X.3 breakdown summary of agged observations S 7.X sliding spans analysis
Option
PRINTALL
monthly date=date; var sales; tables b1 d11; output out=out b1=series d10=d10 d11=d11 d12=d12 d13=d13; run; title Monthly Retail Sales Data (in $1000); proc sgplot data=out; series x=date y=series / markers markerattrs=(color=red symbol=asterisk) lineattrs=(color=red) legendlabel="original" ; series x=date y=d11 / markers markerattrs=(color=blue symbol=circle) lineattrs=(color=blue) legendlabel="adjusted" ; yaxis label=Original and Seasonally Adjusted Time Series; run;
title Monthly Seasonal Factors (in percent); proc sgplot data=out; series x=date y=d10 / markers markerattrs=(symbol=CircleFilled) ; run; title Monthly Retail Sales Data (in $1000); proc sgplot data=out; series x=date y=d12 / markers markerattrs=(symbol=CircleFilled) ; run; title Monthly Irregular Factors (in percent); proc sgplot data=out; series x=date y=d13 / markers markerattrs=(symbol=CircleFilled) ; run;
title Monthly Retail Sales Data (in $1000); proc x11 data=quarter; var fy35rr; quarterly date=date; tables b1 d11;
Year
1st
4th
Total
1971 6.590 6.010 6.510 6.180 25.290 1972 5.520 5.590 5.840 6.330 23.280 1973 6.520 7.350 9.240 10.080 33.190 1974 9.910 11.150 12.400 11.640 45.100 1975 9.940 8.160 8.220 8.290 34.610 1976 7.540 7.440 7.800 7.280 30.060 ----------------------------------------------------------Avg 7.670 7.617 8.335 8.300 Total: 191.53 Mean: 7.9804 S.D.: 1.9424
Year
Total
1971 6.877 6.272 6.222 5.956 25.326 1972 5.762 5.836 5.583 6.089 23.271 1973 6.820 7.669 8.840 9.681 33.009 1974 10.370 11.655 11.855 11.160 45.040 1975 10.418 8.534 7.853 7.947 34.752 1976 7.901 7.793 7.444 6.979 30.116 ----------------------------------------------------------Avg 8.025 7.960 7.966 7.969 Total: 191.51 Mean: 7.9797 S.D.: 1.9059
end; run; proc x11 data=a; monthly date=date additive fullweight=3.0 zeroweight=3.5; var x; table d9; output out=b b1=original e1=e1; run; proc sgplot data=b; series x=date y=original / markers markerattrs=(color=red symbol=asterisk) lineattrs=(color=red) legendlabel="unmodified" ; series x=date y=e1 / markers markerattrs=(color=blue symbol=circle) lineattrs=(color=blue) legendlabel="modified" ; yaxis label=Original and Outlier Adjusted Time Series; run;
JUN . . -10.671 .
D9 Final Replacement Values for Extreme SI Ratios JUL AUG SEP OCT NOV . . . . . 11.180 . . . . . . . . . . . . . .
DEC . . . .
References ! 2097
References
Bell, W. R. and Hillmer, S. C. (1984), Issues Involved with the Seasonal Adjustment of Economic Time Series, Journal of Business and Economic Statistics, 2(4). Bobbit, L. G. and Otto, M. C. (1990), Effects of Forecasts on the Revisions of Seasonally Adjusted Data Using the X-11 Adjustment Procedure, Proceedings of the Business and Economic Statistics Section of the American Statistical Association, 449453. Buszuwski, J. A. (1987), Alternative ARIMA Forecasting Horizons When Seasonally Adjusting Producer Price Data with X-11-ARIMA, Proceedings of the Business and Economic Statistics Section of the American Statistical Association, 488493. Cleveland, W. P. and Tiao, G. C. (1976), Decomposition of Seasonal Time Series: A Model for Census X-11 Program, Journal of the American Statistical Association, 71(355). Cleveland, W. S. and Devlin, S. J. (1980), Calendar Effects in Monthly Time Series: Detection by Spectrum Analysis and Graphical Methods, Journal of the American Statistical Association, 75(No. 371), 487496.
Dagum, E. B. (1980), The X-11-ARIMA Seasonal Adjustment Method, Statistics Canada. Dagum, E. B. (1982a), The Effects of Asymmetric Filters on Seasonal Factor Revision, Journal of the American Statistical Association, 77(380), 732738. Dagum, E. B. (1982b), Revisions of Seasonally Adjusted Data Due to Filter Changes, Proceedings of the Business and Economic Section, the American Statistical Association, 3945. Dagum, E. B. (1982c), Revisions of Time Varying Seasonal Filters, Journal of Forecasting, 1(Issue 2), 173187. Dagum, E. B. (1983), The X-11-ARIMA Seasonal Adjustment Method, Technical Report 12-564E, Statistics Canada. Dagum, E. B. (1985), Moving Averages, in S. Kotz and N. L. Johnson, eds., Encyclopedia of Statistical Sciences, volume 5, New York: John Wiley & Sons. Dagum, E. B. (1988), The X-11-ARIMA/88 Seasonal Adjustment Method: Foundations and Users Manual, Ottawa: Statistics Canada. Dagum, E. B. and Laniel, N. (1987), Revisions of Trend Cycle Estimators of Moving Average Seasonal Adjustment Method, Journal of Business and Economic Statistics, 5(2), 177189. Davies, N., Triggs, C. M., and Newbold, P. (1977), Signicance Levels of the Box-Pierce Portmanteau Statistic in Finite Samples, Biometrika, 64, 517522. Findley, D. F. and Monsell, B. C. (1986), New Techniques for Determining If a Time Series Can Be Seasonally Adjusted Reliably, and Their Application to U.S. Foreign Trade Series, in M. R. Perryman and J. R. Schmidt, eds., Regional Econometric Modeling, 195228, Amsterdam: KluwerNijhoff. Findley, D. F., Monsell, B. C., Shulman, H. B., and Pugh, M. G. (1990), Sliding Spans Diagnostics for Seasonal and Related Adjustments, Journal of the American Statistical Association, 85(410), 345355. Ghysels, E. (1990), Unit Root Tests and the Statistical Pitfalls of Seasonal Adjustment: The Case of U.S. Post War Real GNP, Journal of Business and Economic Statistics, 8(2), 145152. Higginson, J. (1975), An F test for the Presence of Moving Seasonality When Using Census Method II-X-II Variant, StatCan Staff Paper STC2102E, Seasonal Adjustment and Time Series Analysis Staff, Statistics Canada, Ottawa. Huot, G., Chui, L., Higginson, J., and Gait, N. (1986), Analysis of Revisions in the Seasonal Adjustment of Data Using X11ARIMA Model-Based Filters, International Journal of Forecasting, 2, 217229. Ladiray, D. and Quenneville, B. (2001), Seasonal Adjustment with the X-11 Method, New York: Springer-Verlag. Laniel, N. (1985), Design Criteria for the 13-Term Henderson End-Weights, Working Paper, Methodology Branch, Ottawa: Statistics Canada. Lehmann, E. L. (1998), Nonparametrics: Statistical Methods Based on Ranks, San Francisco: Holden-Day.
References ! 2099
Ljung, G. M. and Box, G. E. P. (1978), On a Measure of Lack of Fit in Time Series Models, Biometrika, 65(2), 297303. Lothian, J. (1978), The Identication and Treatment of Moving Seasonality in the X-11 Seasonal Adjustment Method, StatCan Staff Paper STC0803E, Seasonal Adjustment and Time Series Analysis Staff, Statistics Canada, Ottawa. Lothian, J. (1984a), The Identication and Treatment of Moving Seasonality in the X-11-ARIMA Seasonal Adjustment Method, Statcan staff paper, Seasonal Adjustment and Time Series Analysis Staff, Statistics Canada, Ottawa. Lothian, J. (1984b), The Identication and Treatment of Moving Seasonality in X-11-ARIMA, in Proceedings of the Business and Economic Statistics Section of the American Statistical Association, 166171. Lothian, J. and Morry, M. (1978a), Selection of Models for the Automated X-11-ARIMA Seasonal Adjustment Program, StatCan Staff Paper STC1789, Seasonal Adjustment & Time Series Analysis Staff, Statistics Canada, Ottawa. Lothian, J. and Morry, M. (1978b), A Test for the Presence of Identiable Seasonality When Using the X-11-ARIMA Program, StatCan Staff Paper STC2118, Seasonal Adjustment and Time Series Analysis Staff, Statistics Canada, Ottawa. Marris, S. (1961), The Treatment of Moving Seasonality in Census Method II, in Seasonal Adjustment on Electronic Computers, 257309, Paris: Organisation for Economic Co-operation and Development. Monsell, B. C. (1984), The Substantive Changes in the X-11 Procedure of X-11-ARIMA, SRD Research Report Census/SRD/RR-84/10, Bureau of the Census, Statistical Research Division. Pierce, D. A. (1980), Data Revisions with Moving Average Seasonal Adjustment Procedures, Journal of Econometrics, 14, 95114. Shiskin, J. (1958), Decomposition of Economic Time Series, Science, 128(3338). Shiskin, J. and Eisenpress, H. (1957), Seasonal Adjustment by Electronic Computer Methods, Journal of the American Statistical Association, 52(280). Shiskin, J., Young, A. H., and Musgrave, J. C. (1967), The X-11 Variant of the Census Method II Seasonal Adjustment Program, Technical Report 15, U.S. Department of Commerce, Bureau of the Census. U.S. Bureau of the Census (1969), X-11 Information for the User, U.S. Department of Commerce, Washington, DC: Government Printing Ofce. Young, A. H. (1965), Estimating Trading Day Variation in Monthly Economic Time Series, Technical Report 12, U.S. Department of Commerce, Bureau of the Census, Washington, DC.
2100
Chapter 32
Example 32.3: Seasonal Adjustment . . . . . . . . . . . . . . Example 32.4: RegARIMA Automatic Model Selection . . . . Example 32.5: Automatic Outlier Detection . . . . . . . . . . Example 32.6: User-Dened Regressors . . . . . . . . . . . . Example 32.7: MDLINFOIN= and MDLINFOOUT= Data Sets Example 32.8: Setting Regression Parameters . . . . . . . . . Example 32.9: Illustration of ODS Graphics . . . . . . . . . . Acknowledgments: X12 Procedure . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
Graphics are now available with the X12 procedure. For more information, see the section ODS Graphics on page 2141.
The results of the seasonal adjustment are in table D11 (the nal seasonally adjusted series) in the displayed output as shown in Figure 32.1.
Figure 32.1 Basic Seasonal Adjustment
The X12 Procedure Table D 11: Year JAN JUL Final Seasonally Adjusted Data For variable sales FEB MAR APR MAY AUG SEP OCT NOV
JUN DEC
Total
1978
. . . . . . . . 124.560 124.649 124.920 129.002 503.131 1979 125.087 126.759 125.252 126.415 127.012 130.041 128.056 129.165 127.182 133.847 133.199 135.847 1547.86 1980 128.767 139.839 143.883 144.576 148.048 145.170 140.021 153.322 159.128 161.614 167.996 165.388 1797.75 1981 175.984 166.805 168.380 167.913 173.429 175.711 179.012 182.017 186.737 197.367 183.443 184.907 2141.71 1982 186.080 203.099 193.386 201.988 198.322 205.983 210.898 213.516 213.897 218.902 227.172 240.453 2513.69 1983 231.839 224.165 219.411 225.907 225.015 226.535 221.680 222.177 222.959 212.531 230.552 232.565 2695.33 1984 237.477 239.870 246.835 242.642 244.982 246.732 251.023 254.210 264.670 266.120 266.217 276.251 3037.03 1985 275.485 281.826 294.144 286.114 293.192 296.601 293.861 309.102 311.275 319.239 319.936 323.663 3604.44 1986 326.693 330.341 330.383 330.792 333.037 332.134 336.444 341.017 346.256 350.609 361.283 362.519 4081.51 1987 364.951 371.274 369.238 377.242 379.413 376.451 378.930 375.392 374.940 373.612 368.753 364.885 4475.08 1988 371.618 383.842 385.849 404.810 381.270 388.689 385.661 377.706 397.438 404.247 414.084 416.486 4711.70 1989 426.716 419.491 427.869 446.161 438.317 440.639 450.193 454.638 460.644 463.209 427.728 485.386 5340.99 1990 477.259 477.753 483.841 483.056 481.902 499.200 484.893 485.245 . . . . 3873.15 -------------------------------------------------------------------------Avg 277.330 280.422 282.373 286.468 285.328 288.657 288.389 291.459 265.807 268.829 268.774 276.446 Total: 40323 Mean: Min: 280.02 S.D.: 124.56 Max: 111.31 499.2
You can compare the original series (Table A1) and the nal seasonally adjusted series (Table D11) by plotting them together as shown in Figure 32.2. These tables are requested in the OUTPUT statement and are written to the OUT= data set. Note that the default variable name used in the output data set is the input variable name followed by an underscore and the corresponding table name.
proc x12 data=sales date=date noprint; var sales; x11; output out=out a1 d11; run;
proc sgplot data=out; series x=date y=sales_A1 / name = "A1" markers markerattrs=(color=red symbol=asterisk) lineattrs=(color=red); series x=date y=sales_D11 / name= "D11" markers markerattrs=(symbol=circle) lineattrs=(color=blue); yaxis label=Original and Seasonally Adjusted Time Series; run;
PROC X12 options ; VAR variables ; BY variables ; ID variables ; EVENT variables ; USERDEFINED variables ; TRANSFORM options ; ADJUST options ; IDENTIFY options ; AUTOMDL options ; OUTLIER options ; REGRESSION options ; INPUT variables ; ARIMA options ; ESTIMATE options ; X11 options ; FORECAST options ; OUTPUT options ; TABLES options ;
The PROC X12 statements perform basically the same function as the Census Bureaus X-12ARIMA specs. Specs (specications) are used in X-12-ARIMA to control the computations and output. The PROC X12 statement performs some of the same functions as the Series spec in the Census Bureaus X-12-ARIMA software. The ADJUST statement performs some of the same functions as the Transform spec. The TRANSFORM, IDENTIFY, AUTOMDL, OUTLIER, REGRESSION, ARIMA, ESTIMATE, X11, and FORECAST statements are designed to perform the same functions as the corresponding X-12-ARIMA specs, although full compatibility is not yet available. The Census Bureau documentation X-12-ARIMA Reference Manual (U.S. Bureau of the Census 2001b) can provide added insight to the functionality of these statements.
Functional Summary
Table 32.1 summarizes the statements and options that control the X12 procedure.
Table 32.1 X12 Syntax Summary
Description Data Set Options specify input data set specify regression and ARIMA information output regression and ARIMA information write table values to an output data set Display Control Options suppress all displayed output
Statement
Option
PROC X12
NOPRINT
Description request tables that are not displayed by default specify that the summary line not be displayed display iterations history display information on restarted iterations display regression model parameter estimates display automatic model information Date Information Options specify the date variable specify the date of the rst observation specify the beginning and/or ending date of the subset specify the interval of the time series specify the interval of the time series Declaring the Role of Variables specify BY-group processing specify identifying variables specify the variables to be seasonally adjusted specify user-dened variables available for regression Controlling the Table Computations suppress trimming of leading/trailing missing values transform or prior-adjust the series transform or prior-adjust the series adjust the series by using a predened adjustment variable use differencing to identify the ARIMA part of the model specify automatic outlier detection estimate the regARIMA model specied by the REGRESSION and ARIMA statements or the MDLINFOIN= option specify seasonal adjustment specify the number of forecasts to extend the series for seasonal adjustment Specifying the Regression Model specify Census regression variables specify user-dened regression variables specify user dened event denition data set specify user dened event regression variables
PROC X12 PROC X12 PROC X12 PROC X12 PROC X12
BY ID VAR USERDEFINED
X11 FORECAST
LEAD=
Description Specifying the ARIMA Model use X-12-ARIMA TRAMO-based method to choose a model specify the ARIMA part of the model
Statement
Option
The PROC X12 statement provides information about the time series to be processed by PROC X12. Either the START= or the DATE= option must be specied. The original series is displayed in Table A1. If there are missing values in the original series and a regARIMA model is specied or automatically selected, then Table MV1 is displayed. Table MV1 contains the original series with missing values replaced by the predicted values from the tted model. Table B1 is displayed when the original data is altered (for example, through an ARIMA model estimation, prior adjustment factors, or regression) or the series is extended with forecasts. Although the X-12-ARIMA method handles missing values, there are some restrictions. In order for PROC X12 to process the series, no month or quarter can contain missing values for all years. For instance, if the third quarter contained only missing values for all years, then processing is skipped for that series. In addition, if more than half the values for a month or a quarter are missing, then a warning message is displayed in the log le, and other errors might occur later in processing. If a series contains many missing values, other methods of missing value replacement should be considered prior to seasonally adjusting the series. The following options can appear in the PROC X12 statement.
DATA=SAS-data-set
species the input SAS data set used. If this option is omitted, the most recently created SAS data set is used.
DATE=variable DATEVAR=variable
species a variable that gives the date for each observation. Unless specied in the SPAN= option, the starting and ending dates are obtained from the rst and last values of the DATE= variable, which must contain SAS date or datetime values. The procedure checks values of the DATE= variable to ensure that the input observations are sequenced correctly in ascending order. If the INTERVAL= option or the SEASONS= option is specied, the values of the date variable must be consistent with the specied seasonality or interval. If neither the INTERVAL= option nor the SEASONS= option is specied, then the procedure tries to determine the type of data from the values of the date variable. This variable is automatically added to the OUT= data set if a data set is requested in an OUTPUT statement, and the date values for
the variable are extrapolated if necessary. If the DATE= option is not specied, the START= option must be specied.
START=mmmyy START=yyQq STARTDATE=mmmyy STARTDATE=yyQq
species the date of the rst observation. Unless the SPAN= option is used, the starting and ending dates are the dates of the rst and last observations, respectively. Either this option or the DATE= option is required. When using this option, use either the INTERVAL= option or the SEASONS= option to specify monthly or quarterly data. If neither the INTERVAL= option nor the SEASONS= option is present, monthly data are assumed. Note that for a quarterly date, the specication must be enclosed in quotes. A four-digit year can be specied; if a two-digit year is specied, the value specied in the YEARCUTOFF= SAS system option applies. When using the START= option with BY processing, the start date is applied to the rst observation in each BY group.
SPAN=(mmmyy ,mmmyy ) SPAN=(yyQq ,yyQq )
species the dates of the rst and last observations to dene a subset for processing. A single date in parentheses is interpreted to be the starting date of the subset. To specify only the ending date, use SPAN=(,mmmyy ). If the starting or ending date is omitted, then the rst or last date, respectively, of the input data set is assumed. A four-digit year can be specied; if a two-digit year is specied, the value specied in the YEARCUTOFF= SAS system option applies.
INTERVAL=interval
species the frequency of the input time series. If the input data consist of quarterly observations, then INTERVAL=QTR should be used. If the input data consist of monthly observations, then INTERVAL=MONTH should be used. If the INTERVAL= option is not specied and SEASONS=4, then INTERVAL=QTR is assumed; likewise, SEASONS=12 implies INTERVAL=MONTH. If both the INTERVAL= option and the SEASONS= option are specied, the values should not be conicting. If neither the INTERVAL= option nor the SEASONS= option is specied and the START= option is specied, then the data are assumed to be monthly. If a date variable is specied using the DATE= option, it is not necessary to specify the INTERVAL= option or the SEASONS= option; however, if specied, the values of the INTERVAL= option or the SEASONS= option should not be in conict with the values of the date variable. See Chapter 4, Date Intervals, Formats, and Functions, for more details about intervals.
SEASONS=number
species the number of observations in a seasonal cycle. If the SEASONS= option is not specied and INTERVAL=QTR, then SEASONS=4 is assumed. If the SEASONS= option is not specied and INTERVAL=MONTH, then SEASONS=12 is assumed. If the SEASONS= option is specied, its value should not conict with the values of the INTERVAL= option or the values of the date variable. See the preceding descriptions for the START=, DATE=, and INTERVAL= options for more details.
BY Statement ! 2111
NOTRIMMISS
suppresses the default, by which leading and trailing missing values are trimmed from the series. If NOTRIMMISS is used, PROC X12 automatically generates missing value regressors for any missing value within the span of the series, including leading and trailing missing values.
INEVENT=SAS-data-set
species the input data set that denes any user-dened event variables. This option can be omitted if events are not specied or if only SAS predened events are specied in an EVENT statement. For more information about the format of this data set, see the section INEVENT= Data Set on page 2144 for details.
MDLINFOIN=SAS-data-set
species an optional input data set that contains model information that can replace the information contained in the TRANSFORM, REGRESSION, ARIMA, and AUTOMDL statements. The MDLINFOIN= data set can contain BY-group and series names; it is useful for providing specic information about each series to be seasonally adjusted. See the section MDLINFOIN= and MDLINFOOUT= Data Sets on page 2142 for details.
MDLINFOOUT=SAS-data-set
species the optional output data set that contains the transformation, regression, and ARIMA information related to each seasonally adjusted series. The data set is sorted by the BY-group variables, if any, and by series names. The MDLINFOOUT= data set can be used as input for the MDLINFOIN= option. See the section MDLINFOIN= and MDLINFOOUT= Data Sets on page 2142 for details.
NOPRINT
BY Statement
BY variables ;
A BY statement can be used with PROC X12 to obtain separate analyses on observations in groups dened by the BY variables. When a BY statement appears, the procedure expects the input DATA= data set to be sorted in order of the BY variables.
ID Statement
ID variables ;
If you are creating an output data set, use the ID statement to copy values of the ID variables, in addition to the table values, into the output data set. Or, if the VAR statement is omitted, all numeric variables that are not identied as BY variables, ID variables, the DATE= variable, or user-dened regressors are processed as time series. The ID statement has no effect when a VAR statement is
specied and an output data set is not created. If the DATE= variable is specied in the PROC X12 statement, this variable is included automatically in the OUTPUT data set. If no DATE= variable is specied, the variable _DATE_ is added. The date variable (or _DATE_ ) values outside the range of the actual data (from forecasting) are extrapolated, while all other ID variables are missing in the forecast horizon.
EVENT Statement
EVENT variables < / options > ;
The EVENT statement species EVENTs to be included in the regression portion of the regARIMA model. Multiple EVENT statements can be specied. If a MDLINFOIN= data set is not specied, then all variables specied in the EVENT statements are applied to all BY-groups and all time series that are processed. If a MDLINFOIN= data set is specied, then the EVENT statements applies only if no regression information for the BY-group and series is available in the MDLINFOIN= data set. The EVENTs specied in the EVENT statements either must be SAS predened EVENTs or must be dened in the data set specied in the INEVENT=SAS-data-set option of the PROC X12 statement. For a list of SAS predened EVENTs, see the section EVENTKEY Statement in Chapter 6, The HPFEVENTS Procedure (SAS High-Performance Forecasting Users Guide). The EVENT statement can also be used to include outlier, level shift, and temporary change regressors that are available as predened U.S. Census Bureau variables in the X-12-ARIMA program. For example, the following statements specify an additive outlier in January 1970 and a level shift that begins in July 1971:
proc x12 data=ICMETI seasons=12 start=jan1968; event AO01JAN1970D CBLS01JUL1971D;
and the following statements specify an additive outlier in the second quarter 1970 and a temporary change that begins in the fourth quarter 1971:
proc x12 data=ICMETI seasons=4 start=1970q1; event AO01APR1970D TC01OCT1971D;
species initial or xed values for the EVENT parameters. For details about the B= option, see B=(value <F> . . . ) in the section REGRESSION Statement on page 2124.
USERTYPE=AO USERTYPE=CONSTANT USERTYPE=EASTER USERTYPE=HOLIDAY USERTYPE=LABOR
USERTYPE=LOM USERTYPE=LOMSTOCK USERTYPE=LOQ USERTYPE=LPYEAR USERTYPE=LS USERTYPE=RP USERTYPE=SCEASTER USERTYPE=SEASONAL USERTYPE=TC USERTYPE=TD USERTYPE=TDSTOCK USERTYPE=THANKS USERTYPE=USER
For details about the USERTYPE= option, see the USERTYPE= option in the section REGRESSION Statement on page 2124.
INPUT Statement
INPUT variables < / options > ;
The INPUT statement species variables in the PROC X12 DATA= data set that are to be used as regressors in the regression portion of the regARIMA model. The variables in the data set should contain the values for each observation that dene the regressor. Future values of regression variables should also be included in the DATA= data set if the time series listed in the VAR statement is to be extended with regARIMA forecasts. Multiple INPUT statements can be specied. If a MDLINFOIN= data set is not specied, then all variables listed in the INPUT statements are applied to all BY-groups and all time series that are processed. If a MDLINFOIN= data set is specied, then the INPUT statements apply only if no regression information for the BY-group and series is available in the MDLINFOIN= data set. The following options can appear in the INPUT statement.
B=(value <F> . . . )
species initial or xed values for the INPUT variable parameters. For details about the B= option, see the B=(value <F> . . . ) option in the section REGRESSION Statement on page 2124.
USERTYPE=AO USERTYPE=CONSTANT USERTYPE=EASTER USERTYPE=HOLIDAY USERTYPE=LABOR USERTYPE=LOM USERTYPE=LOMSTOCK
USERTYPE=LOQ USERTYPE=LPYEAR USERTYPE=LS USERTYPE=RP USERTYPE=SCEASTER USERTYPE=SEASONAL USERTYPE=TC USERTYPE=TD USERTYPE=TDSTOCK USERTYPE=THANKS USERTYPE=USER
For details about the USERTYPE= option, see the USERTYPE= option in the section REGRESSION Statement on page 2124.
ADJUST Statement
ADJUST options ;
The ADJUST statement adjusts the series for leap year and length-of-period factors prior to estimating a regARIMA model. The Prior Adjustment Factors table is associated with the ADJUST statement. The following option can appear in the ADJUST statement.
PREDEFINED=LOM PREDEFINED=LOQ PREDEFINED=LPYEAR
species length-of-month adjustment, length-of-quarter adjustment, or leap year adjustment. PREDEFINED=LOM and PREDEFINED=LOQ are equivalent; the actual adjustment is determined by the interval of the time series. Also, since leap year adjustment is a limited form of length-of-period adjustment, only one type of predened adjustment can be specied. The PREDEFINED= option should not be used in conjunction with PREDEFINED=TD or PREDEFINED=TD1COEF in the REGRESSION statement or MODE=ADD or MODE=PSEUDOADD in the X11 statement. PREDEFINED=LPYEAR cannot be specied unless the series is log transformed. If the series is to be transformed by using a Box-Cox or logistic transformation, the series is rst adjusted according to the ADJUST statement, then transformed. In the case of a length-of-month adjustment for the series with observations Yt , each observation is rst divided by the number of days in that month, mt , and then multiplied by the average length of month (30.4375), resulting in .30:4375 Yt /=mt . Length-of-quarter adjustments are performed in a similar manner, resulting in .91:3125 Yt /=qt , where qt is the length in days of quarter t.
Forecasts of the transformed and adjusted data are transformed and adjusted back to the original scale for output.
ARIMA Statement
ARIMA options ;
The ARIMA statement species the ARIMA part of the regARIMA model. This statement denes a pure ARIMA model if no REGRESSION statements, INPUT statements, or EVENT statements are specied. The ARIMA part of the model can include multiplicative seasonal factors. The following option can appear in the ARIMA statement.
MODEL=((p d q) (P D Q)s)
species the ARIMA model. The format follows standard Box-Jenkins notation (Box, Jenkins, and Reinsel 1994). The nonseasonal AR and MA orders are given by p and q, respectively, while the seasonal AR and MA orders are given by P and Q. The number of differences and seasonal differences are given by d and D, respectively. The notation (p d q) and (P D Q) can also be specied as (p, d, q) and (P, D, Q). The maximum lag of any AR or MA parameter is 36. The maximum value of a difference order, d or D, is 144. All values for p, d, q, P, D, and Q should be nonnegative integers. The lag that corresponds to seasonality is s; s should be a positive integer. If s is omitted, it is set equal to the value used in the SEASONS= option in the PROC X12 statement. For example, the following statements specify an ARIMA (2,1,1)(1,1,0)12 model:
proc x12 data=ICMETI seasons=12 start=jan1968; arima model=((2,1,1)(1,1,0));
ESTIMATE Statement
ESTIMATE options ;
The ESTIMATE statement estimates the regARIMA model. The regARIMA model is specied by the REGRESSION, INPUT, EVENT, and ARIMA statements or by the MDLINFOIN= data set. Estimation output includes point estimates and standard errors for all estimated AR, MA, and regression parameters; the maximum likelihood estimate of the variance 2 ; t statistics for individual regression parameters; 2 statistics for assessing the joint signicance of the parameters associated with certain regression effects (if included in the model); and likelihood-based model selection statistics (if the exact likelihood function is used). The regression effects for which 2 statistics are produced are xed seasonal effects. Tables displayed in the output associated with estimation are Exact ARMA Likelihood Estimation Iteration Tolerances, Average Absolute Percentage Error in within-Sample Forecasts, ARMA Iteration History, AR/MA Roots, Exact ARMA Likelihood Estimation Iteration Summary,
Regression Model Parameter Estimates, Chi-Squared Tests for Groups of Regressors, Exact ARMA Maximum Likelihood Estimation, and Estimation Summary. The following options can appear in the ESTIMATE statement.
MAXITER=value
species the maximum number of iterations (for estimating the AR and MA parameters) allowed. For models with regression variables, this limit applies to the total number of ARMA iterations over all iterations of the iterative generalized least squares (IGLS) algorithm. For models without regression variables, this is the maximum number of iterations allowed for the set of ARMA iterations. The default is MAXITER=200.
TOL=value
species the convergence tolerance for the nonlinear estimation. Absolute changes in the loglikelihood are compared to the TOL= value to check convergence of the estimation iterations. For models with regression variables, the TOL= value is used to check convergence of the IGLS iterations (where the regression parameters are reestimated for each new set of AR and MA parameters). For models without regression variables, there are no IGLS iterations, and the TOL= value is then used to check convergence of the nonlinear iterations used to estimate the AR and MA parameters. The default value is TOL=0.00001. The minimum tolerance value is a positive value based on the machine precision and the length of the series. If a tolerance less than the minimum supported value is specied, an error message is displayed and the series is not processed.
ITPRINT
species that the Iterations History table be displayed. This includes detailed output for estimation iterations (including log-likelihood values and parameters) and counts of function evaluations and iterations. It is useful to examine the Iterations History table when errors occur within estimation iterations. By default, only successful iterations are displayed, unless the PRINTERR option is specied. An unsuccessful iteration is an iteration that is restarted due to a problem such as a root inside the unit circle. Successful iterations have a status of 0. If restarted iterations are displayed, a note at the end of the table gives denitions for status codes that indicate a restarted iteration. For restarted iterations, the number of function evaluations and the number of iterations will be 1, which is displayed as missing. If regression parameters are included in the model, then both IGLS and ARMA iterations are included in the table. The number of function evaluations is a cumulative total.
PRINTERR
causes restarted iterations to be included in the Iterations History table (if ITPRINT is specied) or creates the Restarted Iterations table (if ITPRINT is not specied). Whether or not PRINTERR is specied, a WARNING message is printed to the log le if any iteration is restarted during estimation.
FORECAST Statement
FORECAST options ;
The FORECAST statement uses the estimated model to forecast the time series. The output contains point forecast and forecast statistics for the transformed and original series. The following option can appear in the FORECAST statement.
LEAD=value
species the number of periods ahead to forecast. The default is the number of periods in a year (4 or 12), and the maximum is 60. Tables that contain forecasts, standard errors, and condence limits are displayed in association with the FORECAST statement. If the data is transformed, then two tables are displayed: one table for the original data, and one table for the transformed data.
IDENTIFY Statement
IDENTIFY options ;
The IDENTIFY statement is used to produce plots of the sample autocorrelation function (ACF) and partial autocorrelation function (PACF) for identifying the ARIMA part of a regARIMA model. The sample ACF and PACF are produced for all combinations of the nonseasonal and seasonal differences of the data specied by the DIFF= and SDIFF= options. If the model includes a regression component (specied using the REGRESSION, INPUT, and EVENT statements or the MDLINFOIN= data set), then the ACFs and PACFs are calculated for the specied differences of the regression residuals. If the model does not include a regression component, then the ACFs and PACFs are calculated for the specied differences of the original data. Tables displayed in association with identication are Autocorrelation of Model Residuals and Partial Autocorrelation of Model Residuals. If the model includes a regression component (specied using the REGRESSION, INPUT, and EVENT statements or the MDLINFOIN= data set), then the Regression Model Parameter Estimates table is also available. The following options can appear in the IDENTIFY statement.
DIFF=(order, order, order )
species orders of nonseasonal differencing to use in model identication. The value 0 species no differencing; the value 1 species one nonseasonal difference .1 B/; the value 2 species two nonseasonal differences .1 B/2 ; and so forth. The ACFs and PACFs are produced for all orders of nonseasonal differencing specied, in combination with all orders of seasonal differencing specied in the SDIFF= option. The default is DIFF=(0). You can specify up to three values for nonseasonal differences.
SDIFF=(order, order, order )
species orders of seasonal differencing to use in model identication. The value 0 species no seasonal differencing; the value 1 species one seasonal difference .1 B s /; the value 2 species two seasonal differences .1 B s /2 ; and so forth. Here the value for s corresponds to the period specied in the SEASONS= option in the PROC X12 statement. The value of the SEASONS= option is supplied explicitly or is implicitly supplied through the INTERVAL= option or the values of the DATE= variable. The ACFs and PACFs are produced for all orders
of seasonal differencing specied, in combination with all orders of nonseasonal differencing specied in the DIFF= option. The default is SDIFF=(0). You can specify up to three values for seasonal differences. For example, the following statement produces ACFs and PACFs for two levels of differencing: .1 B/ and .1 B/.1 B s /:
identify diff=(1) sdiff=(0, 1);
PRINTREG
causes the Regression Model Parameter Estimates table to be printed if the REGRESSION statement is present. By default, the table is not printed.
AUTOMDL Statement
AUTOMDL options ;
The AUTOMDL statement is used to invoke the automatic model selection procedure of the X-12ARIMA method. This method is based largely on the TRAMO (time series regression with ARIMA noise, missing values, and outliers) method by Gomez and Maravall (1997a, b). If the AUTOMDL statement is used without the OUTLIER statement, then only missing values regressors are included in the regARIMA model. If the AUTOMDL and the OUTLIER statements are used, then both missing values regressors and regressors for automatically identied outliers are included in the regARIMA model. If both the AUTOMDL statement and the ARIMA statement are present, the ARIMA statement is ignored. The ARIMA statement species the model, while the AUTOMDL statement allows the X12 procedure to select the model. If the AUTOMDL statement is specied and a data set is specied in the MDLINFOIN= option of the PROC X12 statement, then the AUTOMDL statement is ignored if the specied data set contains a model specication for the series. If no model for the series is specied in the data set specied in the MDLINFOIN= option, then the AUTOMDL (or ARIMA) statement is used to determine the model. Thus, it is possible to give a specic model for some series and automatically identify the model for other series by using both the MDLINFOIN= option and the AUTOMDL statement. When AUTOMDL is specied, the X12 procedure compares a model selected using a TRAMO method to a default model. The TRAMO method is implemented rst, and involves two parts: identifying the orders of differencing and identifying the ARIMA model. The table ARIMA Estimates for Unit Root Identication provides details about the identication of the orders of differencing, while the table Results of Unit Root Test for Identifying Orders of Differencing shows the orders of differencing selected by TRAMO. The table Models Estimated by Automatic ARIMA Model Selection Procedure provides details regarding the TRAMO automatic model selection, and the table Best Five ARIMA Models Chosen by Automatic Modeling ranks the best ve models estimated using the TRAMO method. The next available table, Comparison of Automatically Selected Model and Default Model, compares the model selected by the TRAMO method to a default model. At this point in the processing, if the default model is selected over the TRAMO model, then PROC X12 displays a note. No note is displayed if the TRAMO model is selected. PROC X12 then performs checks for unit roots, over-differencing, and insignicant ARMA coefcients. If the model
is changed due to any of these tests, a note is displayed. The last table, Final Automatic Model Selection, shows the results of the automatic model selection. The following options can appear in the AUTOMDL statement:
MAXORDER=(nonseasonal order, seasonal order )
species the maximum orders of nonseasonal and seasonal ARMA polynomials for the automatic ARIMA model identication procedure. The maximum order for the nonseasonal ARMA parameters should be between 1 and 4; the maximum order for the seasonal ARMA should be 1 or 2.
DIFFORDER=(nonseasonal order, seasonal order )
species the xed orders of differencing to be used in the automatic ARIMA model identication procedure. When the DIFFORDER= option is used, only the AR and MA orders are automatically identied. Acceptable values for the regular differencing orders are 0, 1, and 2; acceptable values for the seasonal differencing orders are 0 and 1. If the MAXDIFF= option is also specied, then the DIFFORDER= option is ignored. There are no default values for DIFFORDER. If neither the DIFFORDER= option nor the MAXDIFF= option is specied, then the default is MAXDIFF=(2,1).
MAXDIFF=(nonseasonal order, seasonal order )
species the maximum orders of regular and seasonal differencing for the automatic identication of differencing orders. When MAXDIFF is specied, the differencing orders are identied rst, and then the AR and MA orders are identied. Acceptable values for the regular differencing orders are 1 and 2; the only acceptable value for the seasonal differencing order is 1. If both the MAXDIFF= option and the DIFFORDER option= are specied, then the DIFFORDER= option is ignored. If neither the DIFFORDER= nor the MAXDIFF= option is specied, the default is MAXDIFF=(2,1).
NOINT
suppresses the tting of a constant (or intercept) parameter in the model. (That is, the parameter is omitted.)
PRINT=UNITROOTTEST PRINT=AUTOCHOICE PRINT=UNITROOTTESTMDL PRINT=AUTOCHOICEMDL PRINT=BEST5MODEL
lists the tables to be displayed in the output. PRINT=AUTOCHOICE displays the tables titled Comparison of Automatically Selected Model and Default Model and Final Automatic Model Selection. The Comparison of Automatically Selected Model and Default Model table compares a default model to the model chosen by the TRAMO-based automatic modeling method. The Final Automatic Model Selection table indicates which model has been chosen automatically. If the PRINT= option is not specied, then PRINT=AUTOCHOICE is displayed by default. PRINT=UNITROOTTEST causes the table titled Results of Unit Root Test for Identifying Orders of Differencing to be printed. This table displays the orders that were automatically
selected by AUTOMDL. Unless the nonseasonal and seasonal differences are specied using the DIFFORDER= option, AUTOMDL automatically identies the orders of differencing. PRINT=UNITROOTMDL displays the table titled ARIMA Estimates for Unit Root Identication. This table summarizes the various models that were considered by the TRAMO automatic selection method while identifying the orders of differencing and the statistics associated with those models. The unit root identication method rst attempts to obtain the coefcients by using the Hannan-Rissanen method. If Hannan-Rissanen estimation cannot be performed, the algorithm attempts to obtain the coefcients by using conditional likelihood estimation. PRINT=AUTOCHOICEMDL displays the table Models Estimated by Automatic ARIMA Model Selection Procedure. This table summarizes the various models that were considered by the TRAMO automatic model selection method and their measures of t. PRINT=BEST5MODEL displays the table Best Five ARIMA Models Chosen by Automatic Modeling. This table ranks the ve best models that were considered by the TRAMO automatic modeling method.
BALANCED
species that the automatic modeling procedure prefer balanced models over unbalanced models. A balanced model is one in which the sum of AR, differencing, and seasonal differencing orders equal to the sum of MA and seasonal MA orders. Specifying BALANCED gives the same preference as the TRAMO program. If BALANCED is not specied, all models are given equal consideration.
HRINITIAL
species that Hannan-Rissanen estimation be done before exact maximum likelihood estimation to provide initial values. If HRINITIAL is specied, then models for which the HannanRissanen estimation has an unacceptable coefcient are rejected.
ACCEPTDEFAULT
species acceptance criteria for condence coefcient of the Ljung-Box Q statistic. If the Ljung-Box Q for a nal model is greater than this value, the model is rejected, the outlier critical value is reduced, and outlier identication is redone with the reduced value (see the REDUCECV option). The value specied must be greater than 0 and less than 1. The default value is 0.95.
REDUCECV=value
species the percentage that the outlier critical value be reduced when a nal model is found to have an unacceptable condence coefcient for the Ljung-Box Q statistic. This value should be between 0 and 1. The default value is 0.14286.
ARMACV=value
species the threshold value for the t statistics associated with the highest order ARMA coefcients. As a check of model parsimony, the parameter estimates and t statistics of the highest order ARMA coefcients are examined to determine if the coefcient is insignicant.
An ARMA coefcient is considered to be insignicant if the absolute value of the parameter estimate is below 0.15 for 150 or fewer observations, and below 0.1 for more than 150 observations and the t value (displayed in the table Exact ARMA Maximum Likelihood Estimation) is below the value specied in the ARMACV= option. If the highest order ARMA coefcient is found to be insignicant then the order of the ARMA model is reduced. For example, if AUTOMDL identies a (3 1 1)(0 0 1) model and the parameter estimate of the seasonal MA lag of order 1 is 0.9 and its t value is 0.55, then the ARIMA model is reduced to at least (3 1 1)(0 0 0). After the model is reestimated, the check for insignicant coefcients is performed again. If ARMACV=0.54 is specied in the preceding example, then the coefcient is not found to be insignicant and the model is not reduced. If a constant regressor is allowed in the model and if the t value (displayed in the table Regression Model Parameter Estimates) is below the ARMACV= critical value, then the constant regressor is considered to be insignicant and is removed. Note that if a constant regressor is added to or removed from the model and then the ARIMA model changes, then the t statistic for the constant regressor also changes. Thus, changing the ARMACV= value does not necessarily add or remove a constant term from the model. The value specied in the ARMACV= option should be greater than zero. The default value is 1.0.
OUTPUT Statement
OUTPUT OUT= SAS-data-set tablename1 tablename2 . . . ;
The OUTPUT statement creates an output data set that contains specied tables. The data set is named by the OUT= option.
OUT=SAS-data-set
names the data set to contain the specied tables. If the OUT= option is omitted, the SAS System names the new data set by using the default DATAn convention. For each table to be included in the output data set, you must specify the X12 tablename keyword. The keyword corresponds to the title label used by the Census Bureau X12-ARIMA software. Currently available tables are A1, A2, A6, A7, A8, A8AO, A8LS, A8TC, A9, A10, B1, C17, C20, D1, D7, D8, D9, D10, D10B, D10D, D11, D11A, D11F, D11R, D12, D13, D16, D16B, D18, E5, E6, E6A, E6R, E7, and MV1. If no table is specied in the OUTPUT statement, Table A1 is output to the OUT= data set by default. The tablename keywords that can be used in the OUTPUT statement are listed in the section Displayed Output/ODS Table Names/OUTPUT Tablename Keywords on page 2139. The following is an example of a VAR statement and an OUTPUT statement:
var sales costs; output out=out_x12 b1 d11;
Note that the default variable name used in the output data set is the input variable name followed by an underscore and the corresponding table name. The variable sales_B1 contains
the Table B1 values for the variable sales, the variable costs_B1 contains the Table B1 values for costs, while the Table D11 values for sales are contained in the variable sales_D11, and the variable costs_D11 contains the Table D11 values for costs. If necessary, the variable name is shortened so that the table name can be added. If the DATE= variable is specied in the PROC X12 statement, then that variable is included in the output data set; otherwise, a variable named _DATE_ is written to the OUT= data set as the date identier.
OUTLIER Statement
OUTLIER options ;
The OUTLIER statement species that the X12 procedure perform automatic detection of additive (point) outliers, temporary change outliers, level shifts, or any combination of the three when using the specied model. After outliers are identied, the appropriate regression variables are incorporated into the model as Automatically Identied Outliers, and the model is reestimated. This procedure is repeated until no additional outliers are found. The OUTLIER statement also identies potential outliers and lists them in the table Potential Outliers in the displayed output. Potential outliers are identied by decreasing the critical value by 0.5. In the output, the default initial critical values used for outlier detection in a given analysis are displayed in the table Critical Values to Use in Outlier Detection. Outliers that are detected and incorporated into the model are displayed in the output in the table Regression Model Parameter Estimates, where the regression variable is listed as Automatically Identied. The following options can appear in the OUTLIER statement.
SPAN=(mmmyy ,mmmyy ) SPAN=(yyQq ,yyQq )
gives the dates of the rst and last observations to dene a subset for searching for outliers. A single date in parentheses is interpreted to be the starting date of the subset. To specify only the ending date, use SPAN=(,mmmyy ) or SPAN=(,yyQq ). If the starting or ending date is omitted, then the rst or last date, respectively, of the input data set is assumed. A four-digit year can be specied; if a two-digit year is specied, the value specied in the YEARCUTOFF= SAS system option applies.
TYPE=NONE TYPE=(outlier types)
lists the outlier types to be detected by the automatic outlier identication method. TYPE=NONE turns off outlier detection. The valid outlier types are AO, LS, and TC. The default is TYPE=(AO LS).
CV=value
species an initial critical value to use for detection of all types of outliers. The absolute value of the t statistic associated with an outlier parameter estimate is compared with the critical value to determine the signicance of the outlier. If the CV= option is not specied, then the
default initial critical value is computed using a formula presented by Ljung (1993), which is based on the number of observations or model span used in the analysis. Table 32.2 gives default critical values for various series lengths. Increasing the critical value decreases the sensitivity of the outlier detection routine and can reduce the number of observations treated as outliers. The automatic model identication process might lower the critical value by a certain percentage, if the automatic model identication process fails to identify an acceptable model.
Table 32.2 Default Critical Values for Outlier Identication
Number of Observations 1 2 3 4 5 6 7 8 9 10 11 12 24 36 48 72 96 120 144 168 192 216 240 264 288 312 336 360
AOCV=value
Outlier Critical Value 1.96 2.24 2.44 2.62 2.74 2.84 2.92 2.99 3.04 3.09 3.13 3.16 3.42 3.55 3.63 3.73 3.80 3.85 3.89 3.92 3.95 3.97 3.99 4.01 4.03 4.04 4.05 4.07
species a critical value to use for additive (point) outliers. If AOCV is specied, this value overrides any default critical value for AO outliers. See the CV= option for more details.
LSCV=value
species a critical value to use for level shift outliers. If LSCV is specied, this value overrides any default critical value for LS outliers. See the CV= option for more details.
TCCV=value
species a critical value to use for temporary change outliers. If TCCV is specied, this value overrides any default critical value for TC outliers. See the CV= option for more details.
REGRESSION Statement
REGRESSION PREDEFINED= variables < / options > ; REGRESSION USERVAR= variables < / options > ;
The REGRESSION statement includes regression variables in a regARIMA model or species regression variables whose effects are to be removed by the IDENTIFY statement to aid in ARIMA model identication. Predened regression variables are selected with the PREDEFINED= option. User-dened regression variables are specied with the USERVAR= option. The currently available predened variables are listed below in Table 32.3. Table A6 in the displayed output generated by the X12 procedure provides information related to trading day effects. Table A7 provides information related to holiday effects. Tables A8, A8AO, A8LS, and A8TC provide information related to outlier factors. Ramps and level shifts are combined in the A8LS table. The A8AO, A8LS and A8TC tables are available only when more than one outlier type is present in the model. Table A9 provides information about user-dened regression effects. Table A10 provides information about the user-dened seasonal component. Missing values in the span of an input series automatically create missing value regressors. See the NOTRIMMISS option of the PROC X12 statement and the section Missing Values on page 2136 for further details about missing values. Combining your model with additional predened regression variables can result in a singularity problem. If a singularity occurs, then you might need to alter either the model or the choices of the predened regressors in order to successfully perform the regression. In order to seasonally adjust a series that uses a regARIMA model, the factors derived from the regression coefcients must be the same type as the factors generated by the seasonal adjustment procedure, so that combined adjustment factors can be derived and adjustment diagnostics can be generated. If the regARIMA model is applied to a log-transformed series, the regression factors are expressed in the form of ratios, which match seasonal factors generated by the multiplicative (or log-additive) adjustment modes. Conversely, if the regARIMA model is t to the original series, the regression factors are measured on the same scale as the original series, which match seasonal factors generated by the additive adjustment mode. Note that the default transformation (no transformation) and the default seasonal adjustment mode (multiplicative) are in conict. Thus when you specify the X11 statement and any of the REGRESSION, INPUT, or EVENT statements, it is necessary to also specify either a transform option (using the TRANSFORM statement) or a mode (using the MODE= option of the X11 statement) in order to seasonally adjust the data that uses the regARIMA model. According to Ladiray and Quenneville (2001), X-12-ARIMA is based on the same principle [as the X-11 method] but proposes, in addition, a complete module, called Reg-ARIMA, that allows for the initial series to be corrected for all sorts of undesirable effects. These effects are estimated using regression models with ARIMA errors (Findley et al. [23]). In order to correct the series for effects in this manner, the REGRESSION statement must be specied. The effects that can be corrected in this manner are listed in the PREDEFINED= option below.
Either the PREDEFINED= option or the USERVAR= option can be specied in a single REGRESSION statement, but not both. Multiple REGRESSION statements can be used. The following options can appear in the REGRESSION statement.
PREDEFINED=CONSTANT < / B= > PREDEFINED=LOM PREDEFINED=LOMSTOCK PREDEFINED=LOQ PREDEFINED=LPYEAR PREDEFINED=SEASONAL PREDEFINED=TD PREDEFINED=TDNOLPYEAR PREDEFINED=TD1COEF PREDEFINED=TD1NOLPYEAR PREDEFINED=EASTER(value) PREDEFINED=SCEASTER(value) PREDEFINED=LABOR(value) PREDEFINED=THANK(value) PREDEFINED=TDSTOCK(value) PREDEFINED=SINCOS(value . . . )
lists the predened regression variables to be included in the model. Data values for these variables are calculated by the program, mostly as functions of the calendar. Table 32.3 gives denitions for the available predened variables. The values LOM and LOQ are actually equivalent: the actual regression is controlled by the PROC X12 SEASONS= option. Multiple predened regression variables can be used. The syntax for using both a length-of-month and a seasonal regression can be in one of the following forms:
regression predefined=lom seasonal; regression predefined=(lom seasonal); regression predefined=lom predefined=seasonal;
Certain restrictions apply when you use more than one predened regression variable. Only one of TD, TDNOLPYEAR, TD1COEF, or TD1NOLPYEAR can be specied. LPYEAR cannot be used with TD, TD1COEF, LOM, LOMSTOCK, or LOQ. LOM or LOQ cannot be used with TD or TD1COEF. The following restriction also applies to the SINCOS predened regression variable. If SINCOS is specied, then the INTERVAL= option or the SEASONS= option must also be specied because there are restrictions to this regression variable based on the frequency of the data. The predened regression variables TDSTOCK, SCEASTER, EASTER, LABOR, THANK, and SINCOS require extra parameters. Only one TDSTOCK regressor can be implemented in the regression model. If multiple TDSTOCK variables are specied, PROC X12 uses
the last TDSTOCK variable specied. For SCEASTER, EASTER, LABOR, THANK, and SINCOS, multiple regressors can be implemented in the model by specifying the variables with different parameters. The syntax for specifying two EASTER regressors with widths 7 and 14 would be:
regression predefined=easter(7) easter(14);
For SINCOS, specifying a parameter includes both the sine and the cosine regressor except for the highest order allowed (2 for quarterly data and 6 for monthly data.) The most common use of the SINCOS variable for quarterly data would be
regression predefined=sincos(1,2);
Variable Denitions .1 B/
d .1
B s( / 1/ D
D I.t
1/;
where I.t
1 0
mt m where mt = length of month t (in days) N and m D 30:4375 (average length of month) N
mt m N .l/ for t D 1 SLOMt 1 C mt m otherwise N N 8 where m and mt are dened in LOM and 0:375 when 1st February in series is a leap year <0:125 when 2nd February in series is a leap year .l/ D 0:125 when 3rd February in series is a leap year : 0:375 when 4th February in series is a leap year SLOMt D qt q where qt = length of quarter t (in days) N and q D 91:3125 (average length of quarter) N
length-of-quarter (quarterly ow) LOQ leap year (monthly and quarterly ow) LPYEAR
in leap year February (rst quarter) in other Februaries (rst quarter) otherwise
Table 32.3
continued
Regression Effect
Variable Denitions in January 1 in December : 0 otherwise 8 1 in November < ; : : : ; M11;t D 1 in December : 0 otherwise si n.wj t/; cos.wj t/; where wj D 2 j=12; 1 j s=2 and s is the seasonal period (drop si n.wj t/ 0 for j D s=2) T1;t D (no. of Mondays) (no. of Sundays) ; : : : ; T6;t D (no. of Saturdays) (no. of Sundays)
5 2 (no.
M1;t D
8 1 <
(no. of weekdays)
w t h day of month t is a Monday Q D1;t D 1 w t h day of month t is a Sunday Q : 0 otherwise 8 1 w t h day of month t is a Saturday Q < ; : : : ; D6;t D 1 w t h day of month t is a Sunday Q : 0 otherwise where w is the smaller of w and the length of month t. Q For end-of-month stock series, set w to 31; that is, specify TDSTOCK(31). Restriction: 1 w 31. If Easter falls before April w, let nE be the number of the w days on or before Easter that fall in March. Then: 8 nE =w in March < E.w; t/ D n =w in April E : 0 otherwise If Easter falls on or after April w, then E.w; t/ D 0. (Note: This variable is 0 except in March and April (or rst and second quarter).) Restriction: 1 w 24.
8 1 <
Table 32.3
continued
Regression Effect
Variable Denitions
1 E.w; t/ D w nt and nt is the number of the w days before Easter that fall in month (or quarter) t. (Note: This variable is 0 except in February, March, and April (or rst and second quarter). It is nonzero in February only for w > 22.) Restriction: 1 w 25. 1 L.w; t/ D w no. of the w days before Labor Day that fall in month t (Note: This variable is 0 except in August and September.) Restriction: 1 w 25.
Thanksgiving THANK(w)
T hC.w; t/ D proportion of days from w days before Thanksgiving through December 24 that fall in month t (negative values of w indicate days after Thanksgiving). (Note: This variable is 0 except in November and December.) Restriction: 8 w 17.
species variables in the PROC X12 DATA= data set that are to be used as regressors. The variables in the data set should contain the values for each observation that dene the regressor; regression variables should also be dened for forecast values if the time series is to be extended with regARIMA forecasts. Missing values are not permitted within the data span, including forecasts, of the user-dened regressors. Example 32.6 shows how you would create an input data set that contains both the series to be seasonally adjusted and a user-dened input variable. Note that all regression variables in the USERVAR= option apply to all time series to be seasonally adjusted unless the MDLINFOIN= data set species different regression information.
B=(value <F> . . . )
species initial or xed values for the regression parameters in the order in which they appear in the PREDEFINED= and USERVAR= options. Each B= list applies to the PREDEFINED= or USERVAR= variable list that immediately precedes the slash. The PREDEFINED= option and the USERVAR= option cannot be specied in the same REGRESSION statement; however, multiple REGRESSION statements can be specied. For example, the following statements set an initial value for the user-dened regressor, x, of 1:
regression predefined=LOM ; regression uservar=x / b=1 2 ;
In this example, the B= option applies only to the USERVAR= statement. The value 2 is discarded since there is only one variable in the USERVAR= list. To assign an initial value of 1 to the LOM regressor and 2 to the x regressor, use the following statements:
An F immediately following the numerical value indicates that this is not an initial value, but a xed value. See Example 32.8 for an example that uses xed parameters. In PROC X12, individual parameters can be xed while other parameters in the same model are estimated.
USERTYPE=CONSTANT USERTYPE=SEASONAL USERTYPE=TD USERTYPE=LOM USERTYPE=LOQ USERTYPE=LPYEAR USERTYPE=TDSTOCK USERTYPE=LOMSTOCK USERTYPE=EASTER USERTYPE=LABOR USERTYPE=THANKS USERTYPE=AO USERTYPE=LS USERTYPE=RP USERTYPE=HOLIDAY USERTYPE=SCEASTER USERTYPE=USER USERTYPE=TC
enables a user-dened variable to be processed in the same manner as a U.S. Census predened variable. For instance, the U.S. Census Bureau EASTER(w) regression effects are included the RegARIMA Holiday Component table (A7). You should specify USERTYPE=EASTER to include a user-dened variable which would be processed exactly as the U.S. Census predened EASTER(w) variable, including inclusion in the A7 table. Each USERTYPE= list applies to the USERVAR= variable list that immediately precedes the slash. USERTYPE= does not apply to U.S. Census predened variables. The same rules for assigning B= values to regression variables apply for USERTYPE= options. See the example in B=(value <F> . . . ).
TABLES Statement
TABLES tablename1 tablename2 . . . options ;
The TABLES statement enables you to alter the display of the PROC X12 tables. You can specify the display of tables that are not displayed by default by PROC X12, and the NOSUM option enables you to suppress the printing of the period summary line in the time series tables.
tablename1 tablename2 . . .
keywords that correspond to the title label used by the Census Bureau X12-ARIMA software. For each table to be included in the displayed output, you must specify the X12 tablename keyword. Currently available tables are C20, D1, and D7. Although these tables are not displayed by default, their values are sometimes useful in understanding the X-12ARIMA method. For further description of the available tables, see the section Displayed Output/ODS Table Names/OUTPUT Tablename Keywords on page 2139.
NOSUM NOSUMMARY NOSUMMARYLINE
applies to the tables available for output in the OUTPUT Statement. By default, these tables include a summary line that gives the average, total, or standard deviation for the historical data by period. The NOSUM option suppresses the display of the summary line in the listing. Also, if the tables are output with ODS, the summary line is not an observation in the data set. Thus, the output to the data set is only the time series, both the historical data and the forecast data, if available.
TRANSFORM Statement
TRANSFORM options ;
The TRANSFORM statement transforms or adjusts the series prior to estimating a regARIMA model. With this statement, the series can be Box-Cox (power) transformed. The Prior Adjustment Factors table is associated with the TRANSFORM statement. Only one of the following options can appear in the TRANSFORM statement.
POWER=value
transforms the input series, Yt , by using a Box-Cox power transformation, Yt ! yt D log.Yt / 2 C .Y t 1/= D0 0
The power must be specied (for example, POWERD 0:33). The default is no transformation ( D 1); that is, POWERD 1. The log transformation (POWERD 0), square root transformation (POWERD 0:5), and the inverse transformation (POWERD 1) are equivalent to the corresponding FUNCTION= option.
Table 32.4 Power Values Related to the Census Bureau Function Argument
Range for Yt all values Yt > 0 for all t Yt 0 for all t Yt 0 for all t 0 < Yt < 1 for all t
Equivalent Power Argument power D 1 power D 0 power D 0:5 power D 1 none equivalent
FUNCTION=NONE FUNCTION=LOG FUNCTION=SQRT FUNCTION=INVERSE FUNCTION=LOGISTIC FUNCTION=AUTO TYPE=NONE | LOG | SQRT | INVERSE | LOGISTIC | AUTO
the transformation to be applied to the series prior to estimating a regARIMA model. The transformation used by FUNCTION=NONE, LOG, SQRT, INVERSE, and LOGISTIC is related to the POWER= option as shown in Table 32.4. FUNCTION=AUTO uses selection based on Akaikes information criterion (AIC) to decide between a log transformation and no transformation. The default is FUNCTION=NONE. However, the FUNCTION= and POWER= options are not completely equivalent. In some cases, using the FUNCTION= option causes the program to automatically select other options. For instance, FUNCTION=NONE causes the default mode to be MODE=ADD in the X11 statement. Also, the choice of transformation invoked by the FUNCTION=AUTO option can impact the default mode of the X11 statement. Note that there are restrictions on the value used in the POWER and FUNCTION options when preadjustment factors for seasonal adjustment are generated from a regARIMA model. When seasonal adjustment is requested with the X11 statement, any value of the POWER option can be used for the purpose of forecasting the series with a regARIMA model. However, this is not the case when factors generated from the regression coefcients are used to adjust either the original series or the nal seasonally adjusted series. In this case, the only accepted transformations are the log transformation, which can be specied as POWER=0 (for multiplicative or log-additive seasonal adjustments) and no transformation, which can be specied as POWER=1 (for additive seasonal adjustments). If no seasonal adjustment is performed, any POWER transformation can be used. The preceding restrictions also apply to FUNCTION=NONE and FUNCTION=LOG.
USERDEFINED Statement
USERDEFINED variables ;
The USERDEFINED statement is used to identify the variables in the input data set that are available for user-dened regression. Only numeric variables can be specied. Note that specifying variables in the USERDEFINED statement does not include the variables as regressors. If a variable is specied in the INPUT statement or REGRESSION USERVAR= option, it is not necessary to include that variable in the USERDEFINED statement. However, if a variable is specied in the MDLINFOIN= data set and is not specied in an INPUT statement or in the REGRESSION USERVAR= option, then the variable should be specied in the USERDEFINED statement in order to make the variable available for regression.
VAR Statement
VAR variables ;
The VAR statement is used to specify the variables in the input data set that are to be analyzed by the procedure. Only numeric variables can be specied. If the VAR statement is omitted, all numeric variables are analyzed except those that appear in a BY statement, ID statement, INPUT statement, USERDEFINED statement, the USERVAR= option of the REGRESSION statement, or the variable named in the DATE= option in the PROC X12 statement.
X11 Statement
X11 options ;
The X11 statement is an optional statement for invoking seasonal adjustment by an enhanced version of the methodology of the Census Bureau X-11 and X-11Q programs. You can control the type of seasonal adjustment decomposition calculated with the MODE= option. The output includes the nal tables and diagnostics for the X-11 seasonal adjustment method listed in Table 32.5. Tables C20, D1, and D7 are not displayed by default; you can display these tables by using the TABLES statement.
Table 32.5
Table Name B1 C17 C20 D1 D7 D8 D8A D9 D9A D10 D10B D10D D11 D11A D11R D12 D13 D16 D16B D18 E4 E5 E6 E6A E6R E7 F2AF2I F3 F4 G
Description original series, adjusted for prior effects and forecast extended nal weights for the irregular component nal extreme value adjustment factors modied original data, D iteration preliminary trend cycle, D iteration nal unmodied SI ratios (differences) F tests for stable and moving seasonality, D8 nal replacement values for extreme SI ratios (differences), D iteration moving seasonality ratios for each period nal seasonal factors seasonal factors, adjusted for user-dened seasonal nal seasonal difference nal seasonally adjusted series nal seasonally adjusted series with forced yearly totals rounded nal seasonally adjusted series (with forced yearly totals) nal trend cycle nal irregular component combined seasonal and trading day factors nal adjustment differences combined calendar adjustment factors ratio of yearly totals of original and seasonally adjusted series percent changes (differences) in original series percent changes (differences) in seasonally adjusted series percent changes (differences) in seasonally adjusted series with forced yearly totals (D11.A) percent changes (differences) in rounded seasonally adjusted series (D11.R) percent changes (differences) in nal trend component series X11 diagnostic summary monitoring and quality assessment statistics day of the week trading day component factors spectral plots
For more details about the X-11 seasonal adjustment diagnostics, see Shiskin, Young, and Musgrave (1967), Lothian and Morry (1978a), and Ladiray and Quenneville (2001). The following options can appear in the X11 statement.
MODE=ADD MODE=MULT MODE=LOGADD MODE=PSEUDOADD
are four choices: multiplicative (MODE=MULT), additive (MODE=ADD), pseudo-additive (MODE=PSEUDOADD), and log-additive (MODE=LOGADD) decomposition. If this option is omitted, the procedure performs multiplicative adjustments. Table 32.6 shows the values of the MODE= option and the corresponding models for the original (O) and the seasonally adjusted (SA) series.
Table 32.6 Modes of Seasonal Adjustment and Their Models
determines whether forecasts are included in certain tables sent to the output data set. If OUTFORECAST is specied, then forecast values are included in the output data set for tables A6, A7, A8, A9, A10, B1, D10, D10B, D10D, D16, D16B, and D18. The default is not to include forecasts.
SEASONALMA=S3X1 SEASONALMA=S3X3 SEASONALMA=S3X5 SEASONALMA=S3X9 SEASONALMA=S3X15 SEASONALMA=STABLE SEASONALMA=X11DEFAULT SEASONALMA=MSR
species which seasonal moving average (also called seasonal lter) be used to estimate the seasonal factors. These seasonal moving averages are n m moving averages, meaning that an n-term simple average is taken of a sequence of consecutive m-term simple averages. X11DEFAULT is the method used by the U.S. Census Bureaus X-11-ARIMA program. The default for PROC X12 is SEASONALMA=MSR, which is the methodology of Statistic Canadas X-11-ARIMA/88 program. Table 32.7 describes the seasonal lter options available for the entire series:
Table 32.7
X11DEFAULT
MSR
Description of Filter a 3 1 moving average a 3 3 moving average a 3 5 moving average a 3 9 moving average a 3 15 moving average stable seasonal lter. A single seasonal factor for each calendar month or quarter is generated by calculating the simple average of all the values for each month or quarter (taken after detrending and outlier adjustment). a 3 3 moving average is used to calculate the initial seasonal factors in each iteration, and a 3 5 moving average to calculate the nal seasonal factors lter chosen automatically by using the moving seasonality ratio of X-11-ARIMA/88 (Dagum 1988)
TRENDMA=value
species which Henderson moving average be used to estimate the nal trend cycle. Any odd number greater than one and less than or equal to 101 can be specied. Example: TRENDMA=23. If no selection is made, the program selects a trend moving average based on statistical characteristics of the data. For monthly series, a 9-, 13-, or 23-term Henderson moving average is selected. For quarterly series, the program chooses either a 5- or a 7-term Henderson moving average.
FINAL=AO FINAL=LS FINAL=TC FINAL=ALL
lists the types of prior adjustment factors, obtained from the regression and outlier statements, that are to be removed from the nal seasonally adjusted series. Additive outliers (FINAL=AO), level change and ramp outliers (FINAL=LS), and temporary change (FINAL=TC) can be removed. If this option is not specied, the nal seasonally adjusted series contains these effects.
FORCE=TOTALS FORCE=ROUND FORCE=BOTH
species that the seasonally adjusted series be modied to (a) force the yearly totals of the seasonally adjusted series and the original series to be the same (FORCE=TOTALS), (b) adjust the seasonally adjusted values for each calendar year so that the sum of the rounded seasonally adjusted series for any year equals the rounded annual total (FORCE=ROUND), or (c) rst force the yearly totals, then round the adjusted series (FORCE=BOTH). When FORCE=TOTALS, the differences between the annual totals is distributed over the seasonally adjusted values in a way that approximately preserves the month-to-month (or quarter-toquarter) movements of the original series. For more details, see Huot (1975) and Cholette
(1979). This forcing procedure is not recommended if the seasonal pattern is changing or if trading day adjustment is performed. Forcing the seasonally adjusted totals to be the same as the original series annual totals can degrade the quality of the seasonal adjustment, especially when the seasonal pattern is undergoing change. It is not natural if trading day adjustment is performed because the aggregate trading day effect over a year is variable and moderately different from zero.
Missing Values
PROC X12 can process a series with missing values. Missing values in a series are considered to be one of two types: One type of missing value is a leading or trailing missing value, which occurs before the rst nonmissing value or after the last nonmissing value, respectively, in the span of a series. The span of a series can be determined either explicitly by the SPAN= option of the PROC X12 statement or implicitly by the START= or DATE= options. By default, leading and trailing missing values are ignored. The NOTRIMMISS option of the PROC X12 statement causes leading and trailing missing values to also be processed using the X-12-ARIMA missing value method. The second type of missing value is an embedded missing value. These missing values occur between the rst nonmissing value and the last nonmissing value in the span of the series. Embedded missing values are processed using X-12-ARIMAs missing value method described below. When the X-12-ARIMA method encounters a missing value, it inserts an additive outlier for that observation into the set of regression variables for the model of the series and then replaces the missing observation with a value large enough to be considered an outlier during model estimation. After the regARIMA model is estimated, the X-12-ARIMA method adjusts the original series by using factors generated from these missing value outlier regressors. The adjusted values are estimates of the missing values, and the adjusted series is displayed in Table MV1.
For seasonality to be identiable, the series should be identied as seasonal by using the Test for the Presence of Seasonality Assuming Stability and Nonparametric Test for the Presence of Seasonality Assuming Stability. Also, since the presence of moving seasonality can cause distortion, it is important to evaluate the moving seasonality in conjunction with the stable seasonality to determine if the seasonality is identiable. The test for identiable seasonality is performed by combining the F tests for stable and moving seasonality, along with a Kruskal-Wallis test for stable seasonality. The description below is based on Lothian and Morry (1978b); other details can be found in Dagum (1988, 1983). Let Fs and Fm denote the F value for the stable and moving seasonality tests, respectively. The combined test is performed as shown in Table 32.3 and as follows: 1. If the null hypothesis in the stable seasonality test is not rejected at the 0.10% signicance level (0.001), then since the series is not seasonal, PROC X12 displays Identiable Seasonality Not Present. 2. If the null hypothesis in step 1 is rejected, then compute the following quantities: T1 D T2 D 7 Fm 3Fm Fs
If the moving seasonality null hypothesis is not rejected at the 5.0% signicance level (0.05) and if T 1:0, the null hypothesis of identiable seasonality not present is accepted and PROC X12 displays Identiable Seasonality Not Present. 3. If the null hypothesis of identiable seasonality not present has not been accepted, but T1 1:0, T2 1:0, or the Kruskal-Wallis chi-squared test fails at the 0.10% signicance level (0.001), then PROC X12 displays Identiable Seasonality Probably Not Present. 4. If the FS and Kruskal-Wallis chi-squared tests pass, and if none of the combined measures described in steps 2 and 3 fail, then the null hypothesis of identiable seasonality not present is rejected, and PROC X12 displays Identiable Seasonality Present. The Summary of Results for Combined Test for the Presence of Identiable Seasonality table displays the T1 , T2 , and T values and the signicance levels for the stable seasonality test, the moving seasonality test, and the Kruskal-Wallis test. The last item in the table is the result of the combined test for identiable seasonality.
Computations
For more details about the computations used in PROC X12, see X-12-ARIMA Reference Manual (U.S. Bureau of the Census 2001b). For more details about the X-11 method of decomposition, see Seasonal Adjustment with the X-11 Method (Ladiray and Quenneville 2001).
Table 32.8 describes the individual tables and charts. P indicates that the table is displayed only and is not available for output to the OUT= data set. Data from displayed tables can be extracted into data sets by using the Output Delivery System (ODS). For more information about the features of the ODS Graphics system, including the many ways that you can control or customize the plots produced by SAS procedures, see Chapter 21, Statistical Graphics Using ODS (SAS/STAT Users Guide). For more information about the SAS Output Delivery system, see the SAS Output Delivery System: Users Guide. When tables available through the OUTPUT statement are output using ODS, the summary line is included in the ODS output by default. The summary line gives the average, standard deviation, or total by each period. The value 1 for YEAR indicates that the summary line is a total; the value 2 for YEAR indicates that the summary line is an average; and the value 3 for YEAR indicates that the line is a standard deviation. The value of YEAR for historical and forecast values will be greater than or equal to zero. Thus, a negative value indicates a summary line. You can suppress the summary line altogether by specifying the NOSUM option in the TABLES statement. However, the NOSUM option also suppresses the display of the summary line in the displayed table. T indicates that the table is available using the OUTPUT statement, but is not displayed by default; you must request that these tables be displayed by using the TABLES Statement. The actual number of tables displayed depends on the options and statements specied. If a table is not computed, it is not displayed.
Table 32.8 Table Names and Descriptions
Description
Notes
P P
Table 32.8
continued
Description
Notes
ARIMA estimates for unit root identication results of unit root test for identifying orders of differencing models estimated by automatic ARIMA model selection procedure best ve ARIMA models chosen by automatic modeling comparison of automatically selected model and default model nal automatic model choice
P P P P P P
Modeling Tables MissingExtreme ARMAIterationTolerances IterHistory OutlierDetection PotentialOutliers ARMAIterationSummary RegParameterEstimates RegressorGroupChiSq ARMAParameterEstimates AvgFcstErr Roots MLESummary ForecastCL MV1 Sequenced Tables A1 A2 A6 A7 A8 A8AO A8LS A8TC A9 A10 B1 C17 C20
extreme or missing values exact ARMA likelihood estimation iteration tolerances ARMA iteration history critical values to use in outlier detection potential outliers exact ARMA likelihood estimation iteration summary regression model parameter estimates chi-squared tests for groups of regressors exact ARMA maximum likelihood estimation average absolute percentage error in within(out)sample fore(back)casts: (non)seasonal (AR)MA roots estimation summary forecasts, standard errors, and condence limits original series adjusted for missing value regressors
P P P P P P P P P P P P P
original series prior-adjustment factors regARIMA trading day component regARIMA holiday component regARIMA combined outlier component regARIMA AO outlier component regARIMA level change outlier component regARIMA temporary change outlier component regARIMA user-dened regression component regARIMA user-dened seasonal component prior-adjusted or original series nal weight for irregular components nal extreme value adjustment factors
Table 32.8
continued
Table D1 D7 D8 D8A D9 D9A D10 D10B D10D D11 D11A D11F D11R D12 D13 D16 D16B D18 E4 E5 E6 E6A E6R E7 F2A-I F3 F4 G
Description modied original data, D iteration preliminary trend cycle, D iteration nal unmodied S-I ratios seasonality tests nal replacement values for extreme S-I ratios moving seasonality ratio nal seasonal factors seasonal factors, adjusted for user-dened seasonal nal seasonal difference nal seasonally adjusted series nal seasonally adjusted series with forced yearly totals factors applied to get adjusted series with forced yearly totals rounded nal seasonally adjusted series (with forced yearly totals) nal trend cycle nal irregular series combined adjustment factors nal adjustment differences combined calendar adjustment factors ratios of annual totals percent changes in original series percent changes in nal seasonally adjusted series percent changes (differences) in seasonally adjusted series with forced yearly totals (D11.A) percent changes (differences) in rounded seasonally adjusted series (D11.R) differences in nal trend cycle summary measures quality assessment statistics day of the week trading day component factors spectral analysis
Notes T T P P
P P P P
ODS Graphics
This section describes the use of ODS for creating graphics with the X12 procedure. To request these graphs, you must specify the ODS GRAPHICS statement. The graphics available through ODS GRAPHICS are ACF plots, PACF plots, and spectral graphs. ACF and PACF plots are not available unless the IDENTIFY statement is used. A spectral plot of the original series is always available; however additional spectral plots are provided when the X11
statement is used. When the ODS GRAPHICS statement is not used, the plots are integrated into the ACF, PACF, and spectral tables as a column of the table.
Plot Description autocorrelation of model residuals partial autocorrelation of model residuals spectral plot of original or adjusted series
_NAME_
_MODELTYPE_
ted in other SAS procedures, PROC X12 recognizes only REG and ARIMA. _MODELPART_ further qualies the regression or ARIMA information in the observation. For _MODELTYPE_= REG, valid values of _MODELPART_ are INPUT, EVENT, and PREDEFINED. A value of INPUT indicates that this observation refers to the user-dened variable whose name is given in _DSVAR_. Likewise, a value of EVENT indicates that the observation refers to the SAS or user-dened EVENT whose name is given in _DSVAR_. PREDEFINED indicates that the name given in _DSVAR_ is a predened U.S. Census Bureau variable. If only model information is included in the data set (that is, all observations have _MODELTYPE_ = ARIMA) then the _MODELPART_ variable can be omitted. However, valid values for model information are FORECAST, ., or blank. further qualies the regression or ARIMA information in the observation. For _MODELTYPE_= REG, the only valid value of _COMPONENT_ is SCALE. Other SAS procedures might allow other values. For _MODELTYPE_= ARIMA, the valid values of _COMPONENT_ are TRANSFORM, CONSTANT, NONSEASONAL, and SEASONAL. TRANSFORM indicates that the observation contains the information that would be supplied in the TRANSFORM statement. CONSTANT is specied to control the constant term in the model. NONSEASONAL and SEASONAL refer to the AR, MA, and differencing terms in the ARIMA model. further qualies the regression or ARIMA information in the observation. For regression information, the value of _PARMTYPE_ is the same as the value of the REGRESSION USERTYPE= option. Since the USERTYPE= option applies only to user-dened events and variables, the value of _PARMTYPE_ does not alter processing in observations where _MODELPART_ = PREDEFINED. However, it is consistent to use a value for _PARMTYPE_ that matches the Census predened variable. For the constant term in model information, _PARMTYPE_ should be SCALE. For transformation information, the value of _PARMTYPE_ should be NONE, LOG, LOGIT, SQRT, or BOXCOX. For ARIMA model information, _PARMTYPE_ should be AR, MA, or DIF. species the variable name associated with this observation. For regression information, the value of _DSVAR_ is the name of the user-dened variable, the EVENT, or the Census predened variable. For model information, _DSVAR_ should match the name of the series being processed. If the model information applies to more than one series, then _DSVAR_ can be blank or .. contains a numerical value that is used as a parameter for certain types of information. For certain Census predened variables, _VALUE_ is the associated parameter value. For example, the REGESSION statement option PREDEFINED=EASTER(6) would be implemented using _DSVAR_=EASTER and _VALUE_=6. For a BOXCOX transformation, _VALUE_ would be the associated parameter value. For _COMPONENT_=SEASONAL, if _VALUE_ is nonmissing, then _VALUE_ is used as the seasonal period. If _VALUE_ is missing for _COMPONENT_=SEASONAL, then the seasonal period is determined by the interval of the series.
_COMPONENT_
_PARMTYPE_
_DSVAR_
_VALUE_
_FACTOR_
applies only to AR and MA ARIMA model information. The actual value of _FACTOR_ should be the same for all observations related to lags within the same ARIMA factor. So the value of _FACTOR_ identies lags that belong in the same factor. identies the degree for differencing and AR and MA lags. If _COMPONENT_=SEASONAL, then the value in _LAG_ is multiplied by the seasonal period indicated by the value of _VALUE_. contains the shift value for transfer functions. This value is not processed by PROC X12. indicates whether a parameter associated with the observation is to be estimated. For example, the NOINT option would be indicated by constant information with _NOEST_=1 and _EST_=0. _NOEST_=1 indicates that the value in _EST_ is a xed value. _NOEST_ pertains to the constant term, to AR and MA lags, and to regression information. contains an initial or xed value for a parameter associated with the observation that is to be estimated. _NOEST_=1 indicates the value in _EST_ is a xed value. _EST_ pertains to the constant term, to AR and MA lags, and to regression information. contains output information about estimated parameters. contains output information about estimated parameters. contains output information about estimated parameters.
_LAG_
_SHIFT_ _NOEST_
_EST_
_CLASS_ _KEYNAME_
_STARTDATE_ _ENDDATE_
_DATEINTRVL_ _STARTDT_
contains the interval for the date do-list. The default for _DATEINTRVL_ is no interval, designated by .. contains either the datetime timing value or the rst datetime timing value to use in a do-list. The default for _STARTDT_ is no datetime, designated by a missing value. contains the last datetime timing value to use in a do-list. The default for _ENDDT_ is no datetime, designated by a missing value. contains the interval for the datetime do-list. The default for _DTINTRVL_ is no interval, designated by .. contains either the observation number timing value or the rst observation number timing value to use in a do-list. The default for _STARTOBS_ is no observation number, designated by a missing value. contains the last observation number timing value to use in a do-list. The default for _ENDOBS_ is no observation number, designated by a missing value. contains the interval length of the observation number do-list. The default for _OBSINTRVL_ is no interval, designated by .. type of EVENT. The default for _TYPE_ is POINT. value for nonzero observation. The default for _VALUE_ is 1:0. INTERVAL that denes the units for the DURATION values. The default for _PULSE_ is no interval, designated by .. number of durations before the timing value. The default for _DUR_BEFORE_ is 0.
_DUR_AFTER_ number of durations after the timing value. The default for _DUR_AFTER_ is 0. _SLOPE_BEFORE_ determines whether the curve is GROWTH or DECAY before the timing value for _TYPE_=RAMP, _TYPE_=RAMPP, and _TYPE_=TC. The default for _SLOPE_BEFORE_ is GROWTH. _SLOPE_AFTER_ determines whether the curve is GROWTH or DECAY after the timing value for _TYPE_=RAMP, _TYPE_=RAMPP, and _TYPE_=TC. The default for _SLOPE_AFTER_ is GROWTH unless _TYPE_=TC; then the default is DECAY. _SHIFT_ number of _PULSE_= intervals to shift the timing value. The shift can be positive (forward in time) or negative (backward in time). If _PULSE_= is not specied, then the shift is in observations. The default for _SHIFT_ is 0. parameter for EVENT of TYPE=TC. The default for _TCPARM_ is 0:5. rule to use when combining events or when timing values of an event overlap. The default for _RULE_ is ADD. frequency interval at which the event should be repeated. If this value is missing, then the event is not periodic. The default for _PERIOD_ is no interval, designated by .. label or description for the event. If you do not specify a label, then the default label value is displayed as .. For events that produce dummy variables, either the user-supplied label or the default label is used. For COMPLEX events, the _LABEL_ value is merely a description of the group of events.
_LABEL_
Lag 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
NOTE: The P-values approximate the probability of observing a Q-value at least this large when the model fitted is correct. When DF is positive, small values of P, customarily those below 0.05 indicate model inadequacy.
Average absolute percentage error in within-sample forecasts: For variable sales Last year: Last-1 year: Last-2 year: Last three years: 2.81 6.38 7.69 5.63
Exact ARMA Likelihood Estimation Iteration Summary For variable sales Number of ARMA iterations Number of Function Evaluations 6 19
Exact ARMA Maximum Likelihood Estimation For variable sales Standard Lag Estimate Error t Value 1 12 0.40181 0.55695 Estimation Summary For variable sales Number of Residuals Number of Parameters Estimated Variance Estimate Standard Error Estimate Log likelihood Transformation Adjustment Adjusted Log likelihood AIC AICC (F-corrected-AIC) Hannan Quinn BIC 131 3 1.3E-03 3.7E-02 244.6965 -735.2943 -490.5978 987.1956 987.3845 990.7005 995.8211 0.07887 0.07626 5.09 7.30
ods output D8A#1=SalesD8A_1; ods output D8A#2=SalesD8A_2; ods output D8A#3=SalesD8A_3; ods output D8A#4=SalesD8A_4; proc x12 data=sales date=date; var sales; transform power=0; arima model=( (0,1,1)(0,1,1) ); estimate; x11; run; title Stable Seasonality Test; proc print data=SalesD8A_1 LABEL; run; title Nonparametric Stable Seasonality Test; proc print data=SalesD8A_2 LABEL; run; title Moving Seasonality Test; proc print data=SalesD8A_3 LABEL; run; title Combined Seasonality Test; proc print data=SalesD8A_4 LABEL NOOBS; var _NAME_ Name1 Label1 cValue1; run;
** Seasonality present at the 0.1 percent level. Nonparametric Test for the Presence of Seasonality Assuming Stability KruskalWallis Probability Statistic DF Level 131.9546 11 .00%
F-Value 3.370317 **
**Moving seasonality present at the one percent level. Summary of Results and Combined Test for the Presence of Identifiable Seasonality Seasonality Tests: Stable Seasonality F-test Moving Seasonality F-test Kruskal-Wallis Chi-square Test Combined Measures: T1 = 7/F_Stable T2 = 3*F_Moving/F_Stable T = (T1 + T2)/2 Combined Test of Identifiable Seasonality: Probability Level 0.000 0.001 0.000 Value 0.04 0.05 0.04 Present
The four ODS statements in the preceding example direct output from the D8A tables into four data sets: SalesD8A_1, SalesD8A_2, SalesD8A_3, and SalesD8A_4. It is best to direct the output to four different data sets because the four tables associated with table D8A have varying formats. The ODS data sets are shown in Output 32.3.2, Output 32.3.3, Output 32.3.4, and Output 32.3.5.
Output 32.3.2 Table D8.A as Output in a Data Set by Using ODS
Stable Seasonality Test Sum of Squares 23571.41 1481.28 25052.69 Mean Square 2142.855 11.22182 .
Obs 1 2 3
DF 11 132 143
F-Value 190.9544 . .
FT_AST **
Obs 1
_NAME_ sales
DF 11
Obs 1 2
DF 10 110
F-Value 3.370317 .
FT_AST **
ModelEstimation.AutoModel.UnitRootTest ModelEstimation.AutoModel.AutoChoiceModel ModelEstimation.AutoModel.Best5Model ModelEstimation.AutoModel.AutomaticModelChoice ModelEstimation.AutoModel.FinalModelChoice ModelEstimation.AutoModel.AutomdlNote; proc x12 data=sales date=date; var sales; transform function=log; regression predefined=td; automdl maxorder=(1,1) print=unitroottest unitroottestmdl autochoicemdl best5model; estimate; x11; output out=out(obs=23) a1 a2 a6 b1 c17 c20 d1 d7 d8 d9 d10 d11 d12 d13 d16 d18; run; proc print data=out(obs=23); title Output Variables Related to Trading Day Regression; run;
The automatic model selection output is shown in Output 32.4.1, Output 32.4.2, and Output 32.4.3. The rst table, ARIMA Estimate for Unit Root Identication, gives details of the method that TRAMO uses to automatically select the orders of differencing. The second table, Results of Unit Root Test for Identifying Orders of Differencing, shows that a regular difference order of 1 and a seasonal difference order of 1 has been determined by TRAMO. The third table, Models estimated by Automatic ARIMA Model Selection procedure, shows all the models examined by the TRAMO-based method. The fourth table, Best Five ARIMA Models Chosen by Automatic Modeling, shows the top ve models in order of rank and their BIC2 statistic. The fth table, Comparison of Automatically Selected Model and Default Model, compares the model selected by the TRAMO model to the default X-12-ARIMA model. The sixth table, Final Automatic Model Selection, shows which model was actually selected.
Estimated Model 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0)( 0)( 0)( 1)( 1)( 1)( 1)( 1)( 1)( 1)( 1)( 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0) 0) 0) 1) 1) 1) 1) 1) 1) 1) 1)
Results of Unit Root Test for Identifying Orders of Differencing For variable sales Regular difference order 1 Seasonal difference order 1
Mean Significant no
Estimated Model 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0)( 0)( 0)( 0)( 0)( 0)( 0)( 0)( 0)( 0)( 0)( 0)( 0)( 0)( 0)( 0)( 0)( 0)( 0)( 0)( 0)( 0)( 1)( 1)( 1)( 0)( 0)( 0)( 1)( 1)( 1)( 1)( 1)( 1)( 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
5 6
Best Five ARIMA Models Chosen by Automatic Modeling For variable sales Rank 1 2 3 4 5 ( ( ( ( ( Estimated Model 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1)( 0)( 1)( 0)( 1)( 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1) 1) 1) 1) 0) BIC2 -3.69426 -3.69037 -3.65918 -3.61628 -3.45663
Comparison of Automatically Selected Model and Default Model For variable sales
Comparison of Automatically Selected Model and Default Model For variable sales Statistics of Fit Number of Plbox RvrOutliers 0.62560 0.62561 0.03546 0.03546 0 0
Final Automatic Model Selection For variable sales Source of Model Automatic Model Choice Estimated Model ( 0, 1, 1)( 0, 1, 1)
Table 32.10 and Output 32.4.4 illustrate the regARIMA modeling method. Table 32.10 shows the relationship between the output variables in PROC X12 that results from a regARIMA model. Note that some of these formulas apply only to this example. Output 32.4.4 shows the values of these variables for the rst 23 observations in the example.
Table 32.10
Table A1 A2
Title time series data (for the span analyzed) prior-adjustment factors leap year (from trading day regression) adjustments regARIMA trading day component leap year prior adjustments included from Table A2 original series (prior adjusted) (adjusted for regARIMA factors) nal weights for irregular component nal extreme value adjustment factors modied original data, D iteration
A6
factor
B1 C17 C20 D1
D7 D8
data factor
D9
factor
D10 D11
nal seasonal factors nal seasonally adjusted data (also adjusted for trading day) nal trend cycle nal irregular component combined adjustment factors (includes seasonal, trading day factors) combined calendar adjustment factors (includes trading day factors)
factor data
B1 D A1=A6 * * because only TD specied calculated using moving standard deviation calculated using C16 and C17 D1 D B1=C 20 ** D1 D C19=C 20 ** C19=B1 in this example calculated using Henderson moving average D8 D B1=D7 *** D8 D C19=D7 *** TD specied in regression if C17 shows extreme values, D9 D D1=D7; D9 D : otherwise calculated using moving averages D11 D B1=D10 **** D11 D C19=D10 **** B1 D C19 for this example calculated using Henderson moving average D13 D D11=D12 D16 D A1=D11 D18 D D16=D10 D18 D A6 ***** ***** regression TD is the only calendar adjustment factor in this example
Obs DATE 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 SEP78 OCT78 NOV78 DEC78 JAN79 FEB79 MAR79 APR79 MAY79 JUN79 JUL79 AUG79 SEP79 OCT79 NOV79 DEC79 JAN80 FEB80 MAR80 APR80 MAY80 JUN80 JUL80
sales_A1 sales_A2 sales_A6 sales_B1 112 118 132 129 121 135 148 148 136 119 104 118 115 126 141 135 125 149 170 170 158 133 114 1.00000 1.00000 1.00000 1.00000 1.00000 0.99115 1.00000 1.00000 1.00000 1.00000 1.00000 1.00000 1.00000 1.00000 1.00000 1.00000 1.00000 1.02655 1.00000 1.00000 1.00000 1.00000 1.00000 1.01328 0.99727 0.98960 1.00957 0.99408 0.99115 1.00966 0.99279 0.99406 1.01328 0.99727 0.99678 1.00229 0.99408 1.00366 0.99872 0.99406 1.03400 0.99872 0.99763 1.00966 0.99279 0.99406 110.532 118.323 133.388 127.777 121.721 136.205 146.584 149.075 136.813 117.440 104.285 118.381 114.737 126.751 140.486 135.173 125.747 144.100 170.217 170.404 156.489 133.966 114.681
Obs sales_D8 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 0.89040 0.94731 1.06161 1.01225 0.96179 1.07521 1.15580 1.17347 1.07360 0.91822 0.81156 0.91589 0.88151 0.96581 1.05869 1.00429 0.91906 1.03509 1.20245 1.18520 1.07239 0.90436 0.76108
sales_ D10 0.90264 0.94328 1.06320 0.99534 0.97312 1.05931 1.17842 1.18283 1.06125 0.91663 0.81329 0.91135 0.90514 0.93820 1.06183 0.99339 0.97481 1.06153 1.17965 1.18499 1.06005 0.91971 0.81275
sales_ D11 122.453 125.438 125.459 128.375 125.083 128.579 124.391 126.033 128.916 128.121 128.226 129.897 126.761 135.100 132.306 136.072 128.996 135.748 144.295 143.802 147.624 145.662 141.103
sales_ D12 124.448 125.115 125.723 126.205 126.479 126.587 126.723 126.902 127.257 127.747 128.421 129.316 130.347 131.507 132.937 134.720 136.763 138.996 141.221 143.397 145.591 147.968 150.771
sales_ D13 0.98398 1.00258 0.99790 1.01720 0.98896 1.01574 0.98160 0.99315 1.01303 1.00293 0.99848 1.00449 0.97249 1.02732 0.99525 1.01004 0.94321 0.97663 1.02177 1.00283 1.01397 0.98442 0.93588
Output 32.5.1 PROC X12 Output When Potential Outliers Are Identied
Automatic Outlier Identification The X12 Procedure Critical Values to use in Outlier Detection For variable sales Begin End Observations Method AO Critical Value LS Critical Value SEP1978 AUG1990 144 Add One 3.889838 3.889838
NOTE: The following time series values might later be identified as outliers when data are added or revised. They were not identified as outliers in this run either because their test t-statistics were slightly below the critical value or because they were eliminated during the backward deletion step of the identification procedure, when a non-robust t-statistic is used. Potential Outliers For variable sales t Value Date for AO NOV1989 -3.48
Type of Outlier AO
In the next example, reducing the critical value to 3.3 causes the outlier identication routine to more aggressively identify outliers as shown in Output 32.5.2. The rst table shows the critical values. The second table shows that three additive outliers and a level shift have been included in the regression model. The third table shows how the inclusion of outliers in the model affects the ARMA parameters.
proc x12 data=sales date=date; var sales; transform function=log; arima model=((0,1,1) (0,1,1)); outlier cv=3.3; estimate; x11; output out=outlier(obs=50) a1 a8 run; proc print data=outlier(obs=50); run;
Regression Model Parameter Estimates For variable sales Standard Parameter NoEst Estimate Error AO JAN1981 LS FEB1983 AO OCT1983 AO NOV1989 Est Est Est Est 0.09590 -0.09673 -0.08032 -0.10323 0.02168 0.02488 0.02146 0.02480
The rst 50 observations of the A1, A8, A8AO, A8LS, B1, and D10 series are displayed in Output 32.5.3. You can conrm the following relationships from the data. A8 D A8AO B1 D A1=A8 The seasonal factors are stored in the variable sales_D10. A8LS
Obs 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
DATE SEP78 OCT78 NOV78 DEC78 JAN79 FEB79 MAR79 APR79 MAY79 JUN79 JUL79 AUG79 SEP79 OCT79 NOV79 DEC79 JAN80 FEB80 MAR80 APR80 MAY80 JUN80 JUL80 AUG80 SEP80 OCT80 NOV80 DEC80 JAN81 FEB81 MAR81 APR81 MAY81 JUN81 JUL81 AUG81 SEP81 OCT81 NOV81 DEC81 JAN82 FEB82 MAR82 APR82 MAY82 JUN82 JUL82 AUG82 SEP82 OCT82
sales_A1 112 118 132 129 121 135 148 148 136 119 104 118 115 126 141 135 125 149 170 170 158 133 114 140 145 150 178 163 172 178 199 199 184 162 146 166 171 180 193 181 183 218 230 242 209 191 172 194 196 196
sales_A8 1.10156 1.10156 1.10156 1.10156 1.10156 1.10156 1.10156 1.10156 1.10156 1.10156 1.10156 1.10156 1.10156 1.10156 1.10156 1.10156 1.10156 1.10156 1.10156 1.10156 1.10156 1.10156 1.10156 1.10156 1.10156 1.10156 1.10156 1.10156 1.21243 1.10156 1.10156 1.10156 1.10156 1.10156 1.10156 1.10156 1.10156 1.10156 1.10156 1.10156 1.10156 1.10156 1.10156 1.10156 1.10156 1.10156 1.10156 1.10156 1.10156 1.10156
sales_B1 101.674 107.121 119.830 117.107 109.844 122.553 134.355 134.355 123.461 108.029 94.412 107.121 104.397 114.383 128.000 122.553 113.475 135.263 154.327 154.327 143.433 120.738 103.490 127.093 131.632 136.171 161.589 147.972 141.864 161.589 180.653 180.653 167.036 147.064 132.539 150.695 155.234 163.405 175.206 164.312 166.128 197.901 208.795 219.688 189.731 173.391 156.142 176.114 177.930 177.930
From the two previous examples, you can examine how outlier detection affects the seasonally adjusted series. Output 32.5.4 shows a plot of A1 versus B1 in the series where outliers are detected. B1 has been adjusted for the additive outliers and the level shift.
proc sgplot data=outlier; series x=date y=sales_A1 / name=A1 markers markerattrs=(color=red symbol=circle) lineattrs=(color=red); series x=date y=sales_B1 / name=B1 markers markerattrs=(color=black symbol=asterisk) lineattrs=(color=black); yaxis label=Original and Outlier Adjusted Time Series; run;
Output 32.5.5 compares the seasonal factors (table D10) of the series unadjusted for outliers to the series adjusted for outliers. The seasonal factors are based on the B1 series.
data both; merge nooutlier(rename=(sales_D10=unadj_D10)) outlier; run; title Results of Outlier Identification on Final Seasonal Factors; proc sgplot data=both;
series x=date y=unadj_D10 / name=unadjusted markers markerattrs=(color=red symbol=circle) lineattrs=(color=red) legendlabel=Unadjusted for Outliers; series x=date y=sales_D10 / name=adjusted markers markerattrs=(color=blue symbol=asterisk) lineattrs=(color=blue) legendlabel=Adjusted for Outliers; yaxis label=Final Seasonal Factors; run;
Output 32.5.5 Seasonal Factors Based on Original and Outlier Adjusted Series
Since the regARIMA model forecasts one year ahead, the user-dened regressor must be dened for 144 observations that start in January of 1949. You can construct a simple length-of-month regressor by using the following DATA step.
title User-defined Regressor for Data to be Seasonally Adjusted; data regressors(keep=date LengthOfMonth); set sashelp.air; LengthOfMonth = INTNX(MONTH,date,1) - date; run;
The two data sets must be merged in order to use them as input to PROC X12. The BY statement is used to align the regressors with the time series by the time ID variable DATE.
title Data Set Containing Series and Regressors; data datain; merge regressors salesdata; by date; run; proc print data=datain(firstobs=121); run;
The last 24 observations of the input data set are displayed in Output 32.6.1. Note that the regressor variable is dened for one year (12 observations) beyond the span of the time series to be seasonally adjusted.
Output 32.6.1 PROC X12 Input Data Set with User-Dened Regressor
Data Set Containing Series and Regressors Length OfMonth 31 28 31 30 31 30 31 31 30 31 30 31 31 29 31 30 31 30 31 31 30 31 30 31
Obs 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144
DATE JAN59 FEB59 MAR59 APR59 MAY59 JUN59 JUL59 AUG59 SEP59 OCT59 NOV59 DEC59 JAN60 FEB60 MAR60 APR60 MAY60 JUN60 JUL60 AUG60 SEP60 OCT60 NOV60 DEC60
AIR 360 342 406 396 420 472 548 559 463 407 362 405 . . . . . . . . . . . .
The DATAIN data set is now ready to be used as input to PROC X12. Note that the DATE= variable and the user-dened regressors are automatically excluded from the variables to be seasonally adjusted.
title regARIMA Model with User-defined Regressor; proc x12 data=datain date=DATE interval=MONTH; transform function=log; regression uservar=LengthOfMonth; automdl; x11; run;
The parameter estimates for the regARIMA model are shown in Output 32.6.2
t Value 2.55
Exact ARMA Maximum Likelihood Estimation For variable AIR Standard Lag Estimate Error t Value 1 12 0.33678 0.54078 0.08506 0.07726 3.96 7.00
title Model Identification Output to MDLINFOOUT= Data Set; proc x12 data=b1 start=1980q1 interval=qtr MdlInfoOut=mdl; automdl; outlier; run ; proc print data=mdl; run;
Final Automatic Model Selection For variable y Source of Model Automatic Model Choice Estimated Model ( 2, 1, 0)( 0, 0, 0)
Regression Model Parameter Estimates For variable y Standard Parameter NoEst Estimate Error AO 1984Q3 Est 102.36589 5.96584
t Value 17.16
Parameter Nonseasonal AR
Exact ARMA Maximum Likelihood Estimation For variable y Standard Lag Estimate Error t Value 1 2 0.40892 -0.53710 0.20213 0.20975 2.02 -2.56
Output 32.7.2 PROC X12 MDLINFOOUT= Data Set Model with Outlier Detection
Model Identification Output to MDLINFOOUT= Data Set _ M O D E L T Y P E _ REG ARIMA ARIMA ARIMA _ M O D E L P A R T _ EVENT FORECAST FORECAST FORECAST _ C O M P O N E N T _ SCALE NONSEASONAL NONSEASONAL NONSEASONAL
O b s 1 2 3 4
_ N A M E _ y y y y
_ P A R M T Y P E _ AO DIF AR AR
_ D S V A R _ AO01JUL1984D y y y
_ V A L U E _ . . . .
O b s 1 2 3 4
_ F A C T O R _ . . 1 1
_ L A G _ . 1 1 2
_ S H I F T _ . . . .
_ N O E S T _ 0 . 0 0
_ S T A T U S _ . . . .
_ S C O R E _
_ L A B E L _
Suppose that after examining the output from the preceding example, you decide that an Easter regressor should be added to the model. The following statements create a data set with the model identied above and adds a U.S. Census Bureau Predened Easter(25) regression. The new model data set to be used as input in the MDLINFOIN= option is displayed in the data set shown in Output 32.7.3.
data pluseaster; _NAME_ = y; _MODELTYPE_ = REG; _MODELPART_ = PREDEFINED; _COMPONENT_ = SCALE; _PARMTYPE_ = EASTER; _DSVAR_ = EASTER; _VALUE_ = 25; run;
data mdlpluseaster; set mdl; run; proc append base=mdlpluseaster data=pluseaster force; run; proc print data=mdlpluseaster; run;
Output 32.7.3 MDLINFOIN= Data Set Model with Easter(25) Regression Added
Model Identification Output to MDLINFOOUT= Data Set _ M O D E L T Y P E _ REG ARIMA ARIMA ARIMA REG _ M O D E L P A R T _ EVENT FORECAST FORECAST FORECAST PREDEFINED _ C O M P O N E N T _ SCALE NONSEASONAL NONSEASONAL NONSEASONAL SCALE
O b s 1 2 3 4 5
_ N A M E _ y y y y y
_ P A R M T Y P E _ AO DIF AR AR EASTER
_ D S V A R _ AO01JUL1984D y y y EASTER
O b s 1 2 3 4 5
_ V A L U E _ . . . . 25
_ F A C T O R _ . . 1 1 .
_ L A G _ . 1 1 2 .
_ S H I F T _ . . . . .
_ N O E S T _ 0 . 0 0 .
_ S T A T U S _ . . . . .
_ S C O R E _
_ L A B E L _
The following statement estimate the regression and ARIMA parameters by using the model described in the new data set mdlpluseaster. The results of estimating the new model are shown in Output 32.7.4.
proc x12 data=b1 start=1980q1 interval=qtr MdlInfoIn=mdlpluseaster MdlInfoOut=mdl2;
estimate; run;
Parameter Nonseasonal AR
Exact ARMA Maximum Likelihood Estimation For variable y Standard Lag Estimate Error t Value 1 2 0.44376 -0.54050 0.20739 0.21656 2.14 -2.50
The new model estimation results are displayed in the data set mdl2 shown in Output 32.7.5.
proc print data=mdl2; run;
Output 32.7.5 MDLINFOOUT= Data Set, Estimation of Model with Easter(25) Regression Added
Model Identification Output to MDLINFOOUT= Data Set _ M O D E L T Y P E _ REG REG ARIMA ARIMA ARIMA _ M O D E L P A R T _ PREDEFINED EVENT FORECAST FORECAST FORECAST _ C O M P O N E N T _ SCALE SCALE NONSEASONAL NONSEASONAL NONSEASONAL
O b s 1 2 3 4 5
_ N A M E _ y y y y y
_ P A R M T Y P E _ EASTER AO DIF AR AR
_ D S V A R _ EASTER AO01JUL1984D y y y
O b s 1 2 3 4 5
_ V A L U E _ 25 . . . .
_ F A C T O R _ . . . 1 1
_ L A G _ . . 1 1 2
_ S H I F T _ . . . . .
_ N O E S T _ 0 0 . 0 0
_ S T A T U S _ . . . . .
_ S C O R E _
_ L A B E L _
title Estimate Easter(25) Parameter; proc x12 data=sales date=date MdlInfoOut=mdlout1; var sales; regression predefined=easter(25); automdl; run ;
Type Easter
t Value -1.45
Parameter Nonseasonal AR
Exact ARMA Maximum Likelihood Estimation For variable sales Standard Lag Estimate Error t Value 1 2 3 1 12 0.62148 0.23354 -0.07191 0.97377 0.10558 0.09279 0.10385 0.09055 0.03771 0.10205 6.70 2.25 -0.79 25.82 1.03
Nonseasonal MA Seasonal MA
The MDLINFOOUT= data set, mdlout1, that contains the model and parameter estimates is shown in Output 32.8.2.
proc print data=mdlout1; run;
Output 32.8.2 MDLINFOOUT= Data Set, Estimation of Automatic Model ID with Easter(25) Regression
Estimate Easter(25) Parameter _ M O D E L T Y P E _ REG ARIMA ARIMA ARIMA ARIMA ARIMA ARIMA ARIMA _ M O D E L P A R T _ PREDEFINED FORECAST FORECAST FORECAST FORECAST FORECAST FORECAST FORECAST _ C O M P O N E N T _ SCALE NONSEASONAL SEASONAL NONSEASONAL NONSEASONAL NONSEASONAL NONSEASONAL SEASONAL
O b s 1 2 3 4 5 6 7 8
_ V A L U E _ 25 . . . . . . .
O b s 1 2 3 4 5 6 7 8
_ F A C T O R _ . . . 1 1 1 1 2
_ L A G _ . 1 1 1 2 3 1 1
_ S H I F T _ . . . . . . . .
_ N O E S T _ 0 . . 0 0 0 0 0
_ S T A T U S _ . . . . . . . .
_ S C O R E _
_ L A B E L _
To x the Easter(25) parameter while adding a regressor that is weighted according to the number of Saturdays in a month, either use the REGRESSION and EVENT statements or create a MDLINFOOUT= data set. The following statements show the method for using the REGRESSION statement to x the EASTER parameter and the EVENT statement to add the SATURDAY regressor. The output is shown in Output 32.8.3.
title Use SAS Statements to Alter Model; proc x12 data=sales date=date MdlInfoOut=mdlout2grm; var sales;
Output 32.8.3 Automatic Model ID with Fixed Easter(25) and Saturday Regression
Use SAS Statements to Alter Model The X12 Procedure Regression Model Parameter Estimates For variable sales Standard Parameter NoEst Estimate Error Saturday Easter[25] Est Fixed 3.23225 -5.02930 1.16701 .
t Value 2.77 .
Parameter Nonseasonal AR
Exact ARMA Maximum Likelihood Estimation For variable sales Standard Lag Estimate Error t Value 1 -0.32506 0.08256 -3.94
To x the EASTER regressor and add the new SATURDAY regressor by using a DATA step, you would create the data set mdlin2 as shown. The data set mdlin2 is displayed in Output 32.8.4.
title Use a SAS DATA Step to Create a MdlInfoIn= Data Set; data plusSaturday; _NAME_ = sales; _MODELTYPE_ = REG; _MODELPART_ = EVENT; _COMPONENT_ = SCALE; _PARMTYPE_ = USER; _DSVAR_ = SATURDAY; run; data mdlin2; set mdlout1; if ( _DSVAR_ = EASTER ) then do; _NOEST_ = 1; _EST_ = -5.029298; end; run; proc append base=mdlin2 data=plusSaturday force; run; proc print data=mdlin2; run;
Output 32.8.4 MDLINFOIN= Data Set, Fixed Easter(25) and Added Saturday Regression, Previously Identied Model
Use a SAS DATA Step to Create a MdlInfoIn= Data Set _ M O D E L T Y P E _ REG ARIMA ARIMA ARIMA ARIMA ARIMA ARIMA ARIMA REG _ M O D E L P A R T _ PREDEFINED FORECAST FORECAST FORECAST FORECAST FORECAST FORECAST FORECAST EVENT _ C O M P O N E N T _ SCALE NONSEASONAL SEASONAL NONSEASONAL NONSEASONAL NONSEASONAL NONSEASONAL SEASONAL SCALE
O b s 1 2 3 4 5 6 7 8 9
O b s 1 2 3 4 5 6 7 8 9
_ V A L U E _ 25 . . . . . . . .
_ F A C T O R _ . . . 1 1 1 1 2 .
_ L A G _ . 1 1 1 2 3 1 1 .
_ S H I F T _ . . . . . . . . .
_ N O E S T _ 1 . . 0 0 0 0 0 .
_ S T A T U S _ . . . . . . . . .
_ S C O R E _
_ L A B E L _
The data set mdlin2 can be used to replace the regression and model information contained in the REGRSSION, EVENT, and AUTOMDL statements. Note that the model specied in the mdlin2 data set is the same model as the automatically identied model. The following example uses the mdlin2 data set as input; the results are displayed in Output 32.8.5.
title Use Updated Data Set to Alter Model;
proc x12 data=sales date=date MdlInfoIn=mdlin2 MdlInfoOut=mdlout2DS; var sales; estimate; run ;
Output 32.8.5 Estimate MDLINFOIN= File for Model with Fixed Easter(25) and Saturday Regression, Previously Identied Model
Use Updated Data Set to Alter Model The X12 Procedure Regression Model Parameter Estimates For variable sales Standard Parameter NoEst Estimate Error SATURDAY Easter[25] Est Fixed 3.41762 -5.02930 1.07641 .
t Value 3.18 .
Parameter Nonseasonal AR
Exact ARMA Maximum Likelihood Estimation For variable sales Standard Lag Estimate Error t Value 1 2 3 1 12 0.62225 0.30429 -0.14862 0.97125 0.11691 0.09175 0.10109 0.08859 0.03798 0.10000 6.78 3.01 -1.68 25.57 1.17
Nonseasonal MA Seasonal MA
The following statements specify almost the same information as contained in the data set mdlin2. Note that the ARIMA statement is used to specify the lags of the model. However, the initial AR and MA parameter values are the default. When using the mdlin2 data set as input, the initial values can be specied. The results are displayed in Output 32.8.6.
title Use SAS Statements to Alter Model; proc x12 data=sales date=date MdlInfoOut=mdlout3grm; var sales; regression predefined=easter(25) / b=-5.029298 F; event Saturday; arima model=((3 1 1)(0 1 1)); estimate; run ; proc print data=mdlout3grm; run;
Output 32.8.6 MDLINFOOUT= Statement, Fixed Easter(25) and Added Saturday Regression, Previously Identied Model
Use SAS Statements to Alter Model _ M O D E L T Y P E _ REG REG ARIMA ARIMA ARIMA ARIMA ARIMA ARIMA ARIMA _ M O D E L P A R T _ EVENT PREDEFINED FORECAST FORECAST FORECAST FORECAST FORECAST FORECAST FORECAST _ C O M P O N E N T _ SCALE SCALE NONSEASONAL SEASONAL NONSEASONAL NONSEASONAL NONSEASONAL NONSEASONAL SEASONAL
O b s 1 2 3 4 5 6 7 8 9
O b s 1 2 3 4 5 6 7 8 9
_ V A L U E _ . 25 . . . . . . .
_ F A C T O R _ . . . . 1 1 1 1 2
_ L A G _ . . 1 1 1 2 3 1 1
_ S H I F T _ . . . . . . . . .
_ N O E S T _ 0 1 . . 0 0 0 0 0
_ S T A T U S _ . . . . . . . . .
_ S C O R E _
_ L A B E L _
The MDLINFOOUT= data set provides a method for comparing the results of the model identication. The data set mdlout3grm that is the result of using the ARIMA MODEL= option can be compared to the data set mdlout2DS that is the result of using the MDLINFOIN= data set with initial values for the AR and MA parameters. The mdlout2DS data set is shown in Output 32.8.7, and the results of the comparison are shown in Output 32.8.8. The slight difference in the estimated parameters can be attributed to the difference in the initial values for the AR and MA parameters.
Output 32.8.7 MDLINFOOUT= Data Set, Fixed Easter(25) and Added Saturday Regression, Previously Identied Model
Use SAS Statements to Alter Model _ M O D E L T Y P E _ REG REG ARIMA ARIMA ARIMA ARIMA ARIMA ARIMA ARIMA _ M O D E L P A R T _ EVENT PREDEFINED FORECAST FORECAST FORECAST FORECAST FORECAST FORECAST FORECAST _ C O M P O N E N T _ SCALE SCALE NONSEASONAL SEASONAL NONSEASONAL NONSEASONAL NONSEASONAL NONSEASONAL SEASONAL
O b s 1 2 3 4 5 6 7 8 9
O b s 1 2 3 4 5 6 7 8 9
_ V A L U E _ . 25 . . . . . . .
_ F A C T O R _ . . . . 1 1 1 1 2
_ L A G _ . . 1 1 1 2 3 1 1
_ S H I F T _ . . . . . . . . .
_ N O E S T _ 0 1 . . 0 0 0 0 0
_ S T A T U S _ . . . . . . . . .
_ S C O R E _
_ L A B E L _
title Compare Results of SAS Statement Input and MdlInfoIn= Input; proc compare base= mdlout3grm compare=mdlout2DS; var _EST_; run ;
Output 32.8.8 Compare Parameter Estimates from Different MDLINFOOUT= Data Sets
Value Comparison Results for Variables __________________________________________________________ || Value of Parameter Estimate || Base Compare Obs || _EST_ _EST_ Diff. % Diff ________ || _________ _________ _________ _________ || 1 || 3.4176 3.4176 0.0000225 0.000658 5 || 0.6223 0.6222 -0.000033 -0.005237 6 || 0.3043 0.3043 -0.000021 -0.006977 7 || -0.1486 -0.1486 0.0000235 -0.0158 8 || 0.9713 0.9713 -0.000024 -0.002452 9 || 0.1168 0.1169 0.0000759 0.0650 __________________________________________________________
References
Box, G. E. P., Jenkins, G. M., and Reinsel, G. C. (1994), Time Series Analysis: Forecasting and Control, Third Edition, Englewood Cliffs, NJ: Prentice Hall. Cholette, P. A. (1979), A Comparison and Assessment of Various Adjustment Methods of Sub-annual Series to Yearly Benchmarks, StatCan Staff Paper STC2119, Seasonal Adjustment and Time Series Staff, Statistics Canada, Ottawa. Dagum, E. B. (1983), The X-11-ARIMA Seasonal Adjustment Method, Technical Report 12-564E, Statistics Canada. Dagum, E. B. (1988), The X-11-ARIMA/88 Seasonal Adjustment Method: Foundations and Users Manual, Ottawa: Statistics Canada. Findley, D. F., Monsell, B. C., Bell, W. R., Otto, M. C., and Chen, B. C. (1998), New Capabilities and Methods of the X-12-ARIMA Seasonal Adjustment Program, Journal of Business and Economic Statistics, 16, 127176. Gomez, V. and Maravall, A. (1997a), Guide for Using the Programs TRAMO and SEATS, Beta Version, Banco de Espaa. Gomez, V. and Maravall, A. (1997b), Program TRAMO and SEATS: Instructions for the User, Beta Version, Banco de Espaa. Huot, G. (1975), Quadratic Minimization Adjustment of Monthly or Quarterly Series to Annual Totals, StatCan Staff Paper STC2104, Statistics Canada, Seasonal Adjustment and Time Series Staff, Ottawa.
References ! 2185
Ladiray, D. and Quenneville, B. (2001), Seasonal Adjustment with the X-11 Method, New York: Springer-Verlag. Ljung, G. M. (1993), On Outlier Detection in Time Series, Journal of the Royal Statistical Society, B, 55, 559567. Lothian, J. and Morry, M. (1978a), A Set of Quality Control Statistics for the X-11-ARIMA Seasonal Adjustment Method, StatCan Staff Paper STC1788E, Seasonal Adjustment and Time Series Analysis Staff, Statistics Canada, Ottawa. Lothian, J. and Morry, M. (1978b), A Test for the Presence of Identiable Seasonality When Using the X-11-ARIMA Program, StatCan Staff Paper STC2118, Seasonal Adjustment and Time Series Analysis Staff, Statistics Canada, Ottawa. Shiskin, J., Young, A. H., and Musgrave, J. C. (1967), The X-11 Variant of the Census Method II Seasonal Adjustment Program, Technical Report 15, U.S. Department of Commerce, Bureau of the Census. U.S. Bureau of the Census (2001a), X-12-ARIMA Quick Reference for UNIX, Version 0.2.8, Washington, DC. U.S. Bureau of the Census (2001b), X-12-ARIMA Reference Manual, Version 0.2.8, Washington, DC. U.S. Bureau of the Census (2001c), X-12-ARIMA Seasonal Adjustment Program, Version 0.2.8, Washington, DC.
2186
Part III
2188
Chapter 33
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2287 2287
Introduction
The SASECRSP interface engine enables SAS users to access and process time series, events, portfolios, and group data that reside in CRSPAccess databases. It also provides a seamless interface between CRSP, COMPUSTAT, and SAS data processing. Currently, SASECRSP supports access of CRSP Stock databases, CRSP Indices databases and CRSP/Compustat Merged databases.
Opening a Database
The SASECRSP engine uses the LIBNAME statement to enable you to specify which CRSPAccess database you would like to access and how you would like to perform selection on that database. To specify the database, you supply the combination of a physical path to indicate the location of the CRSPAccess data les and a set identier (SETID) to identify the database desired from those available at the physical path. Specify one SETID from Table 33.1. Note that the CRSP environment variable CRSPDB_SASCAL must be dened before the SASECRSP engine can access the CRSPAccess database calendars that provide the time ID variables and enable the LIBNAME to successfully assign.
Table 33.1 CRSPAccess Databases SETIDs
Data Set CRSP Stock, daily data CRSP Stock, monthly data CRSP/Compustat Merged (CCM) data CRSP Indices data, monthly index groups CRSP Indices data, monthly index series CRSP Indices data, daily index groups CRSP Indices data, daily index series
Usually you do not want to open the entire CRSPAccess database, so for efciency purposes and ease of use, SASECRSP supports a variety of options for performing data selection on your CRSPAccess database with the LIBNAME statement. These options enable you to open and retrieve data for only the portion of the database you desire. The availability of some of these options depends on the type of database being opened.
CRSP Stock Databases When accessing the CRSP Stock databases, you can select which securities you want to access by specifying their PERMNOs with the PERMNO= option. PERMNO is CRSPs unique permanent issue identication number and the primary key for their stock databases. Alternatively, a number of secondary keys can also be used to select stock data. For example, you could use the PERMCO= option to read selected securities based on CRSPs unique permanent company identication number, PERMCO. A full list of possible keys for accessing CRSP Stock data is shown in Table 33.2.
Table 33.2 Keys for Accessing CRSP Stock Data
Key
PERMNO PERMCO CUSIP HCUSIP SICCD TICKER
Access By CRSPs unique permanent issue identication number This is the primary key for CRSP Stock databases. CRSPs unique permanent company identication number CUSIP number historical CUSIP standard industrial classication (SIC) code TICKER symbol (for active companies only)
CRSP/Compustat Merged Databases When accessing Compustat data via the CCM database, you can select which companies you want to access by specifying their GVKEYs. GVKEY is Compustats unique identifer and primary key. You can specify a GVKEY to include with the GVKEY= option. Two secondary keys, PERMNO and PERMCO, are also supported for access via their respective PERMNO= and PERMCO= options. A full list of possible keys for accessing CCM data is shown in Table 33.3.
Table 33.3 Keys for Accessing CCM Data
Key
GVKEY PERMNO PERMCO
Access By Compustats unique identifer and primary key for CCM database CRSPs unique permanent issue identication number CRSPs unique permanent company identication number
CRSP Indices Databases When accessing CRSP Indices data, you can select which indices you want to access by specifying their INDNOs. INDNO is the primary key for CRSP Indices databases. You can specify which INDNO to use with the INDNO= option. No secondary key access is supported for CRSP Indices. A full list of possible keys for accessing CRSP Indices data is shown in Table 33.4.
Table 33.4
Key
INDNO
Access By CRSPs unique permanent index identier number. This is the primary key for CRSP Indices databases and enables you to specify which index series or groups you want to select.
Regardless of the database you are accessing, you can always use the INSET= and RANGE= options for subsetting and selection. The RANGE= option subsets the data timewise. The INSET= option enables you to specify which issues or companies you want to select from the CRSP database by using an input SAS data set.
Structure of a SAS Data Set That Contains Time Series Data ! 2193
based calendar date and is in a CRSP date format. This means dates are stored as an offset in an array of trading days or a trading day calendar. There are ve different CRSP trading day calendars and the one used depends on the frequency of the data member. For example, the CRSP date for a daily time series refers to a daily trading day calendar. The ve trading day calendars are: annual, quarterly, monthly, weekly and daily. For your convenience, the format and informat for this eld is set so the CRSP date is automatically converted to an Integer date representation when viewed or printed. For data programming, the SASECRSP engine provides 23 different user functions for date conversions between CRSP, SAS, and integer dates. The CCM database contains members whose dates are based on the scal calendar of the corresponding company, so a comprehensive set of time ID variables are provided. CRSPDT, RCALDT and FISCALDT provide day-based dates, each with its own format.
CRSPDT This time ID variable provides a date in CRSP date format similar to CALDT. CRSPDT differs
only in that its format and informat are not set for automatic conversion to integer dates because this is already provided by FISCALDT and RCALDT. For scal members, CRSPDT is one based on the scal calendar of the company.
FISCALDT This time ID variable provides the same date CRSPDT does, but in integer format. It is the result of performing a CRSP-to-Integer date conversion on CRSPDT. Since the date CRSPDT holds is scal, FISCALDT is also scal. RCALDT This time ID variable is also an integer date, just like FISCALDT, but it has been shifted so the
date is on calendar time as opposed to being scal. For example, Microsofts scal year ends in June, so if you look at its annual period descriptor for the 2002 scal year, its time ID variables are 78 for CRSPDT, 20021231 for its FISCALDT, and 20020628 for RCALDT. In summary, a total of three time ID variables are provided for scal time series members. One is in CRSP date format, and the other two are in integer format with the only difference between the two integer formats being that one of them is based on the scal calendar of the company while the other is not. For more information about how CALDT, CRSPDT, and date conversions are handled, see the section Understanding CRSP Date Formats, Informats, and Functions on page 2208. The CCM database also contains scal array members, which are all the segment data members. They are unlike the scal time series in that they are not associated with a calendar and also have their time ID variables embedded in the data as a data eld. Generally both scal and calendar time ID variables are embedded. However, segment members segsrc, segcur, and segitm have only one scal time ID variable embedded. For your convenience, SASECRSP calculates and provides CALYR, the calendar version of the embedded scal time ID variable for these three segment members. Note that due to limitations of the data, all segment member time ID variables are year-based.
Table 33.5
Option SETID=
PERMNO= PERMCO= CUSIP= HCUSIP= TICKER= SICCD= GVKEY= INDNO= RANGE= INSET= CRSPLINKPATH=
Description species which CRSP database subset to open This option is required. See Table 33.1 for complete list of supported SETIDs. species a CRSP PERMNO to be selected for access species a CRSP PERMCO to be selected for access species a current CUSIP to be selected for access species a historic CUSIP to be selected for access species a TICKER to be selected for access (for active companies only) species a SIC Code to be selected for access species a Compustat GVKEY to be selected for access species a CRSP INDNO to be selected for access species the range of data to keep in format YYYYMMDD-YYYYMMDD uses a SAS data set named setname as input for issues species location of the CRSP link history. This option is required for accessing CRSP data with Compustats GVKEY.
The physical name required by the LIBNAME statement should point to the directory of CRSPAccess data les where the CRSP database you want to open is located. Note that the physical name must end in a slash for UNIX environments and a backslash for Windows environments. The CRSP environment variable CRSPDB_SASCAL must be dened before the SASECRSP engine can access the CRSPAccess database calendars. The CRSP environment variable CRSPDB_SASCAL is necessary for the SASECRSP LIBNAME to assign successfully. This necessary environment variable should be dened automatically by either the CRSP software installation or, in later versions, the CRSP data installation. Since occasional instances occur where the variable is not set properly, always check to ensure the CRSPDB_SASCAL environment variable is set to the location where your most recent CRSP data resides. Remember to include the nal slash or backslash required. After the LIBNAME is assigned, you can access any of the available data sets/members within the opened database. For a complete description of available data sets and their elds, see the section Data Elements Reference: SASECRSP Interface Engine on page 2234. The following options can be used in the LIBNAME libref SASECRSP statement:
SETID=crsp_setidnumber
Species the CRSP database you want to read from. SETID is a required option. Choose one
SETID from seven possible values in Table 33.1. The SETID limits the frequency selection of time series that are included in the SAS data set. As an example, to access monthly CRSP Stock data, you would use the following statements:
LIBNAME myLib sasecrsp physical-name SETID=20;
PERMNO=crsp_permnumber
By default, the SASECRSP engine reads all keys for the CRSPAccess database that you specied in your SASECRSP libref. The PERMNO= option enables you to select data from your CRSP database by the PERMNO(s) (or other keys) you specify. PERMNOs are CRSPs unique permanent issue identication number. There is no limit to the number of crsp_permnumber options that you can use. From a performance standpoint, the PERMNO= option does efcient random access and reads only the data for the PERMNOs specied. The following LIBNAME statement reads data only for Microsoft Corporation (PERMNO=10107) and International Business Machines Corporation (PERMNO=12490) using the primary PERMNO key, and thus is very efcient.
LIBNAME myLib sasecrsp physical-name SETID=20 PERMNO=10107 PERMNO=12490;
The PERMCO=, CUSIP=, HCUSIP=, SICCD=, TICKER=, GVKEY=, and INDNO= options behave similarly and you can use them in conjunction with or in place of the PERMNO= option. For example you could have used the following statement to access monthly data for Microsoft and IBM:
LIBNAME myLib sasecrsp physical-name SETID=20 TICKER=MSFT CUSIP=59491810;
Details on the use of other key selection options are described separately later. PERMNOs specied by this option can select the companies or issues to keep for CRSP Stock or for CRSP/Compustat Merged databases, but PERMNO is not a supported option for CRSP Indices databases. Use the INDNO= option for the CRSP Indices database and use the PERMNO= option with CRSP US Stock and with CRSP/Compustat Merged databases. Details on the use of key selection options for each type of database follows. STK Databases PERMNO is the primary key for CRSP Stock databases. Every valid PERMNO you specify with the PERMNO= option keeps exactly one issue. CCM Databases PERMNO can be used as a secondary key for the CCM database through CRSPLink. Linking between the CRSP and Compustat databases is a complex, many-to-many relationship
between PERMNO/PERMCOs and GVKEYs. When accessing CCM data by PERMNO, all GVKEYs that link to the given PERMNO are amalgamated to provide seamless access to all linked data. However, note that accessing CCM data by PERMNO is logically different than accessing it by its linked GVKEY(s). In particular, when the PERMNO you specify is linked to several different GVKEYs, one link is designated as the primary link. This designation is set by CRSP and its researchers, and serves in a specic role for the link information in the CCM database. Only data for the primary link is retrieved for the header. For other members, including all time series members, all links are examined, but data is extracted only for the active period of the links and only if the data is within any possible user-specied date restriction. If two or more GVKEYto-PERMNO links overlap in time, data from the later (more recent) GVKEY is used. For more information about CRSP links, see Link Used Array in the CRSP/Compustat Merged Database Guide. For example, PERMNO=10083 is CRSPs unique issue identier for Teknowledge Incorporated, and later (due to a name change) Cimex Teknowledge Corporation. To access CCM data for IBM Corporation, Teknowledge Inc., and Cimex Teknowledge Corp., you can use the following statement:
LIBNAME myLib1 sasecrsp physical-name SETID=200 GVKEY=6066 /* IBM */ PERMNO=10083; /* Teknowledge and Cimflex */
Teknowledge Inc. and Cimex Corp. have separate GVKEYs in the CCM database, so the previous statement is actually an example of using one PERMNO to access data for several (linked) GVKEYs. The rst link to GVKEY=11947 spans March 5, 1986, to December 31, 1988, and the second link to GVKEY=15495 spans February 2, 1989, to September 9, 1993. An alternate way of accessing the data is by using the linked GVKEYs directly as seen in this statement.
LIBNAME myLib2 sasecrsp physical-name SETID=200 GVKEY=6066 GVKEY=11947 GVKEY=15495;
These two LIBNAME statements look similar, but do not perform the same operation. myLib1 assumes you are selecting the issue data for PERMNO=10083, so only observations from the CCM database that are within the time period of the used links are accessed. In the previous example for myLib1, only data ranging from March 5, 1986, to December 31, 1988, are extracted for GVKEY=11947 and only data ranging from February 28, 1989, to September 9, 1993, are extracted for GVKEY=15496. Furthermore, while both GVKEYs 11947 and 15495 are linked to the PERMNO, GVKEY 15495 is the primary link, and when accessing the header, only 15495 is used. If the two links overlap, the data from the later (more recent) GVKEY of 15495 is used.
In contrast, myLib2 uses an open range for all three specied keys. If there are data overlapping in time between GVKEY 11947 and 15495, data for both are reported. Similarly, when accessing the header, data for both 11947 and 15497 are retrieved. IND Databases INDNO is the primary key for accessing CRSP Indices databases. PERMNO is not available as a key for the IND (CRSP Indices) database; use INDNO for efcient access of IND database.
GVKEY=crsp_gvkey
The GVKEY= option is similar to the PERMNO= option. It enables you to use the Compustats Permanent SPC Identier key (GVKEY) to select the companies or issues to keep. There is no limit to the number of crsp_gvkey options that you can use. STK Databases GVKEY can serve as a secondary key for accessing CRSP Stock databases. This requires the additional use of the CRSPLINKPATH= option. Linking between the Compustat and CRSP databases is a complex, many-to-many relationship between GVKEYs and PERMNO/PERMCOs. When accessing CRSP data by GVKEY, all links of the specied GVKEY are followed and processed. No additional logic is applied, and link ranges are ignored. Accessing CRSP data by GVKEY is identical to accessing CRSP data by all of its linked PERMNOs. For example, Wolverine Exploration Co. and Amerac Energy Corp have different PERMNOs but the same GVKEY, and there are two identical ways of accessing CRSP Stock data on these two entities.
LIBNAME myLib1 sasecrsp physical-name SETID=10 PERMNO=13638 /* Wolverine Exploration */ PERMNO=84641; /* Amerac Energy */ LIBNAME myLib2 sasecrsp physical-name SETID=10 CRSPLINKPATH=physical-name GVKEY=1544;
The CRSPLINKPATH= option is required when accessing CRSP Stock databases by GVKEY. See the discussion later in this section on the CRSPLINKPATH= option. CCM Databases GVKEY is the primary key for accessing the CCM database. Every valid GVKEY you specify keeps exactly one company. IND Databases INDNO is the primary key for accessing CRSP Indices databases; use INDNO instead of GVKEY for IND databases. GVKEY is not available as a key for accessing CRSP Indices databases.
PERMCO=crsp_permcompany
The PERMCO= option is similar to the PERMNO= option. It enables you to use the CRSPs unique permanent company identication key (PERMCO) to select the companies or issues to keep. There is no limit to the number of crsp_permcompany options that you can use. STK Databases PERMCO is a secondary key for accessing CRSP Stock databases. One PERMCO can map to multiple PERMNOs. Access by PERMCO is equivalent to access by all mapped PERMNOs. CCM Databases PERMCO can also be used as a secondary key for accessing the CCM database. Linking between the CRSP and CCM databases is a complex, many-to-many relationship. When accessing CCM data by PERMCO, all linking GVKEYs are amalgamated and processed. Link active ranges are respected. Only data for the primary link is returned for the header. In cases when the active ranges of various links overlap, the most recent link is used. See PERMNO= option for more details. IND Databases Use INDNO for accessing CRSP Indices databases. PERMCO is not available as a key for accessing CRSP Indices databases; use INDNO instead.
CUSIP=crsp_cusip
The CUSIP= option is similar to the PERMNO= option. It enables you to use the CUSIP key to select the companies or issues to keep. There is no limit to the number of crsp_cusip options that you can use. STK Databases CUSIP is a secondary key for accessing CRSP Stock databases. One CUSIP maps to one PERMNO. CCM Databases CUSIP is not available as a key for accessing CCM databases. IND Databases Use INDNO for accessing CRSP Indices databases. CUSIP is not available as a key for accessing CRSP Indices databases; use INDNO instead.
HCUSIP=crsp_hcusip
The HCUSIP= option is similar to the PERMNO= option. It enables you to use the historical CUSIP key, HCUSIP, to select the companies or issues to keep. There is no limit to the number of crsp_hcusip options that you can use. STK Databases HCUSIP is a secondary key for accessing CRSP Stock databases. One HCUSIP maps to one PERMNO.
CCM Databases HCUSIP is not available as a key for accessing CCM databases. IND Databases Use INDNO for accessing CRSP Indices databases. HCUSIP is not available as a key for accessing CRSP Indices databases; use INDNO instead.
TICKER=crsp_ticker
The TICKER= option is similar to the PERMNO= option. It enables you to use the TICKER key to select the companies or issues to keep. There is no limit to the number of crsp_ticker options that you can use. STK Databases TICKER is a secondary key for accessing CRSP Stock databases. One TICKER maps to one PERMNO. Note that some PERMNOs are inaccessible by TICKER. CCM Databases TICKER is not available as a key for accessing CCM databases. IND Databases Use INDNO for accessing CRSP Indices databases. TICKER is not available as a key for accessing CRSP Indices databases; use INDNO instead.
SICCD=crsp_siccd
The SICCD= option is similar to the PERMNO= option. It enables you to use the Standard Industrial Classication (SIC) Code (SICCD) to select the companies or issues to keep. There is no limit to the number of crsp_siccd options that you can use. STK Databases SICCD is a secondary key for accessing CRSP Stock databases. One SICCD can map to multiple PERMNOs. All PERMNOs that have been classied once under the specied SICCD are mapped and data for them is retrieved. Access by SICCD is equivalent to access by all PERMNOs that have ever been classied under the specied SICCD. CCM Databases SICCD is not available as a key for accessing CCM databases. IND Databases Use INDNO for accessing CRSP Indices databases. SICCD is not available as a key for accessing CRSP Indices databases; use INDNO instead.
INDNO=crsp_indno
The INDNO= option is similar to the PERMNO= option. It enables you to use CRSPs permanent index number INDNO to select the companies or issues to keep. There is no limit to the number of crsp_indno options that you can use.
STK Databases INDNO is not available as a key for accessing CRSP Stock databases, but it can be used in the combined CRSP Stock and Indices databases. CCM Databases INDNO is not available as a key for accessing CCM databases; use GVKEY instead. IND Databases INDNO is the primary key for accessing CRSP Indices databases. Every INDNO you specify keeps exactly one index series or group. For example, you can use the following statement to access the CRSP NYSE Value-Weighted and Equal-Weighted daily market indices:
LIBNAME myLib3 sasecrsp physical-name SETID=460 INDNO=1000000 /* Value-Weighted */ INDNO=1000001; /* Equal-Weighted */
CRSPLINKPATH=crsp_linkpath
To access CRSP Stock data with GVKEYs, use the CRSPLINKPATH= option. CRSPLINKPATH= species the physical location where your CCM database resides. N OTE : The physical name must end in a slash for UNIX environments and a backslash for Windows environments.
RANGE=crsp_begdt-crsp_enddt
To limit the time range of data read from your CRSPAccess database, specify the RANGE= option in your SASECRSP libref, where crsp_begdt is the beginning date in YYYYMMDD format and crsp_enddt is the ending date of the range in YYYYMMDD format. As an example, to access monthly stock data for Microsoft Corporation and for International Business Machines Corporation for the rst quarter of 1999, you can use the following statement:
LIBNAME myLib sasecrsp physical-name SETID=20 PERMNO=10107 PERMNO=12490 RANGE=19990101-19990331;
The given beginning and ending dates are interpreted as calendar dates by default. If you want these dates to be interpreted as scal dates, you must prepend the character f to the range. For example, the following statement extracts data for the 1994 scal year of both Microsoft and IBM.
LIBNAME myLib sasecrsp physical-name SETID=20 PERMNO=10107
PERMNO=12490 RANGE=f19940101-19941231;
The result of the previous statement is that data from actual calendar date July 1,1993, to June 30,1994, is extracted for Microsoft because its scal year end month is June. Data from January 1,1994, to December 31,1994, is extracted for IBM because its scal year end month is December. See Example 33.10 for a more detailed example. The RANGE= option can be used on all CRSP Stock, Indices, and CCM members. When this option is applied to segment data members however, the behavior is slightly different in the following ways. Dates associated with segment member data records are in years and can resolve only to years. This is unique to segment members. All other CRSP data members have a date resolution to the day. For example, monthly time series, though monthly, resolve to the last trading day of the month. However, segment members have a maximum resolution of years because they are not mapped to a calendar in the CRSP/Compustat database. Hence, when range restrictions are applied to segment members, only the YYYY year portion of the range is considered. Multiple dates are sometimes associated with a particular segment member record. In such cases, the preferred date for use in determining the date range restriction is the data year as opposed to the source year. This multiple date behavior is unique only to segment members. All other CRSP data members are associated with only one date.
INSET=setname[,keyeldname,keyeldtype,date1eld,date2eld,datetype]
When you specify a SAS data set named setname as input for issues, the SASECRSP engine assumes that a default PERMNO eld that contains selected CRSP PERMNOs is present in the data set. If optional parameters are used, they must all be specied. The only acceptable shorthand for dropping the parameters is to drop those at the very end, assuming they are all being omitted. Dropped parameters use their defaults. The optional parameters are explained below: keyeldname keyeldtype label of the eld that contains the keys to be selected. If unspecied, the default is PERMNO. species the CRSPAccess key type of the provided keys. Possible key types are: PERMNO, PERMCO, CUSIP, HCUSIP, TICKER, SICCD, GVKEY or INDNO. If unspecied, the default is PERMNO.
date1eld beginning date of the specic date range restriction being applied to this key. If either date1eld or date2eld is omitted, the default is for there to be no date range restriction. date2eld ending date of the specic date range restriction being applied to this key. If either date1eld or date2eld is omitted, the default is for there to be no date range restriction. datetype indicates whether the provided beginning and ending dates are calendar dates or scal dates. A scal date type means the dates given are based on the scal calendar of the respective company or GVKEY. A calendar date means the dates are based on the standard Julian calendar. The strings calendar and scal are used to indicate the respective date types. If unspecied, the default type is calendar.
It is important to note that scal dates are applicable only to members with scal data. Fiscal members consists of all period descriptors, items, and segment members of the CCM database. If a scal date range is applied to nonscal members, it is ignored. Individual date range restrictions specied by the inset can be used in combination with the RANGE= option on the LIBNAME. In such a case, only data from the intersection of the individual date restriction and the global RANGE= option date restriction are read.
General Use of Inset for Specifying Lists of Keys The following example illustrates the use of the INSET= option to select a few Index Series from the Indices database, companies from the CCM database, and securities from the Stock database. Libref ind2 is used for accessing the Indices database with the two specied INDNOs. Libref comp2 is used to access the CCM database with the two specied PERMCOs. Libref sec3 is used to access the Stock database with the three specied TICKERs. Note the use of shorthand in specifying the INSET= option. The date1eld, date2eld, and datetype elds are all omitted, thereby using the default of no range restriction (though the range restriction set by the RANGE= on the LIBNAME statement still applies). For details including sample output, see Example 33.4
data indices;
libname ind2 sasecrsp "%sysget(CRSP_MSTK)" setid=420 inset=indices,INDNO,INDNO range=19990101-19990401; title2 Total Returns for NYSE Value and Equal Weighted Market Indices; proc print data=ind2.tret label; run; data companies; permco=8045; output; permco=20483; output; run;
/* Oracle */ /* Citigroup */
libname comp2 sasecrsp "%sysget(CRSP_CST)" setid=200 inset=companies,PERMCO,PERMCO range=20040101-20040531; title2 Link Info of Selected PERMCOs; proc print data=comp2.link label; run; title3 Dividends Per Share for Oracle and Citigroup; proc print data=comp2.div label; run; data securities; ticker=BAC; output; ticker=DUK; output; ticker=GSK; output; run;
libname sec3 sasecrsp "%sysget(CRSP_MSTK)" setid=20 inset=securities,TICKER,TICKER range=19970820-19970920; title2 PERMNOs and General Header Info of Selected TICKERs; proc print data=sec3.stkhead (keep=permno htick htsymbol) label; run; title3 Average Price for Bank of America, Duke and GlaxoSmithKline; proc print data=sec3.prc label; run;
Key-Specic Date Range Restriction with Insets Suppose you not only want to select keys with your inset, but also want to specify a date range restriction for each key individually. The following example shows how to do this. Again, shorthand enables you to omit the datetype eld. The provided dates default to a calendar interpretation. For details including the sample output, see Example 33.5.
title2 INSET=testin2 uses date ranges along with PERMNOs:; title3 10107, 12490, 14322, 25788; title4 Begin dates and end dates for each permno are used in the INSET;
data testin2; permno = 10107; permno = 12490; permno = 14322; permno = 25778; run;
= = = =
= = = =
libname mstk2 sasecrsp "%sysget(CRSP_MSTK)" setid=20 inset=testin2,PERMNO,PERMNO,DATE1,DATE2; data b; set mstk2.prc; run; proc print data=b; run;
Fiscal Date Range Restrictions with Insets You can use scal dates on the date range restrictions inside insets by specifying the date type. The following example shows two identical accesses, except one inset uses the date range restriction in scal terms, and the other inset uses the date range restriction in calendar terms. For details including sample output, see Example 33.10.
data comp_fiscal; /* Crude Petroleum & Natural Gas */ compkey=2416; begdate=19860101; enddate=19861231; datetype=fiscal; output; /* Commercial Intertech */ compkey=3248; begdate=19940101; enddate=19941231; datetype=fiscal; output; run; data comp_calendar; /* Crude Petroleum & Natural Gas */ compkey=2416; begdate=19860101; enddate=19861231; datetype=calendar; output; /* Commercial Intertech */ compkey=3248; begdate=19940101; enddate=19941231; datetype=calendar; output;
run; libname fisclib sasecrsp "%sysget(CRSP_CST)" SETID=200 INSET=comp_fiscal,compkey,gvkey,begdate,enddate,datetype; libname callib sasecrsp "%sysget(CRSP_CST)" SETID=200 INSET=comp_calendar,compkey,gvkey,begdate,enddate,datetype; title2 Quarterly Period Descriptors with Fiscal Date Range; proc print data=fisclib.qperdes(drop = peftnt1 peftnt2 peftnt3 peftnt4 peftnt5 peftnt6 peftnt7 peftnt8 candxc flowcd spbond spdebt sppaper); run; title2 Quarterly Period Descriptors with Calendar Date Range; proc print data=callib.qperdes(drop = peftnt1 peftnt2 peftnt3 peftnt4 peftnt5 peftnt6 peftnt7 peftnt8 candxc flowcd spbond spdebt sppaper); run;
Inset Ranges in Conjunction with the LIBNAME Range Suppose you want to specify individual date restrictions but also impose a common range. This example demonstrates two companies, each with its own date range restriction, but both companies are also subject to a common range set in the LIBNAME by the RANGE= option. As a result, data from August 1, 1999, to February 1, 2000, is retrieved for IBM, and data from January 1, 2001, to April 21, 2002, is retrieved for Microsoft. For details including sample output see Example 33.11.
data two_companies; gvkey=6066; date1=19800101; date2=20000201; output; gvkey=12141; date1=20010101; date2=20051231; output; run; libname mylib sasecrsp "%sysget(CRSP_CST)" SETID=200 INSET=two_companies,gvkey,gvkey,date1,date2 RANGE=19990801-20020421; proc sql; select prcc.gvkey,prcc.caldt,prcc,ern from mylib.prcc as prcc, mylib.ern as ern where prcc.caldt = ern.caldt and prcc.gvkey = ern.gvkey; quit;
data set on the DATA statement, it causes the engine supervisor to create a SAS data set using the specied name in either the SAS WORK library or, if specied, the USER library. The contents of the SAS data set include the DATE of each observation, the series name of each series read from the CRSPAccess database, event variables, and the label or description of each series/event or array. You can use PROC PRINT and PROC CONTENTS to print your output data set and its contents. Alternatively, you can view your SAS output observations by opening the desired output data set in the SAS Explorer. You can also use PROC SQL with your SASECRSP libref to create a custom view of your data. In general, CRSP missing values are represented as . in the SAS data set. When accessing the CRSP STOCK data, SASECRSP uses the mapping shown in Table 33.6 for converting CRSP missing values into SAS missing codes.
Table 33.6 Mapping of CRSP Stock Missing Values to SAS Missing Codes
CRSP Stock 99 88 77 66 55 44
SAS . .A .B .C .D .E
Condition No valid price Out of range Off-exchange No valid previous price No delisting information No valid comparison for an excess return
When accessing the CCM database, CRSP uses certain Compustat missing codes which SASECRSP then converts into SAS missing codes. Table 33.7 shows the mapping of Compustat missing codes for the CCM database.
Table 33.7 Mapping of Compustat and SAS Missing Codes
SAS . .S .A .C .N .I
Condition No data for data item Data is only on a semi-annual basis Data is only on an annual basis Combined into other item Data is not meaningful Reported as insignicant
Missing value codes conform with Compustats Strategic Insight and binary conventions for missing values. See Notes on Missing Values in the second chapter of the CRSP/Compustat Merged Database Guide for more information about how CRSP handles Compustat missing codes.
Date
CRSP Date CRSP Date (Daily) (Monthly) July 31, 1962 942 21 440 August 31, 1962 973 44 441 Dec. 30, 1998 14,243 9190 NA* Dec. 31, 1998 14,244 9191 877 * Not available if an exact match is requested.
SAS Date
Having an understanding of the internal differences in representing SAS dates, CRSP dates, and CRSP integer dates helps you use the SASECRSP formats, informats, and functions effectively. Always keep in mind the frequency of the CRSP calendar that you are accessing when you specify a CRSP date.
types are: CRSPDTA, CRSPDTQ, CRSPDTM, CRSPDTW, CRSPDTD, CRSPDRA, CRSPDRQ, CRSPDRM, CRSPDRW, and CRSPDRD. Table 33.9 shows some samples that use the monthly and daily calendar as examples. The Annual (CRSPDTA and CRSPDRA), Quarterly (CRSPDTQ and CRSPDRQ), and the Weekly (CRSPDTW and CRSPDRW) formats work analogously.
Table 33.9 Sample CRSPDT Formats for Daily and Monthly Data
19620630, 19620731 August 31,1962 44, 441 19620831 19620831 + 19620831 19620801, 19620831 Dec. 30,1998 9190, NA * 19981230 19981230 + NA* NA* Dec. 31,1998 9191, 877 19981231 19981231 + 19981231 19981201, 19981231 + Daily ranges look similar to Monthly Ranges if they are Mondays or immediately following a trading holiday. * When working with exact matches, no CRSP monthly date exists for December 30, 1998.
Table 33.10
19620815 19620831
.(missing) 441
441 441
440 441
+ If missing, then missing. If 441, then 19620831. If 440, then 19620731. * If missing, then missing. If 441, then 19620801 to 19620831. If 440, then 19620630 to 19620731.
Table 33.11
Function Group
Function Argument Argument Return Name One Two Value CRSP dates to integer dates for December 31, 1998 Annual crspdcia 74 None 19981231 Quarterly crspdciq 293 None 19981231 Monthly crspdcim 877 None 19981231 Weekly crspdciw 1905 None 19981231 Daily crspdcid 9191 None 19981231 CRSP dates to SAS dates for December 31, 1998 Annual crspdcsa 74 None 14,244 Quarterly crspdcsq 293 None 14,244 Monthly crspdcsm 877 None 14,244 Weekly crspdcsw 1905 None 14,244 Daily crspdcsd 9191 None 14,244 Integer dates to CRSP dates exact is illustrated, but can be forward or backward Annual crspdica 19981231 0 74 Quarterly crspdicq 19981231 0 293 Monthly crspdicm 19981231 0 877 Weekly crspdicw 19981231 0 1905 Daily crspdicd 19981231 0 9191 SAS dates to CRSP dates exact is illustrated, but can be forward or backward Annual crspdsca 14,244 0 74 Quarterly crspdscq 14,244 0 293 Monthly crspdscm 14,244 0 877 Weekly crspdscw 14,244 0 1905 Daily crspdscd 14,244 0 9191 Integer dates to SAS dates for December 31, 1998 Integer to SAS crspdi2s 19981231 None 14,244 SAS dates to integer dates for December 31, 1998 SAS to Integer crspds2i 14,244 None 19981231 Fiscal to calendar shifting of integer dates for December 31, 1998 Fiscal to Calendar crspdf2c 20021231 8 20020831 Shift
Example 33.1: Specifying PERMNOs and RANGE on the LIBNAME Statement ! 2213
Example 33.2: Using the LIBNAME Statement to Access All Keys ! 2215
database, and four securities from the Stock database for data extraction. Note that because each CRSP database might be in a different location and has to be opened separately, a total of three different librefs are used, one for each database.
data indices; indno=1000000; output; indno=1000001; output; run;
libname _all_ clear; libname ind2 sasecrsp "%sysget(CRSP_MSTK)" setid=420 inset=indices,INDNO,INDNO range=19990101-19990401; title2 Total Returns for NYSE Value and Equal Weighted Market Indices; proc print data=ind2.tret label; run;
Output 33.4.1 shows the result of selecting two CRSP Index Series from the Indices database.
Output 33.4.1 IND Data Extracted Using INSET= Option
Total Returns for NYSE Value and Equal Weighted Market Indices Obs 1 2 3 4 5 6 INDNO 1000000 1000000 1000000 1000001 1000001 1000001 CALDT 19990129 19990226 19990331 19990129 19990226 19990331 TRET 0.012419 -0.024179 0.028591 -0.007822 -0.041127 0.015204
/* Oracle */ /* Citigroup */
libname comp2 sasecrsp "%sysget(CRSP_CST)" setid=200 inset=companies,PERMCO,PERMCO range=20040101-20040531; title2 Using the Link Info of Selected PERMCOs; proc print data=comp2.link label; run; title3 To Show Dividends Per Share for Oracle and Citigroup; proc print data=comp2.div label;
run;
Output 33.4.2 shows the result of selecting two companies from the CCM database by using the CCM LINK data and the INSET= option.
Output 33.4.2 CCM LINK Data Extracted By Using INSET= Option
Using the Link Info of Selected PERMCOs To Show Dividends Per Share for Oracle and Citigroup Obs 1 2 GVKEY 12142 3243 LINKDT 19860312 19861029 LINKENDT 20991231 20991231 NPERMNO 10104 70519 NPERMCO 8045 20483 LINKTYPE LC LC LINKFLAG BBB BBB
Output 33.4.3 shows the result of selecting two companies from the CCM database by using the CCM DIV data and the INSET= option.
Output 33.4.3 CCM DIV Data Extracted By Using INSET= Option
Using the Link Info of Selected PERMCOs To Show Dividends Per Share for Oracle and Citigroup Obs 1 2 3 4 5 6 7 8 9 10 GVKEY 12142 12142 12142 12142 12142 3243 3243 3243 3243 3243 CALDT 20040130 20040227 20040331 20040430 20040528 20040130 20040227 20040331 20040430 20040528 DIV 0.0000 0.0000 0.0000 0.0000 0.0000 0.4000 0.0000 0.0000 0.4000 0.0000
This example selects three securities from the Stock database by using TICKERs in the INSET= option for data extraction.
data securities; ticker=BAC; output; ticker=DUK; output; ticker=GSK; output; run;
libname sec3 sasecrsp "%sysget(CRSP_MSTK)" setid=20 inset=securities,TICKER,TICKER range=19970820-19970920; title2 PERMNOs and General Header Info of Selected TICKERs;
proc print data=sec3.stkhead(keep=permno htick htsymbol) label; run; title3 Average Price for Bank of America, Duke and GlaxoSmithKline; proc print data=sec3.prc label; run;
Output 33.4.4 shows the STK header data for the TICKERs specied by using the INSET= option.
Output 33.4.4 STK Header Data Extracted Using INSET= Option
PERMNOs and General Header Info of Selected TICKERs Average Price for Bank of America, Duke and GlaxoSmithKline Obs 1 2 3 Obs PERMNO 59408 27959 75064 PERMCO COMPNO ISSUNO HEXCD HSHRCD 4005 0 2523 1 1 1 11 11 31 HSICCD BEGDT
HCOMNAM BANK OF AMERICA CORP DUKE ENERGY CORP NEW GLAXOSMITHKLINE PLC HTRDSTAT A A A HSECSTAT R R R
06050510 BAC 26441C10 DUK 37733W10 GSK HNAICS 522110 221122 325412 HPRIMEXC N N N
Output 33.4.5 shows the STK price data for the TICKERs specied by using the INSET= option.
Output 33.4.5 STK Price Data Extracted Using INSET= Option
PERMNOs and General Header Info of Selected TICKERs Average Price for Bank of America, Duke and GlaxoSmithKline Obs 1 2 3 PERMNO 59408 27959 75064 CALDT 19970829 19970829 19970829 PRC 59.75000 48.43750 39.93750
Example 33.5: Specifying Ranges for Individual Keys with the INSET= Option ! 2221
Example 33.5: Specifying Ranges for Individual Keys with the INSET= Option
Insets enable you to dene options specic to each individual key. This example uses an inset to select four PERMNOs and species a different date restriction for each PERMNO.
title2 INSET=testin2 uses date ranges along with PERMNOs:; title3 10107, 12490, 14322, 25788; title4 Begin dates and end dates for each permno are used in the INSET; data testin2; permno = 10107; permno = 12490; permno = 14322; permno = 25778; run;
= = = =
= = = =
libname _all_ clear; libname mstk2 sasecrsp "%sysget(CRSP_MSTK)" setid=20 inset=testin2,PERMNO,PERMNO,DATE1,DATE2; data b; set mstk2.prc; run; proc print data=b; run;
Output 33.5.1 shows CRSP Stock price time series data selected by PERMNO in the INSET= option, where each PERMNO has its own time span specied in the INSET= option.
Example 33.6: Converting Dates By Using the CRSP Date Functions ! 2223
data a (keep = crspdt crspdt2 crspdt3 sasdt sasdt2 sasdt3 intdt intdt2 intdt3); format crspdt crspdt2 crspdt3 crspdtd8.; format sasdt sasdt2 sasdt3 yymmdd6.; format intdt intdt2 intdt3 8.; format exact 2.; crspdt = 1; sasdt = 2jul1962d; intdt = 19620702; exact = 0; /* Call the CRSP date to Integer function*/ intdt2 = crspdcid(crspdt); /* Call the SAS date to Integer function*/ intdt3 = crspds2i(sasdt); /* Call the Integer to Crsp date function*/ crspdt2 = crspdicd(intdt,exact); /* Call the Sas date to Crsp date conversion function*/ crspdt3 = crspdscd(sasdt,exact); /* Call the CRSP date to SAS date conversion function*/ sasdt2 = crspdcsd(crspdt); /* Call the Integer to Sas date conversion function*/ sasdt3 = crspdi2s(intdt); run; title3 PROC PRINT showing data for sasecrsp; proc print data=a; run; title3 PROC CONTENTS showing formats for sasecrsp; proc contents data=a; run;
Output 33.6.1 shows the OUT= data set created by the DATA step.
Output 33.6.1 Date Conversions By Using the CRSP Date Functions
OUT= Data Set PROC CONTENTS showing formats for sasecrsp Obs crspdt 1 crspdt2 crspdt3 sasdt sasdt2 sasdt3 intdt intdt2 intdt3
Output 33.6.2 shows the contents of the OUT= data set by alphabetically listing the variables and their attributes.
Output 33.6.2 Contents of Date Conversions By Using the CRSP Date Functions
Alphabetic List of Variables and Attributes # 1 2 3 7 8 9 4 5 6 Variable crspdt crspdt2 crspdt3 intdt intdt2 intdt3 sasdt sasdt2 sasdt3 Type Num Num Num Num Num Num Num Num Num Len 8 8 8 8 8 8 8 8 8 Format CRSPDTD8. CRSPDTD8. CRSPDTD8. 8. 8. 8. YYMMDD6. YYMMDD6. YYMMDD6.
title2 PERMNO=10083 access of CCM data; title3 Sales (Net); data permnoaccess; set crsp1a.iqitems(keep=gvkey rcaldt fiscaldt iq2); run; proc print data=permnoaccess; run;
Output 33.7.1 shows PERMNO access of CCM quarterly Sales (Net) data.
Output 33.7.1 PERMNO Access of CCM Data
Comparing various access methods for CCM data PERMNO=10083 access of CCM data Sales (Net) Obs 1 2 3 4 5 6 7 8 9 10 GVKEY 15495 15495 15495 15495 15495 15495 15495 15495 15495 15495 RCALDT 19870331 19870630 19870930 19871231 19880331 19880630 19890331 19890630 19890929 19891229 FISCALDT 19870930 19871231 19880331 19880630 19880930 19881230 19890331 19890630 19890929 19891229 IQ2 4.5680 5.0240 4.4380 3.8090 3.5420 2.5940 6.4660 10.1020 12.0650 10.8780
title2 GVKEY=11947 and GVKEY=15495 access of CCM data; title3 Sales (Net); data gvkeyaccess; set crsp2.iqitems(keep=gvkey rcaldt fiscaldt iq2); run; proc print data=gvkeyaccess; run;
title3 LINK: Link information; proc print data=crsp2.link; run; /* Show how PERMNO and PERMCO access are the same */ title4 Proc compare of PERMNO vs. PERMCO access of CCM data; proc compare base=crsp1a.iqitems compare=crsp1b.iqitems brief; run;
Output 33.7.3 shows CRSP link information and comparison of GVKEY to PERMNO access.
Output 33.7.3 Link Information and Comparison
Comparing various access methods for CCM data GVKEY=11947 and GVKEY=15495 access of CCM data LINK: Link information Proc compare of PERMNO vs. PERMCO access of CCM data Obs 1 2 3 GVKEY 11947 15495 15495 LINKDT 19860305 19880101 19890227 LINKENDT 19890226 19890226 19930909 NPERMNO 10083 0 10083 NPERMCO 8026 0 8026 LINKTYPE LC NR LC LINKFLAG BBB XXX BBB
Example 33.8: Comparing PERMNO and GVKEY Access of CRSP Stock Data ! 2227
Example 33.8: Comparing PERMNO and GVKEY Access of CRSP Stock Data
You can access CRSP data using GVKEYs. Access in this manner requires the use of the CRSPLINKPATH= option, and is identical to access by its corresponding PERMNO(s). Links between PERMNOs and GVKEYs are used without reference to their active period. Link information is used solely for nding corresponding GVKEYs. This example shows two ways of accessing CRSP Stock data: one by PERMNOs and the other by its corresponding GVKEY. Several members are compared, showing they are equivalent.
title Comparing PERMNO and GVKEY access of CRSP Stock data; libname _all_ clear; libname crsp1 sasecrsp "%sysget(CRSP_MSTK)" setid=20 permno=13638 permno=84641 range=19900101-19910101; libname crsp2 sasecrsp "%sysget(CRSP_MSTK)" setid=20 crsplinkpath="%sysget(CRSP_CST)" gvkey=1544 range=19900101-19910101; title1 PERMNO=13638 and PERMNO=84641 access of CRSP data; proc print data=crsp1.stkhead; run; %macro compareMember(memb); title1 "Proc compare on &memb between PERMNO and GVKEY"; proc compare base=crsp1.&memb compare=crsp2.&memb brief; run; %mend; %compareMember(stkhead); %compareMember(prc); %compareMember(ret);
%compareMember(askhi); %compareMember(vol);
Output 33.8.1 compares PERMNO with GVKEY access of CRSP Stock members STKHEAD, PRC, RET, ASKHI, AND VOL, showing that they are equal.
Output 33.8.1 Comparing PERMNO and GVKEY Access of CRSP Stock Data
Proc compare on vol between PERMNO and GVKEY Obs 1 2 Obs PERMNO 13638 84641 PERMCO COMPNO ISSUNO HEXCD HSHRCD 428 0 3 2 11 11 HSICCD BEGDT
HPRIMEXC Q A
HTRDSTAT A A
HSECSTAT R R
Output 33.9.1 shows Earnings Per Share for several companies for the 1994 scal year.
Output 33.9.1 Earnings Per Share by GVKEY Access for the 1994 Fiscal Year.
Extract data for fiscal year 1994 for several companies Obs 1 2 3 4 5 6 7 8 9 10 11 12 Obs 1 2 3 4 5 6 7 8 9 10 11 12 GVKEY 6066 6066 6066 6066 12141 12141 12141 12141 10107 10107 10107 10107 RCALDT 19940331 19940630 19940930 19941230 19930930 19931231 19940331 19940630 19940331 19940630 19940930 19941230 IQ9 0.6300 1.1300 1.1600 2.0300 0.7900 0.9500 0.8400 0.5900 0.4600 0.7100 0.7600 0.5400 FISCALDT 19940331 19940630 19940930 19941230 19940331 19940630 19940930 19941230 19940331 19940630 19940930 19941230 IQ19 0.6400 1.1400 1.1800 2.0600 0.7900 0.9500 0.8400 0.5900 0.4600 0.7100 0.7600 0.5400 IQ4 1100.0000 1092.0000 1053.0000 1118.0000 134.0000 150.0000 156.0000 170.0000 A A A 100.9630 IQ69 392.0000 688.0000 710.0000 1231.0000 239.0000 289.0000 256.0000 362.0000 11.3890 17.3670 18.8070 13.4190
Note how two time ID variables are kept. Raw Calendar Trading Date provides the actual calendar date. Fiscal Trading Date provides the date according to the companys scal calendar which is dependent upon when its scal year-end month is. For example, Observation 8 is Microsofts fourth scal quarter, hence a Fiscal Trading Date of December 30,1994. Since Microsofts scal year ends in June, its fourth scal quarter corresponds to the second calendar quarter of the year, so the Raw Calendar Trading Date shows June 30,1994. The shift calculation of six months (in this case) required to compute the Raw Calendar Trading Date is done automatically by the SASECRSP engine. Keep in mind that scal date ranges are applicable only to scal members. When scal date range restrictions are applied to nonscal members, they are ignored. The missing value .A seen in observations 9 through 12 indicate that the data is reported only on an annual basis.
Example 33.10: Using Different Types of Range Restrictions in the INSET ! 2231
title1 Quarterly Period Descriptors; title2 Using the Fiscal Date Range; proc print data=fisclib.qperdes(drop = peftnt1 peftnt2 peftnt3 peftnt4 peftnt5 peftnt6 peftnt7 peftnt8 candxc flowcd spbond spdebt sppaper); run;
Output 33.10.1 shows quarterly period descriptors for the 1986 and 1994 scal years.
Output 33.10.1 Using Inset with Fiscal Date Range
Quarterly Period Descriptors Using the Fiscal Date Range F I S C A L D T 19860331 19860630 19860930 19861231 19940331 19940630 19940930 19941230
O b s 1 2 3 4 5 6 7 8
D A T Q T R 1 2 3 4 1 2 3 4
F I S C Y R 3 3 3 3 10 10 10 10
C A L Q T R 2 3 4 1 4 1 2 3
U P C O D E 3 3 3 3 3 3 3 3
S R C D O C 53 53 53 53 53 53 53 53
O b s 1 2 3 4 5 6 7 8
S P B O N D 0 0 0 0 0 0 0 0
S P D E B T 0 0 0 0 0 0 0 0
S P P A P E R 0 0 0 0 0 0 0 0
S P R A N K 17 18 18 21 16 16 16 16
M A J I N D 0 0 0 0 0 0 0 0
I N D I N D 0 0 0 0 0 0 0 0
F L O W C D 1 1 1 1 7 7 7 7
C A N D X C 0 0 0 0 0 0 0 0
P E F T N T 1
P E F T N T 2
P E F T N T 3
P E F T N T 4
P E F T N T 5
P E F T N T 6
P E F T N T 7
P E F T N T 8
The next PRINT procedure uses the calendar datetype in its INSET= option instead of the scal datetype, producing different results for the Crude Petroleum and Natural Gas Company when the report is based on calendar dates instead of scal dates. The differences shown in observations 1 through 4 are due to Crude Petroleum and Natural Gas Companys scal year ending in March instead of December.
Since Commercial Intertech does not shift its scal year, but uses a scal year ending in December, the scal report and the calendar report match exactly for the companys corresponding observations 5 through 8 in Output 33.10.1 and Output 33.10.2 respectively.
title1 Quarterly Period Descriptors; title2 Using the Calendar Date Range; proc print data=callib.qperdes(drop = peftnt1 peftnt2 peftnt3 peftnt4 peftnt5 peftnt6 peftnt7 peftnt8 candxc flowcd spbond spdebt sppaper); run;
Output 33.10.2 shows quarterly period descriptors for the designated calendar date range.
Output 33.10.2 Using Inset with Calendar Date Range
Quarterly Period Descriptors Using the Calendar Date Range F I S C A L D T 19851231 19860331 19860630 19860930 19940331 19940630 19940930 19941230
O b s 1 2 3 4 5 6 7 8
D A T Q T R 4 1 2 3 1 2 3 4
F I S C Y R 3 3 3 3 10 10 10 10
C A L Q T R 1 2 3 4 4 1 2 3
U P C O D E 3 3 3 3 3 3 3 3
S R C D O C 53 53 53 53 53 53 53 53
O b s 1 2 3 4 5 6 7 8
S P B O N D 0 0 0 0 0 0 0 0
S P D E B T 0 0 0 0 0 0 0 0
S P P A P E R 0 0 0 0 0 0 0 0
S P R A N K 17 17 18 18 16 16 16 16
M A J I N D 0 0 0 0 0 0 0 0
I N D I N D 0 0 0 0 0 0 0 0
F L O W C D 1 1 1 1 7 7 7 7
C A N D X C 0 0 0 0 0 0 0 0
P E F T N T 1
P E F T N T 2
P E F T N T 3
P E F T N T 4
P E F T N T 5
P E F T N T 6
P E F T N T 7
P E F T N T 8
Fiscal date range restrictions are valid only for scal members and can be used in either the INSET=
Example 33.11: Using INSET Ranges with the LIBNAME RANGE Option ! 2233
option or the RANGE= option. Use calendar date ranges for nonscal members. N OTE : Fiscal date ranges are ignored when used with nonscal members.
Example 33.11: Using INSET Ranges with the LIBNAME RANGE Option
It is possible to specify both individual range restrictions with an INSET and a global date range restriction via the RANGE= option on the LIBNAME statement. In such cases, only observations that satisfy both date range restrictions are returned. The effective range restriction becomes the intersection of the two specied range restrictions. If this intersection is empty, no observations are returned. This example extracts data for two companies, IBM and Microsoft. Each company has an individual range restriction specied in the inset. Furthermore, a global range restriction is set by the RANGE= option on the LIBNAME statement. As a result the effective date range restriction for IBM becomes August 1, 1999, to February 1, 2000, and the effective date range restriction for Microsoft becomes January 1, 2001, to April 21, 2002.
data two_companies; gvkey=6066; date1=19800101; date2=20000201; output; gvkey=12141; date1=20010101; date2=20051231; output; run; libname _all_ clear; libname mylib sasecrsp "%sysget(CRSP_CST)" SETID=200 INSET=two_companies,gvkey,gvkey,date1,date2 RANGE=19990801-20020421; title1 Two Companies, Two Range Selections; title2 Global RANGE Statement Used With Individual Inset Ranges; title3 Results Show Intersection of Both Range Restrictions; proc sql; select prcc.gvkey,prcc.caldt,prcc,ern from mylib.prcc as prcc, mylib.ern as ern where prcc.caldt = ern.caldt and prcc.gvkey = ern.gvkey; quit;
Output 33.11.1 shows the combined effect of both INSET and RANGE date restrictions on the closing prices and earnings per share for IBM and Microsoft.
For more about using the SQL procedure, see the chapter on SQL in Base SAS Procedures Guide.
Table 33.12
Data Set Name STKHEAD NAMES DISTS SHARES DELIST NASDIN PRC RET BIDLO ASKHI BID ASK RETX SPREAD ALTPRC VOL NUMTRD ALTPRCDT PORT1 PORT2 PORT3 PORT4 PORT5 PORT6 PORT7 PORT8 PORT9 GROUP16
Reference Table Title Header Identication and Summary Data Name History Array Distribution Event Array Shares Outstanding Observation Array Delisting History Array NASDAQ Information Array Price or Bid/Ask Average Time Series Returns Time Series Bid or Low Price Time Series Ask or High Price Time Series Bid Time Series Ask Time Series Returns Without Dividends Time Series Spread Between Bid and Ask Price Alternate Time Series Volume Time Series Number of Trades Time Series Price Alternate Date Time Series Portfolio Data for Portfolio Type 1 Portfolio Data for Portfolio Type 2 Portfolio Data for Portfolio Type 3 Portfolio Data for Portfolio Type 4 Portfolio Data for Portfolio Type 5 Portfolio Data for Portfolio Type 6 Portfolio Data for Portfolio Type 7 Portfolio Data for Portfolio Type 8 Portfolio Data for Portfolio Type 9 Group Data for Group Type 16
Reference Table Table 33.15 Table 33.16 Table 33.17 Table 33.18 Table 33.19 Table 33.20 Table 33.21 Table 33.21 Table 33.21 Table 33.21 Table 33.21 Table 33.21 Table 33.21 Table 33.21 Table 33.21 Table 33.21 Table 33.21 Table 33.21 Table 33.22 Table 33.22 Table 33.22 Table 33.22 Table 33.22 Table 33.22 Table 33.22 Table 33.22 Table 33.22 Table 33.22
Table 33.13
Data Set Name CSTHEAD CSTNAME LINK APERDES QPERDES IAITEMS IQITEMS BAITEMS BQITEMS PRCH PRCL PRCC DIV ERN SHSTRD DIVRTE RAWADJ CUMADJ BKV CHEQVM CSHOQ NAVM OEPS12 GICS CPSPIN DIVFT RAWADJFT COMSTAFT ISAFT SEGSRC SEGPROD SEGCUST SEGDTL SEGNAICS SEGGEO SEGCUR SEGITM
Reference Table Title Compustat Header Data Compustat Description History Array CRSP Compustat Link History Annual Period Descriptors Time Series Quarterly Period Descriptors Time Series Annual Data Items Quarterly Data Items Bank Annual Data Items Bank Quarterly Data Items High Price Time Series Low Price Time Series Closing Price Time Series Dividends Per Share Time Series Earnings Per Share Time Series Shares Traded Time Series Annualized Dividend Rate Time Series Raw Adjustment Factor Time Series Cumulative Adjustment Factor Time Series Book Value Per Share Time Series Cash Equivalent Distribution Time Series Common Shares Outstanding Time Series Net Asset Value Time Series Earnings/Share from Operations Global Industry Class. Std. code S&P Index Primary Marker Time Series Dividends per share footnotes Raw adjustment factor footnotes Comparability status footnotes Issue status alert footnotes Operating Segment Source History Operating Segment Products History Operating Segment Customer History Operating Segment Detail History Operating Segment NAICS History Geographic Segment History Segment Currency Data Segment Item Data
Reference Table Table 33.23 Table 33.24 Table 33.25 Table 33.26 Table 33.27 Table 33.28 Table 33.29 Table 33.30 Table 33.31 Table 33.32 Table 33.32 Table 33.32 Table 33.32 Table 33.32 Table 33.32 Table 33.32 Table 33.32 Table 33.32 Table 33.32 Table 33.32 Table 33.32 Table 33.32 Table 33.32 Table 33.32 Table 33.32 Table 33.32 Table 33.32 Table 33.32 Table 33.32 Table 33.33 Table 33.34 Table 33.35 Table 33.36 Table 33.37 Table 33.38 Table 33.39 Table 33.40
Table 33.14
Data Set Name INDHEAD REBAL REBAL LIST LIST USDCNT TOTCNT USDCNT TOTCNT USDVAL TOTVAL USDVAL TOTVAL TRET ARET IRET TRET ARET IRET TIND AIND IIND TIND AIND IIND
Reference Table Title Index Header Data Index Rebalancing History Arrays Index Rebalancing History Group Arrays Index Membership List Arrays Index Membership List Groups Arrays Portfolio Used Count Array Portfolio Total Count Array Portfolio Used Count Time Series Groups Portfolio Total Count Time Series Groups Portfolio Used Value Array Portfolio Total Value Array Portfolio Used Value Time Series Groups Portfolio Total Value Time Series Groups Total Returns Time Series Appreciation Returns Time Series Income Returns Time Series Total Returns Time Series Groups Income Returns Time Series Groups Income Returns Time Series Groups Total Return Index Levels Time Series Appreciation Index Levels Time Series Income Index Levels Time Series Total Return Index Levels Groups Appreciation Index Levels Groups Income Index Levels Time Series Groups
Reference Table Table 33.41 Table 33.42 Table 33.43 Table 33.44 Table 33.45 Table 33.46 Table 33.47 Table 33.48 Table 33.49 Table 33.50 Table 33.51 Table 33.52 Table 33.53 Table 33.54 Table 33.55 Table 33.56 Table 33.57 Table 33.58 Table 33.59 Table 33.60 Table 33.61 Table 33.62 Table 33.63 Table 33.64 Table 33.65
Table 33.15
Fields PERMNO PERMCO COMPNO ISSUNO HEXCD HSHRCD HSICCD BEGDT ENDDT DLSTCD HCUSIP HTICK HCOMNAM HTSYMBOL HNAICS HPRIMEXC HTRDSTAT HSECSTAT
Label PERMNO PERMCO NASDAQ Company Number NASDAQ Issue Number Exchange Code Header Share Code Header Standard Industrial Classication Code Begin of Stock Data End of Stock Data Delisting Code Header CUSIP Header Ticker Symbol Header Company Name Header Trading Symbol Header North American Industry Classication Header Primary Exchange Header Trading Status Header Security Status Header
Type Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Character Character Character Character Character Character Character Character
Fields PERMNO NAMEDT NAMEENDT SHRCD EXCHCD SICCD NCUSIP TICKER COMNAM SHRCLS TSYMBOL NAICS PRIMEXCH TRDSTAT
Label PERMNO Names Date Names Ending Date Share Code Exchange Code Standard Industrial Classication Code CUSIP Ticker Symbol Company Name Share Class Trading Symbol North American Industry Classication System Primary Exchange Trading Status
Type Numeric Numeric Numeric Numeric Numeric Numeric Numeric Character Character Numeric Character Character Character Character
Table 33.16
continued
Fields SECSTAT
Type Character
Fields PERMNO DISTCD DIVAMT FACPR FACSHR DCLRDT EXDT RCRDDT PAYDT ACPERM ACCOMP
Label PERMNO Distribution Code Dividend Cash Amount Factor to Adjust Price Factor to Adjust Share Distribution Declaration Date Ex-Distribution Date Record Date Payment Date Acquiring PERMNO Acquiring PERMCO
Type Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric
Label PERMNO Shares Outstanding Shares Observation Date Shares Observation End Date Shares Outstanding Observation Flag
Fields PERMNO DLSTDT DLSTCD NWPERM NWCOMP NEXTD DLAMT DLRETX DLPRC
Label PERMNO Delisting Date Delisting Code New PERMNO New PERMCO Delisting Next Price Date Delisting Amount Delisting Return Without Dividends Delisting Price
Type Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric
Table 33.19
continued
Label PERMNO NASDAQ Traits Code NASDAQ Traits Date NASDAQ Traits End Date NASDAQ National Market Indicator Market Maker Count Nasd Index Code
Data Set Name, Long Name PRC Price or Bid/Ask Average Time Series RET Returns Time Series ASKHI Ask or High Price Time Series BIDLO Bid or Low Price Time Series BID Bid Time Series ASK Ask Time Series RETX Returns without Dividends SPREAD Spread Between Bid
Fields PERMNO CALDT PRC PERMNO CALDT RET PERMNO CALDT ASKHI PERMNO CALDT BIDLO PERMNO CALDT BID PERMNO CALDT ASK PERMNO CALDT RETX PERMNO CALDT
Label PERMNO Calendar Trading Date Price or Bid/Ask Aver PERMNO Calendar Trading Date Returns PERMNO Calendar Trading Date Ask or High Price PERMNO Calendar Trading Date Bid or Low Price PERMNO Calendar Trading Date Bid PERMNO Calendar Trading Date Ask PERMNO Calendar Trading Date Returns w/o Dividends PERMNO Calendar Trading Date
Type Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric
Table 33.21
continued
Data Set Name, Long Name and Ask Time Series ALTPRC Price Alternate Time Series VOL Volume Time Series NUMTRD Number of Trades Time Series ALTPRCDT Alternate Price Date Time Series
Fields SPREAD PERMNO CALDT ALTPRC PERMNO CALDT VOL PERMNO CALDT NUMTRD PERMNO CALDT ALTPRCDT
Label Spread Between Bid Ask PERMNO Calendar Trading Date Price Alternate PERMNO Calendar Trading Date Volume PERMNO Calendar Trading Date Number of Trades PERMNO Calendar Trading Date Alternate Price Date
Type Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric
Data Set PORT1 Portfolio data for Portfolio Type 1 PORT2 Portfolio data for Portfolio Type 2 PORT3 Portfolio data for Portfolio Type 3 PORT4 Portfolio data for Portfolio Type 4 PORT5 Portfolio data for Portfolio Type 5 PORT6 Portfolio data for Portfolio Type 6
Fields PERMNO CALDT PORT1 STAT1 PERMNO CALDT PORT2 STAT2 PERMNO CALDT PORT3 STAT3 PERMNO CALDT PORT4 STAT4 PERMNO CALDT PORT5 STAT5 PERMNO CALDT PORT6 STAT6
Label PERMNO Calendar Trading Date Portfolio Assignment for Portfolio Type 1 Portfolio Statistic for Portfolio Type 1 PERMNO Calendar Trading Date Portfolio Assignment for Portfolio Type 2 Portfolio Statistic for Portfolio Type 2 PERMNO Calendar Trading Date Portfolio Assignment for Portfolio Type 3 Portfolio Statistic for Portfolio Type 3 PERMNO Calendar Trading Date Portfolio Assignment for Portfolio Type 4 Portfolio Statistic for Portfolio Type 4 PERMNO Calendar Trading Date Portfolio Assignment for Portfolio Type 5 Portfolio Statistic for Portfolio Type 5 PERMNO Calendar Trading Date Portfolio Assignment for Portfolio Type 6 Portfolio Statistic for Portfolio Type 6
Type Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric
Table 33.22
continued
Data Set PORT7 Portfolio data for Portfolio Type 7 PORT8 Portfolio data for Portfolio Type 8 PORT9 Portfolio data for Portfolio Type 9 GROUP16 Group data for Group Type 16
Fields PERMNO CALDT PORT7 STAT7 PERMNO CALDT PORT8 STAT8 PERMNO CALDT PORT9 STAT9 PERMNO GRPDT GRPENDDT GRPFLAG GRPSU
Label PERMNO Calendar Trading Date Portfolio Assignment for Portfolio Type 7 Portfolio Statistic for Portfolio Type 7 PERMNO Calendar Trading Date Portfolio Assignment for Portfolio Type 8 Portfolio Statistic for Portfolio Type 8 PERMNO Calendar Trading Date Portfolio Assignment for Portfolio Type 9 Portfolio Statistic for Portfolio Type 9 PERMNO Group Beginning Date Group Ending Date Group Flag of Associated Index Group Subag
Type Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric
Fields GVKEY IPERM ICOMP BEGYR ENDYR BEGQTR ENDQTR AVAILFLG DNUM FILE ZLIST STATE COUNTY STINC FINC XREL STK DUP
Label GVKEY Header CRSP issue PERMNO link Header CRSP company PERMCO link Annual date of earliest data (yyyy) Annual date of latest data (yyyy) Quarterly date of earliest data (yyyy.q) Quarterly date of latest data (yyyy.q) Code of available CST data types Industry code File identication code Exchange listing and S&P Index code State identication code County identication code State incorporation code Foreign incorporation code S&P Industry Index relative code Stock ownership code Duplicate le code
Type Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric
Table 33.23
continued
Fields CCNDX GICS IPODT BEGDT ENDDT FUNDF1 FUNDF2 FUNDF3 NAICS CPSPIN CSSPIN CSSPII SUBDBT CPAPER SDBT SDBTIM CNUM CIC CONAME INAME SMBL EIN INCORP RCST3
Label Current Canadian Index Code Global Industry Class. Std. code IPO date First date of Compustat data Last date of Compustat data Fundamental File Identication Code 1 Fundamental File Identication Code 2 Fundamental File Identication Code 3 North American Industry Classication Primary S&P index marker Secondary S&P index marker Subset S&P index marker Current S&P Subordinated Debt Rating Current S&P Commercial Paper Rating Current S&P Senior Debt Rating Current S&P Senior Debt Rating-Footnote CUSIP issuer code Issuer number Company name Industry name Stock ticker symbol Employer Identication Number Incorporation ISO Country Code Reserved 3
Type Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Character Character Character Character Character Character Character Character Character Character Character Character Character Character Character Character
Fields GVKEY CHGDT CHGENDDT DNUM FILE ZLIST STATE COUNTY STINC FINC XREL STK DUP CCNDX GICS
Label GVKEY Effective date of this description Last effective date of this description Industry code File identication code Exchange listing and S&P Index code State identication code County identication code State incorporation code Foreign incorporation code S&P Industry Index relative code Stock ownership code Duplicate le code Current Canadian Index Code Global Industry Classication Std. code
Type Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric
Table 33.24
continued
Fields IPODT RCST1 RCST2 FUNDF1 FUNDF2 FUNDF3 NAICS CPSPIN CSSPIN CSSPII SUBDBT CPAPER SDBT SDBTIM CNUM CIC CONAME INAME SMBL EIN INCORP RCST3
Label IPO date Reserved 1 Reserved 2 Fundamental File Identication Code 1 Fundamental File Identication Code 2 Fundamental File Identication Code 3 North American Industry Classication Primary S&P index marker Secondary S&P index marker Subset S&P index marker Current S&P Subordinated Debt Rating Current S&P Commercial Paper Rating Current S&P Senior Debt Rating Current S&P Senior Debt Rating-Footnote CUSIP issuer code Issuer number Company name Industry name Stock ticker symbol Employer Identication Number Incorporation ISO Country Code Reserved 3
Type Numeric Numeric Numeric Numeric Numeric Numeric Character Character Character Character Character Character Character Character Character Character Character Character Character Character Character Character
Label GVKEY First date link is valid Last date link is valid CRSP PERMNO linked CRSP PERMCO linked Link type code Linking Flag
Label GVKEY CRSP Date Raw Calendar Trading Date Fiscal Trading Date
Table 33.26
continued
Fields DATYR DATQTR FISCYR CALYR CALQTR UPCODE SRCDOC SPBOND SPDEBT SPPAPER SPRANK MAJIND INDIND REPDT FLOWCD CANDXC PEFTNT1 PEFTNT2 PEFTNT3 PEFTNT4 PEFTNT5 PEFTNT6 PEFTNT7 PEFTNT8
Label Data Year Data Quarter Fiscal year-end month of data Calendar year Calendar quarter Update code Source document code S&P Senior Debt Rating S&P Subordinated Debt Rating S&P Commercial Paper Rating Common Stock Ranking Major Index Code S&P Industry Index Code Report date of quarterly earnings Flow of funds statement format code Canadian index code Period Descriptor Footnote 1 Source Document Code Period Descriptor Footnote 2 Month of Deletion Period Descriptor Footnote 3 Year of Deletion Period Descriptor Footnote 4 Reason For Deletion Period Descriptor Footnote 5 Unused Period Descriptor Footnote 6 Unused Period Descriptor Footnote 7 Unused Period Descriptor Footnote 8 Unused
Type Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Character Character Character Character Character Character Character Character
Fields GVKEY CRSPDT RCALDT FISCALDT DATYR DATQTR FISCYR CALYR CALQTR UPCODE SRCDOC SPBOND SPDEBT SPPAPER SPRANK
Label GVKEY CRSP Date Raw Calendar Trading Date Fiscal Trading Date Data Year Data Quarter Fiscal year-end month of data Calendar year Calendar quarter Update code Source document code S&P Senior Debt Rating S&P Subordinated Debt Rating S&P Commercial Paper Rating Common Stock Ranking
Type Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric
Table 33.27
continued
Fields MAJIND INDIND REPDT FLOWCD CANDXC PEFTNT1 PEFTNT2 PEFTNT3 PEFTNT4 PEFTNT5 PEFTNT6 PEFTNT7 PEFTNT8
Label Major Index Code S&P Industry Index Code Report date of quarterly earnings Flow of funds statement format code Canadian index code Period Descriptor Footnote 1 Comparability Status Period Descriptor Footnote 2 Company Status Alert Period Descriptor Footnote 3 S&P Senior Debt Rating Period Descriptor Footnote 4 Reason For Deletion Period Descriptor Footnote 5 Unused Period Descriptor Footnote 6 Unused Period Descriptor Footnote 7 Unused Period Descriptor Footnote 8 Unused
Type Numeric Numeric Numeric Numeric Numeric Character Character Character Character Character Character Character Character
Fields GVKEY CRSPDT RCALDT FISCALDT IA1 IA2 IA3 IA4 IA5 IA6 IA7 IA8 IA9 IA10 IA11 IA12 IA13 IA14 IA15 IA16 IA17 IA18 IA19 IA20 IA21 IA22
Label GVKEY CRSP Date Raw Calendar Trading Date Fiscal Trading Date Cash and Short Term Investments ReceivablesTotal InventoriesTotal Current AssetsTotal Current Liabilities Total Assets Total Liabilities and Stockholders Equity Total Property, Plant, and Equipment Total (Gross) Property, Plant, and Equipment Total (Net) Long-Term DebtTotal Preferred Stock Liquidating Value Common Equity Tangible Sales (Net) Operating Income Before Depreciation Depreciation and Amortization Income Statement) Interest Expense Income TaxesTotal Special Items Income Before Extraordinary Items DividendsPreferred Income Before Extraordinary Items Adj for Com Stk Equiv Dollar Savings DividendsCommon PriceHigh
Type Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric
Table 33.28
continued
Fields IA23 IA24 IA25 IA26 IA27 IA28 IA29 IA30 IA31 IA32 IA33 IA34 IA35 IA36 IA37 IA38 IA39 IA40 IA41 IA42 IA43 IA44 IA45 IA46 IA47 IA48 IA49 IA50 IA51 IA52 IA53 IA54 IA55 IA56 IA57 IA58 IA59 IA60 IA61 IA62 IA63 IA64 IA65 IA66 IA67
Label PriceLow PriceClose Common Shares Outstanding Dividends Per Share by Ex-Date Adjustment Factor (Cumulative) by Ex-Date Common Shares Traded Employees Property, Plant, and Equipment Capital Expenditures (Schedule V) Investments and Advances Equity Method Investments and AdvancesOther Intangibles Debt in Current Liabilities Deferred Taxes and Investment Tax Credit (Balance Sheet) Retained Earnings Invested Capital Total Minority Interest Balance Sheet Convertible Debt and Preferred Stock Common Shares Reserved for ConversionTotal Cost of Goods Sold Labor and Related Expense Pension and Retirement Expense DebtDue in One Year Advertising Expense Research and Development Expense Rental Expense Extraordinary Items and Discontinued Operations Minority Interest (Income Account) Deferred Taxes (Income Account) Investment Tax Credit (Income Account) Net Operating Loss Carry Forward Unused Portion Earnings Per Share (Basic)Including Extraordinary Items Common Shares Used to Calculate Earnings Per Share (Basic) Equity in Earnings Preferred Stock Redemption Value Earnings Per Share (Diluted)Excluding Extraordinary Items Earnings Per Share (Basic)Excluding Extraordinary Items Inventory Valuation Method Common EquityTotal Non-operating Income (Expense) Interest Income Income TaxesFederal Current Income TaxesForeign Current Amortization of Intangibles Discontinued Operations Receivables Estimated Doubtful
Type Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric
Table 33.28
continued
Fields IA68 IA69 IA70 IA71 IA72 IA73 IA74 IA75 IA76 IA77 IA78 IA79 IA80 IA81 IA82 IA83 IA84 IA85 IA86 IA87 IA88 IA89 IA90 IA91 IA92 IA93 IA94 IA95 IA96 IA97 IA98 IA99 IA100 IA101 IA102 IA103 IA104 IA105 IA106 IA107 IA108 IA109 IA110 IA111 IA112
Label Current AssetsOther AssetsOther Accounts Payable Income Taxes Payable Current Liabilities Other Property, Plant, and EquipmentConstruction in Progress (Net) Deferred Taxes (Balance Sheet) LiabilitiesOther InventoriesRaw Materials InventoriesWork in Progress Inventories Finished Goods DebtConvertible DebtSubordinated DebtNotes DebtDebentures LongTerm Debt Other DebtCapitalized Lease Obligations Common Stock Treasury Stock Memo Entry Treasury Stock Number of Common Shares Treasury StockTotal Dollar Amount Pension CostsUnfunded Vested Benets Pension CostsUnfunded Past or Prior Service DebtMaturing In Second Year DebtMaturing In Third Year DebtMaturing In Fourth Year DebtMaturing In Fifth Year Rental Commitments MinimumFive Years Total Rental Commitments MinimumFirst Year Retained Earnings Unrestricted Order Backlog Retained Earnings Restatement Common Shareholders Interest Expense on Long-Term Debt Excise Taxes Depreciation Expense (Schedule VI) Short-Term Borrowing Average Short-Term Borrowings Average Interest Rate Equity In Net Loss (Earnings) (Statement of Cash Flows) Sale of Property, Plant, and Equipment (Statement of Cash Flows) Sale of Common and Preferred Stock (Statement of Cash Flows) Sale of Investments (Statement of Cash Flows) Funds from Operations Total (Statement Changes) Long-Term Debt Issuance (Statement of Cash Flows) Sources of Funds Total (Statement of Changes)
Type Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric
Table 33.28
continued
Fields IA113 IA114 IA115 IA116 IA117 IA118 IA119 IA120 IA121 IA122 IA123 IA124 IA125 IA126 IA127 IA128 IA129 IA130 IA131 IA132 IA133 IA134 IA135 IA136 IA137 IA138 IA139 IA140 IA141 IA142 IA143 IA144 IA145 IA146 IA147 IA148 IA149 IA150
Label Increase in Investment (Statement of Cash Flows) Long-Term Debt Reduction (Statement of Cash Flows) Purchase of Common and Preferred Stock (Statement of Cash Flows) Uses of FundsTotal (Statement of Changes) Sales (Restated) Income Before Extraordinary Items (Restated) Earnings Per Share (Basic)Excluding Extraordinary Items (Restated) AssetsTotal (Restated) Working Capital (Restated) Pretax Income (Restated) Income Before Extraordinary Items (Statement of Cash Flows) Extraordinary Items & Discontinued Operations (Statement of Cash Flows) Depreciation and Amortization (Statement of Cash Flows) Deferred Taxes (Statement of Cash Flows) Cash Dividends (Statement of Cash Flows) Capital Expenditures (Statement of Cash Flows) Acquisitions (Statement of Cash Flows) Preferred Stock Carrying Value Cost of Goods Sold (Restated) Selling, General, and Administrative Expense (Restated) Depreciation and Amortization Restated Interest Expenses (Restated) Income TaxesTotal (Restated) Extraordinary Items and Discontinued Operations (Restated) Earnings Per Share (Basic)Including Extraordinary Items Restated) Common Shares Used To Calculate Earnings Per Share (Basic) Restated Earnings Per Share (Diluted)Excluding Extraordinary Items (Restated) Earnings Per Share (Diluted)Including Extraordinary Items (Restated) Property, Plant, and EquipmentTotal Net (Restated) Long-Term DebtTotal (Restated) Retained Earnings (Restated) Stockholders Equity (Restated) Capital Expenditures (Restated) Employees (Restated) Interest Capitalized Long-Term Debt Tied to Prime Auditor/Auditors Opinion Foreign Currency Adjustment Income Account
Type Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric
Table 33.28
continued
Fields IA151 IA152 IA153 IA154 IA155 IA156 IA157 IA158 IA159 IA160 IA161 IA162 IA163 IA164 IA165 IA166 IA167 IA168 IA169 IA170 IA171 IA172 IA173 IA174 IA175 IA176 IA177 IA178 IA179 IA180 IA181 IA182 IA183 IA184 IA185 IA186 IA187 IA188 IA189 IA190 IA191 IA192 IA193 IA194 IA195
Label ReceivablesTrade Deferred Charges Accrued Expense DebtSubordinated Convertible Property, Plant, and EquipmentBuildings (Net) Property, Plant, and EquipmentMachinery and Equipment (Net) Property, Plant, and Equipment Natural Resources (Net) Property, Plant, and EquipmentLand and Improvements (Net) Property, Plant, and EquipmentLeases (Net) Prepaid Expense Income Tax Refund Cash Rental Income Rental Commitments MinimumSecond Year Rental Commitments MinimumThird Year Rental Commitments MinimumFourth Year Rental Commitments MinimumFifth Year Compensating Balance Earnings Per Share (Diluted)Including Extraordinary Items Pretax Income Common Shares Used to Calculate Earnings Per Share (Diluted) Net Income (Loss) Income TaxesState Current Depletion Expense (Schedule VI) Preferred Stock Redeemable Blank Net Income (Loss) (Restated) Operating Income After Depreciation Working Capital (Balance Sheet) Working Capital Change Total (Statement of Changes) LiabilitiesTotal Property, Plant, and EquipmentBeginning Balance (Schedule V) Accounting Changes Cumulative Effect Property, Plant, and Equipment Retirements (Schedule V) Property, Plant, and EquipmentOther Changes (Schedule V) InventoriesOther Property, Plant, and EquipmentEnding Balance (Schedule V) DebtSenior Convertible Selling, General, and Administrative Expense Non-operating Income (Expense) Common Stock EquivalentsDollar Savings Extraordinary Items Short-Term Investments ReceivablesCurrent Other Current AssetsOther Excluding Prepaid Expenses
Type Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric
Table 33.28
continued
Fields IA196 IA197 IA198 IA199 IA200 IA201 IA202 IA203 IA204 IA205 IA206 IA207 IA208 IA209 IA210 IA211 IA212 IA213 IA214 IA215 IA216 IA217 IA218 IA219 IA220 IA221 IA222 IA223 IA224 IA225 IA226 IA227 IA228 IA229 IA230 IA231 IA232 IA233 IA234 IA235 IA236 IA237
Label Depreciation, Depletion and Amortization (Accumulated) (Balance Sheet) PriceFiscal Year High PriceFiscal Year Low PriceFiscal Year Close Common Shares Reserved for Conversion Convertible Debt Dividends Per Share by Payable Date Adjustment Factor (Cumulative) by Payable Date Common Shares Reserved for Conversion Preferred Stock Goodwill AssetsOther Excluding Deferred Charges Notes Payable Current Liabilities OtherExcluding Accrued Expenses Investment Tax Credit (Balance Sheet) Preferred Stock Nonredeemable Capital Surplus Income TaxesOther Blank Sale of Prop, Plnt, & Equip & Sale of Invs Loss(Gain)(Stmt of Csh Flo) Preferred Stock Convertible Common Shares Reserved for Conversion Stock Options Stockholders Equity Total Funds from Operations Other (Statement of Cash Flow) Sources of Funds Other (Statement of Changes) Uses of FundsOther (Statement of Changes) Depreciation (Accumulated) Beginning Balance (Schedule VI) Depreciation (Accumulated) Retirements (Schedule VI) Depreciation (Accumulated)Other Changes (Schedule VI) Depreciation (Accumulated)Ending Balance (Schedule VI) Non-operating Income (Expense) (Restated) Minority Interest (Restated) Treasury Stock (Dollar Amount)Common Treasury Stock (Dollar Amount)Preferred Currency Translation Rate Common Shares Reserved for Conversion Warrants and Other Retained Earnings Cumulative Translation Adjustment Retained Earnings Other Adjustments Common StockPer Share Carrying Value Earnings per Share from Operations ADR Ratio Common Equity Liquidation Value Working Capital Change OtherIncrease (Decrease) (Stmnt of Changes) Income Before Extraordinary Items Available for Common
Type Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric
Table 33.28
continued
Fields IA238 IA239 IA240 IA241 IA242 IA243 IA244 IA245 IA246 IA247 IA248 IA249 IA250 IA251 IA252 IA253 IA254 IA255 IA256 IA257 IA258 IA259 IA260 IA261 IA262 IA263 IA264 IA265 IA266 IA267 IA268 IA269 IA270 IA271 IA272 IA273 IA274 IA275 IA276 IA277 IA278 IA279 IA280 IA281
Label Marketable Securities Adjustment (Balance Sheet) Interest Capitalized Net Income Effect InventoriesLIFO Reserve DebtMortgages and Other Secured DividendsPreferred In Arrears Pension Benets Present Value of Vested Pension Benets Present Value of Nonvested Pension Benets Net Assets Pension Discount Rate (Assumed Rate of Return) Pension Benets Information Date AcquisitionIncome Contribution AcquisitionsSales Contribution Property, Plant, and EquipmentOther (Net) Depreciation (Accumulated)Land and Improvements Depreciation (Accumulated) Natural Resources Depreciation (Accumulated) Buildings Depreciation (Accumulated) Machinery and Equipment Depreciation (Accumulated)Leases Depreciation (Accumulated) Construction in Progress Depreciation (Accumulated)Other Net Income Adjusted for Common Stock Equivalents Retained Earnings Unadjusted Property, Plant, and EquipmentLand and Improvements at Cost Property, Plant, and EquipmentNatural Resources at Cost Blank Property, Plant, and EquipmentBuildings at Cost Property, Plant, and EquipmentMachinery and Equipment at Cost Property, Plant, and EquipmentLeases at Cost Property, Plant, and Equipment Construction in Progress at Cost Property, Plant, and EquipmentOther at Cost Debt Unamortized Debt Discount and Other Deferred Taxes Federal Deferred Taxes Foreign Deferred TaxesState Pretax Income Domestic Pretax Income Foreign Cash & Cash Equivalent Increase (Decrease) (Statement of Cash Flows) Blank S&P Major Index Code Historical S&P Industry Index CodeHistorical Fortune Industry Code Historical Fortune Rank S&P Long-Term Domestic Issuer Credit Rating Historical Blank
Type Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric
Table 33.28
continued
Fields IA282 IA283 IA284 IA285 IA286 IA287 IA288 IA289 IA290 IA291 IA292 IA293 IA294 IA295 IA296 IA297 IA298 IA299 IA300 IA301 IA302 IA303 IA304 IA305 IA306 IA307 IA308 IA309 IA310 IA311 IA312 IA313 IA314 IA315 IA316 IA317 IA318 IA319 IA320 IA321 IA322 IA323
Label S&P Common Stock Ranking S&P Short-Term Domestic Issuer Credit RatingHistorical Pension Vested Benet Obligation PensionAccumulated Benet Obligation PensionProjected Benet Obligation Pension Plan Assets PensionUnrecognized Prior Service Cost PensionOther Adjustments PensionPrepaid Accrued Cost Pension Vested Benet Obligation Underfunded Periodic Postretirement Benet Cost Net PensionAccumulated Benet Obligation (Underfunded) PensionProjected Benet Obligation (Underfunded) Periodic Pension Cost (Net) Pension Plan Assets (Underfunded) PensionUnrecognized Prior Service Cost (Underfunded) PensionAdditional Minimum Liability PensionOther Adjustments (Underfunded) PensionPrepaid Accrued Cost (Underfunded) Changes in Current Debt (Statement of Cash Flows) Accounts Receivable Decrease (Increase) (Statement of Cash Flows) InventoryDecrease (Increase) (Statement of Cash Flows) Accounts Payable & Accrued Liabilities Inc (Decrease) (St Cash Flows) Income TaxesAccrued Increase (Decrease) Statement of Cash Flow Blank Assets and Liabilities Other (Net Change) Statement of Cash Flow Operating Activities Net Cash Flow Statement of Cash Flow Short-Term Investments Change (Statement of Cash Flows) Investing Activities Other (Statement of Cash Flows) Investing Activities Net Cash Flow (Statement of Cash Flows) Financing Activities Other (Statement of Cash Flows) Financing Activities Net Cash Flow (Statement of Cash Flows) Exchange Rate Effect (Statement of Cash Flows) Interest PaidNet (Statement of Cash Flows) Blank Income Taxes Paid (Statement of Cash Flows) Format Code (Statement of Cash Flows) Dilution Adjustment S&P Subordinated Debt Rating Interest Income Total (Financial Services) Dilution Available Excluding Numeric Earnings per share from Operations (Diluted)
Type Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric
Table 33.28
continued
Fields IA324 IA325 IA326 IA327 IA328 IA329 IA330 IA331 IA332 IA333 IA334 IA335 IA336 IA337 IA338 IA339 IA340 IA341 IA342 IA343 IA344 IA345 IA346 IA347 IA348 IA349 IA350 IA351 IA352 IA353 IA354 IA355 IA356 IA357 IA358 IA359 IA360 IA361 IA362 IA363 IA364 IA365 IA366 IA367 IA368
Label Historical SIC Code Blank Blank Contingent Liabilities Guarantees DebtFinance Subsidiary DebtConsolidated Subsidiary Postretirement Benet Asset (Liability) (Net) Pension Plans Service Cost Pension Plans Interest Cost Pension Plans Return on Plan Assets (Actual) Pension PlansOther Periodic Cost Components (Net) Pension PlansRate of Compensation Increase Pension Plans Anticipated Long-Term Rate of Return on Plan Assets Risk-Adjusted Capital RatioTier 1 Blank Interest Expense Total (Financial Services) Net Interest Income (Tax Equivalent) Non-performing Assets Total Provision for Loan/Asset Losses Reserve for Loan/Asset Losses Net Interest Margin Blank Blank Blank Risk-Adjusted Capital RatioTotal Net Charge-Offs Blank Current Assets Discontinued Operations Other Intangibles Long-Term Assets of Discontinued Operations Other Current Assets Excluding Discontinued Other Assets Excluding Discontinued Operations Deferred Revenue Current Accumulated Other Comprehensive Income Deferred Compensation Other Stockholders Equity Adjustments Acquisition/Merger Pretax Acquisition/Merger After-Tax Acquisition/Merger Basic EPS Effect Acquisition/Merger Diluted EPS Effect Gain/Loss Pretax Gain/Loss After-Tax Gain/Loss Basic EPS Effect Gain/Loss Diluted EPS Effect Impairments of Goodwill Pretax
Type Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric
Table 33.28
continued
Fields IA369 IA370 IA371 IA372 IA373 IA374 IA375 IA376 IA377 IA378 IA379 IA380 IA381 IA382 IA383 IA384 IA385 IA386 IA387 IA388 IA389 IA390 IA391 IA392 IA393 IA394 IA395 IA396 IA397 IA398 IA399 IA400
Label Impairments of Goodwill After-Tax Impairments of Goodwill Basic EPS Effect Impairments of Goodwill Diluted EPS Effect Settlement (Litigation /Insurance) Pretax Settlement (Litigation /Insurance) Aftertax Settlement (Litigation /Insurance) Basic EPS Settlement (Litigation /Insurance) Diluted EPS Restructuring Costs Pretax Restructuring Costs Aftertax Restructuring Costs Basic EPS Effect Restructuring Costs Diluted EPS Effect Writedowns Pretax Writedowns After-Tax Writedowns Basic EPS Effect Writedowns Diluted EPS Effect Other Special Items Pretax Other Special Items Aftertax Other Special Items Basic EPS Effect Other Special Items Diluted EPS Effect In Process Research & Development Thereafter Rent Commitments Accumulated Depreciation of Real Estate Property Total Real Estate Property Gain/Loss on Sale of Property Depreciation and Amortization of Property Goodwill Amortization See Footnote for 394 Common Shares Issued Deferred Revenue Long-Term Stock Compensation Expense Implied Option Expense Blank
Type Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric
Label GVKEY CRSP Date Raw Calendar Trading Date Fiscal Trading Date Selling, General, and Administrative Expense Sales (Net) Minority Interest (Income Account)
Table 33.29
continued
Fields IQ4 IQ5 IQ6 IQ7 IQ8 IQ9 IQ10 IQ11 IQ12 IQ13 IQ14 IQ15 IQ16 IQ17 IQ18 IQ19 IQ20 IQ21 IQ22 IQ23 IQ24 IQ25 IQ26 IQ27 IQ28 IQ29 IQ30 IQ31 IQ32 IQ33 IQ34 IQ35 IQ36 IQ37 IQ38 IQ39 IQ40 IQ41 IQ42 IQ43 IQ44
Label Research and Development Expense Depreciation and Amortization (Income Statement) Income TaxesTotal Earnings Per Share (Diluted)Including Extraordinary Items Income Before Extraordinary Items Earnings Per Share (Diluted)Excluding Extraordinary Items Income Before Extraordinary ItemsAdj for Com Stk Equiv Dollar Savings Earnings Per Share (Basic)Including Extraordinary Items PriceClose 1st Month of Quarter PriceClose 2nd Month of Quarter PriceClose 3rd Month of Quarter Common Shares Used to Calculate Earnings per Share (Basic) Dividends per Share by Ex-Date Adjustment Factor Cumulative by Ex-Date Common Shares Traded Earnings Per Share (Basic)Excluding Extraordinary Items DividendsCommon Indicated Annual Operating Income Before Depreciation Interest Expense Pretax Income DividendsPreferred Income Before Extraordinary Items Available for Common Extraordinary Items and Discontinued Operations Earnings Per Share (Basic) Excluding Extraordinary Items 12 Mo Moving Common Shares Used to Calculate Earnings Per Share12 Month Moving Interest Income Total (Financial Services) Cost of Goods Sold Non-operating Income (Expense) Special Items Discontinued Operations Foreign Currency Adjustment (Income Account) Deferred Taxes (Income Account) Cash and Short Term Investments ReceivablesTotal InventoriesTotal Current AssetsOther Current AssetsTotal Depreciation, Depletion and Amortization (Accumulated) (Balance Sheet) Property, Plant, and EquipmentTotal (Net) AssetsOther AssetsTotal/Liabilities and Stockholders Equity-Total
Type Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric
Table 33.29
continued
Fields IQ45 IQ46 IQ47 IQ48 IQ49 IQ50 IQ51 IQ52 IQ53 IQ54 IQ55 IQ56 IQ57 IQ58 IQ59 IQ60 IQ61 IQ62 IQ63 IQ64 IQ65 IQ66 IQ67 IQ68 IQ69 IQ70 IQ71 IQ72 IQ73 IQ74 IQ75 IQ76 IQ77 IQ78 IQ79 IQ80 IQ81 IQ82 IQ83 IQ84 IQ85 IQ86
Label Debt in Current Liabilities Accounts Payable Income Taxes Payable Current Liabilities Other Current Liabilities Total LiabilitiesOther Long-Term DebtTotal Deferred Taxes and Investment Tax Credit (Balance Sheet) Minority Interest (Balance Sheet) LiabilitiesTotal Preferred Stock Carrying Value Common Stock Capital Surplus Retained Earnings Common EquityTotal Stockholders Equity Total Common Shares Outstanding Invested Capital Total PriceHigh 1st Month of Quarter PriceHigh 2nd Month of Quarter PriceHigh 3rd Month of Quarter PriceLow 1st Month of Quarter PriceLow 2nd Month of Quarter PriceLow 3rd Month of Quarter Net Income (Loss) Interest Expense Total (Financial Services) Preferred Stock Redeemable Dividends per Share by Payable Date Working Capital Change OtherIncrease (Decrease) (Stmnt of Changes) Cash & Cash Equivalents Increase (Decrease) (Statement of Cash Flows) Changes in Current Deb (Statement of Cash Flows) Income Before Extraordinary Items (Statement of Cash Flows) Depreciation and Amortization (Statement of Cash Flows) Extraordinary Items & Discontinued Operation (Statement of Cash Flows) Deferred Taxes (Statement of Cash Flows) Equity in Net Loss (Earnings) (Statement of Cash Flows) Funds from Operations Other (Statement of Cash Flows) Funds from Operation Total (Statement of Charges) Sale of Property, Plant and Equipment (Statement of Cash Flows) Sale of Common and Preferred Stock (Statement of Cash Flows) Sale of Investments (Statement of Cash Flows) Long-Term Debt Issuance (Statement of Cash Flows)
Type Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric
Table 33.29
continued
Fields IQ87 IQ88 IQ89 IQ90 IQ91 IQ92 IQ93 IQ94 IQ95 IQ96 IQ97 IQ98 IQ99 IQ100 IQ101 IQ102 IQ103 IQ104 IQ105 IQ106 IQ107 IQ108 IQ109 IQ110 IQ111 IQ112 IQ113 IQ114 IQ115 IQ116 IQ117 IQ118 IQ119 IQ120 IQ121 IQ122 IQ123 IQ124 IQ125
Label Sources of Funds Other (Statement of Changes) Sources of Funds Total (Statement of Changes) Cash Dividends (Statement of Cash Flows) Capital Expenditures (Statement of Cash Flows) Increase in Investment (Statement of Cash Flows) Long-Term Debt Reduction (Statement of Cash Flows) Purchase of Common and Preferred Stock (Statement of Cash Flows) Acquisitions (Statement of Cash Flows) Uses of FundsOther (Statement of Changes) Uses of FundsTotal (Statement of Changes) Net Interest Income (Tax Equivalent) Treasury Stock Total Dollar Amount Non-Performing Assets Total Adjustment Factor (Cumulative) by Payable Date Working Capital Change Total (Statement of Changes) Sale of Prop, Plnt, & Equip & Sale of Invs Loss(Gain) (Stmt of Csh Flo) Accounts Receivable Decrease (Increase) (Statement of Cash Flows) InventoryDecrease (Increase) (Statement of Cash Flows) Accounts Payable & Accrued Liabilities Inc (Decrease) (St Cash Flows) Income TaxesAccrued Increase (Decrease) (Statement of Cash Flows) Assets and Liabilities Other (Net Change) (Statement of Cash Flows) Operating Activities Net Cash Flow (Statement of Cash Flows) Short-Term Investments Change (Statement of Cash Flows) Investing Activities Other (Statement of Cash Flows) Investing Activities Net Cash Flow (Statement of Cash Flows) Financing Activities Other (Statement of Cash Flows) Financing Activities Net Cash Flow (Statement of Cash Flows) Exchange Rate Effect (Statement of Cash Flows) Interest PaidNet (Statement of Cash Flows) Income Taxes Paid (Statement of Cash Flows) Accounting Changes Cumulative Effect Property, Plant and EquipmentTotal (Gross) Extraordinary Items Common Stock Equivalents Dollar Savings Currency Translation Rate Accounts Payable Expanded Blank Common Shares for Diluted EPS Dilution Adjustment
Type Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric
Table 33.29
continued
Fields IQ126 IQ127..IQ170 IQ171 IQ172 IQ173 IQ174 IQ175 IQ176 IQ177 IQ178 IQ179 IQ180 IQ181 IQ182..IQ232 IQ233 IQ234 IQ235 IQ236 IQ237 IQ238 IQ239 IQ240 IQ241 IQ242 IQ243 IQ244 IQ245 IQ246 IQ247 IQ248 IQ249 IQ250 IQ251 IQ252 IQ253 IQ254 IQ255 IQ256 IQ257 IQ258 IQ259 IQ260 IQ261 IQ262
Label Dilution Available Excluding Blank Provision for Loan/Asset Losses Reserve for Loan/Asset Losses Net Interest Margin Risk-Adjusted Capital RatioTier 1 Risk-Adjusted Capital RatioTotal Net Charge-Offs Earnings per Share from Operations Earnings per Share from Operations Earnings per Share (Diluted)Excluding Extraordinary Items 12 Mo Mov Earnings per Share from Operations (Diluted) 12 Months Moving Earnings per Share from Operations (Diluted) Blank Total Long-Term Investments Goodwill Other Intangibles Other Long-Term Assets Unadjusted Retained Earnings Accumulated Other Comprehensive Income Deferred Compensation Other Stockholders Equity Adjustments Acquisition/Merger Pretax Acquisition/Merger After-Tax Acquisition/Merger Basic EPS Effect Acquisition/Merger Diluted EPS Effect Gain/Loss Pretax Gain/Loss After-Tax Gain/Loss Diluted EPS Effect Gain/Loss Basic EPS Effect Impairments of Goodwill Pretax Impairments of Goodwill After-Tax Impairments of Goodwill Basic EPS Effect Impairments of Goodwill Diluted EPS Effect Settlement (Litigation/Insurance) Pretax Settlement (Litigation/Insurance) Aftertax Settlement (Litigation/Insurance) Basic EPS Settlement (Litigation/Insurance)Diluted EP Restructuring Costs Pretax Restructuring Costs Aftertax Restructuring Costs Basic EPS Effect Restructuring Costs Diluted EPS Effect Writedowns Pretax Writedowns After-Tax
Type Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric
Table 33.29
continued
Fields IQ263 IQ264 IQ265 IQ266 IQ267 IQ268 IQ269 IQ270 IQ271 IQ272 IQ273 IQ274 IQ275 IQ276 IQ277 IQ278 IQ279 IQ280
Label Writedowns Basic EPS Effect Writedowns Diluted EPS Effect Other Special Items Pretax Other Special Items Aftertax Other Special Items Basic EPS Effect Other Special Items Diluted EPS Effect Accumulated Depreciation of Real Estate Property Total Real Estate Property Gain/Loss on Sale of Property Depreciation and Amortization of Property ADR Ratio In Process Research & Development Goodwill Amortization See Footnote for 275 Common Shares Issued Stock Compensation Expense Blank Blank
Type Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric
Fields GVKEY CRSPDT RCALDT FISCALDT BA1 BA2 BA3 BA4 BA5 BA6 BA7 BA8 BA9 BA10 BA11 BA12 BA13 BA14 BA15 BA16 BA17
Label GVKEY CRSP Date Raw Calendar Trading Date Fiscal Trading Date Cash and Due from Bank U.S. Treasury Securities Securities of Other U.S. Government Agencies and Corporations Due from Banks (Memorandum Entry) Other Securities (Taxable) Total Taxable Investment Securities Obligations of States and Political Subdivisions Total Investment Securities Geographic Designation Code Trading Account Securities Federal Funds Sold and Securities Purchased under Agreements to Resell Treasury StockDollar AmountCommon Foreign Loans Real Estate Loans Total Real Estate Loans Insured or Guaranteed by U.S. Government Treasury Stock Dollar Amount Preferred Loans to Financial Institutions
Type Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric
Table 33.30
continued
Fields BA18 BA19 BA20 BA21 BA22 BA23 BA24 BA25 BA26 BA27 BA28 BA29 BA30 BA31 BA32 BA33 BA34 BA35 BA36 BA37 BA38 BA39 BA40 BA41 BA42 BA43 BA44 BA45 BA46 BA47 BA48 BA49 BA50 BA51 BA52 BA53 BA54 BA55 BA56 BA57 BA58 BA59
Label Loans for Purchasing or Carrying Securities Interest Income Total (Financial Services) Commercial or Industrial Loans Loans to Individuals for Household, Family, Other Consumer Expenditures Other Loans Loans (Gross) Unearned Discount/ Income Interest on Due from Banks (Restated) Interest Income on Fed Funds Sold & Secs Purchased under Agmnt to Resell Other Interest Income (Restated) Bank Premises, Furniture, and Fixture Real Estate Other than Bank Premises Investments in Nonconsolidated Subsidiaries Direct Lease Financing Customers Liability to this Bank on Acceptances Outstanding Other Assets Intangible Assets Aggregate Miscellaneous Assets Total Assets (Gross) Trading Account Income (Restated) Other Current Operating Revenue (Restated) Interest Expense on Fed Funds Purchd & Secs Sold under Agmnts to Repur Assets Held for Sale Total Demand Deposits Net Interest Margin Consumer Type Time Deposit Total Savings Deposits Money Market Certicates of Deposit All Other Time Deposit Total Time Deposits (Other than Savings) Risk-Adjusted Capital Ratio - Tier 1 Interest on Long-Term Debt and Not Classied as Capital (Restated) Interest on Other Borrowed Money (Restated) Other Interest Expense (Restated) Salaries and Related Expenses (Restated) Total Deposits Worldwide Total Domestic Deposits Total Foreign Deposits Demand Deposits of IPC Time and Savings Deposits of IPC Deposits of U.S. Government Deposits of States and Political Subdivisions
Type Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric
Table 33.30
continued
Fields BA60 BA61 BA62 BA63 BA64 BA65 BA66 BA67 BA68 BA69 BA70 BA71 BA72 BA73 BA74 BA75 BA76 BA77 BA78 BA79 BA80 BA81 BA82 BA83 BA84 BA85 BA86 BA87 BA88 BA89 BA90 BA91 BA92 BA93 BA94 BA95 BA96 BA97 BA98 BA99 BA100 BA101 BA102
Label Deposits of Foreign Governments Deposits of Commercial Banks Certied and Ofcers Checks Other Deposits Risk-Adjusted Capital RatioTotal Federal Funds Purchased & Securities Sold under Agreements to Repurchase Commercial Paper Long-Term Debt Not Classied as Capital Other Liabilities for Borrowed Money Total Borrowings Valuation Portion of Reserve for Loan Losses Mortgage Indebtedness Acceptances Executed by or for the Account of this Bank and Outstanding Other Liabilities (Excluding Valuation Reserves) Deferred Portion of Reserve for Loan Losses Contingency Portion of Reserve for Loan Losses Total Liabilities (Excluding Valuation Reserves) Minority Interest in Consolidated Subsidiaries Reserve(s) for Bad Debt Losses on Loans Depreciation and Amortization Reserves on Securities Total Reserves on Loan and Securities Fixed Expense (Occupancy and Equipment - Net)(Restated) Other Current Operating Expense(Restated) Capital Notes and Debentures Minority Interest (Income Account)(Restated) Preferred Stock Par Value Number of Shares of Preferred Stock Outstanding Common Stock Par Value Number of Shares Authorized Number of Shares Outstanding Number of Shares Reserved for Conversion Treasury StockCost Number of Treasury Shares Held Special Items Surplus Undivided Prots Reserves for Contingencies and Other Capital Reserves Total Extraordinary ItemsNet of Taxes (Restated) Total Book Value Net Income Per Share Excluding Extraordinary Items (Restated) Total Liabilities, Reserves and Capital Accounts Total Capital Accounts and Minority Interest (Invested Capital)
Type Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric
Table 33.30
continued
Fields BA103 BA104 BA105 BA106 BA107 BA108 BA109 BA110 BA111 BA112 BA113 BA114 BA115 BA116 BA117 BA118 BA119 BA120 BA121 BA122 BA123 BA124 BA125 BA126 BA127 BA128 BA129 BA130 BA131 BA132 BA133 BA134 BA135 BA136 BA137 BA138 BA139 BA140 BA141
Label Net Current Op Erngs Per ShareExcluding Extraordinary Items Fully Foreign Exchange Gains and Losses Interest and Fees on Loans Interest Inc on Federal Funds Sold & Secs Purchased und Agmnts to Resell Blank Interest and Discount on U.S. Treasury Securities Interest on Securities of U.S. Government Agencies and Corporations Interest and Dividends on Other Taxable Securities Total Taxable Investment Revenue Interest on Obligation of States & Political Subdivisions Other Interest Income Trading Account Interest (Memorandum Entry) Total Interest and Dividends on Investment Aggregate Loan and Investment Revenue Trust Department Income Service Charges on Deposit Accounts Other Svce Charges, Collection & Exchange Charges, Comms & Fees Trading Account Income Other Current Operating Revenue Interest on Due from Banks Aggregate Other Current Operating Revenue Total Current Operating Revenue Number of Employees Salaries and Wages of Ofcers and Employees Pension and Employee Benets Average Fed Funds Purchd & Securities Sold under Agmnts to Repurchase Interest on Deposits Interest Exp on Fed Fnds Purchd & Securities Sold under Agmnts to Repur Interest on Borrowed Money Interest on Long-Term DebtNot Classied as Capital Total Interest on Deposits and Borrowing Interest on Capital Notes and Debentures Provision for Loan Losses Occupancy Expense of Bank PremisesNet Total Interest Expense Rental Income Furniture and Equipment Depreciation, Rental Cost, Servicing, Etc. Number of Employees(Restated) Number of Domestic Ofcers (Restated)
Type Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric
Table 33.30
continued
Fields BA142 BA143 BA144 BA145 BA146 BA147 BA148 BA149 BA150 BA151 BA152 BA153 BA154 BA155 BA156 BA157 BA158 BA159 BA160 BA161 BA162 BA163 BA164 BA165 BA166 BA167 BA168 BA169 BA170 BA171 BA172 BA173 BA174 BA175 BA176 BA177 BA178 BA179 BA180 BA181 BA182
Label Other Current Operating Expense Aggregate Other Current Operating Expense Total Current Operating Expense Current Operating Earnings before Income Tax Income Taxes Applicable to Current Operating Earnings Net Current Operating Earnings Minority Interest (Income Account) Net Current Operating Earnings after Minority Interest Average Cash and Due from Banks (Restated) Average Loans Domestic (Restated) Average Loans Foreign (Restated) Net Pre-Tax Prot or Loss on Securities Sold or Redeemed Average Fed Fnds Sold & Secs Purchased under Agmnts to Resell (Rest) Average Trading Account Securities(Restated) Average Deposits(Restated) Tax Effect on Prot or Loss on Securities Sold or Redeemed Net Aft-Tax Prot/Loss on Secs Sld or Redmd Prior to Eff of Min Int Minority Interest in Aft-Tax Prot/Loss on Securities Sold or Redeemed Net After-Tax & Aft-Min Int Prot/Loss on Secs Sld or Redeemed Net Income Preferred Dividend Deductions Savings Due to Common Stock Equivalents Net Income Available for Common Net Current Operating Earnings Available for Common Interest and Fees on Loans (Restated) Taxable Investment Income (Restated) Non-Taxable Investment Income (Restated) Total Interest Income (Restated) Trust Department Income (Restated) Total Current Operating Revenue (Restated) Interest on Deposits (Restated) Total Interest on Deposits and Borrowing (Restated) Interest on Capital Notes and Debentures (Restated) Total Interest Expense (Restated) Provision for Loan Losses (Restated) Cash Dividends Declared on Common Stock Cash Dividends Declared on Preferred Stock Total Current Operating Expense (Restated) Current Operating Earnings before Income Tax (Restated) Income Taxes Applicable to Current Operating Earnings (Restated) Net After-Tax & Aft-Min Int Prot/Loss on Secs Sld/Redeemed (Rest)
Type Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric
Table 33.30
continued
Fields BA183 BA184 BA185 BA186 BA187 BA188 BA189 BA190 BA191 BA192 BA193 BA194 BA195 BA196 BA197 BA198 BA199 BA200 BA201 BA202 BA203 BA204 BA205 BA206 BA207 BA208 BA209 BA210 BA211 BA212 BA213 BA214 BA215 BA216 BA217 BA218
Label Net Income (Restated) Net Current Operating Earnings Per Share (Restated) Total Extraordinary Items Net of Taxes Common Shares Used in Calculating Earnings Per Shares (Restated) Additions to Reserves for Bad Debts Due to Mergers and Absorptions Additions to Reserves for Bad Debts Due to Recoveries Credtd to Rsrvs Deductions from Reserves for Bad Debts Due to Losses Charged to Reserves Net Credit/Charge to Reserves for Bad Debts from Loan Recs or Chg-offs Transfers to Reserves for Bad Debts from Inc and/or to/from Undiv Prfts Average Preferred Stoc Par Value (Restated) Net Current Operating Earnings Per Share Excluding Extraordinary Items Net Income per Share Excluding Extraordinary Items Net Income per Share Including Extraordinary Items Common Shares Used in Calculating Earnings per Share Net Cur Op Earnings per Shares - Exc Extraordinary Items & Fully Diluted Net Income per Share Excluding Extraordinary ItemsFully Diluted Net Income per Share Including Extraordinary ItemsFully Diluted Common Shares Used in Calculating Fully Diluted Earnings per Share Common Dividends Paid per Share by Ex-Date Annualized Dividend Rate Market PriceHigh Market PriceLow Market PriceClose Common Shares Traded Average Reserve for Bad Debt Losses on Loans(Restated) Number of Domestic Ofces Number of Foreign Ofces Average Loans(Restated) Average Assets (Gross) Average Loans (Gross) Average Cash and Due from Banks Average Taxable Investments Average Non-Taxable Investments Average Deposits Average Deposits Time and Savings Average Deposits Demand
Type Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric
Table 33.30
continued
Fields BA219 BA220 BA221 BA222 BA223 BA224 BA225 BA226 BA227 BA228 BA229 BA230 BA231 BA232
Label Average Borrowings Average Fed Funds Sold & Secs Purchased under Agrmnts to Resell Average Book Value Average Fed Funds Purchd & Secs Sold under Agmnts to Repurchase Average Taxable Investments (Restated) Average Nontaxable Investments Average Assets(Restated) Average Deposits Time and Savings(Restated) Average Deposits Demand (Restated) Average Deposits Foreign (Restated) Average Borrowings (Restated) Average Long-Term Debt (Restated) Average Book Value (Restated) Adjustment Factor Cumulative by Ex-Date
Type Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric
Fields GVKEY CRSPDT RCALDT FISCALDT BQ1 BQ2 BQ3 BQ4 BQ5 BQ6 BQ7 BQ8 BQ9 BQ10 BQ11 BQ12 BQ13 BQ14 BQ15 BQ16 BQ17 BQ18 BQ19 BQ20
Label GVKEY CRSP Date Raw Calendar Trading Date Fiscal Trading Date Cash and Due from Banks U.S. Treasury Securities Securities of Other U.S. Government Agencies and C Due from Banks (Memorandum Entry) Other Securities (Taxable) Total Taxable Investment Securities Obligations of States and Political Subdivisions Total Investment Securities Geographic Designation Code Trading Account Securities Federal Funds Sold & Secs Purchd under Agrmnts to S&P Senior Debt Rating Unearned Discount/Income Loans (Gross) Treasury Stock Dollar AmountCommon Treasury StockDollar AmountPreferred Interest Income Total (Financial Services) Assets Held for Sale Bank Premises, Furniture, and Fixtures Real Estate Other than Bank Premises
Type Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric
Table 33.31
continued
Fields BQ21 BQ22 BQ23 BQ24 BQ25 BQ26 BQ27 BQ28 BQ29 BQ30 BQ31 BQ32 BQ33 BQ34 BQ35 BQ36 BQ37 BQ38 BQ39 BQ40 BQ41 BQ42 BQ43 BQ44 BQ45 BQ46 BQ47 BQ48 BQ49 BQ50 BQ51 BQ52 BQ53 BQ54 BQ55 BQ56 BQ57 BQ58 BQ59 BQ60 BQ61 BQ62 BQ63 BQ64 BQ65
Label Investments in Nonconsolidated Subsidiaries Direct Lease Financing Customers Liability to this Bank on Acceptances Other Assets Intangible Assets Aggregate Miscellaneous Assets Total Assets (Gross) Net Interest Margin Risk-Adjusted Capital RatioTier 1 Total Demand Deposits Average Investments Average Loans (Gross) Total Savings Deposits Average Assets (Gross) Average Deposits Demand Total Time Deposits (Other than Savings) Average Deposits Time and Savings Average Deposits Average Deposits Foreign Average Borrowings Average Total Stockholders Equity Total Deposits Worldwide Total Domestic Deposit Total Foreign Deposits Risk-Adjusted Capital RatioTotal Federal Funds Purchased & Secs Sold under Agrmnts Commercial Paper Long-Term DebtNot Classied as Capital Other Liabilities for Borrowed Money Total Borrowings Depreciation and Amortization Mortgage Indebtedness Acceptances Executed by or for Account of this Ban Other Liabilities (Excluding Valuation Reserves) Special Items Blank Total Liabilities (Excluding Valuation Reserves) Minority Interest in Consolidated Subsidiaries Reserve(s) for Bad Debt Losses on Loans Valuation Portion of Reserves for Loan Losses Deferred Portion of Reserve for Loan Losses Contingency Portion of Reserve for Loan Losses Blank Capital Notes and Debentures Blank
Type Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric
Table 33.31
continued
Fields BQ66 BQ67 BQ68 BQ69 BQ70 BQ71 BQ72 BQ73 BQ74 BQ75 BQ76 BQ77 BQ78 BQ79 BQ80 BQ81 BQ82 BQ83 BQ84 BQ85 BQ86 BQ87 BQ88 BQ89 BQ90 BQ91 BQ92 BQ93 BQ94 BQ95 BQ96 BQ97 BQ98 BQ99 BQ100 BQ101 BQ102 BQ103 BQ104 BQ105 BQ106 BQ107 BQ108
Label Preferred Stock Par Value Common Stock Par Value Number of Shares Outstanding Surplus Undivided Prots Reserves for Contingencies & Other Capital Reserve Blank Total Book Value Blank Total Liabilities, Reserves and Capital Accounts Total Capital Accounts and Minority Interest (Invested Capital) Report Date of Quarterly Earnings Per Share Interest and Fees on Loans Interest Inc on Fed Funds Sld & Secs Purchased under Agrmnts to Resell Blank Interest and Discount on U.S. Treasury Securities Interest on Securities of U.S. Government Agencies Interest and Dividends on Other Taxable Securities Total Taxable Investment Revenue Interest on Obligation of States and Political Subdivisions Foreign Exchange Gains and Losses Total Interest and Dividends on Investments Other Interest Income Trust Department Income Service Charges on Deposit Accounts Other Svce Charges, Collections & Exchange Charges Trading Account Income Other Current Operating Revenue Interest on Due from Banks Aggregate Other Current Operating Revenue Total Current Operating Revenue Salaries and Wages of Ofcers and Employees Pension and Employee Benets Blank Interest on Deposits Interest Expense on Fed Funds Purchased & Secs Sld Interest on Other Borrowed Money Interest on Long-Term DebtNot Classied as Cap Total Interest Expense Interest on Capital Notes and Debentures Provision for Loan Losses Occupancy Expense of Bank PremisesNet Furniture and Equipment:Depreciation Rental, Costs, Servicing, Etc.
Type Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric
Table 33.31
continued
Fields BQ109 BQ110 BQ111 BQ112 BQ113 BQ114 BQ115 BQ116 BQ117 BQ118 BQ119 BQ120 BQ121 BQ122 BQ123 BQ124 BQ125 BQ126 BQ127 BQ128 BQ129 BQ130 BQ131 BQ132 BQ133 BQ134 BQ135 BQ136 BQ137 BQ138 BQ139 BQ140 BQ141 BQ142 BQ143 BQ144 BQ145 BQ146
Label Blank Other Current Operating Expense Aggregate Other Current Operating Expense Total Current Operating Expense Current Operating Earnings before Income Tax Income Taxes Applicable to Current Operating Earnings Net Current Operating Earnings Minority Interest (Income Account) Net Current Operating Earnings after Minority Interest Net Pre-Tax Prot or Loss on Securities Sold or Redeemed Blank Blank Tax Effect on Prot or Loss on Securities Sold or Redeemed Minority Interest in After-Tax Prot or Loss on Sec sold or Redeemed Net After-Tax & After-Min Int Prot or Loss on Secs Sld or Redeemed Net Income Preferred Dividend Deductions Savings Due to Common Stock Equivalents Net Income Available for Common Net Current Operating Earnings Available for Common Cash Dividends Declared on Common Stock Cash Dividends Declared on Preferred Stock Net After-Tax Transfers bet Undivided Prots & Valuation Reserves Total Extraordinary Items Net of Taxes Net Credit or Charge to Reserves for Bad Debts for Loan Recs or Crg-Offs Blank Common Dividends Paid per Share by Payable Date Adjustment Factor Cumulative by Payable Date Net Current Operating Earnings per Share Excluding Extraordinary Items Net Income per Share Excluding Extraordinary Items Net Income per Share Including Extraordinary Items Common Shares Used in Calculating Quarterly Earnings per Share Net Cur Op Erns per ShareExcluding Extraordinary Items 12 Mo Moving Net Income per Share Excluding Extraordinary Items 12 Mo Moving Net Income per Share Including Extraordinary Items 12 Mo Moving Common Shares Used in Calculating 12 Mo Moving Earnings per Share Net Current Op Earngs per ShareExtraordinary Net Income per Share Excluding Extraordinary Items Fully Diluted
Type Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric
Table 33.31
continued
Fields BQ147 BQ148 BQ149 BQ150 BQ151 BQ152 BQ153 BQ154 BQ155 BQ156 BQ157 BQ158 BQ159 BQ160 BQ161 BQ162 BQ163 BQ164 BQ165
Label Net Income per Share Including Extraordinary Items Fully Diluted Common Shares Used in Calculating Quartly Fully Diluted Earnings per Shr Net Cur Op Erngs/Share Ex Extrd Items Fully Diluted 12 Mo Mov Net Inc/Share Excldg Extraordinary ItemsFully Net Inc per Share Inc Extraordinary ItemsFully Diluted12 Mo Mov Common Shrs Used in Calc 12 Mo Moving Fully Diluted 12 Mo Mov Market Price 1st Month of Quarter High Market Price 1st Month of Quarter Low Market Price 1st Month of Quarter Close Market Price 2nd Month of quarter High Market Price 2nd Month of Quarter Low Market Price 2nd Month of Quarter Close Market Price 3rd Month of quarter High Market Price 3rd Month of Quarter Low Market Price 3rd Month of Quarter Close Common Dividends Paid per Share by Ex-Date Annualized Dividend Rate Common Shares Traded (Quarterly) Adjustment Factor Cumulative by Ex-Date
Type Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric
Data Set PRCH High Price Time Series PRCL Low Price Time Series PRCC Closing Price Time Series DIV Dividends Per Share Time Series ERN Earnings Per Share Time Series SHSTRD Shares Traded
Fields GVKEY CALDT PRCH GVKEY CALDT PRCL GVKEY CALDT PRCC GVKEY CALDT DIV GVKEY CALDT ERN GVKEY CALDT
Label GVKEY Calendar Trading Date High Price GVKEY Calendar Trading Date Low Price GVKEY Calendar Trading Date Closing Price GVKEY Calendar Trading Date Dividends Per share GVKEY Calendar Trading Date Earnings Per Share GVKEY Calendar Trading Date
Type Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric
Table 33.32
continued
Data Set Time Series DIVRTE Annualized Dividend Rate Time Series RAWADJ Adjustment Factor Time Series CUMADJ Cumulative Adjustment Factor Time Series BKV Book Value Per Share Time Series CHEQVM Cash Equivalent Distribution CSHOQ Common Share Outstanding NAVM Net Asset Value Time Series OEPS12 Earnings/Share From Operations GICS Global Industry Class Standard Code CPSPIN S&P Index Primary Marker Time Series DIVFT Dividends per Share Footnotes RAWADJFT Raw Adjustment Factor Footnotes COMSTAFT Comparability Status Footnotes ISAFT Issue Status Alert Footnotes
Fields SHSTRD GVKEY CALDT DIVRTE GVKEY CALDT RAWADJ GVKEY CALDT CUMADJ GVKEY CALDT BKV GVKEY CALDT CHECQVM GVKEY CALDT CSHOQ GVKEY CALDT NAVM GVKEY CALDT OEPS12 GVKEY CALDT GICS GVKEY CALDT CPSPIN GVKEY CALDT DIVFT GVKEY CALDT RAWADJFT GVKEY CALDT COMSTAFT GVKEY CALDT ISAFT
Label Shares Traded GVKEY Calendar Trading Date Annuald Dividend Rate GVKEY Calendar Trading Date Raw Adjustment Factor GVKEY Calendar Trading Date Cumulative Adjustment Factor GVKEY Calendar Trading Date Book Value Per Share GVKEY Calendar Trading Date Cash Equivalent Distributions GVKEY Calendar Trading Date Common Shares Outstanding GVKEY Calendar Trading Date Net Asset Value GVKEY Calendar Trading Date Earnings/Share from Operations GVKEY Calendar Trading Date Global Industry Class. Std. code GVKEY Calendar Trading Date S&P Index Primary Marker GVKEY Calendar Trading Date Dividends per share footnotes GVKEY Calendar Trading Date Raw adjustment factor footnotes GVKEY Calendar Trading Date Comparability status footnotes GVKEY Calendar Trading Date Issue status alert footnotes
Type Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Character Numeric Numeric Character Numeric Numeric Character Numeric Numeric Character Numeric Numeric Character
Fields GVKEY SRCYR SRCFYR CALYR RCST1 SSRCE SUCODE CURCD SRCCUR HNAICS
Label GVKEY Segment Source year Segment Source scal year end month Calendar Year Reserved 1 Source Document code Update code ISO currency code Source ISO currency code Segment Primary historical NAICS
Type Numeric Numeric Numeric Numeric Numeric Character Character Character Character Character
Fields GVKEY SRCYR SRCFYR CALYR PDID PSID PSALE RCST1 PNAICS PSTYPE PNAME
Label GVKEY Segment Source year Segment Source scal year end month Calendar Year Product Identier Segment Link segment identier External Revenues Reserved 1 Product NAICS code Segment link segment type Product Name
Type Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Character Character Character
Fields GVKEY SRCYR SRCFYR CALYR CDID CSID CSALE RCST1 CTYPE
Label GVKEY Segment Source year Segment Source scal year end month Calendar Year Customer Identier (cio) Segment Link segment identier Customer Revenues Reserved 1 Customer type
Type Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Character
Table 33.35
continued
Label Geographic area code Geographic area type Segment link - segment type Customer Name
Fields GVKEY SRCYR SRCFYR CALYR SID RCST1 STYPE SOPTP1 SOPTP2 SGEOTP SNAME
Label GVKEY Segment Source year Segment Source scal year end month Calendar Year Segment Identier Reserved 1 Segment type Operating segment type 1 Operating segment type 2 Geographic segment type Segment Name
Type Numeric Numeric Numeric Numeric Numeric Numeric Character Character Character Character Character
Fields GVKEY SRCYR SRCFYR CALYR SID RANK SIC RST1 SNAICS STYPE
Label GVKEY Segment Source year Segment Source scal year end month Calendar Year Segment Identier Ranking Segment SIC Code Reserved 1 Segment NAICS code Segment type
Type Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Character Character
Fields GVKEY
Label GVKEY
Type Numeric
Table 33.38
continued
Label Segment Source year Segment Source scal year end month Calendar Year Segment Identier Reserved 1 Segment type Geographic area code Geographic segment type
Fields GVKEY DATYR DATFYR CALYR SRCYRFYR XRATE XRATE12 SRCCUR CURCD
Label GVKEY Segment Data year (year) Segment Data scal year end month (fyr) Segment Calendar Year (cyr) Segment Source year and source scal (fyr) Period end exchange rate 12-month moving Exchange rate Source currency code ISO Currency code (USD)
Type Numeric Numeric Numeric Numeric Numeric Numeric Numeric Character Character
Fields GVKEY DATYR FISCYR SRCYR SRCFYR CALYR SID EMP SALE OIBD DP OIAD CAPX IAT EQEARN INVEQ
Label GVKEY Data year (year) Data Fiscal year end month (fyr) Source year Source scal year end month Data calendar year (cyr) Segment Identier Employees Net Sales Operating income before depreciation Depreciation and amortization Operating income after depreciation Capital expenditures Identiable/total Assets Equity in earnings Investments at equity
Type Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric
Table 33.40
continued
Fields RD OBKLG EXPORTS INTSEG OPINC PI IB NI RCST1 RCST2 RCST3 SALEF OPINCF CAPXF EQEARNF EMPF RDF STYPE
Label Research and development Order backlog Export sales Intersegment eliminations Operating prot Pretax income Income Before Extraordinary Items Net Income (loss) Reserved 1 Reserved 2 Reserved 3 Footnote 1sales Footnote 2operating prot Footnote 3capital expenditures Footnote 4equity in earnings Footnote 5employees Footnote 6research and development Segment type
Type Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Character Character Character Character Character Character Character
Label Permanent index identication number Permanent index group identication number Index primary link Portfolio number if subset series Index Name Index Group Name
Label INDNO Rebalancing beginning date Rebalancing ending date Count used as of rebalancing Maximum count during period
Table 33.42
continued
Label Available count as of rebalancing Count at end of period Identier at minimum value Identier at maximum value Smallest statistic in period Largest statistic in period Median statistic in period Average statistic in period
Fields INDNO RBEGDT1 RBEGDT2 RBEGDT3 RBEGDT4 RBEGDT5 RBEGDT6 RBEGDT7 RBEGDT8 RBEGDT9 RBEGDT10 RENDDT1 RENDDT2 RENDDT3 RENDDT4 RENDDT5 RENDDT6 RENDDT7 RENDDT8 RENDDT9 RENDDT10 USDCNT1 USDCNT2 USDCNT3 USDCNT4 USDCNT5 USDCNT6 USDCNT7 USDCNT8 USDCNT9 USDCNT10
Label INDNO Rebalancing beginning date for port 1 Rebalancing beginning date for port 2 Rebalancing beginning date for port 3 Rebalancing beginning date for port 4 Rebalancing beginning date for port 5 Rebalancing beginning date for port 6 Rebalancing beginning date for port 7 Rebalancing beginning date for port 8 Rebalancing beginning date for port 9 Rebalancing beginning date for port 10 Rebalancing ending date for port 1 Rebalancing ending date for port 2 Rebalancing ending date for port 3 Rebalancing ending date for port 4 Rebalancing ending date for port 5 Rebalancing ending date for port 6 Rebalancing ending date for port 7 Rebalancing ending date for port 8 Rebalancing ending date for port 9 Rebalancing ending date for port 10 Count used as of rebalancing for port 1 Count used as of rebalancing for port 2 Count used as of rebalancing for port 3 Count used as of rebalancing for port 4 Count used as of rebalancing for port 5 Count used as of rebalancing for port 6 Count used as of rebalancing for port 7 Count used as of rebalancing for port 8 Count used as of rebalancing for port 9 Count used as of rebalancing for port10
Type Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric
Table 33.43
continued
Fields MAXCNT1 MAXCNT2 MAXCNT3 MAXCNT4 MAXCNT5 MAXCNT6 MAXCNT7 MAXCNT8 MAXCNT9 MAXCNT10 TOTCNT1 TOTCNT2 TOTCNT3 TOTCNT4 TOTCNT5 TOTCNT6 TOTCNT7 TOTCNT8 TOTCNT9 TOTCNT10 ENDCNT1 ENDCNT2 ENDCNT3 ENDCNT4 ENDCNT5 ENDCNT6 ENDCNT7 ENDCNT8 ENDCNT9 ENDCNT10 MINID1 MINID2 MINID3 MINID4 MINID5 MINID6 MINID7 MINID8 MINID9 MINID10 MAXID1 MAXID2 MAXID3 MAXID4 MAXID5
Label Maximum count during period for port 1 Maximum count during period for port 2 Maximum count during period for port 3 Maximum count during period for port 4 Maximum count during period for port 5 Maximum count during period for port 6 Maximum count during period for port 7 Maximum count during period for port 8 Maximum count during period for port 9 Maximum count during period for port 10 Available count as of rebalancing for port 1 Available count as of rebalancing for port 2 Available count as of rebalancing for port 3 Available count as of rebalancing for port 4 Available count as of rebalancing for port 5 Available count as of rebalancing for port 6 Available count as of rebalancing for port 7 Available count as of rebalancing for port 8 Available count as of rebalancing for port 9 Available count as of rebalancing for port10 Count at end of period for port 1 Count at end of period for port 2 Count at end of period for port 3 Count at end of period for port 4 Count at end of period for port 5 Count at end of period for port 6 Count at end of period for port 7 Count at end of period for port 8 Count at end of period for port 9 Count at end of period for port 10 Identier at minimum value for port 1 Identier at minimum value for port 2 Identier at minimum value for port 3 Identier at minimum value for port 4 Identier at minimum value for port 5 Identier at minimum value for port 6 Identier at minimum value for port 7 Identier at minimum value for port 8 Identier at minimum value for port 9 Identier at minimum value for port 10 Identier at maximum value for port 1 Identier at maximum value for port 2 Identier at maximum value for port 3 Identier at maximum value for port 4 Identier at maximum value for port 5
Type Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric
Table 33.43
continued
Fields MAXID6 MAXID7 MAXID8 MAXID9 MAXID10 MINSTA1 MINSTA2 MINSTA3 MINSTA4 MINSTA5 MINSTA6 MINSTA7 MINSTA8 MINSTA9 MINSTA10 MAXSTA1 MAXSTA2 MAXSTA3 MAXSTA4 MAXSTA5 MAXSTA6 MAXSTA7 MAXSTA8 MAXSTA9 MAXSTA10 MEDSTA1 MEDSTA2 MEDSTA3 MEDSTA4 MEDSTA5 MEDSTA6 MEDSTA7 MEDSTA8 MEDSTA9 MEDSTA10 AVGSTA1 AVGSTA2 AVGSTA3 AVGSTA4 AVGSTA5 AVGSTA6 AVGSTA7 AVGSTA8 AVGSTA9 AVGSTA10
Label Identier at maximum value for port 6 Identier at maximum value for port 7 Identier at maximum value for port 8 Identier at maximum value for port 9 Identier at maximum alue for port 10 Smallest statistic in period for port 1 Smallest statistic in period for port 2 Smallest statistic in period for port 3 Smallest statistic in period for port 4 Smallest statistic in period for port 5 Smallest statistic in period for port 6 Smallest statistic in period for port 7 Smallest statistic in period for port 8 Smallest statistic in period for port 9 Smallest statistic in period for port 10 Largest statistic in period for port 1 Largest statistic in period for port 2 Largest statistic in period for port 3 Largest statistic in period for port 4 Largest statistic in period for port 5 Largest statistic in period for port 6 Largest statistic in period for port 7 Largest statistic in period for port 8 Largest statistic in period for port 9 Largest statistic in period for port 10 Median statistic in period for port 1 Median statistic in period for port 2 Median statistic in period for port 3 Median statistic in period for port 4 Median statistic in period for port 5 Median statistic in period for port 6 Median statistic in period for port 7 Median statistic in period for port 8 Median statistic in period for port 9 Median statistic in period for port 10 Average statistic in period for port 1 Average statistic in period for port 2 Average statistic in period for port 3 Average statistic in period for port 4 Average statistic in period for port 5 Average statistic in period for port 6 Average statistic in period for port 7 Average statistic in period for port 8 Average statistic in period for port 9 Average statistic in period for port 10
Type Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric
Label INDNO Issue identier First date included Last date included Code for subcategory of list Weight during range
Label INDNO Issue identier First date included Last date included Code for subcategory of list Weight during range
Fields INDNO CALDT USDCNT1 USDCNT2 USDCNT3 USDCNT4 USDCNT5 USDCNT6 USDCNT7 USDCNT8 USDCNT9 USDCNT10 USDCNT11 USDCNT12 USDCNT13 USDCNT14 USDCNT15 USDCNT16 USDCNT17
Label INDNO Calendar Trading Date Used Count for Port 1 Used Count for Port 2 Used Count for Port 3 Used Count for Port 4 Used Count for Port 5 Used Count for Port 6 Used Count for Port 7 Used Count for Port 8 Used Count for Port 9 Used Count for Port 10 Used Count for Port 11 Used Count for Port 12 Used Count for Port 13 Used Count for Port 14 Used Count for Port 15 Used Count for Port 16 Used Count for Port 17
Type Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric
Fields INDNO CALDT TOTCNT1 TOTCNT2 TOTCNT3 TOTCNT4 TOTCNT5 TOTCNT6 TOTCNT7 TOTCNT8 TOTCNT9 TOTCNT10 TOTCNT11 TOTCNT12 TOTCNT13 TOTCNT14 TOTCNT15 TOTCNT16
Label INDNO Calendar Trading Date Total Count for Port 1 Total Count for Port 2 Total Count for Port 3 Total Count for Port 4 Total Count for Port 5 Total Count for Port 6 Total Count for Port 7 Total Count for Port 8 Total Count for Port 9 Total Count for Port10 Total Count for Port11 Total Count for Port12 Total Count for Port13 Total Count for Port14 Total Count for Port15 Total Count for Port16
Type Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric
Table 33.49
continued
Fields TOTCNT17
Type Numeric
Fields INDNO CALDT USDVAL1 USDVAL2 USDVAL3 USDVAL4 USDVAL5 USDVAL6 USDVAL7 USDVAL8 USDVAL9 USDVAL10 USDVAL11 USDVAL12 USDVAL13 USDVAL14 USDVAL15 USDVAL16 USDVAL17
Label INDNO Calendar Trading Date Used Value for Port 1 Used Value for Port 2 Used Value for Port 3 Used Value for Port 4 Used Value for Port 5 Used Value for Port 6 Used Value for Port 7 Used Value for Port 8 Used Value for Port 9 Used Value for Port 10 Used Value for Port 11 Used Value for Port 12 Used Value for Port 13 Used Value for Port 14 Used Value for Port 15 Used Value for Port 16 Used Value for Port 17
Type Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric
Fields INDNO CALDT TOTVAL1 TOTVAL2 TOTVAL3 TOTVAL4 TOTVAL5 TOTVAL6 TOTVAL7 TOTVAL8 TOTVAL9 TOTVAL10 TOTVAL11 TOTVAL12 TOTVAL13 TOTVAL14 TOTVAL15 TOTVAL16 TOTVAL17
Label INDNO Calendar Trading Date Total Value for Port 1 Total Value for Port 2 Total Value for Port 3 Total Value for Port 4 Total Value for Port 5 Total Value for Port 6 Total Value for Port 7 Total Value for Port 8 Total Value for Port 9 Total Value for Port10 Total Value for Port11 Total Value for Port12 Total Value for Port13 Total Value for Port14 Total Value for Port15 Total Value for Port16 Total Value for Port17
Type Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric
Fields INDNO CALDT TRET1 TRET2 TRET3 TRET4 TRET5 TRET6 TRET7 TRET8 TRET9 TRET10 TRET11 TRET12 TRET13 TRET14 TRET15 TRET16 TRET17
Label INDNO Calendar Trading Date Total Returns for Port 1 Total Returns for Port 2 Total Returns for Port 3 Total Returns for Port 4 Total Returns for Port 5 Total Returns for Port 6 Total Returns for Port 7 Total Returns for Port 8 Total Returns for Port 9 Total Returns for Port 10 Total Returns for Port 11 Total Returns for Port 12 Total Returns for Port 13 Total Returns for Port 14 Total Returns for Port 15 Total Returns for Port 16 Total Returns for Port 17
Type Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric
Label INDNO Calendar Trading Date Appreciation Returns for Port 1 Appreciation Returns for Port 2 Appreciation Returns for Port 3 Appreciation Returns for Port 4 Appreciation Returns for Port 5 Appreciation Returns for Port 6
Table 33.58
continued
Fields ARET7 ARET8 ARET9 ARET10 ARET11 ARET12 ARET13 ARET14 ARET15 ARET16 ARET17
Label Appreciation Returns for Port 7 Appreciation Returns for Port 8 Appreciation Returns for Port 9 Appreciation Returns for Port 10 Appreciation Returns for Port 11 Appreciation Returns for Port 12 Appreciation Returns for Port 13 Appreciation Returns for Port 14 Appreciation Returns for Port 15 Appreciation Returns for Port 16 Appreciation Returns for Port 17
Type Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric
Fields INDNO CALDT IRET1 IRET2 IRET3 IRET4 IRET5 IRET6 IRET7 IRET8 IRET9 IRET10 IRET11 IRET12 IRET13 IRET14 IRET15 IRET16 IRET17
Label INDNO Calendar Trading Date Income Returns for Port 1 Income Returns for Port 2 Income Returns for Port 3 Income Returns for Port 4 Income Returns for Port 5 Income Returns for Port 6 Income Returns for Port 7 Income Returns for Port 8 Income Returns for Port 9 Income Returns for Port 10 Income Returns for Port 11 Income Returns for Port 12 Income Returns for Port 13 Income Returns for Port 14 Income Returns for Port 15 Income Returns for Port 16 Income Returns for Port 17
Type Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric
TIND Group Data SetTotal Return Index Levels Time Series Groups
Table 33.63 TIND Group Data SetTotal Return Index Levels Time Series Groups
Fields INDNO CALDT TIND1 TIND2 TIND3 TIND4 TIND5 TIND6 TIND7 TIND8 TIND9 TIND10 TIND11 TIND12 TIND13 TIND14 TIND15 TIND16 TIND17
Label INDNO Calendar Trading Date Total Return Index Levels for Port 1 Total Return Index Levels for Port 2 Total Return Index Levels for Port 3 Total Return Index Levels for Port 4 Total Return Index Levels for Port 5 Total Return Index Levels for Port 6 Total Return Index Levels for Port 7 Total Return Index Levels for Port 8 Total Return Index Levels for Port 9 Total Return Index Levels for Port 10 Total Return Index Levels for Port 11 Total Return Index Levels for Port 12 Total Return Index Levels for Port 13 Total Return Index Levels for Port 14 Total Return Index Levels for Port 15 Total Return Index Levels for Port 16 Total Return Index Levels for Port 17
Type Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric
Fields INDNO CALDT AIND1 AIND2 AIND3 AIND4 AIND5 AIND6 AIND7 AIND8 AIND9 AIND10 AIND11 AIND12 AIND13 AIND14 AIND15 AIND16 AIND17
Label INDNO Calendar Trading Date Appreciation Index Levels for Port 1 Appreciation Index Levels for Port 2 Appreciation Index Levels for Port 3 Appreciation Index Levels for Port 4 Appreciation Index Levels for Port 5 Appreciation Index Levels for Port 6 Appreciation Index Levels for Port 7 Appreciation Index Levels for Port 8 Appreciation Index Levels for Port 9 Appreciation Index Levels for Port 10 Appreciation Index Levels for Port 11 Appreciation Index Levels for Port 12 Appreciation Index Levels for Port 13 Appreciation Index Levels for Port 14 Appreciation Index Levels for Port 15 Appreciation Index Levels for Port 16 Appreciation Index Levels for Port 17
Type Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric
Fields INDNO CALDT IIND1 IIND2 IIND3 IIND4 IIND5 IIND6 IIND7 IIND8 IIND9 IIND10 IIND11 IIND12 IIND13 IIND14 IIND15 IIND16
Label INDNO Calendar Trading Date Income Index Levels for Port 1 Income Index Levels for Port 2 Income Index Levels for Port 3 Income Index Levels for Port 4 Income Index Levels for Port 5 Income Index Levels for Port 6 Income Index Levels for Port 7 Income Index Levels for Port 8 Income Index Levels for Port 9 Income Index Levels for Port 10 Income Index Levels for Port 11 Income Index Levels for Port 12 Income Index Levels for Port 13 Income Index Levels for Port 14 Income Index Levels for Port 15 Income Index Levels for Port 16
Type Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric
References ! 2287
Table 33.65
continued
Fields IIND17
Type Numeric
References
Center for Research in Security Prices (2003), CRSP/Compustat Merged Database Guide, Chicago: The University of Chicago Graduate School of Business. Center for Research in Security Prices (2003), CRSP Data Description Guide, Chicago: The University of Chicago Graduate School of Business, [http://www.crsp.uchicago.edu/support/documentation/index.html]. Center for Research in Security Prices (2002), CRSP Programmers Guide, Chicago: The University of Chicago Graduate School of Business, [http://www.crsp.uchicago.edu/support/documentation/index.html]. Center for Research in Security Prices (2003), CRSPAccess Database Format Release Notes, Chicago: The University of Chicago Graduate School of Business, [http://www.crsp.uchicago.edu/support/documentation/index.html]. Center for Research in Security Prices (2003), CRSP Utilities Guide, Chicago: The University of Chicago Graduate School of Business, [http://www.crsp.uchicago.edu/support/documentation/index.html]. Center for Research in Security Prices (2002), CRSP SFA Guide, Chicago: The University of Chicago Graduate School of Business, [http://www.crsp.uchicago.edu/support/documentation/index.html].
Acknowledgments
Many people have been instrumental in the development of the ETS Interface engine. The individuals listed here have been especially helpful. Janet Eder, Center for Research in Security Prices, University of Chicago Graduate School of Business. Ken Kraus, Center for Research in Security Prices, University of Chicago Graduate School of Business. Bob Spatz, Center for Research in Security Prices, University of Chicago Graduate School of Business. Rick Langston, SAS Institute, Cary, NC.
Kelly Fellingham, SAS Institute, Cary, NC. Peng Zang, SAS Institute, Atlanta, GA. The nal responsibility for the SAS System lies with SAS Institute alone. We hope that you will always let us know your opinions about the SAS System and its documentation. It is through your participation that SAS software is continuously improved.
Chapter 34
conversion of very large databases, you may want to dene the FAME_TEMP environment variable to point to a location where there is ample space for FAME WORK. SASEFAME provides seamless access to FAME databases via FAMEs C Host Language Interface (CHLI). The FAME CHLI does not support writing more than 2 gigabytes to the FAME WORK area, and when this happens SASEFAME will terminate with a system error. The SASEFAME engine nishes the CHLI whenever a fatal error occurs. To restart the engine after a fatal error, terminate the current SAS session and bring up a new SAS session.
As shown in Example 34.7, you simply provide the frdb_m port number and node name of your FAME master server in your libref. For more details, refer to Starting the Master Server in the Guide to FAME Database Servers.
Table 34.1
RANGE= INSET=
CROSSLIST= FAMEOUT=
Description species the FAME frequency and the FAME technique species a FAME wildcard to match data object series names within the FAME database, which limits the selection of time series that are included in the SAS data set species the range of data to keep in format ddmonyyyyd - ddmonyyyyd uses a SAS data set named setname and/or WHERE= FAME namelist as selection input for BY variables such as tickers or issues stored in a FAME string case series species a FAME crosslist namelist to perform selection based on the crossproduct of two FAME namelists species the FAME data object class/type you want output to the SAS data set
Since physical name species the location of the folder where your FAME database resides, it should end in a backslash if you are in a Windows environment, or a forward slash if you are in a UNIX environment. If you are accessing a remote FAME database by using FAME CHLI, you can use the following syntax for physical name: #port_number@hostname physical_path_name The following options can be used in the LIBNAME libref SASEFAME statement.
CONVERT=( FREQ=fame_frequency TECH=fame_technique)
species the FAME frequency and the FAME technique just as you would in the FAME CONVERT function. There are four possible values for fame_technique: CONSTANT (default), CUBIC, DISCRETE, or LINEAR. All FAME frequencies except PPY and YPP are supported by the SASEFAME engine. For a more complete discussion of FAME frequencies and SAS time intervals, see the section Mapping FAME Frequencies to SAS Time Intervals on page 2298. For all possible fame_frequency values, refer to Understanding Frequencies in the Users Guide to FAME. For example:
LIBNAME libref sasefame physical-name CONVERT=(TECH=CONSTANT FREQ=TWICEMONTHLY);
WILDCARD="fame_wildcard"
By default, the SASEFAME engine reads all time series in the FAME database that you name in your SASEFAME libref. You can limit the time series read from the FAME database by specifying the WILDCARD= option in your LIBNAME statement. The fame_wildcard is a quoted string containing the FAME wildcard you want to use. The wildcard is used for matching against the data object names of series you want to select from the FAME database that resides in the library you are assigning. For more information about wildcarding, see Specifying Wildcards in the Users Guide to FAME. For example, to read all time series in the TEST library being accessed by the SASEFAME engine, you would specify
LIBNAME test sasefame physical name of test database WILDCARD="?";
To read series with names such as A_DATA, B_DATA, or C_DATA, you could specify
LIBNAME test sasefame physical name of test database WILDCARD="^_DATA";
When you use the WILDCARD= option, you are limiting the number of series that are read and converted to the desired frequency. This option can help you save resources when processing large databases or when processing a large number of observations, such as daily or hourly frequencies. Since the SASEFAME engine uses the FAME WORK database to store the converted time series, using wildcards is recommended to prevent your WORK space from getting too large. When the FAMEOUT= option is also specied, the wildcard is applied to the type of data object series you specify in the FAMEOUT= option.
RANGE=fame_begdtd-fame_enddtd
To limit the time range of data read from your FAME database, specify the RANGE= option in your SASEFAME libref, where fame_begdt is the beginning date in ddmonyyyy format and fame_enddt is the ending date of the range in ddmonyyyy format. As an example, to read a series with a date range that spans the rst quarter of 1999, you could use the following statement:
LIBNAME test sasefame physical name of test database RANGE=01jan1999d - 31mar1999d;
INSET=(setname WHERE=fame_bygroup )
When you specify a SAS data set named setname as input for a BY group such as tickers, the SASEFAME engine uses the fame_bygroup to select time series that are named using the following convention. Selected variable names are glued together by the ticker name concatenated with the glue character (such as DOT) to the series name that is specied in the CROSSLIST= option or the fame_namelist. See the following example that uses both the CROSSLIST= option and the INSET= option with the WHERE= clause.
There are two methods for performing the crosslist selection function. The rst method uses two FAME namelists, and the second method uses one namelist and one BY group specied in the WHERE= clause of the INSET= option. Using the CROSSLIST= option with the optional fame_namelist1 causes the SASEFAME engine to perform a crossproduct of the rst namelists members with the second namelists members, using a glue symbol "." to join the two. For example, if your FAME database has a namelist named TICKER dened by
Ticker = {AOL, C, CVX, F, GM, HPQ, IBM, INDUA, INTC, SPX, SUNW, XOM}
and your time series are named in fame_namelist2 as adjust, close, high, low, open, volume, uclose, uhigh, ulow, uopen, uvolume when you specify
LIBNAME test sasefame physical name of test database RANGE=01jan1999d - 31mar1999d CROSSLIST=(nl(ticker), {adjust, close, high, low, open, volume, uclose, uhigh, ulow, uopen, uvolume}) ;
then the 132 variables shown in Table 34.2 are selected by the CROSSLIST= option.
Table 34.2 SAS Variables Selected by CROSSLIST= Option
AOL.ADJUST AOL.CLOSE AOL.HIGH AOL.LOW AOL.OPEN AOL.UCLOSE AOL.UHIGH AOL.ULOW AOL.UOPEN AOL.UVOLUME AOL.VOLUME GM.ADJUST GM.CLOSE GM.HIGH GM.LOW GM.OPEN GM.UCLOSE GM.UHIGH GM.ULOW GM.UOPEN GM.UVOLUME GM.VOLUME
C.ADJUST C.CLOSE C.HIGH C.LOW C.OPEN C.UCLOSE C.UHIGH C.ULOW C.UOPEN C.UVOLUME C.VOLUME HPQ.ADJUST HPQ.CLOSE HPQ.HIGH HPQ.LOW HPQ.OPEN HPQ.UCLOSE HPQ.UHIGH HPQ.ULOW HPQ.UOPEN HPQ.UVOLUME HPQ.VOLUME
CVX.ADJUST CVX.CLOSE CVX.HIGH CVX.LOW CVX.OPEN CVX.UCLOSE CVX.UHIGH CVX.ULOW CVX.UOPEN CVX.UVOLUME CVX.VOLUME IBM.ADJUST IBM.CLOSE IBM.HIGH IBM.LOW IBM.OPEN IBM.UCLOSE IBM.UHIGH IBM.ULOW IBM.UOPEN IBM.UVOLUME IBM.VOLUME
F.ADJUST F.CLOSE F.HIGH F.LOW F.OPEN F.UCLOSE F.UHIGH F.ULOW F.UOPEN F.UVOLUME F.VOLUME INDUA.ADJUST INDUA.CLOSE INDUA.HIGH INDUA.LOW INDUA.OPEN INDUA.UCLOSE INDUA.UHIGH INDUA.ULOW INDUA.UOPEN INDUA.UVOLUME INDUA.VOLUME
Table 34.2
continued
INTC.ADJUST INTC.CLOSE INTC.HIGH INTC.LOW INTC.OPEN INTC.UCLOSE INTC.UHIGH INTC.ULOW INTC.UOPEN INTC.UVOLUME INTC.VOLUME
SPX.ADJUST SPX.CLOSE SPX.HIGH SPX.LOW SPX.OPEN SPX.UCLOSE SPX.UHIGH SPX.ULOW SPX.UOPEN SPX.UVOLUME SPX.VOLUME
SUNW.ADJUST SUNW.CLOSE SUNW.HIGH SUNW.LOW SUNW.OPEN SUNW.UCLOSE SUNW.UHIGH SUNW.ULOW SUNW.UOPEN SUNW.UVOLUME SUNW.VOLUME
XOM.ADJUST XOM.CLOSE XOM.HIGH XOM.LOW XOM.OPEN XOM.UCLOSE XOM.UHIGH XOM.ULOW XOM.UOPEN XOM.UVOLUME XOM.VOLUME
Instead of using two namelists, you can use the WHERE= clause from your INSET= option to perform the crossproduct of the BY variables specied in your INSET via the WHERE= clause, with the members named in your namelist. Suppose you have dened a SAS input dataset named INSETA and you want to use it as input for your CROSSLIST= option instead of using the FAME namelist:
DATA INSETA; LENGTH tick $5; /* AOL, C, CVX, F, GM, HPQ, IBM, INDUA, INTC, SPX, SUNW, XOM */ tick=AOL; output; tick=C; output; tick=CVX; output; tick=F; output; tick=GM; output; tick=HPQ; output; tick=IBM; output; tick=INDUA; output; tick=INTC; output; tick=SPX; output; tick=SUNW; output; tick=XOM; output; RUN; LIBNAME test sasefame physical name of test database RANGE=01jan1999d - 31mar1999d INSET=(inseta, where=tick) CROSSLIST=( {adjust, close, high, low, open, volume, uclose, uhigh, ulow, uopen, uvolume}) ;
Whether you use a SAS INSET with a WHERE clause or you use a FAME namelist for your CROSSLIST= selection, the two methods are equivalent ways of performing the same selection function. In the preceding example, the FAME ticker namelist corresponds to the SAS INSETs BY variable named TICK.
Note that the WHERE= fame_bygroup must match your BY variable name used in your INSET= in order for the CROSSLIST= option to perform the desired selection. If one of the time series listed in your fame_namelist2 does not exist, the SASEFAME engine stops processing the remainder of the namelist. For complete results you should make sure that your fame_namelist2 is accurate and does not name unknown variables. The same holds true for fame_namelist1 and the BY variable values named in your INSET= option and used in your WHERE= clause.
FAMEOUT=fame_data_object_class_type
species the class and type of the FAME data series objects you want in your SAS output data set. The possible values for fame_data_object_class_type are FORMULA, TIME, BOOLEAN, CASE, DATE, and STRING. If the FAMEOUT= option is not specied, numeric time series are output to the SAS data set. FAMEOUT=CASE defaults to case series of numeric type, so if you want another type of case series in your output, then you must specify it. Scalar data objects are not supported.
A more detailed discussion of how to map FAME frequencies to SAS time intervals follows. For other types of data such as string case series, date case series, boolean case series, and formulas, see Example 34.11.
FAME Frequency WEEKLY (SUNDAY) WEEKLY (MONDAY) WEEKLY (TUESDAY) WEEKLY (WEDNESDAY) WEEKLY (THURSDAY) WEEKLY (FRIDAY) WEEKLY (SATURDAY) BIWEEKLY (ASUNDAY) BIWEEKLY (AMONDAY) BIWEEKLY (ATUESDAY) BIWEEKLY (AWEDNESDAY) BIWEEKLY (ATHURSDAY) BIWEEKLY (AFRIDAY) BIWEEKLY (ASATURDAY) BIWEEKLY (BSUNDAY) BIWEEKLY (BMONDAY) BIWEEKLY (BTUESDAY) BIWEEKLY (BWEDNESDAY) BIWEEKLY (BTHURSDAY) BIWEEKLY (BFRIDAY) BIWEEKLY (BSATURDAY) BIMONTHLY (NOVEMBER) BIMONTHLY QUARTERLY (OCTOBER) QUARTERLY (NOVEMBER) QUARTERLY
SAS Time Interval WEEK.2 WEEK.3 WEEK.4 WEEK.5 WEEK.6 WEEK.7 WEEK.1 WEEK2.2 WEEK2.3 WEEK2.4 WEEK2.5 WEEK2.6 WEEK2.7 WEEK2.1 WEEK2.9 WEEK2.10 WEEK2.11 WEEK2.12 WEEK2.13 WEEK2.14 WEEK2.8 MONTH2.2 MONTH2.1 QTR.2 QTR.3 QTR.1
Table 34.3
continued
FAME FREQUENCY ANNUAL (JANUARY) ANNUAL (FEBRUARY) ANNUAL (MARCH) ANNUAL (APRIL) ANNUAL (MAY) ANNUAL (JUNE) ANNUAL (JULY) ANNUAL (AUGUST) ANNUAL (SEPTEMBER) ANNUAL (OCTOBER) ANNUAL (NOVEMBER) ANNUAL SEMIANNUAL (JULY) SEMIANNUAL (AUGUST) SEMIANNUAL (SEPTEMBER) SEMIANNUAL (OCTOBER) SEMIANNUAL (NOVEMBER) SEMIANNUAL YPP PPY SECONDLY MINUTELY HOURLY DAILY BUSINESS TENDAY TWICEMONTHLY MONTHLY
SAS TIME INTERVAL YEAR.2 YEAR.3 YEAR.4 YEAR.5 YEAR.6 YEAR.7 YEAR.8 YEAR.9 YEAR.10 YEAR.11 YEAR.12 YEAR.1 SEMIYEAR.2 SEMIYEAR.3 SEMIYEAR.4 SEMIYEAR.5 SEMIYEAR.6 SEMIYEAR.1 not supported not supported SECOND MINUTE HOUR DAY WEEKDAY TENDAY SEMIMONTH MONTH
In the preceding example, the FAME database is called oecd1.db and it resides in the famedir directory. The DATA statement names the SAS output data set a, which will reside in mydir. All time series in the FAME oecd1.db database will be converted to an annual frequency and stored in the mydir.a SAS data set. Since the time series variable names contain the special glue symbol ., the SAS option statement species VALIDVARNAME=ANY. Refer to the SAS Language Reference: Dictionary for more about this option. The FAME environment variable is the location of the FAME installation, and on Windows, the log would look like this:
1 options validvarname=any;
2 %let FAME=%sysget(FAME); 3 %put(&FAME); (C:\PROGRA~1\FAME) 4 %let FAMETEMP=%sysget(FAME_TEMP); 5 %put(&FAMETEMP); (\\ge\U11\saskff\fametemp\) 6 7 libname famedir sasefame "&FAME\util" 8 convert=(freq=annual technique=constant);
NOTE: Libref FAMEDIR was successfully assigned as follows: Engine: FAMECHLI Physical Name: C:\PROGRA~1\FAME\util 9 10 libname mydir \\dntsrc\usrtmp\saskff; NOTE: Libref MYDIR was successfully assigned as follows: Engine: V9 Physical Name: \\dntsrc\usrtmp\saskff 11 12 data mydir.a; /* add data set to mydir */ 13 set famedir.oecd1; AUS.DIRDES -- SERIES (NUMERIC by ANNUAL) AUS.DIRDES copied to work data base as AUS.DIRDES.
For more about the glue DOT character, refer to Glueing Names Together in the Users Guide to FAME. In the preceding log, the variable name AUS.DIRDES uses the glue DOT between AUS and DIRDES. The PROC PRINT statement creates Output 34.1.1 showing all of the observations in the mydir.a SAS data set.
Obs 1 2 3 4
DEU. DIRDES 3538.60 3777.20 2953.30 . FRA. DIRDES 2573.50 2856.50 3005.20 . ISL. DIRDES . 10.3000 11.0000 11.8000
DNK. DIRDES 258.100 284.800 . . GBR. DIRDES 2627.00 2844.10 . . ITA. DIRDES 1861.50 1968.00 2075.00 2137.80
ESP. DIRDES 508.200 623.600 723.600 . GRC. DIRDES 60.600 119.800 . . JPN. DIRDES 9657.20 10405.90 . .
Obs 1 2 3 4
Obs 1 2 3 4
NZL. PRT. SWE. DIRDES NZL.HERD DIRDES PRT.HERD DIRDES . 143.800 . . 111.5 10158.20 . . . . . . YUG. DIRDES 233.000 205.100 . . . 1076 . .
Example 34.2: Reading Time Series from the FAME Database ! 2303
The WILDCARD=WSPCA? option limits reading only those series whose names begin with WSPCA. The KEEP statement further restricts the SAS data set to include only the series named WSPCA and the DATE variable. The time interval used for the conversion is TWICEMONTHLY.
Example 34.3: Writing Time Series to the SAS Data Set ! 2305
Obs 1 2 3 4
DEU. DIRDES 3538.60 3777.20 2953.30 . FRA. DIRDES 2573.50 2856.50 3005.20 . ISL. DIRDES . 10.3000 11.0000 11.8000 PRT. DIRDES 111.5 . . .
DNK. DIRDES 258.100 284.800 . . GBR. DIRDES 2627.00 2844.10 . . NLD. DIRDES 883 945 . . SWE. DIRDES . 1076 . .
ESP. DIRDES 508.200 623.600 723.600 . GRC. DIRDES 60.600 119.800 . . NOR. DIRDES . 308.900 . 352.000 YUG. DIRDES 233.000 205.100 . .
Obs 1 2 3 4
Obs 1 2 3 4
Obs 1 2 3 4
NZL.HERD . 143.800 . .
PRT.HERD 10158.20 . . .
SWE.HERD . 11104 . .
Note that the SAS option VALIDVARNAME=ANY was used at the beginning of this example due to special characters being present in the time series names. SAS variables that contain certain special characters are called n-literals and are referenced in SAS code as shown in this example. You can rename your SAS variables by using the RENAME statement as follows. Output 34.3.2 shows how to use n-literals when selecting variables you want to keep, and how to rename some of your kept variables.
options validvarname=any;
%let FAME=%sysget(FAME); %put(&FAME); %let FAMETEMP=%sysget(FAME_TEMP); %put(&FAMETEMP); libname famedir sasefame "%sysget(FAME_DATA)" convert=(freq=annual technique=constant); libname mydir "%sysget(FAME_TEMP)"; data mydir.a; /* add data set to mydir */ set famedir.oecd1; /* keep and rename */ keep date ita.dirdesn--jpn.herdn tur.dirdesn--usa.herdn; rename ita.dirdesn=italy.dirdesn jpn.dirdesn=japan.dirdesn tur.dirdesn=turkey.dirdesn usa.dirdesn=united.states.of.america.dirdesn ; run; title3 "keep statement using n-literals"; title4 "rename statement using n-literals"; proc print data=mydir.a; run;
Obs 1 2 3 4 5 6 7
Obs 1 2 3 4 5 6 7
united.states. of.america. dirdes 14786.00 16566.90 18326.10 20246.20 22159.50 23556.10 24953.80
Output 34.4.1 Listing of OUT=MYDIR.A of the OECD1 FAME Data Using RANGE= Option
OECD1: TECH=constant, FREQ=annual keep statement using n-literals rename statement using n-literals AUS. AUT. BEL. Obs DATE DIRDES AUS.HERD DIRDES AUT.HERD DIRDES BEL.HERD 1 2 3 1988 1989 1990 CHE. DIRDES 632.100 . . FIN. DIRDES 247.700 259.700 271.000 IRL. DIRDES 49.6000 50.2000 51.7000 750 . . 1072.90 . . . . . . . . 374 . . CAN. DIRDES CAN.HERD 2006 2214 2347
Obs 1 2 3
DEU. DIRDES 3538.60 3777.20 2953.30 FRA. DIRDES 2573.50 2856.50 3005.20 ISL. DIRDES . 10.3000 11.0000
DNK. DIRDES 258.100 284.800 . GBR. DIRDES 2627.00 2844.10 . ITA. DIRDES 1861.5 1968.0 2075.0
ESP. DIRDES 508.200 623.600 723.600 GRC. DIRDES 60.600 119.800 . JPN. DIRDES 9657.20 10405.90 .
Obs 1 2 3
Obs 1 2 3
NZL. PRT. SWE. DIRDES NZL.HERD DIRDES PRT.HERD DIRDES . 143.800 . 111.5 10158.20 . . . . YUG. DIRDES 233.000 205.100 . . 1076 .
The WHERE statement can be used in the DATA step to process the time ID variable DATE only when it falls in the range you are interested in. In Output 34.4.2, you can see that the result from the WHERE statement is the same as the result in Output 34.4.1 using the RANGE= option.
options validvarname=any; %let FAME=%sysget(FAME);
%put(&FAME); %let FAMETEMP=%sysget(FAME_TEMP); %put(&FAMETEMP); libname famedir SASEFAME "%sysget(FAME_DATA)" convert=(freq=annual technique=constant); libname mydir "%sysget(FAME_TEMP)"; data mydir.a; /* add data set to mydir */ set famedir.oecd1; /* where only */ where date between 01jan88d and 31dec90d; run; proc print data=mydir.a; run;
Output 34.4.2 Listing of OUT=MYDIR.A of the OECD1 FAME Data Using WHERE Statement
OECD1: TECH=constant, FREQ=annual keep statement using n-literals rename statement using n-literals AUS. AUT. BEL. Obs DATE DIRDES AUS.HERD DIRDES AUT.HERD DIRDES BEL.HERD 1 2 3 1988 1989 1990 CHE. DIRDES 632.100 . . FIN. DIRDES 247.700 259.700 271.000 IRL. DIRDES 49.6000 50.2000 51.7000 750 . . 1072.90 . . . . . . . . 374 . . CAN. DIRDES CAN.HERD 2006 2214 2347
Obs 1 2 3
DEU. DIRDES 3538.60 3777.20 2953.30 FRA. DIRDES 2573.50 2856.50 3005.20 ISL. DIRDES . 10.3000 11.0000
DNK. DIRDES 258.100 284.800 . GBR. DIRDES 2627.00 2844.10 . ITA. DIRDES 1861.5 1968.0 2075.0
ESP. DIRDES 508.200 623.600 723.600 GRC. DIRDES 60.600 119.800 . JPN. DIRDES 9657.20 10405.90 .
Obs 1 2 3
Obs 1 2 3
NZL. PRT. SWE. DIRDES NZL.HERD DIRDES PRT.HERD DIRDES . 143.800 . 111.5 10158.20 . . . . YUG. DIRDES 233.000 205.100 . . 1076 .
Refer to the SAS Language Reference: Concepts for more information on KEEP, DROP, RENAME, and WHERE statements.
Example 34.5: Creating a View Using the SQL Procedure and SASEFAME ! 2311
Example 34.5: Creating a View Using the SQL Procedure and SASEFAME
The following statements create a view of OECD data by using the SQL procedures FROM and USING clauses as shown in Output 34.5.1. Refer to the BASE SAS Procedures Guide for details on SQL views.
title1 famesql5: PROC SQL Dual Embedded Libraries w/ FAME option; options validvarname=any; %let FAME=%sysget(FAME); %put(&FAME); %let FAMETEMP=%sysget(FAME_TEMP); %put(&FAMETEMP); title2 OECD1: Dual Embedded Library Allocations with FAME Option; proc sql; create view fameview as select date, fin.herdn from lib1.oecd1 using libname lib1 sasefame "%sysget(FAME_DATA)" convert=(tech=constant freq=annual), libname temp "%sysget(FAME_TEMP)"; quit; title2 OECD1: Print of View from Embedded Library with FAME Option; proc print data=fameview; run;
The following statements create a view of DRI Basic Economic data by using the SQL procedures FROM and USING clauses as shown in Output 34.5.2.
title2 SUBECON: Dual Embedded Library Allocations with FAME Option; options validvarname=any;
%let FAME=%sysget(FAME); %put(&FAME); %let FAMETEMP=%sysget(FAME_TEMP); %put(&FAMETEMP); proc sql; create view fameview as select date, gaa from lib1.subecon using libname lib1 sasefame "%sysget(FAME_DATA)" convert=(tech=constant freq=annual), libname temp "%sysget(FAME_TEMP)"; quit; title2 SUBECON: Print of View from Embedded Library with FAME Option; proc print data=fameview; run;
Example 34.5: Creating a View Using the SQL Procedure and SASEFAME ! 2313
Output 34.5.2 Printout of the FAME View of DRI Basic Economic Data
famesql5: PROC SQL Dual Embedded Libraries w/ FAME option SUBECON: Print of View from Embedded Library with FAME Option Obs 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 DATE 1946 1947 1948 1949 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 GAA . . 23174.00 19003.00 24960.00 21906.00 20246.00 20912.00 21056.00 27168.00 27638.00 26723.00 22929.00 29729.00 28444.00 28226.00 32396.00 34932.00 40024.00 47941.00 51429.00 49164.00 51208.00 49371.00 44034.00 52352.00 62644.00 81645.00 91028.00 89494.00 109492.00 130260.00 154357.00 173428.00 156096.00 147765.00 113216.00 133495.00 146448.00 128521.99 111337.99 160785.00 210532.00 201637.00 218702.00 210666.00 . .
The following statements create a view of the DB77 database by using the SQL procedures FROM and USING clauses, as shown in Output 34.5.3.
title2 DB77: Dual Embedded Library Allocations with FAME Option; options validvarname=any; %let FAME=%sysget(FAME); %put(&FAME); %let FAMETEMP=%sysget(FAME_TEMP); %put(&FAMETEMP); proc sql; create view fameview as select date, ann, qandom.xn from lib1.db77 using libname lib1 sasefame "%sysget(FAME_DATA)" convert=(tech=constant freq=annual), libname temp "%sysget(FAME_TEMP)"; quit; title2 DB77: Print of View from Embedded Library with FAME Option; proc print data=fameview; run;
Example 34.5: Creating a View Using the SQL Procedure and SASEFAME ! 2315
The following statements create a view of the DRI economic database by using the SQL procedures FROM and USING clauses, as shown in Output 34.5.4.
title2 DRIECON: Dual Embedded Library Allocations with FAME Option; options validvarname=any; %let FAME=%sysget(FAME); %put(&FAME); %let FAMETEMP=%sysget(FAME_TEMP); %put(&FAMETEMP); proc sql; create view fameview as select date, husts from lib1.driecon using libname lib1 sasefame "%sysget(FAME_DATA)"
convert=(tech=constant freq=annual) range=01jan1980d - 01jan2006d , libname temp "%sysget(FAME_TEMP)"; quit; title2 DRIECON: Print of View from Embedded Library with FAME Option; proc print data=fameview; run;
Note that the SAS option VALIDVARNAME=ANY was used at the beginning of this example due to special characters being present in the time series names. The output from this example shows how each FAME view is the output of the SASEFAME engines processing. Note that different engine options could have been used in the USING LIBNAME clause if desired.
Output 34.5.4 Printout of the FAME View of DRI Basic Economic Data
famesql5: PROC SQL Dual Embedded Libraries w/ FAME option DRIECON: Print of View from Embedded Library with FAME Option Obs 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 DATE 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 HUSTS 1.29990 1.09574 1.05862 1.70580 1.76351 1.74258 1.81205 1.62914 1.48748 1.38218 1.20161 1.00878 1.20159 1.29201 1.44669 1.36158 1.46952 1.47760 1.56250
Example 34.6: Reading Other FAME Data Objects with the FAMEOUT= Option
Suppose you want to see the source for your formulas or you are interested in string case series instead of numeric time series. You can designate the data objects that are output to your SAS data set using the FAMEOUT= option. In this example, the FAMEOUT=FORMULA option selects the formulas and their source denitions to be output as shown in Output 34.6.1. Note that the RANGE= option is ignored since no time series are selected when FAMEOUT=FORMULA is specied.
Example 34.6: Reading Other FAME Data Objects with the FAMEOUT= Option ! 2317
options validvarname=any ls=90; %let FAME=%sysget(FAME); %put(&FAME); %let FAMETEMP=%sysget(FAME_TEMP); %put(&FAMETEMP); libname lib6 sasefame "%sysget(FAME_DATA)" fameout=formula convert=(frequency=business technique=constant) range=02jan1995d - 25jul1997d wildcard="?YIELD?" ; data crout; set lib6.training; keep S.GM.YIELD.An -- S.XON.YIELD.An ; run; title2 Contents of OUT=CROUT from the FAMEOUT=FORMULA Option; title3 Using WILDCARD="?YIELD?"; proc contents data=crout; run;
Output 34.6.1 Contents of OUT=CROUT from the FAMEOUT=FORMULA Option of the TRAINING FAME Data
famesql5: PROC SQL Dual Embedded Libraries w/ FAME option Contents of OUT=CROUT from the FAMEOUT=FORMULA Option Using WILDCARD="?YIELD?" The CONTENTS Procedure Alphabetic List of Variables and Attributes # 1 2 3 4 5 6 7 8 9 10 Variable S.GM.YIELD.A S.GM__PP.YIELD.A S.HWP.YIELD.A S.IBM.YIELD.A S.INDUT.YIELD.A S.SPAL.YIELD.A S.SPALN.YIELD.A S.SUNW.YIELD.A S.XOM.YIELD.A S.XON.YIELD.A Type Char Char Char Char Char Char Char Char Char Char Len 82 82 82 82 82 82 82 82 82 82
The FAMEOUT=FORMULA option restricts the SAS data set to include only formulas. The WILDCARD=?YIELD? option further limits reading only those formulas whose names YIELD in them. This output is shown in Output 34.6.2.
options validvarname=any linesize=79;
title2 PROC PRINT of OUT=CROUT from the FAMEOUT=FORMULA Option; title3 Using WILDCARD="?YIELD?"; proc print data=crout noobs; run;
Output 34.6.2 Listing of OUT=CROUT from the FAMEOUT=FORMULA Option of the TRAINING FAME Data
famesql5: PROC SQL Dual Embedded Libraries w/ FAME option PROC PRINT of OUT=CROUT from the FAMEOUT=FORMULA Option Using WILDCARD="?YIELD?" S.GM.YIELD.A (%SPLC2TF(C37044210X01, IAD_DATE.H, IAD.H)/C37044210X01.CLOSE)*C37044210X01.ADJ S.GM__PP.YIELD.A (%SPLC2TF(C37044210X01, IAD_DATE.H, IAD.H)/C37044210X01.CLOSE)*C37044210X01.ADJ S.HWP.YIELD.A (%SPLC2TF(C42823610X01, IAD_DATE.H, IAD.H)/C42823610X01.CLOSE)*C42823610X01.ADJ S.IBM.YIELD.A (%SPLC2TF(C45920010X01, IAD_DATE.H, IAD.H)/C45920010X01.CLOSE)*C45920010X01.ADJ S.INDUT.YIELD.A (%SPLC2TF(C00000110X00, IAD_DATE.H, IAD.H)/C00000110X00.CLOSE)*C00000110X00.ADJ S.SPAL.YIELD.A (%SPLC2TF(C00000117X00, IAD_DATE.H, IAD.H)/C00000117X00.CLOSE)*C00000117X00.ADJ S.SPALN.YIELD.A (%SPLC2TF(C00000117X00, IAD_DATE.H, IAD.H)/C00000117X00.CLOSE)*C00000117X00.ADJ S.SUNW.YIELD.A (%SPLC2TF(C86681010X60, IAD_DATE.H, IAD.H)/C86681010X60.CLOSE)*C86681010X60.ADJ S.XOM.YIELD.A (%SPLC2TF(C30231G10X01, IAD_DATE.H, IAD.H)/C30231G10X01.CLOSE)*C30231G10X01.ADJ S.XON.YIELD.A (%SPLC2TF(C30231G10X01, IAD_DATE.H, IAD.H)/C30231G10X01.CLOSE)*C30231G10X01.ADJ
Example 34.8: Selecting Time Series Using CROSSLIST= Option and KEEP Statement ! 2319
Example 34.8: Selecting Time Series Using CROSSLIST= Option and KEEP Statement
This example shows how to use two FAME namelists to perform selection. Note that fame_namelist1 could be easily generated using the FAME WILDLIST function. For more about WILDLIST, refer to The WILDLIST Function in the FAME Command Reference Volume 2, Functions. In the following statements, 11 tickers are selected in fame_namelist1, but when you use the KEEP statement, the resulting data set contains only the desired IBM ticker shown in Output 34.8.1 and Output 34.8.2.
libname lib8 sasefame "%sysget(FAME_DATA)" convert=(frequency=business technique=constant) crosslist=( { IBM,SPALN,SUNW,XOM }, { adjust, close, high, low, open, volume, uclose, uhigh, ulow,uopen,uvolume } ); data trout; /* eleven companies, keep only the IBM ticker this time */ set lib8.training; where date between 01mar02d and 20mar02d; keep IBM: ; run;
title2 Contents of OUT=trout for BYGROUP Tick=IBM.; proc contents data=trout; run; title2 TRAINING DB, Pricing Timeseries for IBM Ticker in CROSSLIST=; title3 OUT=TROUT from the PRINT Procedure; proc print data=trout; run;
Output 34.8.1 Contents of the IBM Time Series in the Training FAME Data
DRIECON Database, Using FAME with REMOTE ACCESS VIA CHLI Contents of OUT=trout for BYGROUP Tick=IBM. The CONTENTS Procedure Alphabetic List of Variables and Attributes # 1 2 3 4 5 6 7 8 9 10 11 Variable IBM.ADJUST IBM.CLOSE IBM.HIGH IBM.LOW IBM.OPEN IBM.UCLOSE IBM.UHIGH IBM.ULOW IBM.UOPEN IBM.UVOLUME IBM.VOLUME Type Num Num Num Num Num Num Num Num Num Num Num Len 8 8 8 8 8 8 8 8 8 8 8
Example 34.8: Selecting Time Series Using CROSSLIST= Option and KEEP Statement ! 2321
Output 34.8.2 Listing of Ticker IBM Time Series in the Training FAME Data
DRIECON Database, Using FAME with REMOTE ACCESS VIA CHLI TRAINING DB, Pricing Timeseries for IBM Ticker in CROSSLIST= OUT=TROUT from the PRINT Procedure Obs 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Obs 1 2 3 4 5 6 7 8 9 10 11 12 13 14 IBM.ADJUST 1 1 1 1 1 1 1 1 1 1 1 1 1 1 IBM.UHIGH 103.100 106.540 106.500 107.090 107.500 107.340 105.970 108.850 108.650 107.950 107.450 108.640 108.050 106.900 IBM.CLOSE 103.020 105.900 105.670 106.300 103.710 105.090 105.240 108.500 107.180 106.600 106.790 106.350 107.490 105.500 IBM.ULOW 98.500 103.130 104.160 104.750 103.240 104.820 103.600 105.510 106.700 106.590 105.590 106.230 106.490 105.490 IBM.HIGH 103.100 106.540 106.500 107.090 107.500 107.340 105.970 108.850 108.650 107.950 107.450 108.640 108.050 106.900 IBM.UOPEN 98.600 103.350 104.250 105.150 107.300 104.820 104.350 105.520 108.300 107.020 106.550 107.100 106.850 106.900 IBM.LOW 98.500 103.130 104.160 104.750 103.240 104.820 103.600 105.510 106.700 106.590 105.590 106.230 106.490 105.490 IBM.OPEN 98.600 103.350 104.250 105.150 107.300 104.820 104.350 105.520 108.300 107.020 106.550 107.100 106.850 106.900 IBM.UCLOSE 103.020 105.900 105.670 106.300 103.710 105.090 105.240 108.500 107.180 106.600 106.790 106.350 107.490 105.500
IBM.UVOLUME 104890 107650 75617 76874 109720 107260 86391 110640 64086 53335 108640 53048 46148 48367
IBM.VOLUME 104890 107650 75617 76874 109720 107260 86391 110640 64086 53335 108640 53048 46148 48367
Example 34.9: Selecting Time Series Using CROSSLIST= Option with a FAME Namelist of Tickers
This example demonstrates selection by using the CROSSLIST= option. Only the ticker IBM is specied in the KEEP statement from the 11 companies in the FAME ticker namelist. The results are shown in Output 34.9.1 and Output 34.9.2.
libname lib9 sasefame "%sysget(FAME_DATA)" convert=(frequency=business technique=constant) range=07jul1997d - 25jul1997d crosslist=( nl(ticker), { adjust, close, high, low, open, volume, uclose, uhigh, ulow, uopen, uvolume } ); data crout; /* eleven companies in the FAME ticker namelist */ set lib9.training; keep IBM: ; run; title2 TRAINING DB, Pricing Timeseries for eleven Tickers in CROSSLIST=; title3 OUT=CROUT from the PRINT Procedure; proc print data=crout; run; title2 Contents of OUT=crout from the FAME Crosslist function; title3 Using TICKER namelist.; proc contents data=crout; run;
Example 34.9: Selecting Time Series Using CROSSLIST= Option with a FAME Namelist of Tickers ! 2323 Output 34.9.1 Listing of OUT=CROUT Using CROSSLIST= Option in the Training FAME Data
DRIECON Database, Using FAME with REMOTE ACCESS VIA CHLI Contents of OUT=crout from the FAME Crosslist function Using TICKER namelist. Obs 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Obs 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 IBM.ADJUST 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 IBM.UHIGH 95.500 95.750 96.688 96.188 97.375 96.438 96.250 98.000 101.750 105.250 106.313 103.500 107.000 108.438 108.438 IBM.CLOSE 47.2500 47.8750 48.0938 47.8750 47.8750 47.6250 48.0000 48.8125 49.8125 52.2500 51.8750 51.5000 52.5625 53.9063 53.5000 IBM.ULOW 94.000 94.500 95.313 94.063 95.625 94.000 93.375 95.375 97.125 100.000 102.188 99.375 103.188 104.500 105.625 IBM.HIGH 47.7500 47.8750 48.3438 48.0938 48.6875 48.2188 48.1250 49.0000 50.8750 52.6250 53.1563 51.7500 53.5000 54.2188 54.2188 IBM.UOPEN 95.000 94.500 96.000 94.688 95.813 95.625 94.875 95.750 97.813 100.000 105.250 100.063 104.375 105.625 107.938 IBM.LOW 47.0000 47.2500 47.6563 47.0313 47.8125 47.0000 46.6875 47.6875 48.5625 50.0000 51.0938 49.6875 51.5938 52.2500 52.8125 IBM.OPEN 47.5000 47.2500 48.0000 47.3438 47.9063 47.8125 47.4375 47.8750 48.9063 50.0000 52.6250 50.0313 52.1875 52.8125 53.9688 IBM.UCLOSE 94.500 95.750 96.188 95.750 95.750 95.250 96.000 97.625 99.625 104.500 103.750 103.000 105.125 107.813 107.000
IBM.UVOLUME 129012 102796 177276 127900 137724 128976 149612 215440 315504 463480 328184 368276 219880 204088 146600
IBM.VOLUME 64506 51398 88638 63950 68862 64488 74806 107720 157752 231740 164092 184138 109940 102044 73300
Output 34.9.2 Contents of OUT=CROUT Using CROSSLIST= Option in the Training FAME Data
Alphabetic List of Variables and Attributes # 1 2 3 4 5 6 7 8 9 10 11 Variable IBM.ADJUST IBM.CLOSE IBM.HIGH IBM.LOW IBM.OPEN IBM.UCLOSE IBM.UHIGH IBM.ULOW IBM.UOPEN IBM.UVOLUME IBM.VOLUME Type Num Num Num Num Num Num Num Num Num Num Num Len 8 8 8 8 8 8 8 8 8 8 8
Example 34.10: Selecting Time Series Using CROSSLIST= Option with INSET= and WHERE=TICK ! 2325
Example 34.10: Selecting Time Series Using CROSSLIST= Option with INSET= and WHERE=TICK
Suppose instead of having a FAME namelist with the Tickers for companies whose data you are interested in, you have an input SAS data set (INSET) that species the tickers to select. You can specify your selection by using the WHERE statement as in the following statements. The results are shown in Output 34.10.1 and Output 34.10.2.
data inseta; length tick $5; /* need $5 so SPALN is not truncated */ tick=AOL; tick=C; tick=CPQ; tick=CVX; tick=F; tick=GM; tick=HWP; tick=IBM; tick=SPALN; tick=SUNW; tick=XOM; run; output; output; output; output; output; output; output; output; output; output; output;
libname lib10 sasefame "%sysget(FAME_DATA)" convert=(frequency=business technique=constant) range=07jul1997d - 25jul1997d inset=( inseta where=tick ) crosslist= ( {adjust, close, high, low, open, volume, uclose, uhigh, ulow,uopen,uvolume} ); data trout; /* eleven companies with unique TICKs specified in INSETA */ set lib10.training; keep IBM: ; run; title2 TRAINING DB, Pricing Timeseries for eleven Tickers in CROSSLIST=; title3 OUT=TROUT from the PRINT Procedure; proc print data=trout; run; title2 Contents of OUT=trout from the FAME Crosslist function; title3 Using INSET with WHERE=TICK.; proc contents data=trout; run;
Output 34.10.1 Listing of OUT=TROUT Using CROSSLIST= and INSET= Options in the Training FAME Data
DRIECON Database, Using FAME with REMOTE ACCESS VIA CHLI Contents of OUT=trout from the FAME Crosslist function Using INSET with WHERE=TICK. Obs 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Obs 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 IBM.ADJUST 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 IBM.UHIGH 95.500 95.750 96.688 96.188 97.375 96.438 96.250 98.000 101.750 105.250 106.313 103.500 107.000 108.438 108.438 IBM.CLOSE 47.2500 47.8750 48.0938 47.8750 47.8750 47.6250 48.0000 48.8125 49.8125 52.2500 51.8750 51.5000 52.5625 53.9063 53.5000 IBM.ULOW 94.000 94.500 95.313 94.063 95.625 94.000 93.375 95.375 97.125 100.000 102.188 99.375 103.188 104.500 105.625 IBM.HIGH 47.7500 47.8750 48.3438 48.0938 48.6875 48.2188 48.1250 49.0000 50.8750 52.6250 53.1563 51.7500 53.5000 54.2188 54.2188 IBM.UOPEN 95.000 94.500 96.000 94.688 95.813 95.625 94.875 95.750 97.813 100.000 105.250 100.063 104.375 105.625 107.938 IBM.LOW 47.0000 47.2500 47.6563 47.0313 47.8125 47.0000 46.6875 47.6875 48.5625 50.0000 51.0938 49.6875 51.5938 52.2500 52.8125 IBM.OPEN 47.5000 47.2500 48.0000 47.3438 47.9063 47.8125 47.4375 47.8750 48.9063 50.0000 52.6250 50.0313 52.1875 52.8125 53.9688 IBM.UCLOSE 94.500 95.750 96.188 95.750 95.750 95.250 96.000 97.625 99.625 104.500 103.750 103.000 105.125 107.813 107.000
IBM.UVOLUME 129012 102796 177276 127900 137724 128976 149612 215440 315504 463480 328184 368276 219880 204088 146600
IBM.VOLUME 64506 51398 88638 63950 68862 64488 74806 107720 157752 231740 164092 184138 109940 102044 73300
Output 34.10.2 Contents of OUT=TROUT Using CROSSLIST= and INSET= Options in the Training FAME Data
Alphabetic List of Variables and Attributes # 1 2 3 4 5 6 7 8 9 10 11 Variable IBM.ADJUST IBM.CLOSE IBM.HIGH IBM.LOW IBM.OPEN IBM.UCLOSE IBM.UHIGH IBM.ULOW IBM.UOPEN IBM.UVOLUME IBM.VOLUME Type Num Num Num Num Num Num Num Num Num Num Num Len 8 8 8 8 8 8 8 8 8 8 8
title3 Using FAMEOUT=CASE BOOLEAN option without range; proc contents data=booout; run; title2 ALLTYPES FAMEOUT=BOOLCASE and open wildcard for BOOLEAN CASE Series; title3 OUT=BOOOUT from the PRINT Procedure; proc print data=booout; run;
Output 34.11.1 Contents of OUT=BOOUT Using FAMEOUT=BOOLCASE and Open Wildcard for Boolean Case Series
***famallt: FAMEOUT option, Different Type Values*** Contents of OUT=booout Using FAMEOUT=CASE BOOLEAN option without range The CONTENTS Procedure Alphabetic List of Variables and Attributes # 1 2 3 4 5 Variable BOO0 BOO1 BOO2 BOOM BOO_RES Type Num Num Num Num Num Len 8 8 8 8 8
Output 34.11.2 Listing of OUT=BOOOUT Using FAMEOUT=BOOLCASE and Open Wildcard for Boolean Case Series
***famallt: FAMEOUT option, Different Type Values*** ALLTYPES FAMEOUT=BOOLCASE and open wildcard for BOOLEAN CASE Series OUT=BOOOUT from the PRINT Procedure Obs 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 BOO0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 BOO1 1 0 0 1 1 0 0 1 . . . . . . . . . . . . BOO2 0 1 0 1 0 . . . 0 . . . 1 . . . 0 . . . BOOM 1 0 251 1 1 0 0 1 . . . . . . . . . . . . BOO_RES . . . . . . . . . . . . . . . . . . . .
Suppose instead of boolean case series, you prefer to see numeric case series. Case series can be numeric or boolean or string series or date series. In addition to the existing case series in your FAME database, you can have formulas that resolve to numeric case series. SASEFAME will resolve all formulas that belong to the class and type of series data object that you specify in your FAMEOUT= option. The following statements write all numeric case series to your SAS data set. The results are shown in Output 34.11.3 and Output 34.11.4.
libname lib5 sasefame "%sysget(FAME_DATA)" fameout=case wildcard="?" ; data csout; set lib5.alltypes; run; title2 Contents of OUT=csout; title3 Using FAMEOUT=CASE option without range; proc contents data=csout; run; title2 ALLTYPES, FAMEOUT=CASE and open wildcard for Numeric Case Series; title3 OUT=CSOUT from the PRINT Procedure; proc print data=csout; run;
Output 34.11.3 Contents of OUT=CSOUT Using FAMEOUT=CASE and Open Wildcard for Numeric Case Series
***famallt: FAMEOUT option, Different Type Values*** Contents of OUT=csout Using FAMEOUT=CASE option without range The CONTENTS Procedure Alphabetic List of Variables and Attributes # 1 2 3 4 5 6 7 8 9 10 11 Variable FRM1 NUM0 NUM1 NUM2 NUMM NUM_RES PRC0 PRC1 PRC2 PRCM PRC_RES Type Num Num Num Num Num Num Num Num Num Num Num Len 8 8 8 8 8 8 8 8 8 8 8
Output 34.11.4 Listing of OUT=CSOUT Using FAMEOUT=CASE and Open Wildcard for Numeric Case Series
***famallt: FAMEOUT option, Different Type Values*** ALLTYPES, FAMEOUT=CASE and open wildcard for Numeric Case Series OUT=CSOUT from the PRINT Procedure N U M _ R E S . . . . . . . . . . . . . . . . . . . . P R C _ R E S . . . . . . . . . . . . . . . . . . . .
O b s 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
N U M 0 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 10
N U M 1 0 1 2 3 4 5 6 7 . . . . . . . . . . . .
N U M M 0 1 1.7014E38 3 4 5 6 7 . . . . . . . . . . . .
P R C 1 0 1 2 3 4 5 6 7 . . . . . . . . . . . .
P R C M 0 1 1.7014E38 3 4 5 6 7 . . . . . . . . . . . .
Instead of numeric case series, you could decide to extract date case series. Case series can be numeric or boolean or string series or date series. In addition to the existing case series in your FAME database, you can have formulas that resolve to date case series. SASEFAME will resolve all formulas that belong to the class and type of series data object that you specify in your FAMEOUT= option. The following statements write all date case series to your SAS data set. The results are shown in Output 34.11.5 and Output 34.11.6.
libname lib6 sasefame "%sysget(FAME_DATA)" fameout=datecase wildcard="?" ; data cdout; set lib6.alltypes; run; title2 Contents of OUT=cdout; title3 Using FAMEOUT=DATECASE option without range; proc contents data=cdout; run;
title2 ALLTYPES, FAMEOUT=DATECASE and open wildcard for Date Case Series; title3 OUT=CDOUT from the PRINT Procedure; proc print data=cdout; run;
The next example shows how to extract string case series. Case series can be numeric or boolean or string series or date series. In addition to the existing string case series in your FAME database, you can have formulas that resolve to string case series. SASEFAME will resolve all formulas that belong to the class and type of series data object that you specify in your FAMEOUT= option. The following statements write all string case series to your SAS data set. The results are shown in Output 34.11.7 and Output 34.11.8.
libname lib7 sasefame "%sysget(FAME_DATA)" fameout=stringcase wildcard="?" ; data cstrout; set lib7.alltypes; run; title2 Contents of OUT=cstrout; title3 Using FAMEOUT=STRINGCASE option without range; proc contents data=cstrout; run; title2 ALLTYPES, FAMEOUT=STRINGCASE and open wildcard for STRING CASE Series; title3 OUT=CSTROUT from the PRINT Procedure; proc print data=cstrout; run;
Output 34.11.7 Contents of OUT=CSTROUT Using FAMEOUT=STRINGCASE and Open Wildcard for String Case Series
***famallt: FAMEOUT option, Different Type Values*** Contents of OUT=cstrout Using FAMEOUT=STRINGCASE option without range The CONTENTS Procedure Alphabetic List of Variables and Attributes # 1 2 3 4 Variable STR0 STR1 STR2 STRM Type Char Char Char Char Len 16 16 16 16
Output 34.11.8 Listing of OUT=CSTROUT Using FAMEOUT=STRINGCASE and Open Wildcard for String Case Series
***famallt: FAMEOUT option, Different Type Values*** ALLTYPES, FAMEOUT=STRINGCASE and open wildcard for STRING CASE Series OUT=CSTROUT from the PRINT Procedure Obs 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 STR0 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 10 STR1 0 1 2 3 4 5 6 7 STR2 1.333333 1 0.6666667 0.3333333 0 STRM 0 1 2 3 4 5 7 -1.333333
-2.666667
-4
Suppose you prefer to extract all the source for the formulas in your FAME database. The following statements show the source of all formulas written to your SAS data set. The results are shown in Output 34.11.9 and Output 34.11.10. Another example of FAMEOUT=FORMULA option is shown in Example 34.6.
libname lib8 sasefame "%sysget(FAME_DATA)" fameout=formula wildcard="?" ; data cforout; set lib8.alltypes; run; title2 Contents of OUT=cforout; title3 Using FAMEOUT=FORMULA option without range; proc contents data=cforout; run;
title2 ALLTYPES, FAMEOUT=FORMULA and open wildcard for FORMULA Series; title3 OUT=CFOROUT from the PRINT Procedure; proc print data=cforout noobs; run;
If you want all series of every type, you can merge the resulting data sets together. For more about merging SAS data sets, see SAS Language Reference: Concepts.
Acknowledgments ! 2337
Acknowledgments
Many people have been instrumental in the development of the ETS Interface engine. The individuals listed here have been especially helpful. Jeff Kaplan, SunGard Data Management Solutions, Ann Arbor, MI. Rick Langston, SAS Institute, Cary, NC. Kelly Fellingham, SAS Institute, Cary, NC. The nal responsibility for the SAS System lies with SAS Institute alone. We hope that you will always let us know your opinions about the SAS System and its documentation. It is through your participation that SAS software is continuously improved.
2338
Chapter 35
from the HAVER database. To force aggregation of all selected variables to the frequency specied by the FREQ= option, the FORCE= FREQ option can be used. Aggregation is only supported from a more frequent time interval to a less frequent time interval, such as from weekly to monthly. If a conversion to a more frequent frequency is attempted, all missing values are returned by the HAVER DLX API (application programming interface). See Aggregating to Quarterly Frequency Using the FORCE= FREQ Option on page 2350. The FORCE= option is ignored if the FREQ= option is not specied.
ERROR: No variable selected with current options. ERROR: No variable selected with current options. ERROR: No variable selected with current options. ERROR: No variable selected with current options. ERROR: No variable selected with current options.
When this happens, you will notice that you get one error message for each input data base that does not have the selected options on the sasehavr libref that you have double-clicked on.
Table 35.1
Option FREQ= FREQUENCY= INTERVAL= START= STARTDATE= STDATE= BEGIN= END= ENDDATE= ENDATE= KEEP= DROP= GROUP= KEEPGROUP= DROPGROUP= SOURCE= KEEPSOURCE= DROPSOURCE= FORCE= FREQ
Description species the HAVER frequency species the HAVER frequency species the HAVER frequency species a HAVER start date to limit the selection of time series that begins with the specied date species a HAVER start date to limit the selection of time series that begins with the specied date species a HAVER start date to limit the selection of time series that begins with the specied date species a HAVER start date to limit the selection of time series that begins with the specied date species a HAVER end date to limit the selection of time series that ends with the specied date species a HAVER end date to limit the selection of time series that ends with the specied date species a HAVER end date to limit the selection of time series that ends with the specied date species a list of comma-delimited HAVER variables to keep in the output SAS data set species a list of comma-delimited HAVER variables to drop in the output SAS data set species a list of comma-delimited HAVER groups to keep in the output SAS data set species a list of comma-delimited HAVER groups to keep in the output SAS data set species a list of comma-delimited HAVER groups to drop in the output SAS data set species a list of comma-delimited HAVER sources to keep in the output SAS data set species a list of comma-delimited HAVER sources to keep in the output SAS data set species a list of comma-delimited HAVER sources to drop in the output SAS data set species that all selected variables should be aggregated to the frequency specied in the FREQ= option
The physical name species the location of the folder where your HAVER DLX database resides.
The following options can be used in the LIBNAME libref SASEHAVR statement:
FREQ=haver_frequency
species the HAVER frequency. All HAVER frequencies are supported by the SASEHAVR engine. Accepted frequency values are annual, year, yearly, quarter, quarterly, qtr, monthly, month, mon, week.1, week.2, week.3, week.4, week.5, week.6, week.7, weekly, week, daily, day.
START=start_date
species the start date for the time series in the form YYYYMMDD.
END=end_date
species the end date for the time series in the form YYYYMMDD.
KEEP=haver_variables
species the list of HAVER variables to be included in the output SAS data set. This list is comma-delimited and must be surrounded by quotes .
DROP=haver_variables
species the list of HAVER variables to be excluded from the output SAS data set. This list is comma-delimited and must be surrounded by quotes .
GROUP=haver_groups
species the list of HAVER groups to be included in the output SAS data set. This list is comma-delimited and must be surrounded by quotes .
DROPGROUP=haver_groups
species the list of HAVER groups to be excluded from the output SAS data set. This list is comma-delimited and must be surrounded by quotes .
SOURCE=haver_sources
species the list of HAVER sources to be included in the output SAS data set. This list is comma-delimited and must be surrounded by quotes .
DROPSOURCE=haver_sources
species the list of HAVER sources to be excluded from the output SAS data set. This list is comma-delimited and must be surrounded by quotes .
FORCE= FREQ
species that the selected variables are to be aggregated to the frequency in the FREQ= option. Aggregation is only supported from a more frequent time interval to a less frequent time interval, such as from weekly to monthly. See Aggregating to Quarterly Frequency Using the FORCE= FREQ Option on page 2350 for sample output and suggested error recovery from attempting a conversion that yields missing values when specifying a higher frequency conversion. This option is ignored if the FREQ= option is not set. For a more complete discussion of HAVER frequencies and SAS time intervals see the section Mapping HAVER Frequencies to SAS Time Intervals on page 2346. As an example,
LIBNAME libref sasehavr physical-name FREQ=MONTHLY;
By default, the SASEHAVR engine reads all time series in the HAVER database that you reference when using your SASEHAVR libref. The haver_startdate is specied in the form YYYYMMDD. The start date is used to delimit the data to a specied start date. For example, to read the time series in the TEST library starting on July 4, 1996, you would specify
LIBNAME test sasehavr physical-name STARTDATE=19960704;
When you use the START= option, you are limiting the range of observations that are read from the time series and that are converted to the desired frequency. Start dates can help you save resources when processing large databases or when processing a large number of observations. It is also possible to select specic variables to be included or excluded from the SAS data set by using the KEEP= or the DROP= option.
LIBNAME test sasehavr physical-name KEEP="ABC*, XYZ??"; LIBNAME test sasehavr physical-name DROP="*SC*, #T#";
When the KEEP= or the DROP= option is used the resulting SAS data set will keep or drop the variables that you select in that option. There are three wildcards currently available: *, ? and #. The * wildcard corresponds to any character string and will include any string pattern that corresponds to that position in the matching variable name. The ? means that any single alphanumeric character is valid. And nally, the # wildcard corresponds to a single numeric character. You can also select time series in your data by using the GROUP= or the SOURCE= option to select on GROUP name or on SOURCE name. Alternatively, you can deselect time series by using the DROPGROUP= or the DROPSOURCE= option.
LIBNAME test sasehavr physical-name GROUP="CBA, *ZYX"; LIBNAME test sasehavr physical-name DROPGROUP="TKN*, XCZ?"; LIBNAME test sasehavr physical-name SOURCE="FRB"; LIBNAME test sasehavr physical-name DROPSOURCE="NYSE";
SASEHAVR selects only the variables that are of the specied frequency in the FREQ= option. If this option is not specied, SASEHAVR selects the variables that match the frequency of the rst selected variable. If no other selection criterion are specied, by default, the rst selected variable is the rst physical DLXRecord read from the HAVER database. The FORCE= FREQ option can be specied to force the aggregation of all variables selected to be of the frequency specied in the FREQ= option. Aggregation is only supported from a more frequent time interval to a less frequent time interval, such as from weekly to monthly. See Aggregating to Quarterly Frequency Using the FORCE= FREQ Option on page 2350 for suggested recovery from using a frequency that does not aggregate the data appropriately. The FORCE= option is ignored if the FREQ= option is not specied.
HAVER Frequency ANNUAL QUARTERLY MONTHLY WEEKLY (SUNDAY) WEEKLY (MONDAY) WEEKLY (TUESDAY) WEEKLY (WEDNESDAY) WEEKLY (THURSDAY) WEEKLY (FRIDAY) WEEKLY (SATURDAY) WEEKLY WEEK.1-WEEK.7 DAILY
SAS Time Interval YEAR QTR MONTH WEEK.1 WEEK.2 WEEK.3 WEEK.4 WEEK.5 WEEK.6 WEEK.7 WEEKLY WEEKDAY17W
FREQ= YEARLY QTRLY MON WEEK.1 WEEK.2 WEEK.3 WEEK.4 WEEK.5 WEEK.6 WEEK.7 WEEKLY DAY
The important diagnostic to note here is the warning message which tells you that the data starts in January of 1970 (HAVER date 7001), and ends in March, 2001 (HAVER date 10103). Since the specied range falls outside the range of data, no observations are in range so the engine uses the default range stated in the warning messages. Changing the START= and END= options to overlap the results in data spanning from JAN1970 to MAR2001. To view the entire range of selected data, remove the START= and END= options from your LIBNAME statement:
libname kgs sasehavr "%sysget(HAVER_DATA)" keep="FCSEED, FCSEEI, FCSEEM, BGSX, BGSM, FXDUSBC" group="I01, F56, M02, R30" source="JPM,CEN,OMB" ;
NOTE: Libref KGS was successfully assigned as follows: Engine: HAVERDLX Physical Name: C:\haver data kgse5; set kgs.haver; NOTE: Defaulting to MONTHLY frequency. run; NOTE: There were 375 observations read from the data set KGS.HAVER. NOTE: The data set WORK.KGSE5 has 375 observations and 4 variables.
In the preceding example, the important diagnostic message is the rst error statement which tells you that the RANGE of HAVER dates is invalid for the specied frequency. A valid range is one that overlaps the dates (610104-1050318). Removing the range altogether will cause the engine to output the entire range of data.
libname lib1 sasehavr "%sysget(HAVER_DATA)" freq=Weekly;
NOTE: Libref LIB1 was successfully assigned as follows: Engine: HAVERDLX Physical Name: C:\haver libname lib2 "\\dntsrc\usrtmp\saskff" ; NOTE: Libref LIB2 was successfully assigned as follows: Engine: V9 Physical Name: \\dntsrc\usrtmp\saskff data lib2.wweek; set lib1.intwkly; keep date m11: ; run; NOTE: There were 2307 observations read from the data set LIB1.INTWKLY. NOTE: The data set LIB2.WWEEK has 2307 observations and 35 variables.
When giving a range of dates, since the START=, END= options give day-based dates, its important to use dates that correspond to the FREQ= option, especially with weekly frequencies, such as week.1-week.7. Since the FREQ=week.4 selects weeks that begin on Wednesday, the start and end dates need to be specied as Wednesday dates.
libname lib1 sasehavr "%sysget(HAVER_DATA)" freq=Week.4 start=20050302 end=20050309; NOTE: Libref LIB1 was successfully assigned as follows: Engine: HAVERDLX Physical Name: \\tappan\crsp1\haver title2 Weekly dataset with freq=week.4 range is small; libname lib2 "\\dntsrc\usrtmp\saskff" ; NOTE: Libref LIB2 was successfully assigned as follows: Engine: V9 Physical Name: \\dntsrc\usrtmp\saskff data lib2.wweek; set lib1.intwkly; keep date m11: ; run; NOTE: There were 2 observations read from the data set LIB1.INTWKLY. NOTE: The data set LIB2.WWEEK has 2 observations and 25 variables.
Giving bad dates (i.e., Tuesday dates) for a Wednesday FREQ=week.4 will result in the following error.
ERROR: Fatal error in GetDate routine. Remove the range statement or change the START= date to be consistent with the freq=option. ERROR: No observations found in specified range.
Note that the time series IUM is an annual frequency, so the attempt to convert to a quarterly frequency produces all missing values in the output range because aggregation produces only missing values when going from a lower frequency to a higher (forced) frequency.
In the preceding example, the HAVER database is called haverw and it resides in the directory referenced in lib1. The DATA statement names the SAS output data set hwouty, which will reside in saswork. All time series in the HAVER haverw database are listed alphabetically in Output 35.1.1.
Output 35.1.1 Examining the Contents of HAVER Analytics Database, haverw.dat
Haver Analytics Database, HAVERW.DAT PROC CONTENTS for Time Series converted to yearly frequency The CONTENTS Procedure Alphabetic List of Variables and Attributes # Variable Type Len Format Label 1 DATE 2 FA 3 FCM1M 4 5 6 7 FM1 FTA1MA FTB3 LICN Num Num Num Num Num Num Num 8 YEAR4. Date of Observation 8 Total Assets: All Commercial Banks (SA, Bil.$) 8 1-Month Treasury Bill Market Bid Yield at Constant Maturity (%) 8 Money Stock: M1 (SA, Bil.$) 8 Treasury 4-Week Bill: Total Amount Accepted (Bil$) 8 3-Month Treasury Bills, Auction (% p.a.) 8 Unemployment Insurance: Initial Claims, State Programs (NSA, Thous)
You could use the following SAS statements to create a SAS data set named hwouty and to print its contents.
libname lib1 sasehavr "%sysget(HAVER_DATA)" freq=yearly start=19920101 end=20041231 force=freq;
data hwouty; set lib1.haverw; run; title1 Haver Analytics Database, Frequency=yearly, infile=haverw.dat; title2 Define a range inside the data range for OUT= dataset,; title3 Using the START=19920101 END=20041231 LIBNAME options.; proc print data=hwouty; run;
The preceding LIBNAME lib1 statement species that all time series in the haverw database be converted to yearly frequency but to only select the range of data from January 1, 1992, to December 31, 2004. The resulting SAS data set hwouty is shown in Output 35.1.2.
Output 35.1.2 Dening a Range Inside the Data Range for Yearly Time Series
Haver Analytics Database, Frequency=yearly, infile=haverw.dat Define a range inside the data range for OUT= dataset, Using the START=19920101 END=20041231 LIBNAME options. Obs 1 2 3 4 5 6 7 8 9 10 11 12 13 DATE 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 FA 3466.3 3624.6 3875.8 4209.3 4399.1 4820.3 5254.8 5608.1 6115.4 6436.2 7024.9 7302.9 7950.5 FCM1M . . . . . . . . . 2.31368 1.63115 1.02346 1.26642 FM1 965.31 1077.69 1144.85 1142.70 1106.46 1069.23 1079.56 1101.14 1104.07 1136.31 1192.03 1268.40 1337.89 FTA1MA . . . . . . . . . 11.753 18.798 16.089 13.019 FTB3 3.45415 3.01654 4.28673 5.51058 5.02096 5.06885 4.80726 4.66154 5.84644 3.44471 1.61548 1.01413 1.37557 LICN 407.340 344.934 340.054 357.038 351.358 321.513 317.077 301.581 301.108 402.583 402.796 399.137 345.109
Example 35.3: Viewing Monthly Time Series from a HAVER Database ! 2353
set lib1.haverw; run; title1 HAVER Analytics Database, Frequency=quarterly, infile=haverw.dat; title2 Define a range inside the data range for OUT= dataset; title3 Using the START=20010401 END=20041231 LIBNAME options.; proc print data=hwoutq; run;
title2 title3
Define a range inside the data range for OUT= dataset; Using the START=20040401 END=20041231 LIBNAME options.;
The result from using the range of April 1, 2004, to December 31, 2004, is shown in Output 35.3.1.
Output 35.3.1 Dening a Range Inside the Data Range for Monthly Time Series
Haver Analytics Database, Frequency=monthly, infile=haverw.dat Define a range inside the data range for OUT= dataset Using the START=20040401 END=20041231 LIBNAME options. Obs 1 2 3 4 5 6 7 8 9 DATE APR2004 MAY2004 JUN2004 JUL2004 AUG2004 SEP2004 OCT2004 NOV2004 DEC2004 FA 7703.8 7704.7 7769.8 7859.5 7890.0 7949.5 7967.6 8053.4 7950.5 FCM1M 0.9140 0.9075 1.0275 1.1840 1.3650 1.5400 1.6140 1.9125 1.9640 FM1 1325.73 1332.96 1339.50 1330.13 1347.84 1352.40 1355.28 1366.06 1365.60 FTA1MA 16.946 25.043 12.547 21.823 25.213 21.549 21.322 21.862 13.019 FTB3 0.93900 1.03375 1.26625 1.34900 1.48000 1.65000 1.74750 2.05625 2.20200 LICN 317.36 297.00 315.45 357.32 276.70 270.70 304.24 335.85 441.16
Example 35.5: Viewing Daily Time Series from a HAVER Database ! 2355
Output 35.4.1 Dening a Range Inside the Data Range for Weekly Time Series
HAVER Analytics Database, Frequency=weekly, infile=haverw.dat Define a range inside the data range for OUT= dataset Using the START=20040901 END=20041231 LIBNAME options. Obs 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 DATE 29AUG2004 05SEP2004 12SEP2004 19SEP2004 26SEP2004 03OCT2004 10OCT2004 17OCT2004 24OCT2004 31OCT2004 07NOV2004 14NOV2004 21NOV2004 28NOV2004 05DEC2004 12DEC2004 19DEC2004 26DEC2004 FA 7890.0 7906.2 7962.7 7982.1 7987.9 7949.5 7932.4 7956.9 7957.3 7967.6 7954.1 8009.7 7938.3 8053.4 8010.7 8054.8 8019.2 7995.5 FCM1M 1.39 1.46 1.57 1.57 1.56 1.54 1.56 1.59 1.63 1.75 1.84 1.89 1.93 1.99 2.05 2.08 1.98 1.89 FM1 1360.8 1353.7 1338.3 1345.6 1359.7 1366.0 1362.3 1350.1 1346.0 1362.7 1350.4 1354.8 1364.5 1381.3 1379.3 1355.1 1358.3 1366.3 FTA1MA 27.342 25.213 25.255 15.292 15.068 21.549 17.183 17.438 12.133 21.322 22.028 25.495 24.000 24.424 21.862 22.178 12.066 12.787 FTB3 1.515 1.580 1.635 1.640 1.685 1.710 1.685 1.680 1.770 1.855 1.950 2.045 2.075 2.155 2.195 2.210 2.200 2.180 LICN 275.2 273.7 250.6 275.8 282.7 279.6 338.7 279.8 317.6 305.5 354.8 311.9 356.0 320.7 472.7 370.6 374.7 446.6
Output 35.5.1 shows the output of PROC CONTENTS with the time id variable DATE followed by the time series variables FCM10, FCM1M, FFED, FFP1D, FXAUS, and TCC with their corresponding attributes such as type, length, format, and label.
Output 35.5.1 Examining the Contents of a Daily HAVER Analytics Database, haverd.dat
HAVER Analytics Database, HAVERD.DAT PROC CONTENTS for Time Series converted to daily frequency The CONTENTS Procedure Alphabetic List of Variables and Attributes # Variable Type Len Format Label 1 DATE 2 FCM10 3 FCM1M 4 5 6 7 FFED FFP1D FXAUS TCC Num Num Num Num Num Num Num 8 DATE9. Date of Observation 8 10-Year Treasury Note Yield at Constant Maturity (Avg, % p.a.) 8 1-Month Treasury Bill Market Bid Yield at Constant Maturity (%) 8 Federal Funds [Effective] Rate (% p.a.) 8 1-Day AA Financial Commercial Paper (% per annum) 8 Foreign Exchange Rate: Australia (US$/Australian$) 8 Treasury: Closing Operating Cash Balance (Today, Mil.$)
Example 35.6: Limiting the Range of Time Series from a HAVER Database
Suppose you limit the range of data to the month of December:
libname lib1 sasehavr "%sysget(HAVER_DATA)" freq=daily start=20041201 end=20041231; data hwoutd; set lib1.haverd; run; title1 Haver Analytics Database, Frequency=daily, infile=haverd.dat; title2 Define a range inside the data range for OUT= dataset; title3 Using the START=20041201 END=20041231 LIBNAME options.; proc print data=hwoutd; run;
Note that Output 35.6.1 for daily conversion shows the frequency as the SAS time interval for WEEKDAY.
Example 35.7: Using the WHERE Statement to Subset Time Series from a HAVER Database ! 2357 Output 35.6.1 Dening a Range Inside the Data Range for Daily Time Series
Haver Analytics Database, Frequency=daily, infile=haverd.dat Define a range inside the data range for OUT= dataset Using the START=20041201 END=20041231 LIBNAME options. Obs 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 DATE 01DEC2004 02DEC2004 03DEC2004 06DEC2004 07DEC2004 08DEC2004 09DEC2004 10DEC2004 13DEC2004 14DEC2004 15DEC2004 16DEC2004 17DEC2004 20DEC2004 21DEC2004 22DEC2004 23DEC2004 24DEC2004 27DEC2004 28DEC2004 29DEC2004 30DEC2004 31DEC2004 FCM10 4.38 4.40 4.27 4.24 4.23 4.14 4.19 4.16 4.16 4.14 4.09 4.19 4.21 4.21 4.18 4.21 4.23 . 4.30 4.31 4.33 4.27 4.24 FCM1M 2.06 2.06 2.06 2.09 2.08 2.08 2.07 2.07 2.04 2.01 1.98 1.93 1.95 1.97 1.92 1.84 1.83 . 1.90 1.88 1.76 1.68 1.89 FFED 2.04 2.00 1.98 2.04 1.99 2.01 2.05 2.09 2.18 2.24 2.31 2.26 2.23 2.26 2.24 2.25 2.34 2.27 2.24 2.24 2.23 2.24 1.97 FFP1D 2.01 1.98 1.96 1.98 1.99 1.98 2.03 2.07 2.13 2.22 2.27 2.24 2.20 2.21 2.21 2.22 2.08 . 2.26 2.24 2.23 2.18 2.18 FXAUS 0.7754 0.7769 0.7778 0.7748 0.7754 0.7545 0.7532 0.7495 0.7592 0.7566 0.7652 0.7563 0.7607 0.7644 0.7660 0.7656 0.7654 0.7689 0.7777 0.7787 0.7709 0.7785 0.7805 TCC 7564 8502 7405 7019 15520 12329 5441 6368 11395 13695 39765 33640 32764 36216 35056 34599 24467 26898 31874 30513 34754 20045 24690
Example 35.7: Using the WHERE Statement to Subset Time Series from a HAVER Database
Using a WHERE statement in the DATA step can be useful for further subsetting.
libname lib1 sasehavr "%sysget(HAVER_DATA)" freq=daily start=20041101 end=20041231; data hwoutd; set lib1.haverd; where date between 01nov2004d and 01dec2004d; run; title1 Haver Analytics Database, Frequency=daily, infile=haverd.dat; title2 Define a range inside the data range for OUT= dataset; title3 Using the START=20041101 END=20041231 LIBNAME options.; title4 Subset further: where date between 01nov2004 and 31dec2004.; proc print data=hwoutd; run;
Output 35.7.1 shows that the time slice of November 1, 2004, to December 31, 2004, is narrowed further by the DATE test on the WHERE statement to stop at December 1, 2004.
Output 35.7.1 Dening a Range Using START=20041101 END=20041231 along with the WHERE statement
Haver Analytics Database, Frequency=daily, infile=haverd.dat Define a range inside the data range for OUT= dataset Using the START=20041101 END=20041231 LIBNAME options. Subset further: where date between 01nov2004 and 31dec2004. Obs 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 DATE 01NOV2004 02NOV2004 03NOV2004 04NOV2004 05NOV2004 08NOV2004 09NOV2004 10NOV2004 11NOV2004 12NOV2004 15NOV2004 16NOV2004 17NOV2004 18NOV2004 19NOV2004 22NOV2004 23NOV2004 24NOV2004 25NOV2004 26NOV2004 29NOV2004 30NOV2004 01DEC2004 FCM10 4.11 4.10 4.09 4.10 4.21 4.22 4.22 4.25 . 4.20 4.20 4.21 4.14 4.12 4.20 4.18 4.19 4.20 . 4.24 4.34 4.36 4.38 FCM1M 1.79 1.86 1.83 1.85 1.86 1.88 1.89 1.88 . 1.91 1.92 1.93 1.90 1.91 1.98 1.98 1.99 1.98 . 2.01 2.02 2.07 2.06 FFED 1.83 1.74 1.73 1.77 1.76 1.80 1.79 1.92 1.92 2.02 2.06 1.98 1.99 1.99 1.99 2.01 2.00 2.02 2.02 2.01 2.03 2.02 2.04 FFP1D 1.80 1.74 1.73 1.75 1.75 1.84 1.81 1.85 . 1.96 2.03 1.95 1.93 1.94 1.93 1.96 1.95 1.89 . 1.97 2.00 2.04 2.01 FXAUS 0.7460 0.7447 0.7539 0.7585 0.7620 0.7578 0.7618 0.7592 . 0.7685 0.7719 0.7728 0.7833 0.7786 0.7852 0.7839 0.7860 0.7863 . 0.7903 0.7852 0.7723 0.7754 TCC 35111 34091 14862 23304 19872 21095 16390 12872 12872 28926 10480 13417 10506 6293 5100 6045 18135 14109 14109 20588 24322 18033 7564
Example 35.8: Using the KEEP Option to Subset Time Series from a HAVER Database ! 2359
Example 35.8: Using the KEEP Option to Subset Time Series from a HAVER Database
To select specic time series, the KEEP= or DROP= option can also be used as follows.
libname lib1 sasehavr "%sysget(HAVER_DATA)" freq=daily start=20041101 end=20041231 keep="FCM*"; data hwoutd; set lib1.haverd; run; title1 Haver Analytics Database, Frequency=daily, infile=haverd.dat; title2 Define a range inside the data range for OUT= dataset; title3 Using the START=20041101 END=20041231 LIBNAME options.; title4 Subset further: Using keep="FCM*" LIBNAME option ; proc print data=hwoutd; run;
Output 35.8.1 shows two series that are selected by using KEEP="FCM*" on the LIBNAME statement.
Output 35.8.1 Using the KEEP Option along with Dening a Range Using START=20041101 END=20041231
Haver Analytics Database, Frequency=daily, infile=haverd.dat Define a range inside the data range for OUT= dataset Using the START=20041101 END=20041231 LIBNAME options. Subset further: Using keep="FCM*" LIBNAME option Obs 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 DATE 01NOV2004 02NOV2004 03NOV2004 04NOV2004 05NOV2004 08NOV2004 09NOV2004 10NOV2004 11NOV2004 12NOV2004 15NOV2004 16NOV2004 17NOV2004 18NOV2004 19NOV2004 22NOV2004 23NOV2004 24NOV2004 25NOV2004 26NOV2004 29NOV2004 30NOV2004 01DEC2004 02DEC2004 03DEC2004 06DEC2004 07DEC2004 08DEC2004 09DEC2004 10DEC2004 13DEC2004 14DEC2004 15DEC2004 16DEC2004 17DEC2004 20DEC2004 21DEC2004 22DEC2004 23DEC2004 24DEC2004 27DEC2004 28DEC2004 29DEC2004 30DEC2004 31DEC2004 FCM10 4.11 4.10 4.09 4.10 4.21 4.22 4.22 4.25 . 4.20 4.20 4.21 4.14 4.12 4.20 4.18 4.19 4.20 . 4.24 4.34 4.36 4.38 4.40 4.27 4.24 4.23 4.14 4.19 4.16 4.16 4.14 4.09 4.19 4.21 4.21 4.18 4.21 4.23 . 4.30 4.31 4.33 4.27 4.24 FCM1M 1.79 1.86 1.83 1.85 1.86 1.88 1.89 1.88 . 1.91 1.92 1.93 1.90 1.91 1.98 1.98 1.99 1.98 . 2.01 2.02 2.07 2.06 2.06 2.06 2.09 2.08 2.08 2.07 2.07 2.04 2.01 1.98 1.93 1.95 1.97 1.92 1.84 1.83 . 1.90 1.88 1.76 1.68 1.89
Example 35.9: Using the SOURCE Option to Subset Time Series from a HAVER Database ! 2361
The DROP option can be used to drop specic variables from a HAVER database. To specify this option, use DROP= instead of KEEP=.
Example 35.9: Using the SOURCE Option to Subset Time Series from a HAVER Database
To select specic variables that belong to a certain source, the SOURCE= or DROPSOURCE= option can be used, similar to the way you use the KEEP= or DROP= option.
libname lib1 sasehavr "%sysget(HAVER_DATA)" freq=daily start=20041101 end=20041223 source="FRB"; data hwoutd; set lib1.haverd; run; title1 Haver Analytics Database, Frequency=daily, infile=haverd.dat; title2 Define a range inside the data range for OUT= dataset; title3 Using the START=20041101 END=20041223 LIBNAME options.; title4 Subset further: Using source="FRB" LIBNAME option; proc print data=hwoutd; run;
Output 35.9.1 shows two series that are selected by using SOURCE="FRB" on the LIBNAME statement.
Output 35.9.1 Using the SOURCE Option along with Dening a Range Using START=20041101 END=20041213
Haver Analytics Database, Frequency=daily, infile=haverd.dat Define a range inside the data range for OUT= dataset Using the START=20041101 END=20041223 LIBNAME options. Subset further: Using source="FRB" LIBNAME option Obs 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 DATE 01NOV2004 02NOV2004 03NOV2004 04NOV2004 05NOV2004 08NOV2004 09NOV2004 10NOV2004 11NOV2004 12NOV2004 15NOV2004 16NOV2004 17NOV2004 18NOV2004 19NOV2004 22NOV2004 23NOV2004 24NOV2004 25NOV2004 26NOV2004 29NOV2004 30NOV2004 01DEC2004 02DEC2004 03DEC2004 06DEC2004 07DEC2004 08DEC2004 09DEC2004 10DEC2004 13DEC2004 14DEC2004 15DEC2004 16DEC2004 17DEC2004 20DEC2004 21DEC2004 22DEC2004 23DEC2004 FCM10 4.11 4.10 4.09 4.10 4.21 4.22 4.22 4.25 . 4.20 4.20 4.21 4.14 4.12 4.20 4.18 4.19 4.20 . 4.24 4.34 4.36 4.38 4.40 4.27 4.24 4.23 4.14 4.19 4.16 4.16 4.14 4.09 4.19 4.21 4.21 4.18 4.21 4.23 FFED 1.83 1.74 1.73 1.77 1.76 1.80 1.79 1.92 1.92 2.02 2.06 1.98 1.99 1.99 1.99 2.01 2.00 2.02 2.02 2.01 2.03 2.02 2.04 2.00 1.98 2.04 1.99 2.01 2.05 2.09 2.18 2.24 2.31 2.26 2.23 2.26 2.24 2.25 2.34 FFP1D 1.80 1.74 1.73 1.75 1.75 1.84 1.81 1.85 . 1.96 2.03 1.95 1.93 1.94 1.93 1.96 1.95 1.89 . 1.97 2.00 2.04 2.01 1.98 1.96 1.98 1.99 1.98 2.03 2.07 2.13 2.22 2.27 2.24 2.20 2.21 2.21 2.22 2.08 FXAUS 0.7460 0.7447 0.7539 0.7585 0.7620 0.7578 0.7618 0.7592 . 0.7685 0.7719 0.7728 0.7833 0.7786 0.7852 0.7839 0.7860 0.7863 . 0.7903 0.7852 0.7723 0.7754 0.7769 0.7778 0.7748 0.7754 0.7545 0.7532 0.7495 0.7592 0.7566 0.7652 0.7563 0.7607 0.7644 0.7660 0.7656 0.7654
Example 35.10: Using the GROUP Option to Subset Time Series from a HAVER Database ! 2363
Example 35.10: Using the GROUP Option to Subset Time Series from a HAVER Database
To select specic variables that belong to a certain group, the GROUP= or DROPGROUP= option can also be used, similar to the way you use the KEEP= or DROP= option. Output 35.10.1, Output 35.10.2, and Output 35.10.3 show 3 different cross sections of the same database, haverw, by specifying 3 unique GROUP= options: GROUP="F*" on LIBNAME lib1, GROUP="M*" on LIBNAME lib2, and GROUP="E*" on LIBNAME lib3.
libname lib1 sasehavr "%sysget(HAVER_DATA)" freq=week.6 force=freq start=20040102 end=20041001 group="F*"; data hwoutw; set lib1.haverw; run; title1 Haver Analytics Database, Frequency=week.6, infile=haverw.dat; title2 Define a range inside the data range for OUT= dataset; title3 Using the START=20040102 END=20041001 LIBNAME options.; title4 Subset further: Using group="F*" LIBNAME option; proc print data=hwoutw; run;
Output 35.10.1 Using the GROUP=F* Option along with Dening a Range
Haver Analytics Database, Frequency=week.6, infile=haverw.dat Define a range inside the data range for OUT= dataset Using the START=20040102 END=20041001 LIBNAME options. Subset further: Using group="F*" LIBNAME option Obs 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 DATE 02JAN2004 09JAN2004 16JAN2004 23JAN2004 30JAN2004 06FEB2004 13FEB2004 20FEB2004 27FEB2004 05MAR2004 12MAR2004 19MAR2004 26MAR2004 02APR2004 09APR2004 16APR2004 23APR2004 30APR2004 07MAY2004 14MAY2004 21MAY2004 28MAY2004 04JUN2004 11JUN2004 18JUN2004 25JUN2004 02JUL2004 09JUL2004 16JUL2004 23JUL2004 30JUL2004 06AUG2004 13AUG2004 20AUG2004 27AUG2004 03SEP2004 10SEP2004 17SEP2004 24SEP2004 01OCT2004 FCM1M 0.86 0.88 0.84 0.79 0.86 0.90 0.90 0.92 0.96 0.97 0.96 0.94 0.95 0.95 0.94 0.92 0.89 0.87 0.89 0.89 0.91 0.94 0.97 1.01 1.05 1.08 1.11 1.14 1.16 1.21 1.30 1.34 1.37 1.36 1.39 1.46 1.57 1.57 1.56 1.54 FTA1MA 16.089 12.757 12.141 12.593 17.357 21.759 21.557 21.580 21.390 24.119 24.294 23.334 21.400 21.818 17.255 14.143 14.136 16.946 22.772 23.113 25.407 25.043 27.847 27.240 17.969 12.159 12.547 21.303 25.024 25.327 21.823 21.631 28.237 26.070 27.342 25.213 25.255 15.292 15.068 21.549 FTB3 0.885 0.920 0.870 0.875 0.890 0.920 0.920 0.915 0.930 0.940 0.930 0.945 0.930 0.945 0.930 0.915 0.935 0.970 0.985 1.060 1.040 1.050 1.130 1.230 1.390 1.315 1.355 1.320 1.315 1.330 1.425 1.465 1.470 1.470 1.515 1.580 1.635 1.640 1.685 1.710
Example 35.10: Using the GROUP Option to Subset Time Series from a HAVER Database ! 2365
libname lib2 sasehavr "%sysget(HAVER_DATA)" freq=week.6 force=freq start=20040102 end=20041001 group="M*"; data hwoutw; set lib2.haverw; run; title1 Haver Analytics Database, Frequency=week.6, infile=haverw.dat; title2 Define a range inside the data range for OUT= dataset; title3 Using the START=20040102 END=20041001 LIBNAME options.; title4 Subset further: Using group="M*" LIBNAME option; proc print data=hwoutw; run;
Output 35.10.2 Using the GROUP=M* Option along with Dening a Range
Haver Analytics Database, Frequency=week.6, infile=haverw.dat Define a range inside the data range for OUT= dataset Using the START=20040102 END=20041001 LIBNAME options. Subset further: Using group="M*" LIBNAME option Obs 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 DATE 02JAN2004 09JAN2004 16JAN2004 23JAN2004 30JAN2004 06FEB2004 13FEB2004 20FEB2004 27FEB2004 05MAR2004 12MAR2004 19MAR2004 26MAR2004 02APR2004 09APR2004 16APR2004 23APR2004 30APR2004 07MAY2004 14MAY2004 21MAY2004 28MAY2004 04JUN2004 11JUN2004 18JUN2004 25JUN2004 02JUL2004 09JUL2004 16JUL2004 23JUL2004 30JUL2004 06AUG2004 13AUG2004 20AUG2004 27AUG2004 03SEP2004 10SEP2004 17SEP2004 24SEP2004 01OCT2004 FA 7302.9 7351.2 7378.5 7434.7 7492.4 7510.4 7577.8 7648.7 7530.6 7546.7 7602.0 7603.0 7625.5 7637.3 7667.4 7692.5 7698.4 7703.8 7686.8 7734.6 7695.8 7704.7 7715.1 7754.0 7753.2 7796.2 7769.8 7852.3 7852.8 7854.7 7859.5 7847.9 7888.7 7851.8 7890.0 7906.2 7962.7 7982.1 7987.9 7949.5 FM1 1298.2 1294.3 1286.8 1296.7 1305.1 1303.1 1309.1 1317.0 1321.1 1316.2 1312.7 1324.0 1337.6 1337.9 1327.3 1321.8 1322.2 1331.6 1342.5 1325.5 1330.1 1337.7 1329.0 1324.4 1336.4 1345.8 1351.4 1330.1 1326.3 1323.5 1340.6 1337.3 1340.1 1347.3 1360.8 1353.7 1338.3 1345.6 1359.7 1366.0
Example 35.10: Using the GROUP Option to Subset Time Series from a HAVER Database ! 2367
libname lib3 sasehavr "%sysget(HAVER_DATA)" freq=week.6 force=freq start=20040102 end=20041001 group="E*"; data hwoutw; set lib3.haverw; run; title1 Haver Analytics Database, Frequency=week.6, infile=haverw.dat; title2 Define a range inside the data range for OUT= dataset; title3 Using the START=20040102 END=20041001 LIBNAME options.; title4 Subset further: Using group="E*" LIBNAME option; proc print data=hwoutw; run;
Output 35.10.3 Using the GROUP=E* Option along with Dening a Range
Haver Analytics Database, Frequency=week.6, infile=haverw.dat Define a range inside the data range for OUT= dataset Using the START=20040102 END=20041001 LIBNAME options. Subset further: Using group="E*" LIBNAME option Obs 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 DATE 02JAN2004 09JAN2004 16JAN2004 23JAN2004 30JAN2004 06FEB2004 13FEB2004 20FEB2004 27FEB2004 05MAR2004 12MAR2004 19MAR2004 26MAR2004 02APR2004 09APR2004 16APR2004 23APR2004 30APR2004 07MAY2004 14MAY2004 21MAY2004 28MAY2004 04JUN2004 11JUN2004 18JUN2004 25JUN2004 02JUL2004 09JUL2004 16JUL2004 23JUL2004 30JUL2004 06AUG2004 13AUG2004 20AUG2004 27AUG2004 03SEP2004 10SEP2004 17SEP2004 24SEP2004 01OCT2004 LICN 552.8 677.9 490.8 382.3 406.3 433.2 341.6 328.2 342.1 339.0 312.1 304.5 296.8 304.2 350.7 335.0 313.7 283.2 292.8 297.1 294.0 304.1 308.2 312.4 322.5 318.7 349.9 444.5 394.4 315.7 282.1 291.5 268.0 272.1 275.2 273.7 250.6 275.8 282.7 279.6
Database Name USECON USNA SURVEYS SURVEYW CPIDATA PPI PPIR LABOR EMPL CEW IP FFUNDS CAPSTOCK USINT CBDB
Offering Type U.S. Economic Indicators U.S. Economic Indicators U.S. Economic Indicators U.S. Economic Indicators U.S. Economic Indicators U.S. Economic Indicators U.S. Economic Indicators U.S. Economic Indicators U.S. Economic Indicators U.S. Economic Indicators U.S. Economic Indicators U.S. Economic Indicators U.S. Economic Indicators U.S. Economic Indicators Specialized Databases
Description HAVER Analytics primary database of US Economic, Financial data Complete US National Income and Product Accounts from the Bureau of Economic Analysis Business and Consumer Expectations, surveys Business and Consumer Expectations, weekly surveys Consumer Price Indexes, monthly in CPI Detailed Report Producer Price Indexes by Bureau of Labor Statistics Producer Price Indexes by Bureau of Labor Statistics Employment and Earnings by Bureau of Labor Statistics Household Employment Survey, monthly by Bureau of Labor Statistics Covered Employment and Wages, monthly, quarterly Industrial Production and Capacity Utilization by Federal Reserve Board Flow of Funds Data by Federal Reserve Board Capital Stock by Bureau of Economic Analysis U.S. International Trade (TIC) data by Country and Product Conference Board Database, monthly
Table 35.3
continued
Offering Type Specialized Databases Specialized Databases Specialized Databases Specialized Databases Financial Indicators Financial Indicators Financial Indicators Financial Indicators Financial Indicators Financial Indicators Financial Indicators Financial Indicators Financial Indicators Financial Indicators Financial Indicators Financial Indicators Financial Indicators Financial cators Financial cators Financial cators Financial cators IndiIndiIndiIndi-
Description US Business Cycle Indicators Consumer Sentiment Survey from The University of Michigan Surveys of Consumers Fiber Business Cycle Indicators from the Foundation of International Business and Economic Research Cyclical Indicators from Economic Cycle Research Institute, weekly and monthly U.S. Daily Statistics Data Country Daily Statistics U.S. Weekly Statistics Country Weekly Statistics Standard and Poors Industry Groups, daily Standard and Poors Industry Groups, weekly Standard and Poors Industry Groups, monthly Standard and Poors Analysts Annual Handbook, yearly Morgan Stanley Capital International, daily Morgan Stanley Capital International, weekly Morgan Stanley Capital International, monthly U.S. Bond Indexes CITIGROUP Bond Performance Indexes by CITIGROUP Global Markets, formerly Salomon Smith Barney Mutual Fund Activity from the Investment Company Institute Quarterly Financial Report Capacity Utilization by Federal Reserve Board Mortgage Delinquency Rates by Mortgage Bankers Association Consumer Delinquency Rates by American Bankers Association, monthly
ECRI DAILY INTDAILY WEEKLY INTWKLY SPD SPW SPM SPAH MSCID MSCIW MSCIM BONDINDX BONDS
Table 35.3
continued
Offering Type Financial Indicators Financial Indicators Industry Industry Industry Industry
Description FDIC Banking Statistics TIC data by Country and Product U.S. Government Financial Statistics by U.S. Treasury U.S. Industry Statistics Home Sales from National Association of Realtors U.S. and International Energy Statistics, from Pennwell Publishings Oil and Gas Journal U.S. and International Energy Statistics, from Pennwell Publishings Oil and Gas Journal, annual Weekly Oil Statistics U.S.Electric Output from the Edison Electric Institute, weekly Annual Survey of Manufactures Railcar Loadings from Association of American Railroads and Atlantic Systems Weekly Chemical Prices from the Chemical Division of Access Intelligence Baltic Freight Indexes, from the Baltic Exchange in London International Macroeconomic Data Purchasing Managers Surveys by NTC Research Country Surveys Japan from Nomura Research Institute Japan from Nomura Research Institute, weekly Canada from Statistics Canada and the Bank of Canada Canada from Statistics Canada and the Bank of Canada United Kingdom United Kingdom Surveys, by NTC Research Germany from the Deutsche Bundesbank and Statistics Bundesamt France, Statistics from INSEE, the Bank of France and the Ministry of France
OILWKLY EEI ASM RAILSHAR CHEMWEEK BALTIC G10+ PMI INTSRVYS JAPAN JAPANW CANSIM CANSIMR UK UKSRVYS GERMANY FRANCE
Industry Industry Industry Industry Industry Industry Industrial Countries Industrial Countries Industrial Countries Industrial Countries Industrial Countries Industrial Countries Industrial Countries Industrial Countries Industrial Countries Industrial Countries Industrial Countries
Table 35.3
continued
Database Name ITALY SPAIN IRELAND NORDIC ALPMED BENELUX ANZ EMERGELA EMERGEPR CHINA EMERGECW EMERGEMA EUROSTAT OECDMEI OECDNAQ OECDNA OECDLFS OUTLOOK IFS IFSANN IMFDOTM IMFDOT
Offering Type Industrial Countries Industrial Countries Industrial Countries Industrial Countries Industrial Countries Industrial Countries Industrial Countries Emerging Markets Emerging Markets Emerging Markets Emerging Markets Emerging Markets International Organizations International Organizations International Organizations International Organizations International Organizations International Organizations International Organizations International Organizations International Organizations International Organizations
Description Italy, from Istituto Nazionale di Statistica and Banca dItalia Spain, from the Instituto Nacional de Estadistica and the Banco de Espana Ireland, from the Central Statistics Ofce and Central Bank Norway, Sweden, Denmark, Finland Austria, Switzerland, Greece, Portugal Belgium, Netherlands, Luxembourg, monthly Australia and New Zealand Latin American Macroeconomic Data Asia/Pacic RIm Emerging Markets CEIC Premium China Database, from CEIC Central and Eastern Europe and Western Asia Middle East and African Emerging Markets European Union Data from EUROSTAT and OECD OECD Main Economic Indicators OECD Quarterly National Accounts OECD Annual National Accounts OECD Quarterly Labor Force OECD Economic Outlook International Financial Statistics from International Monetary Fund International Financial Statistics, annual from International Monetary Fund Direction of Trade Statistics, monthly from International Monetary Fund Direction of Trade Statistics from International Monetary Fund
Table 35.3
continued
Offering Type International Organizations International Organizations International Organizations International Organizations Forecasts and As Reported Data Forecasts and As Reported Data Forecasts and As Reported Data Forecasts and As Reported Data Forecasts and As Reported Data Forecasts and As Reported Data Forecasts and As Reported Data Forecasts and As Reported Data Forecasts and As Reported Data Forecasts and As Reported Data Forecasts and As Reported Data Forecasts and As Reported Data
Description World Commodity Prices from World Development Prospects Group (Pinksheets) Global Development Finance from World Bank Debt Tables United Nations Population Projections International Comparisons from Bureau of Labor Statistics Short Term U.S. Economic Forecasts from Macroeconomic Advisors Long Term U.S. Economic Forecasts from Macroeconomic Advisors Canadian Quarterly Model from Centre for Spatial Economics Canadian Provincial Model from Centre for Spatial Economics OEF Global Macroeconomic Forecasts from Oxford Economic Forecasting OEF Global Macroeconomic Forecasts from Oxford Economic Forecasting OEF Global Macroeconomic Forecasts from Oxford Economic Forecasting OEF Global Macroeconomic Forecasts from Oxford Economic Forecasting OEF Global Industry from Oxford Economic Forecasting EIU Market Indicators and Forecasts from the Economist Intelligence Unit EIU Market Indicators and Forecasts from the Economist Intelligence Unit EIU Market Indicators and Forecasts from the Economist Intelligence Unit
MA4CSTL
CQM
CPM
OEFQMACR
OEFMAJOR
OEFINTER
OEFMINOR
OEFQIND
EIUIAMER
EIUIASIA
EIUIEEUR
Table 35.3
continued
Offering Type Forecasts and As Reported Data Forecasts and As Reported Data Forecasts and As Reported Data Forecasts and As Reported Data Forecasts and As Reported Data Forecasts and As Reported Data Forecasts and As Reported Data Forecasts and As Reported Data Forecasts and As Reported Data Forecasts and As Reported Data Forecasts and As Reported Data Forecasts and As Reported Data Forecasts and As Reported Data Forecasts and As Reported Data
Description EIU Market Indicators and Forecasts from the Economist Intelligence Unit EIU Market Indicators and Forecasts from the Economist Intelligence Unit EIU Market Indicators and Forecasts from the Economist Intelligence Unit EIU Market Indicators and Forecasts from the Economist Intelligence Unit EIU Country Data from the Economist Intelligence Unit EIU Country Data from the Economist Intelligence Unit EIU Country Data from the Economist Intelligence Unit EIU Country Data from the Economist Intelligence Unit EIU Country Data from the Economist Intelligence Unit EIU Country Data from the Economist Intelligence Unit EIU Country Data from the Economist Intelligence Unit EIU Country Data from the Economist Intelligence Unit Action Economics Forecast Medians and As Reported Data MMS Survey Medians and As First Reported Data from MMS International
EIUISUBS
EIUIWEUR
EIUIREGS
EIUDAMER
EIUDASIA
EIUDEEUR
EIUDMENA
EIUDSUBS
EIUDWEUR
EIUDOECD
EIUDREGS
AS1REPNA
MMSAMER
Table 35.3
continued
Offering Type Forecasts and As Reported Data Forecasts and As Reported Data Forecasts and As Reported Data Forecasts and As Reported Data U.S. Regional U.S. Regional U.S. Regional U.S. Regional U.S. Regional U.S. Regional U.S. Regional U.S. Regional U.S. Regional U.S. Regional U.S. Regional U.S. Regional U.S. Regional U.S. Regional U.S. Regional U.S. Regional U.S. Regional U.S. Regional U.S. Regional U.S. Regional U.S. Regional U.S. Regional U.S. Regional U.S. Regional U.S. Regional
Description MMS Survey Medians and As First Reported Data from MMS International Economic Survey Forecasts
SURVEYS
AS4CAST
ASREPGDP
As Reported U.S. Gross Dometic Product from Bureau of Economic Analysis Monthly Payroll Employment from Bureau of Labor Statistics Labor Force and Unemployment from Bureau of Labor Statistics Labor Force and Unemployment from Bureau of Labor Statistics Annual Employment by Industry Annual Employment by Industry Residential Building Permits Residential Building Permits Residential Building Permits Residential Building Permits Residential Building Permits Selected Regional Indicators Selected Regional Indicators Personal Income Personal Income Personal Income Personal Income Personal Income Mortgage Delinquincy Rates from Mortgage Bankers Association Consumer Delinquincy Rates from American Bankers Association Bankrupts by County and MSA Gross State Product from BEA Population by Age and Sex Population by Age and Sex Trade by Port Exports by Industry and Country from the World Institute for Strategic Economic Research and the Census Bureau
LABORR EMPLR EMPLC BEAEMPL BEAEMPM PERMITS PERMITY PERMITP PERMITC PERMITA REGIONAL REGIONAW PIQR PIR PIRMSA PICOUNTY PIRC1 to 9 MBAMTG DLINQR BANKRUPT GSP USPOP USPOPC PORTS EXPRQ1 to 9
Table 35.3
continued
Description Exports by Industry and Country from the World Institute for Strategic Economic Research and the Census Bureau Exports by Industry and Country from the World Institute for Strategic Economic Research and the Census Bureau Government Financial Statistics from the Census Bureau and Rockefeller Institute of Government FDIC Banking Statistics
EXPORT99
U.S. Regional
GOVFINR FDICR
References
HAVER ANALYTICS (2001), DLX API Programmers Reference, New York, NY. 10165 [http://www.haver.com/] HAVER ANALYTICS (2005), DLX Database Prole, New York, NY. 10165 HAVER ANALYTICS (2005), DATA LINK EXPRESS, Time Series Data Base Management System, New York, NY. 10165 [http://www.haver.com/]
Acknowledgments
Many people have been instrumental in the development of the ETS Interface engine. The individuals listed here have been especially helpful. Maurine Haver, HAVER ANALYTICS, New York, NY. Lai Cheng, HAVER ANALYTICS, New York, NY. Rick Langston, SAS Institute, Cary, NC. The nal responsibility for the SAS System lies with SAS Institute alone. We hope that you will always let us know your opinions about the SAS System and its documentation. It is through your participation that SAS software is continuously improved.
Part IV
2378
Chapter 36
Introduction
The Time Series Forecasting system forecasts future values of time series variables by extrapolating trends and patterns in the past values of the series or by extrapolating the effect of other variables on the series. The system provides convenient point-and-click windows to control the time series analysis and forecasting tools of SAS/ETS software. You can use the system in a fully automatic mode, or you can use the systems diagnostic features and time series modeling tools interactively to develop forecasting models customized to best predict your time series. The system provides both graphical and statistical features to help you choose the best forecasting method for each series. The following is a brief summary of the features of the Time Series Forecasting system. You can use the system in the following ways: use a wide variety of forecasting methods, including several kinds of exponential smoothing models, Winters method, and ARIMA (Box-Jenkins) models. You can also produce forecasts by combining the forecasts from several models. use predictor variables in forecasting models. Forecasting models can include time trend curves, regressors, intervention effects (dummy variables), adjustments you specify, and dynamic regression (transfer function) models. view plots of the data, predicted versus actual values, prediction errors, and forecasts with condence limits, as well as autocorrelations and results of white noise and stationarity tests. Any of these plots can be zoomed and can represent raw or transformed series. use hold-out samples to select the best forecasting method
compare goodness-of-t measures for any two forecasting models side by side or list all models sorted by a particular t statistic view the predictions and errors for each model in a spreadsheet or compare the t of any two models in a spreadsheet examine the tted parameters of each forecasting model and their statistical signicance control the automatic model selection process: the set of forecasting models considered, the goodness-of-t measure used to select the best model, and the time period used to t and evaluate models customize the system by adding forecasting models for the automatic model selection process and for point-and-click manual selection save your work in a project catalog print an audit trail of the forecasting process show source statements for PROC ARIMA code save and print system output including spreadsheets and graphs
view, with the observations representing the time periods. The data can also be stored in an external spreadsheet or data base if you license SAS/ACCESS software. The Time Series Forecasting System chapters refer to series and variables. Since time series are stored as variables in SAS data sets or data views, these terms are used interchangeably. However, the term series is preferred when attention is focused on the sequence of data values, and the term variable is preferred when attention is focused on the data set.
2382
Chapter 37
This chapter outlines the forecasting process and introduces the major windows of the system through three example sessions. The rst example, beginning with the section The Time Series Forecasting Window, shows how to use the system for fully automated forecasting of a set of time series. This example also introduces the systems features for viewing data and forecasts through tables and interactive graphs. It also shows how to save and restore forecasting work in SAS catalogs. The second example, beginning with the section Develop Models Window, introduces the features for developing the best forecasting models for individual time series. The chapter concludes with an example showing how to create dating variables for your data in the form expected by the system. After working through the examples in this chapter, you should be able to do the following: select a data set of time series to work with and specify its periodicity and time ID variable use the automatic forecasting model selection feature to create forecasting models for the variables in a data set produce and save forecasts of variables in a data set examine your data and forecasts as tables of values and through interactive graphs save and restore your forecasting models by using project les in a SAS catalog and edit project information use some of the model development features to t and select forecasting models for individual time series variables This chapter introduces these topics and helps you get started using the system. Later chapters present these topics in greater detail and document more advanced features and options.
You can invoke the Forecasting System from the SAS Explorer window by opening an existing forecasting project. By default these projects are stored in the FMSPROJ catalog in the SASUSER library. Select SASUSER in the Explorer to display its contents. Then select FMSPROJ. This catalog is created the rst time you use the Forecasting System. If you have saved projects, they appear in the Explorer with the forecasting graph icon, as shown in Figure 37.2. Double-click one of the projects, or select it with the right mouse button and then select Open from the pop-up menu, as shown in the gure. This opens the Forecasting System and opens the selected project.
To invoke the Forecasting System in the SAS desktop environment, select the Solutions menu from the menu bar, select Desktop, and then open the Analysis folder. You can run the Time Series Forecasting System or the Time Series Viewer directly, or you can drag and drop. Figure 37.3 illustrates dragging a data set (known as a table in the Desktop environment) and dropping it on the Forecasting icon. In this example, the tables reside in a user-dened folder called Time Series Data.
If you are using SAS/ASSIST software, select the Planning button and then select Forecasting from the pop-up menu. Any of these methods takes you to the Time Series Forecasting window, as shown in Figure 37.4.
At the top of the window is a data selection area for specifying a project le and the input data set containing historical data (the known past values) for the time series variables that you want to forecast. This area also contains buttons for opening viewers to explore your input data either graphically, one series at a time, or as a table, one data set at a time. The Project and Description elds are used to specify a project le for saving and restoring forecasting models created by the system. Using project les is discussed later, and these elds are ignored for now. The lower part of the window contains six buttons:
Develop Models
opens the Develop Models window, which you use to develop and t forecasting models interactively for individual time series.
Fit Models Automatically
opens the Automatic Model Fitting window, which you use to search automatically for the best forecasting model for multiple series in the input data set.
Produce Forecasts
opens the Produce Forecasts window, which you use to compute forecasts for all
the variables in the input data set for which forecasting models have been t.
Manage Projects
opens the Manage Forecasting Project window, which lists the time series for which you have t forecasting models. You can drill down on a series to see the models that have been t. You can delete series or models from the project, re-evaluate or ret models, and explore models and forecasts graphically or in tabular form.
Exit
When SAS date values are used, the ID variable contains dates within the time periods corresponding to the observations. For example, for monthly data, the values for the time ID variable can be the date of the rst day of the month corresponding to each observation, or the time ID variable can contain the date of the last day in the month. (Any date within the period serves as the time ID for the observation.) If your data set already contains a valid time ID variable with SAS date or datetime values, the next step is to specify this time ID variable in the Time ID eld. If the time ID variable is named DATE, the system lls in the Time ID eld automatically. If your data set does not contain a time ID, you must add a valid time ID variable before beginning the forecasting process. The Forecasting System provides features that make this easy to do. See Chapter 38, Creating Time ID Variables, for details.
Summary ! 2391
To save your work, ll in the Project eld with the name of a SAS catalog member in which the system will store the model information when you exit the system. Later, you will select the same catalog member name when you rst enter the Forecasting System, and the model information will be reloaded. Note that any number of people can work with the same project le. If you are working on a forecasting project as part of a team, you should take care to avoid conicting updates to the project le by different team members.
Summary
This is the basic outline of how the Forecasting System works. The system offers many other features and options that you might need to use (for example, the time range of the data used to t models and how far into the future to forecast). These options will become apparent as you work with the Forecasting System. As an introductory example, the following sections use the Automatic Model Fitting and Produce Forecasts windows to perform automated forecasting of the series in an example data set.
The Libraries list shows the SAS librefs that are currently allocated in your SAS session. Initially, the SASUSER library is selected, and the SAS Data Sets list shows the data sets available in your SASUSER library. In the Libraries list, select the row that starts with SASHELP. The Data Set Selection window now lists the data sets in the SASHELP library, as shown in Figure 37.6.
Use the vertical scroll bar on the SAS Data Sets list to scroll down the list until the data set CITIQTR appears. Then select the CITIQTR row. This selects the data set SASHELP.CITIQTR as the input data set. Figure 37.7 shows the Data Set Selection window after selection of CITIQTR from the SAS Data Sets list.
Note that the Time ID eld is now set to DATE and the Interval eld is set to QTR. These elds are explained in the following section. Now select the OK button to complete selection of the CITIQTR data set. This closes the Data Set Selection window and returns to the Time Series Forecasting window, as shown in Figure 37.8.
If your data set does not contain a time ID variable with SAS date values, you can add a time ID variable using one of the windows described in Chapter 38, Creating Time ID Variables. Once the time ID variable is known, the Forecasting System examines the ID values to determine the time interval between observations. The data set SASHELP.CITIQTR contains quarterly observations. Therefore, the system determined that the data have a quarterly interval, and set the Interval eld to QTR. If the system cannot determine the data frequency from the values of the time ID variable, you must specify the time interval between observations. You can specify the time interval by using the Interval combo box. In addition to the interval names provided in the pop-up list, you can type in more complex interval names to specify an interval that is a multiple of other intervals or that has date values in the middle of the interval (such as monthly data with time ID values falling on the 10th day of the month). See Chapter 3, Working with Time Series Data, and Chapter 4, Date Intervals, Formats, and Functions, for more information about time intervals, SAS date values, and ID variables for time series data sets.
The rst part of the Automatic Model Fitting window conrms the project lename and the input data set name. The Series to Process eld shows the number and lists the names of the variables in the input data set to which the Automatic Model Fitting process will be applied. By default, all numeric variables (except the time ID variable) are processed. However, you can specify that models be generated for only a select subset of these variables. Click the Select button to the right of the Series to Process eld. This opens the Series to Process window, as shown in Figure 37.10.
Use the mouse and the CTRL key to select the personal consumption expenditures series (GC), the personal consumption expenditures for durable goods series (GCD), and the disposable personal income series (GYD), as shown in Figure 37.11. (Remember to hold down the CTRL key as you make the selections; otherwise, selecting a second series will deselect the rst.)
Now select the OK button. This returns you to the Automatic Model Fitting window. The Series to Process eld now shows the selected variables. The Selection Criterion eld shows the goodness-of-t measure that the Forecasting System will use to select the best tting model for each series. By default, the selection criterion is the root mean squared error. To illustrate how you can control the selection criterion, this example uses the mean absolute percent error to select the best tting models. Click the Select button to the right of the Selection Criterion eld. This opens a list of statistics of t, as shown in Figure 37.12.
Select Mean Absolute Percent Error and then select the OK button. The Automatic Model Fitting window now appears as shown in Figure 37.13.
Now that all the options are set appropriately, select the Run button. The Forecasting System now displays a notice, shown in Figure 37.14, conrming that models will be t for three series using the automatic forecasting model search feature. This prompt is displayed because it is possible to t models for a large number of series at once, which might take a lot of time. So the system gives you a chance to cancel if you accidentally ask to t models for more series than you intended. Select the OK button.
The system now ts several forecasting models to each of the three series you selected. While the models are being t, the Forecasting System displays notices indicating what it is doing so that you can observe its progress, as shown in Figure 37.15.
For each series, the system saves the model that produces the smallest mean absolute percent error. You can have the system save all the models t by selecting Automatic Fit from the Options menu. After the Automatic Model Fitting process has completed, the results are displayed in the Automatic Model Fitting Results window, as shown in Figure 37.16.
This resizable window shows the list of series names and descriptive labels for the forecasting models chosen for them, as well as the values of the model selection criterion and other statistics of t. Select the Close button. This returns you to the Automatic Model Fitting window. You can now t models for other series in this data set or change to a different data set and t models for series in the new data set. Select the Close button to return to the Time Series Forecasting window.
The Produce Forecasts window shows the input data set information and indicates the variables in the input data set for which forecasting models exist. Forecasts will be produced for these series. If you want to produce forecasts for only some of these series, use the Select button at the right of the Series eld to select the series to forecast. The Data Set eld in the Forecast Output box contains the name of the SAS data set in which the system will store the forecasts. The default output data set is WORK.FORECAST. You can set the forecast horizon by using the controls on the line labeled Horizon. The default horizon is 12 periods. You can change it by specifying the number of periods, number of years, or the date of the last forecast period. Position the cursor in the date eld and change the forecast ending date to 1 January 1996 by typing jan1996 and pressing the ENTER key. The window now appears as shown in Figure 37.18.
Now select the Run button to produce the forecasts. The system indicates that the forecasts have been stored in the output data set. Select OK to dismiss the notice.
The data set contains time ID variables and the forecast variables, and it contains one observation per time period. Observations for earlier time periods contain actual values copied from the input data set; later observations contain the forecasts. The data set contains time ID variables, the variable TYPE, and the forecast variables. There are several observations per time period, with the meaning of
Interleaved
The data set contains the variable SERIES, time ID variables, and the variables ACTUAL, PREDICT, ERROR, UPPER, LOWER, and STD. There is one observation per time period per forecast series. The variable SERIES contains the name of the forecast series, and the data set is sorted by SERIES and DATE.
Figure 37.19 shows the default simple format. This form of the forecast data set contains time ID variables and the variables that you forecast. The forecast variables contain actual values or predicted values, depending on whether the date of the observation is within the range of data supplied in the input data set. Select File and Close to close the Viewtable window.
Now select the Run button again. The system presents a warning notice reminding you that the data set WORK.FORECAST already exists and asking if you want to replace it. Select Replace. The forecasts are stored in the data set WORK.FORECAST again, this time in the Interleaved format. Dismiss the notice that the forecast was stored. Now select the Output button again. This opens a Viewtable window to display the data set, as shown in Figure 37.21.
In the interleaved format, there are several output observations for each input observation, identied by the TYPE variable. The values of the forecast variables for observations with different TYPE values are as follows.
ACTUAL ERROR LOWER PREDICT
actual values copied from the input data set the difference between the actual and predicted values the lower condence limits the predicted values from the forecasting model These are within-sample, onestep-ahead predictions for observations within the historical period, or multistep predictions for observations within the forecast period the estimated standard deviations of the prediction errors the upper condence limits
STD UPPER
This completes the example of how to use the Produce Forecasts window. Select File and Close to close the Viewtable window. Select the Close button to return to the Time Series Forecasting window.
Forecasting Projects
The system collects all the forecasting models you create, together with the options you set, into a package called a forecasting project. You can save this information in a SAS catalog entry and
restore your work in later forecasting sessions. You can store any number of forecasting projects under different catalog entry names. To see how this works, select the Manage Projects button. This opens the Manage Forecasting Project window, as shown in Figure 37.23.
Figure 37.23 Manage Forecasting Project Window
The table in this window lists the series for which forecasting models have been t, and it shows for each series the forecasting model used to produce the forecasts. This window provides several features that allow you to manage the information in your forecasting project. You can select a row of the table to drill down to the list of models t to the series. Select the GYD row of the table, either by double-clicking with the mouse or by clicking once to highlight the table row and then selecting List Models from the toolbar or from the Tools menu. This opens the Model List window for this series, as shown in Figure 37.24.
Because the Automatic Model Fitting process kept only the best tting model, only one model appears in the model list. You can t and retain any number of models for each series, and all the models t and kept will appear in the series model list. Select Close from the toolbar or from the File menu to return to the Manage Forecasting Project window.
Select the OK button. This returns you to the Project Management window and displays a message indicating that the project was saved. Select Close from the toolbar or from the File menu to return to the Time Series Forecasting window. Now select the Exit button. The system asks if you are sure you want to exit the system; select Yes. The forecasting application now terminates. Open the forecasting application again. A new project name is displayed by default. Now restore the forecasting project you saved previously. Select the Browse button to the right of the Project eld. This opens the Forecasting Project File Selection window, as shown in Figure 37.26.
Select the WORK library from the Libraries list. The Catalogs list now shows all the SAS catalogs in the WORK library. Select the TEST catalog. The Projects list now shows the list of forecasting projects in the catalog TEST. So far, you have created only one project le, TESTPROJ; so TESTPROJ is the only entry in the Projects list, as shown in Figure 37.27.
Select TESTPROJ from the Projects list and then select the OK button. This returns you to the Time Series Forecasting window. The system loads the project information you saved in TESTPROJ and displays a message indicating this. The Project eld is now set to WORK.TEST.TESTPROJ, and the description is the description you previously gave to TESTPROJ, as shown in Figure 37.28.
If you now select the Manage Projects button, you will see the list of series and forecasting models you created in the previous forecasting session.
Sharing Projects
If you plan to work with others on a forecasting project, you might need to consider how project information can be shared. The series, models, and results of your project are stored in a forecasting project (FMSPROJ) catalog entry in the location you specify, as illustrated in the previous section. You need only read access to the catalog to work with it, but you must have write access to save the project. Multiple users cannot open a project for update at the same time, but they can do so at different times if they all have write access to the catalog where it is stored. Project options settings such as the model selection criterion and number of models to keep are stored in an SLIST catalog entry in the SASUSER or TSFSUSER library. Write access to this catalog is required. If you have only read access to the SASUSER library, you can use the -RSASUSER option when starting SAS. You will be prompted for a location for the TSFSUSER library, if it is not already assigned. If you want to use TSFSUSER routinely, assign it before you start the Time
Series Forecasting System. Select New from the SAS Explorer le menu. In the New Library window, type TSFSUSER for the name. Click the Browse button and select the directory or folder you want to use. Turn on the enable at startup option so this library will be assigned automatically in subsequent sessions. The SASUSER library is typically used for private settings saved by individual users. This is the default location for project options. If a work group shares a single options catalog (SASUSER or TSFSUSER points to the same location for all users), then only one user can use the system at a time.
Introduction
From the Time Series Forecasting window, select the Browse button to the right of the Data Set eld to open the Data Set Selection window. Select the USECON data set from the SASHELP library. This data set contains monthly data on the U.S. economy. Select OK to close the selection window. Now select the Develop Models button. This opens the Series Selection window, as shown in Figure 37.29. You can enlarge this window for easier viewing of lists of data sets and series.
Select the series CHEMICAL: Sales of Chemicals and Allied Products, and then select the OK button. This opens the Develop Models window, as shown in Figure 37.30.
Introduction ! 2419
The Data Set, Interval, and Series elds in the upper part of the Develop Models window indicate the series with which you are currently working. You can change the settings of these elds by selecting the Browse button. The Data Range, Fit Range, and Evaluation Range elds show the time period over which data are available for the current series, and what parts of that time period are used to t forecasting models to the series and to evaluate how well the models t the data. You can change the settings of these elds by selecting the Set Ranges button. The bottom part of the Develop Models window consists of a table of forecasting models t to the series. Initially, the list is empty, as indicated by the message No models. You can t any number of forecasting models to each series and designate which one you want to use to produce forecasts. Graphical tools are available for exploring time series and tted models. The two icons below the Browse button access the Time Series Viewer and the Model Viewer. Select the left icon. This opens the Time Series Viewer window, as shown in Figure 37.31.
The Time Series Viewer displays a plot of the CHEMICAL series. The Time Series Viewer offers many useful features, which are explored in later sections. The Time Series Viewer appears in a separate resizable window. You can switch back and forth between the Time Series Viewer window and other windows. For now, return to the Develop Models window. You can close the Time Series Viewer window or leave it open. (To close the Time Series Viewer window, select Close from the toolbar or from the File menu.)
Fitting Models
To open a menu of model tting choices, select Edit from the menu bar and then select Fit Model, or select Fit Models from List in the toolbar, or simply select a blank line in the table as shown in Figure 37.32.
The Forecasting System provides several ways to specify forecasting models. The eight choices given by the menu shown in Figure 37.32 are as follows:
Fit Models Automatically
performs for the current series the same automatic model selection process that the Automatic Model Fitting window applies to a set of series.
Fit Models from List
presents a list of commonly used forecasting models for convenient point-andclick selection.
Fit Smoothing Model
displays the Smoothing Model Specication window, which enables you to specify several kinds of exponential smoothing and Winters method forecasting models.
Fit ARIMA Model
displays the ARIMA Model Specication window, which enables you to specify many kinds of autoregressive integrated moving average (ARIMA) models, including seasonal ARIMA models and ARIMA models with regressors, transfer functions, and other predictors.
displays the Factored ARIMA Model Specication window, which enables you to specify more general ARIMA models, including subset models and models with unusual and/or multiple seasonal cycles. It also supports regressors, transfer functions, and other predictors.
Fit Custom Model
displays the Custom Model Specication window, which enables you to construct a forecasting model by specifying separate options for transforming the data, modeling the trend, modeling seasonality, modeling autocorrelation of the errors, and modeling the effect of regressors and other independent predictors.
Combine Forecasts
displays the Forecast Combination Model Specication window, which enables you to specify models that produce forecasts by combining, or averaging, the forecasts from other models. (This option is not available unless you have t at least two models.)
Use External Forecasts
displays the External Forecast Model Specication window, which enables you to use judgmental or externally produced forecasts that have been saved in a separate series in the data set. All of the forecasting models used by the system are ultimately specied through one of the four windows: Smoothing Method Specication, ARIMA Model Specication, Factored ARIMA Model Specication, or Custom Model Specication. You can specify the same models with either the ARIMA Model Specication window or the Custom Model Specication window, but the Custom Model Specication window can provide a more natural way to specify models for those who are less familiar with the Box-Jenkins style of time series model specication. The Automatic Model feature, the Models to Fit window, and the Forecast Combination Model Specication window all deal with lists of forecasting models previously dened through the Smoothing Model, ARIMA Model, or Custom Model specication windows. These windows are discussed in detail in later sections. To get started using the Develop Models window, select the Fit Models from List item from the menu shown in Figure 37.32. This opens the Models to Fit window, as shown in Figure 37.33.
You can select several models to t at once by holding down the CTRL key as you make the selections. Select Linear Trend and Double (Brown) Exponential Smoothing, as shown in Figure 37.34, and then select the OK button.
The system ts the two models you selected. After the models are t, the labels of the two models and their goodness-of-t statistic are added to the model table, as shown in Figure 37.35.
Trend model, choose it by selecting the corresponding check box in the Forecast Model column. To use a different goodness-of-t criterion, select the button with the current criterion name on it (Root Mean Square Error or Mean Absolute Percent Error). This opens the Model Selection Criterion window, as shown in Figure 37.36.
Figure 37.36 Model Selection Criterion Window
The system provides many measures of t that you can use as the model selection criterion. To avoid confusion, only the most popular of the available t statistics are shown in this window by default. To display the complete list, you can select the Show all option. You can control the subset of statistics listed in this window through the Statistics of Fit item in the Options menu on the Develop Models window. Initially, Root Mean Square Error is selected. Select R-Square and then select the OK button. This changes the t statistic displayed in the model list, as shown in Figure 37.37.
Now that you have t some models to the series, you can use the Model Viewer button to take a closer look at the predictions of these models.
Model Viewer
In the Develop Models window, select the row in the table containing the Linear Trend model so that this model is highlighted. The model list should now appear as shown in Figure 37.38.
Note that the Linear Trend model is now highlighted, but the Forecast Model column still shows the Double Exponential Smoothing model as the model chosen to produce the nal forecasts for the series. Selecting a model in the list means that this is the model that menu items such as View Model, Delete, Edit, and Refit will act upon. Choosing a model by selecting its check box in the Forecast Model column means that this model will be used by the Produce Forecasts process to generate forecasts. Now open the Model Viewer by selecting the right-hand icon under the Browse button, or by selecting Model Predictions in the toolbar or from the View menu. The Model Viewer displays the Linear Trend model, as shown in Figure 37.39.
This graph shows the linear trend line representing the model predicted values together with a plot of the actual data values, which uctuate about the trend line.
If the model being viewed includes a transformation, prediction errors are dened as the difference between the transformed series actual values and model predictions. You can choose to graph instead the difference between the untransformed series values and untransformed model predictions, which are called model residuals. You can also graph normalized prediction errors or normalized model residuals. Use the Residual Plot Options submenu under the Options menu.
Autocorrelation Plots
Select the third icon from the top in the vertical toolbar. This switches the Viewer to display a plot of autocorrelations of the model prediction errors at different lags, as shown in Figure 37.41. Autocorrelations, partial autocorrelations, and inverse autocorrelations are displayed, with lines overlaid at plus and minus two standard errors. You can switch the graphs so that the bars represent signicance probabilities by selecting the Correlation Probabilities item on the toolbar or from the View menu. For more information about the meaning and use of autocorrelation plots, see Chapter 7, The ARIMA Procedure.
The white noise test bar chart shows signicance probabilities of the Ljung-Box chi square statistic. Each bar shows the probability computed on autocorrelations up to the given lag. Longer bars favor rejection of the null hypothesis that the prediction errors represent white noise. In this example, they are all signicant beyond the 0.001 probability level, so that you reject the null hypothesis. In other words, the high level of signicance at all lags makes it clear that the linear trend model is inadequate for this series. The second bar chart shows signicance probabilities of the augmented Dickey-Fuller test for unit roots. For example, the bar at lag three indicates a probability of 0.0014, so that you reject the null hypothesis that the series is nonstationary. The third bar chart is similar to the second except that it represents the seasonal lags. Since this series has a yearly seasonal cycle, the bars represent yearly intervals. You can select any of the bars to display an interpretation. Select the fourth bar of the middle chart. This displays the Recommendation for Current View, as shown in Figure 37.43. This window gives an interpretation of the test represented by the bar that was selected; it is signicant, therefore a stationary series is likely. It also gives a recommendation: You do not need to perform a simple difference to make the series stationary.
For the linear trend model, the parameters are the intercept and slope coefcients. The table shows the values of the tted coefcients together with standard errors and t tests for the statistical significance of the estimates. The model residual variance is also shown.
Data Table
Select the last icon at the bottom of the vertical toolbar to the right of the graph. This switches the Viewer to display the forecast data set as a table, as shown in Figure 37.48.
To view the full data set, use the vertical and horizontal scroll bars on the data table or enlarge the window.
Chapter 38
The Forecasting System requires that the input data set contain a time ID variable. If the data you want to forecast are not in this form, you can use features of the Forecasting System to help you add time ID variables to your data set. This chapter shows examples of how to use these features.
Submit these SAS statements to create the data set NO_ID. This data set contains the single variable Y. Assume that Y is a quarterly series and starts in the rst quarter of 1991. In the Time Series Forecasting window, use the Browse button to the right of the Data set eld to bring up the Data Set Selection window. Select the WORK library, and then select the NO_ID data set. You must create a time ID variable for the data set. Click the Create button to the right of the Time ID eld. This opens a menu of choices for creating the Time ID variable, as shown in Figure 38.1.
Select the rst choice, Create from starting date and frequency. This opens the Time ID Creation from Starting Date window shown in Figure 38.2.
Enter the starting date, 1991:1, in the Starting Date eld. Select the Interval list arrow and select QTR. The Interval value QTR means that the time interval between successive observations is a quarter of a year; that is, the data frequency is quarterly. Now select the OK button. The system prompts you for the name of the new data set. If you want to create a new copy of the input data set with the DATE variable added, enter a name for the new data set. If you want to replace the NO_ID data set with the new copy containing DATE, just select the OK button without changing the name. For this example, change the New name eld to WITH_ID and select the OK button. The data set WITH_ID is created containing the series Y from NO_ID and the added ID variable DATE. The system returns to the Data Set Selection window, which now appears as shown in Figure 38.3.
Select the Table button to see the new data set WITH_ID. This opens a VIEWTABLE window for the data set WITH_ID, as shown in Figure 38.4. Select File and Close to close the VIEWTABLE window.
Select the OK button. This opens the New Data Set Name window. Enter OBS_ID in the New data set name eld. Enter T in the New ID variable name eld. Now select the OK button. The new data set OBS_ID is created, and the system returns to the Data Set Selection window, which now appears as shown in Figure 38.6.
The Interval eld for OBS_ID has the value 1. This means that the values of the time ID variable T increment by one between successive observations. Select the Table button to look at the OBS_ID data set, as shown in Figure 38.7.
Select File and Close to close the VIEWTABLE window. Select the OK button from the Data Set Selection window to return to the Time Series Forecasting window.
91 1 91 2 91 3 91 4 92 1 92 2 92 3 92 4 93 1 93 2 93 3 93 4 94 1 94 2 94 3 94 4 run;
10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85
Submit these SAS statements to create the data set ID_PARTS. This data set contains the three variables YR, QTR, and Y. YR and QTR are ID variables that together date the observations, but each variable provides only part of the date information. Because the forecasting system requires a single dating variable containing SAS date values, you need to combine YR and QTR to create a single variable DATE. Type ID_PARTS in the Data Set eld and press the ENTER key. (You could also use the Browse button to open the Data Set Selection window, as in the previous example, and complete this example from there.) Select the Create button at the right of the Time ID eld. This opens the menu of Create Time ID choices, as shown in Figure 38.8.
Select the second choice, Create from existing variables. This opens the window shown in Figure 38.9.
In the Variables list, select YR. In the Date Part list, select YEAR as shown in Figure 38.10.
Now click the right-pointing arrow button. The variable YR and the part code YEAR are added to the Existing Time IDs list. Next select QTR from the Variables list, select QTR from the Date Part list, and click the arrow button. This adds the variable QTR and the part code QTR to the Existing Time IDs list, as shown in Figure 38.11.
Now select the OK button. This opens the New Data Set Name window. Change the New data set name eld to NEWDATE, and then select the OK button. The data set NEWDATE is created, and the system returns to the Time Series Forecasting window with NEWDATE as the selected Data Set. The Time ID eld is set to DATE, and the Interval eld is set to QTR.
2452
Chapter 39
This chapter explores the tools available through the Develop Models window for investigating the properties of time series and for specifying and tting models. The rst section shows you how to diagnose time series properties in order to determine the class of models appropriate for forecasting series with such properties. Later sections show you how to specify and t different kinds of forecasting models.
Series Diagnostics
The series diagnostics tool helps you determine the kinds of forecasting models that are appropriate for the data series so that you can limit the search for the best forecasting model. The series diagnostics address these three questions: Is a log transformation needed to stabilize the variance? Is a time trend present in the data? Is there a seasonal pattern to the data? The automatic model tting process, which you used in the previous chapter through the Automatic Model Fitting window, performs series diagnostics and selects trial models from a list according to the results. You can also look at the diagnostic information and make your own decisions as to the kinds of models appropriate for the series. The following example illustrates the series diagnostics features. Select Develop Models from the Time Series Forecasting window. Select the library SASHELP, the data set CITIMON, and the series RCARD. This series represents domestic retail sales of passenger cars. To look at this series, select View Series from the Develop Models window. This opens the Time Series Viewer window, as shown in Figure 39.1.
Select Diagnose Series from the Tools menu. You can do this from the Develop Models window or from the Time Series Viewer window. Figure 39.2 shows this from the Develop Models window.
Each of the three series characteristicsneed for log transformation, presence of a trend, and seasonalityhas a set of options for Yes, No, and Maybe. Yes indicates that the series has the characteristic and that forecasting models t to the series should be able to model and predict this behavior. No indicates that you do not need to consider forecasting models designed to predict series with this characteristic. Maybe indicates that models with and without the characteristic should be considered. Initially, all these values are set to Maybe. To have the system diagnose the series characteristics, select the Automatic Series Diagnostics button. This runs the diagnostic routines described in Chapter 44, Forecasting Process Details, and sets the options according to the results. In this example, Trend and Seasonality are changed from Maybe to Yes, while Log Transform remains set to Maybe. These diagnostic criteria affect the models displayed when you use the Models to Fit window or the Automatic Model Selection model-tting options described in the following section. You can set the criteria manually, according to your judgment, by selecting any of the options, whether you have used the Automatic Series Diagnostics button or not. For this exercise, leave them as set by the automatic diagnostics. Select the OK button to close the Series Diagnostics window.
Since you have performed series diagnostics, the models shown are the subset that ts the diagnostic criteria. Suppose you want to consider models other than those in this subset because you are undecided about including a trend in the model. Select the Show all models option. Now the entire model selection list is shown. Scroll through the list until you nd Log Seasonal Exponential Smoothing, as shown in Figure 39.5.
This is a nontrended model, which seems a good candidate. Select this model, and then select the OK button. The model is t to the series and then appears in the table with the value of the selected t criterion, as shown in Figure 39.6.
You can edit the model list that appears in the Models to Fit window by selecting Options and Model Selection List from the menu bar or by selecting the Edit Model List toolbar icon. You can then delete models you are not interested in from the default list and add models using any of the model specication methods described in this chapter. When you save your project, the edited model selection list is saved in the project le. In this way, you can use the Select from List item and the Automatic Model Selection item to select models from a customized search set.
Using automatic selection, the system does not pause to warn you of model tting errors, such as failure of the estimates to converge (you can track these using the audit trail feature). By default, only the best tting model is kept. However, you can control the number of automatically t models retained in the Develop Models list, and the following example shows how to do this. From the menu bar, choose Options and Automatic Fit. This opens the Automatic Model Selection Options window. Click the Models to Keep list arrow, and select All models, as shown in Figure 39.7. Now select OK.
Figure 39.7 Selecting Number of Automatic Fit Models to Keep
Next, select Fit Models Automatically by clicking the middle of the table or using the toolbar or Edit menu. The Automatic Model Selection window appears, showing the diagnostic criteria in effect and the number of models to be t, as shown in Figure 39.8.
Select the OK button. After the models have been t, all of them appear in the table, in addition to the model which you t earlier, as shown in Figure 39.9.
The Smoothing Model Specication window consists of several parts. At the top is the series name and a eld for the label of the model you are specifying. The model label is lled in with an automatically generated label as you specify options. You can type over the automatic label with your own label for the model. To restore the automatic label, enter a blank label. The Smoothing Methods box lists the different methods available. Below the Smoothing Methods box is the Transformation eld, which is used to apply the smoothing method to transformed series values. The Smoothing Weights box species how the smoothing weights are determined. By default, the smoothing weights are automatically set to optimize the t of the model to the data. See Chapter 44, Forecasting Process Details, for more information about how the smoothing weights are t. Under smoothing methods, select Winters Method Additive. Notice the smoothing weights box to the right. The third item, Damping, is grayed out, while the other items, Level, Trend, and Season, show the word Optimize. This tells you that these three smoothing weights are applicable to the smoothing method that you selected and that the system is currently set to optimize these weights for you. Next, specify a transformation using the Transformation list. A menu of transformation choices pops up, as shown in Figure 39.11.
You can specify a logarithmic, logistic, square root, or Box-Cox transformation. For this example, select Square Root from the list. The Transformation eld is now set to Square Root. This means that the system will rst take the square roots of the series values, apply the additive version of the Winters method to the square root series, and then produce the predictions for the original series by squaring the Winters method predictions (and multiplying by a variance factor if the Mean Prediction option is set in the Forecast Options window). See Chapter 44, Forecasting Process Details, for more information about predictions from transformed models. The Smoothing Model Specication window should now appear as shown in Figure 39.12. Select the OK button to t the model. The model is added to the table of tted models in the Develop Models window.
This ARIMA Model Specication window is structured according to the Box and Jenkins approach to time series modeling. You can specify the same time series models with the Custom Model Specication window and the ARIMA Model Specication window, but the windows are structured differently, and you may nd one more convenient than the other. At the top of the ARIMA Model Specication window is the name and label of the series and the label of the model you are specifying. The model label is lled in with an automatically generated label as you specify options. You can type over the automatic label with your own label for the model. To restore the automatic label, enter a blank label. Using the ARIMA Model Specication window, you can specify autoregressive (p), differencing (d), and moving average (q) orders for both simple and seasonal factors. You can specify transformations with the Transformation list. You can also specify whether an intercept is included in the ARIMA model. In addition to specifying seasonal and nonseasonal ARIMA processes, you can also specify predictor variables and other terms as inputs to the model. ARIMA models with inputs are sometimes called ARIMAX models or Box-Tiao models. Another term for this kind of model is dynamic regression. In the lower part of the ARIMA Model Specication window is the list of predictors to the model
(initially empty). You can specify predictors by using the Add button. This opens a menu of different kinds of independent effects, as shown in Figure 39.14.
Figure 39.14 Add Predictors Menu
The kinds of predictor effects allowed include time trends, regressors, adjustments, dynamic regression (transfer functions), intervention effects, and seasonal dummy variables. How to use different kinds of predictors is explained in Chapter 41, Using Predictor Variables. As an example, in the ARIMA Options box, set the order of differencing d to 1 and the moving average order q to 2. You can either type in these values or click the arrows and select the values from pop-up lists. These selections specify an ARIMA(0,1,2) or IMA(1,2) model. (See Chapter 7, The ARIMA Procedure, for more information about the notation used for ARIMA models.) Notice that the model label at the top is now IMA(1,2) NOINT, meaning that the data are differenced once and a second-order moving-average term is included with no intercept. In the Seasonal ARIMA Options box, set the seasonal moving-average order Q to 1. This adds a rst-order moving-average term at the seasonal (12 month) lag. Finally, select Log in the Transformation combo box. The model label is now Log ARIMA(0,1,2)(0,0,1)s NOINT, and the window appears as shown
in Figure 39.15.
Figure 39.15 Log ARIMA(0,1,2)(0,0,1)s Specied
Select the OK button to t the model. The model is t and added to the Develop Models table.
The Factored ARIMA Model Specication window is similar to the ARIMA Model Specication window and has the same features, but it uses a more general specication of the autoregressive (p), differencing (d), and moving-average (q) terms. To specify these terms, select the corresponding Set button, as shown in Figure 39.16. For example, to specify autoregressive terms, select the rst Set button. This opens the AR Polynomial Specication Window, shown in Figure 39.17.
To add AR polynomial terms, select the New button. This opens the Polynomial Specication Window, shown in Figure 39.18. Specify the rst lag you want to include by using the Lag spin box, then select the Add button. Repeat this process, adding each lag you want to include in the current list. All lags must be specied. For example, if you add only lag 3, the model contains only lag 3, not 1 through 3. As an example, add lags 1 and 3, then select the OK button. The AR Polynomial Specication Window now shows (1,3) in the list of polynomials. Now select New again. Add lags 6 and 12 and select OK. Now the AR Polynomial Specication Window shows (1,3) and (6,12) as shown in Figure 39.17. Select OK to close this window. The Factored ARIMA Model Specication Window now shows the factored model p=(1,3)(6,12). Use the same technique to specify the q terms, or moving-average part of the model. There is no limit to the number of lags or the number of factors you can include in the model.
To specify differencing lags, select the middle Set button to open the Differencing Specication window. Specify lags using the spin box and add them to the list with the Add button. When you select OK to close the window, the differencing lags appear after d= in the Factored ARIMA Specication Window, within a single pair of parentheses. You can use the Factored ARIMA Model Specication Window to specify any model that you can specify with the ARIMA Model and Custom Model windows, but the notation is more similar to that of the ARIMA procedure (see Chapter 7, The ARIMA Procedure). Consider as an example the classic Airline model t to the International Airline Travel series, SASHELP.AIR. This is a factored model with one moving-average term at lag one and one moving-average term at the seasonal lag, with rst-order differencing at the simple and seasonal lags. Using the ARIMA Model Specication Window, you specify the value 1 for the q and d terms and also for the Q and D terms, which represent the seasonal lags. For monthly data, the seasonal lags represent lag 12, since a yearly seasonal cycle is assumed. By contrast, the Factored ARIMA Model Specication Window makes no assumptions about seasonal cycles. The Airline model is written as IMA d=(1,12) q=(1)(12) NOINT. To specify the differencing terms, add the values 1 and 12 in the Differencing Specication Window and select OK. Then select New in the MA Polynomial Specication Window, add the value 1, and select OK. To add the factored term, select New again, add the value 12, and select OK. Remember to
select No in the Intercept radio box, since it is not selected by default. Select OK to close the Factored ARIMA Model Specication Window and t the model. You can show that the results are the same as they are when you specify the model by using the ARIMA Model Specication Window and when you select Airline Model from the default model list. If you are familiar with the ARIMA Procedure (Chapter 7, The ARIMA Procedure), you might want to turn on the Show Source Statements option before tting the model, then examine the procedure source statements in the log window after tting the model. The strength of the Factored ARIMA Specication approach lies in its ability to contruct unusual ARIMA models, such as: Subset models These are models of order n, where fewer than n lags are specied. For example, an AR order 3 model might include lags 1 and 3 but not lag 2. Unusual seasonal cycles For example, a monthly series might cycle two or four times per year instead of just once. Multiple cycles For example, a daily sales series might peak on a certain day each week and also once a year at the Christmas season. Given sufcient data, you can t a three-factor model, such as IMA d=(1) q=(1)(7)(365). Models with high order lags take longer to t and often fail to converge. To save time, select the Conditional Least Squares or Unconditional Least Squares estimation method (see Figure 39.16). Once you have narrowed down the list of candidate models, change to the Maximum Likelihood estimation method.
You can specify the same time series models with the Custom Model Specication window and the ARIMA Model Specication window, but the windows are structured differently, and you might nd one more convenient than the other. At the top of the Custom Model Specication window is the name and label of the series and the label of the model you are specifying. The model label is lled in with an automatically generated label as you specify options. You can type over the automatic label with your own label for the model. To restore the automatic label, enter a blank label. The middle part of the Custom Model Specication window consists of four elds: Transformation, Trend Model, Seasonal Model, and Error Model. These elds allow you to specify the model in four parts. Each part species how a different aspect of the pattern of the time series is modeled and predicted. The Predictors list at the bottom of the Custom Model Specication window allows you to include different kinds of predictor variables in the forecasting model. The Predictors feature for the Custom Model Specication window is like the Predictors feature for the ARIMA Model Specication window, except that time trend predictors are provided through the Trend Model eld and seasonal dummy variable predictors are provided through the Seasonal Model eld. To illustrate how to use the Custom Model Specication window, the following example species
the same model you t by using the ARIMA Model Specication window. First, specify the data transformation to use. Select Log using the Transformation combo box. Second, specify how to model the trend in the series. Select First Difference in the Trend Model combo box, as shown in Figure 39.20.
Figure 39.20 Trend Model Options
Next, specify how to model the seasonal pattern in the series. Select Seasonal ARIMA in the Seasonal Model combo box, as shown in Figure 39.21.
This opens the Seasonal ARIMA Model Options window, as shown in Figure 39.22.
Specify a rst-order seasonal moving-average term by typing 1 or by selecting 1 from the Moving Average: Q= combo box pop-up menu, and then select the OK button. Finally, specify how to model the autocorrelation pattern in the model prediction errors. Select the Set button to the right of the Error Model eld. This opens the Error Model Options window, as shown in Figure 39.23. This window allows you to specify an ARMA error process. Set the Moving Average order q to 2, and then select the OK button.
The Custom Model Specication window should now appear as shown in Figure 39.24. The model label at the top of the Custom Model Specication window should now read Log ARIMA(0,1,2)(0,0,1)s NOINT, just as it did when you used the ARIMA Model Specication window.
Now that you have seen how the Custom Model Specication window works, select Cancel to exit the window without tting the model. This should return you to the Develop Models window.
selection process. The model is then considered automatically whenever you use the automatic model selection feature for any series. To edit the model selection list, select Model Selection List from the Options menu as shown in Figure 39.25, or select the Edit Model List toolbar icon.
Figure 39.25 Model Selection List Option
This selection brings up the Model Selection List editor window, as shown in Figure 39.26. This window consists of the model selection list and an Auto Fit column, which controls for each model whether the model is included in the list of models used by the automatic model selection process.
To add a model to the list, select Add Model from the Edit menu and then select Smoothing Model, ARIMA Model, Factored ARIMA Model, or Custom Model from the submenu. Alternatively, click the corresponding icon on the toolbar. As an example, select Smoothing Model. This brings up the Smoothing Model Specication window. Note that the series name is -Null-. This means that you are not specifying a model to be t to a particular series, but are specifying a model to be added to the selection list for later reference. Specify a smoothing model. For example, select Simple Smoothing and then select the Square Root transformation. The window appears as shown in Figure 39.27.
Select the OK button to add the model to the end of the model selection list and return you to the Model Selection List window, as shown in Figure 39.28. You can now select the Fit Models from List model-tting option to use the edited selection list.
Figure 39.28 Model Added to Selection List
If you want to delete one or more models from the list, select the model labels to highlight them in the list. Click a second time to clear a selected model. Then select Delete from the Edit pulldown menu, or the corresponding toolbar icon. As an example, delete the Square Root Simple Exponential Smoothing model that you just added.
The Model Selection List editor window gives you a lot of exibility for managing multiple model lists, as explained in the section Model Selection List Editor Window on page 2611. For example, you can create your own model lists from scratch or modify or combine previously saved model lists and those provided with the software, and you can save them and designate one as the default for future projects. Now select Close from the File menu (or the Close icon) to close the Model Selection List editor window.
At the top of the Forecast Combination window is the name and label of the series and the label of the model you are specifying. The model label is lled in with an automatically generated label as you specify options. You can type over the automatic label with your own label for the model. To restore the automatic label, enter a blank label. The middle part of the Forecast Combination window consists of the list of models that you have t to the series. This table shows the label and goodness-of-t measure for each model and the combining weight assigned to the model. The Weight column controls how much weight is given to each model in the combined forecasts. A missing weight means that the model is not used. Initially, all the models have missing weight values. You can enter the weight values you want to use in the Weight column. Alternatively, you can select models from the Model Description column, and weight values for the models you select are set automatically. To remove a model from the combination, select it again. This resets its weight value to missing. At the bottom of the Forecast Combination window are two buttons: Normalize Weights and Fit Regression Weights. The Normalize Weights button adjusts the nonmissing weight values so that they sum to one. The Fit Regression Weights button uses linear regression to compute the
weight values that produce the combination of model predictions with the best t to the series. If no models are selected, the Fit Regression Weights button ts weights for all the models in the list. You can compute regression weights for only some of the models by rst selecting the models you want to combine and then selecting Fit Regression Weights. In this case, only the nonmissing Weight values are replaced with regression weights. As an example of how to combine forecasting models, select all the models in the list. After you have nished selecting the models, all the models in the list should now have equal weight values, which implies a simple average of the forecasts. Now select the Fit Regression Weights button. The system performs a linear regression of the series on the predictions from the models with nonmissing weight values and replaces the weight values with the estimated regression coefcients. These are the combining weights that produce the smallest mean square prediction error within the sample. The Forecast Combination window should now appear as shown in Figure 39.30. (Note that some of the regression weight values are negative.)
Figure 39.30 Combining Models
Select the OK button to t the combined model. Now the Develop Models window shows this model to be the best tting according to the root mean square error, as shown in Figure 39.31.
Notice that the combined model has a smaller root mean square error than any one of the models included in the combination. The condence limits for forecast combinations are produced by taking a weighted average of the mean square prediction errors for the component forecasts, ignoring the covariance between the prediction errors.
To include external forecasts in the Time Series Forecasting process, you must rst supply the external forecast as a variable in the input data set. You then specify a special kind of forecasting model whose predictions are identical to the external forecast recorded in the data set. As an example, suppose you have 12 months of sales data and ve months of sales forecasts based on a consensus opinion of the sales staff. The following statements create a SAS data set containing made-up numbers for this situation.
data widgets; input date monyy5. sales staff; format date monyy5.; label sales = "Widget Sales" staff = "Sales Staff Consensus Forecast"; datalines; jun94 142.1 . jul94 139.6 . aug94 145.0 . sep94 150.2 . oct94 151.1 . nov94 154.3 . dec94 158.7 . jan95 155.9 . feb95 159.2 . mar95 160.8 . apr95 162.0 . may95 163.3 . jun95 . 166. jul95 . 168. aug95 . 170. sep95 . 171. oct95 . 177. run;
Submit the preceding statements in the SAS Program Editor window. From the Time Series Forecasting window, select Develop Models. In the Series Selection window, select the data set WORK.WIDGETS and the variable SALES. The Develop Models window should now appear as shown in Figure 39.32.
Now select Edit, Fit Model, and External Forecasts from the menu bar of the Develop Models window, as shown in Figure 39.33, or the Use External Forecasts toolbar icon.
This selection opens the External Forecast Model Specication window. Select the STAFF variable as shown in Figure 39.34.
Select the OK button. The external forecast model is now t and added to the Develop Models list, as shown in Figure 39.35.
You can now use this model for comparison with the predictions from other forecasting models that you t, or you can include it in a forecast combination model. Note that no tting is actually performed for an external forecast model. The predictions of the external forecast model are simply the values of the external forecast series read from the input data set. The goodness-of-t statistics for such models will depend on the values that the external forecast series contains for observations within the period of t. In this case, no STAFF values are given for past periods, and therefore the t statistics for the model are missing.
Chapter 40
The Time Series Forecasting System provides a variety of tools for identifying potential forecasting models and for choosing the best tting model. It allows you to decide how much control you want to have over the process, from a hands-on approach to one that is completely automated. This chapter begins with an exploration of the tools available through the Series Viewer and Model Viewer. It presents an example of identifying models graphically and exercising your knowledge of model properties. The remainder of the chapter shows you how to compare models by using a variety of statistics and by controlling the t and evaluation time ranges. It concludes by showing you how to ret existing models and how to compare models using hold-out samples.
From the Series Selection window, select SASHELP as the library, WORKERS as the data set, and MASONRY as the time series, and then click the Graph button. The Time Series Viewer displays a plot of the series, as shown in Figure 40.2.
Select the Zoom In icon, the rst one on the windows horizontal toolbar. Notice that the mouse cursor changes shape and that Note: Click on a corner of the region, then drag to the other corner appears on the message line. Outline an area, as shown in Figure 40.3, by clicking the mouse at the upper-left corner, holding the button down, dragging to the lower-right corner, and releasing the button.
You can repeat the process to zoom in still further. To return to the previous view, select the Zoom Out icon, the second icon on the windows horizontal toolbar. The third icon on the horizontal toolbar is used to link or unlink the viewer window. By default, the viewer is linked, meaning that it is automatically updated to reect selection of a different time series. To see this, return to the Series Selection window by clicking on it or using the Window menu or Next Viewer toolbar icon. Select the Electric series in the Time Series Variables list box. Notice that the Time Series Viewer window is updated to show a plot of the ELECTRIC series. Select the Link/Unlink icon if you prefer to unlink the viewer so that it is not automatically updated in this way. Successive selections toggle between the linked and unlinked state. A note on the message line informs you of the state of the Time Series Viewer window. When a Time Series Viewer window is linked, selecting View Series again makes the linked Viewer window active. When no Time Series Viewer window is linked, selecting View Series opens an additional Time Series Viewer window. You can bring up as many Time Series Viewer windows as you want. Having seen the plot in Figure 40.2, you might suspect that the series is nonstationary and seasonal. You can gain further insight into this by examining the sample autocorrelation function (ACF), partial autocorrelation function (PACF), and inverse autocorrelation function (IACF) plots. To switch
the display to the autocorrelation plots, select the second icon from the top on the vertical toolbar at the right side of the Time Series Viewer. The plot appears as shown in Figure 40.5.
Figure 40.5 Sample Autocorrelation Plots
Each bar represents the value of the correlation coefcient at the given lag. The overlaid lines represent condence limits computed at plus and minus two standard errors. You can switch the graphs to show signicance probabilities by selecting Correlation Probabilities under the Options pull-down menu, or by selecting the Toggle ACF Probabilities toolbar icon. The slow decline of the ACF suggests that rst differencing might be warranted. To see the effect of rst differencing, select the simple difference icon, the fth icon from the left on the windows horizontal toolbar. The plot now appears as shown in Figure 40.6.
Since the ACF still displays slow decline at seasonal lags, seasonal differencing is appropriate (in addition to the rst differencing already applied). Select the Seasonal Difference icon, the sixth icon from the left on the horizontal toolbar. The plot now appears as shown in Figure 40.7.
When you select the OK button, the model is t and you are returned to the Develop Models window. Click on an empty part of the table and choose Fit Models from List from the popup menu. Select Airline Model from the window. (Airline Model is a common name for the ARIMA(0,1,1)(0,1,1)s model, which is often used for seasonal data with a linear trend.) Select the OK button. Once the model has been t, the table shows the two models and their root mean square errors. Notice that the Airline Model provides only a slight improvement over the differencing model, ARIMA(0,1,0)(0,1,0)s. Select the rst row to highlight the differencing model, as shown in Figure 40.9.
Now select the View Selected Model Graphically button, below the Browse button at the right side of the Develop Models window. The Model Viewer window appears, showing the actual data and model predictions for the MASONRY series. (Note that predicted values are missing for the rst 13 observations due to simple and seasonal differencing.) To examine the ACF plot for the model prediction errors, select the third icon from the top on the vertical toolbar. For this model, the prediction error ACF is the same as the ACF of the original data with rst differencing and seasonal differencing applied. This differencing is apparent if you bring the Time Series Viewer back into view for comparison. Return to the Develop Models Window by clicking on it or using the window pull-down menu or the Next Viewer toolbar icon. Select the second row of the table in the Develop Models window to highlight the Airline Model. The Model Viewer is automatically updated to show the prediction error ACF of the newly selected model, as shown in Figure 40.10.
Figure 40.10 Prediction Error ACF Plot for the Airline Model
Another helpful tool available within the Model Viewer is the parameter estimates table. Select the fth icon from the top of the vertical toolbar. The table gives the parameter estimates for the two moving-average terms in the Airline Model, as well as the model residual variance, as shown in Figure 40.11.
You can adjust the column widths in the table by dragging the vertical borders of the column titles with the mouse. Notice that neither of the parameter estimates is signicantly different from zero at the 0.05 level of signicance, since Prob>|t| is greater than 0.05. This suggests that the Airline Model should be discarded in favor of the more parsimonious differencing model, which has no parameters to estimate.
The statistics available as model selection criteria are a subset of the statistics available for informational purposes. To access the entire set, select Options from the menu bar, and then select Statistics of Fit. The Statistics of Fit Selection window appears, as shown in Figure 40.12.
Figure 40.12 Statistics of Fit
By default, ve of the more well known statistics are selected. You can select and deselect statistics by clicking the check boxes in the left column. For this exercise, select All, and notice that all the check boxes become checked. Select the OK button to close the window. Now if you choose Statistics of Fit in the Model Viewer window, all of the statistics will be shown for the selected model. To change the model selection criterion, click the Root Mean Square Error button or select Options from the menu bar and then select Model Selection Criterion. Notice that most of the statistics of t are shown, but those which are not relevant to model selection, such as number of observations, are not shown. Select Schwarz Bayesian Information Criterion and click OK. Since this statistic puts a high penalty on models with larger numbers of parameters, the ARIMA(0,1,0)(0,1,0)s model comes out with the better t. Notice that changing the selection criterion does not automatically select the model that is best according to that criterion. You can always choose the model you want to use for forecasts by
selecting its check box in the Forecast Model column. Now bring up the Model Selection Criterion window again and select Akaike Information Criterion. This statistic puts a lesser penalty on number of parameters, and the Airline Model comes out as the better tting model.
Whether or not you have a right mouse button, the same choices are available under Edit and View from the menu bar. If the model viewer has been invoked, it is automatically updated to show the selected model, unless you have unlinked the viewer by using the Link/Unlink toolbar button. Select the highlighted model in the table again. Notice that it is no longer highlighted. When no models are highlighted, the right mouse button pop-up menu changes, and items on the menu bar that apply to a selected model become unavailable. For example, you can choose Edit from the menu bar, but you cant choose the Edit Model or Delete Model selections unless you have highlighted a model in the table. When you select the check box in the Forecast Model column of the table, the model in that row becomes the forecasting model. This is the model that will be used the next time forecasts are generated by choosing View Forecasts or by using the Produce Forecasts window. Note that this forecasting model ag is automatically set when you use Fit Automatic Model or when you t an individual model that ts better, using the current selection criterion, than the current forecasting model.
Comparing Models
Select Tools and Compare Models from the menu bar. This displays the Model Fit Comparison table, as shown in Figure 40.14.
The two models you have t are shown as Model 1 and Model 2. When there are more than two models, you can bring any two of them into the table by selecting the up and down arrows. In this way, it is easy to do pairwise comparisons on any number of models, looking at as many statistics of t as you like. Since you previously chose to display all statistics of t, all of them are shown in the comparison table. Use the vertical scroll bar to move through the list. After you have examined the model comparison table, select the Close button to return to the Develop Models window.
button, or select Options and Time Ranges from the menu bar. This brings up the Time Ranges Specification window, as shown in Figure 40.15.
Figure 40.15 Time Ranges Specication Window
For this example, suppose the early data in the series is unreliable, and you want to use the range June 1978 to the latest available for both model tting and model evaluation. You can either type JUN1978 in the From column for Period of Fit and Period of Evaluation, or you can advance these dates by clicking the right pointing arrows. The outer arrow advances the date by a large amount (in this case, by a year), and the inner arrow advances it by a single period (in this case, by a month). Once you have changed the Period of Fit and the Period of Evaluation to JUN1978 in the From column, select the OK button to return to the Develop Models window. Notice that these time ranges are updated at the top of the window, but the models already t have not been affected. Your changes to the time ranges affect subsequently t models.
Notice that setting the hold-out sample to 12 automatically sets the t range to JUN1978JUL1981 and the evaluation range to AUG1981JUL1982. If you had set the period of t and period of evaluation to these ranges, the hold-out sample would have been automatically set to 12 periods. Select the OK button to return to the Develop Models window. Now ret the models again. Select Tools and Compare Models to compare the models now that they have been t to the period June 1978 through July 1981 and evaluated for the hold-out sample period August 1981 through July 1982. Note that the t statistics for the hold-out sample are based on one-step-ahead forecasts. (See Statistics of Fit in Chapter 44, Forecasting Process Details.) As shown in Figure 40.17, the ARIMA (0,1,0)(0,1,0)s model now seems to provide a better t to the data than does the Airline model. It should be noted that the results can be quite different if you choose a different size hold-out sample.
Chapter 41
Forecasting models predict the future values of a series by using two sources of information: the past values of the series and the values of other time series variables. Other variables used to predict a series are called predictor variables. Predictor variables that are used to predict the dependent series can be variables in the input data set, such as regressors and adjustment variables, or they can be special variables computed by the system as functions of time, such as trend curves, intervention variables, and seasonal dummies. You can specify seven different types of predictors in forecasting models by using the ARIMA Model or Custom Model Specication windows. You cannot specify predictor variables with the Smoothing Model Specication window. Figure 41.1 shows the menu of options for adding predictors to an ARIMA model that is opened by clicking the Add button. The Add menu for the Custom Model Specication menu is similar.
These types of predictors are as follows. Linear Trend Trend Curve adds a variable that indexes time as a predictor series. A straight line time trend is t to the series by regression when you specify a linear trend. provides a menu of various functions of time that you can add to the model to t nonlinear time trends. The Linear Trend option is a special case of the Trend Curve option for which the trend curve is a straight line. allows you to predict the series by regressing it on other variables in the data set. allows you to specify other variables in the data set that supply adjustments to the forecast. allows you to select a predictor variable from the input data set and specify a complex model for the way that the predictor variable affects the dependent series. allows you to model the effect of special events that intervene to change the pattern of the dependent series. Examples of intervention effects are strikes, tax increases, and special sales promotions.
Interventions
Seasonal Dummies
You can add any number of predictors to a forecasting model, and you can combine predictor variables with other model options. The following sections explain these seven kinds of predictors in greater detail and provide examples of their use. The examples illustrate these different kinds of predictors by using series in the SASHELP.USECON data set. Select the Develop Models button from the main window. Select the data set SASHELP.USECON and select the series PETROL. Then select the View Series Graphically button from the Develop Models window. The plot of the example series PETROL appears as shown in Figure 41.2.
Figure 41.2 Sales of Petroleum and Coal
Linear Trend
From the Develop Models window, select Fit ARIMA Model. From the ARIMA Model Specication window, select Add and then select Linear Trend from the menu (shown in Figure 41.1).
The description for the linear trend item shown in the Predictors list has the following meaning. The rst part of the description, Trend Curve, describes the type of predictor. The second part, _LINEAR_, gives the variable name of the predictor series. In this case, the variable is a time index that the system computes. This variable is included in the output forecast data set. The nal part, Linear Trend, describes the predictor. Notice that the model you have specied consists only of the time index regressor _LINEAR_ and an intercept. Although this window is normally used to specify ARIMA models, in this case no ARIMA model options are specied, and the model is a simple regression on time. Select the OK button. The Linear Trend model is t and added to the model list in the Develop Models window. Now open the Model Viewer by using the View Model Graphically icon or the Model Predictions item under the View pull-down menu or toolbar. This displays a plot of the model predictions and actual series values, as shown in Figure 41.4. The predicted values lie along the least squares trend line.
These trend curves work in a similar way as the Linear Trend option (which is a special case of a trend curve and one of the choices on the menu), but with the Trend Curve menu you have a choice of various nonlinear time trends. Select Quadratic Trend. This adds a quadratic time trend to the Predictors list, as shown in Figure 41.6.
Now select the OK button. The quadratic trend model is t and added to the list of models in the Develop Models window. The Model Viewer displays a plot of the quadratic trend model, as shown in Figure 41.7.
This curve does not t the PETROL series very well, but the View Model plot illustrates how time trend models work. You might want to experiment with different trend models to see what the different trend curves look like. Some of the trend curves require transforming the dependent series. When you specify one of these curves, a notice is displayed reminding you that a transformation is needed, and the Transformation eld is automatically lled in. Therefore, you cannot control the Transformation specication when some kinds of trend curves are specied. See the section Time Trend Curves on page 2515 in Chapter 44, Forecasting Process Details, for more information about the different trend curves.
Regressors
From the Develop Models window, select Fit ARIMA Model. From the ARIMA Model Specication window, select Add and then select Regressors from the menu (shown in Figure 41.1).
Regressors ! 2519
This displays the Regressors Selection window, as shown in Figure 41.8. This window allows you to select any number of other series in the input data set as regressors to predict the dependent series.
Figure 41.8 Regressors Selection Window
Chemicals and Allied Products, and Motor Vehicles and Parts. (Note: You do not need to use the CTRL key when selecting more than one regressor.) Then select the OK button. The two variables you VEHICLES, Sales:
selected are added to the Predictors list as regressor type predictors, as shown in Figure 41.9.
You must have forecasts of the future values of the regressor variables in order to use them as predictors. To do this, you can specify a forecasting model for each regressor, have the system automatically select forecasting models for the regressors, or supply predicted future values for the regressors in the input data set. Even if you have supplied future values for a regressor variable, the system requires a forecasting model for the regressor. Future values that you supply in the input data set take precedence over predicted values from the regressors forecasting model when the system computes the forecasts for the dependent series. Select the OK button. The system starts to t the regression model but then stops and displays a warning that the regressors that you selected do not have forecasting models, as shown in Figure 41.10.
Regressors ! 2521
If you want the system to create forecasting models automatically for the regressor variables by using the automatic model selection process, select the OK button. If not, you can select the Cancel button to abort tting the regression model. For this example, select the OK button. The system now performs the automatic model selection process for CHEMICAL and VEHICLES. The selected forecasting models for CHEMICAL and VEHICLES are added to the model lists for those series. If you switch the current time series in the Develop Models window to CHEMICAL or VEHICLES, you will see the model that the system selected for that series. Once forecasting models have been t for all regressors, the system proceeds to t the regression model for PETROL. The tted regression model is added to the model list displayed in the Develop Models window.
Adjustments
An adjustment predictor is a variable in the input data set that is used to adjust the forecast values produced by the forecasting model. Unlike a regressor, an adjustment variable does not have a regression coefcient. No model tting is performed for adjustments. Nonmissing values of the adjustment series are simply added to the model prediction for the corresponding period. Missing adjustment values are ignored. If you supply adjustment values for observations within the period of t, the adjustment values are subtracted from the actual values, and the model is t to these adjusted values. To add adjustments, select Add and then select Adjustments from the pop-up menu (shown in Figure 41.1). This displays the Adjustments Selection window. The Adjustments Selection window functions the same as the Regressor Selection window (which is shown in Figure 41.8). You can select any number of adjustment variables as predictors. Unlike regressors, adjustments do not require forecasting models for the adjustment variables. If a variable that is used as an adjustment does have a forecasting model t to it, the adjustment variables forecasting model is ignored when the variable is used as an adjustment. You can use forecast adjustments to account for expected future events that have no precedent in the past and so cannot be modeled by regression. For example, suppose you are trying to forecast the sales of a product, and you know that a special promotional campaign for the product is planned during part of the period you want to forecast. If such sales promotion programs have been frequent in the past, then you can record the past and expected future level of promotional efforts in a variable in the data set and use that variable as a regressor in the forecasting model. However, if this is the rst sales promotion of its kind for this product, you have no way to estimate the effect of the promotion from past data. In this case, the best you can do is to make an educated guess at the effect the promotion will have and add that guess to what your forecasting model would predict in the absence of the special sales campaign. Adjustments are also useful for making judgmental alterations to forecasts. For example, suppose you have produced forecast sales data for the next 12 months. Your supervisor believes that the forecasts are too optimistic near the end and asks you to prepare a forecast graph in which the numbers that you have forecast are reduced by 1000 in the last three months. You can accomplish this task by editing the input data set so that it contains observations for the actual data range of sales plus 12 additional observations for the forecast period, and a new variable called, for example, ADJUSTMENT. The variable ADJUSTMENT contains the value 1000 for the last three observations and is missing for all other observations. You t the same model previously selected for forecasting by using the ARIMA Model Specification or Custom Model Specification window, but with an adjustment added that uses the variable ADJUSTMENT. Now when you graph the forecasts by using the Model Viewer, the last three periods of the forecast are reduced by 1000. The condence limits are unchanged, which helps draw attention to the fact that the adjustments to the forecast deviate from what would be expected statistically.
Dynamic Regressor
Selecting Dynamic Regressor from the Add Predictors menu (shown in Figure 41.1) allows you to specify a complex time series model of the way that a predictor variable inuences the series that you are forecasting. When you specify a predictor variable as a simple regressor, only the current period value of the predictor effects the forecast for the period. By specifying the predictor with the Dynamic Regression option, you can use past values of the predictor series, and you can model effects that take place gradually. Dynamic regression models are an advanced feature that you are unlikely to nd useful unless you have studied the theory of statistical time series analysis. You might want to skip this section if you are not trained in time series modeling. The term dynamic regression was introduced by Pankratz (1991) and refers to what Box and Jenkins (1976) named transfer function models. In dynamic regression, you have a time series model, similar to an ARIMA model, that predicts how changes in the predictor series affect the dependent series over time. The dynamic regression model relates the predictor variable to the expected value of the dependent series in the same way that an ARIMA model relates the uctuations of the dependent series about its conditional mean to the random error term (which is also called the innovation series). Refer to Pankratz (1991) and Box and Jenkins (1976) for more information about dynamic regression or transfer function models. See also Chapter 7, The ARIMA Procedure. From the Develop Models window, select Fit ARIMA Model. From the ARIMA Model Specication window, select Add and then select Linear Trend from the menu (shown in Figure 41.1). Now select Add and select Dynamic Regressor. Selection window, as shown in Figure 41.11. This displays the Dynamic Regressors
You can select only one predictor series when specifying a dynamic regression model. For this example, select VEHICLES, Sales: Motor Vehicles and Parts. Then select the OK button. This displays the Dynamic Regression Specification window, as shown in Figure 41.12.
This window consists of four parts. The Input Transformations elds enable you to transform or lag the predictor variable. For example, you might use the lagged logarithm of the variable as the predictor series. The Order of Differencing elds enable you to specify simple and seasonal differencing of the predictor series. For example, you might use changes in the predictor variable instead of the variable itself as the predictor series. The Numerator Factors and Denominator Factors elds enable you to specify the orders of simple and seasonal numerator and denominator factors of the transfer function. Simple regression is a special case of dynamic regression in which the dynamic regression model consists of only a single regression coefcient for the current value of the predictor series. If you select the OK button without specifying any options in the Dynamic Regression Specication window, a simple regressor will be added to the model. For this example, use the Simple Order combo box for Denominator Factors and set its value to 1. The window now appears as shown in Figure 41.13.
This model is equivalent to regression on an exponentially weighted innite distributed lag of VEHICLES (in the same way an MA(1) model is equivalent to single exponential smoothing). Select the OK button to add the dynamic regressor to the model predictors list. In the ARIMA Model Specication window, the Predictors list should now contain two items, a linear trend and a dynamic regressor for VEHICLES, as shown in Figure 41.14.
Interventions ! 2527
This model is a multiple regression of PETROL on a time trend variable and an innite distributed lag of VEHICLES. Select the OK button to t the model. As with simple regressors, if VEHICLES does not already have a forecasting model, an automatic model selection process is performed to nd a forecasting model for VEHICLES before the dynamic regression model for PETROL is t.
Interventions
An intervention is a special indicator variable, computed automatically by the system, that identies time periods affected by unusual events that inuence or intervene in the normal path of the time series you are forecasting. When you add an intervention predictor, the indicator variable of the intervention is used as a regressor, and the impact of the intervention event is estimated by regression analysis. To add an intervention to the Predictors list, you must use the Intervention Specication window to specify the time or times that the intervening event took place and to specify the type of intervention.
You can add interventions either through the Interventions item of the Add action or by selecting Tools from the menu bar and then selecting Define Interventions. Intervention specications are associated with the series. You can specify any number of interventions for each series, and once you dene interventions you can select them for inclusion in forecasting models. If you select the Include Interventions option in the Options menu, any interventions that you have previously specied for a series are automatically added as predictors to forecasting models for the series. From the Develop Models window, invoke the series viewer by selecting the View Series Graphically icon or Series under the View menu. This displays the Time Series Viewer, as was shown in Figure 41.2. Note that the trend in the PETROL series shows several clear changes in direction. The upward trend in the rst part of the series reverses in 1981. There is a sharp drop in the series towards the end of 1985, after which the trend is again upwardly sloped. Finally, in 1991 the series takes a sharp upward excursion but quickly returns to the trend line. You might have no idea what events caused these changes in the trend of the series, but you can use these patterns to illustrate the use of intervention predictors. To do this, you t a linear trend model to the series, but modify that trend line by adding intervention effects to model the changes in trend you observe in the series plot.
The top of the Intervention Specication window shows the current series and the label for the new intervention (initially blank). At the right side of the window is a scrollable table showing the values of the series. This table helps you locate the dates of the events you want to model. At the left of the window is an area titled Intervention Specification that contains the options for dening the intervention predictor. The Date eld species the time that the intervention occurs. You can type a date value in the Date eld, or you can set the Date value by selecting a row from the table of series values at the right side of the window. The area titled Type of Intervention controls the kind of indicator variable constructed to model the intervention effect. You can specify the following kinds of interventions: Point is used to indicate an event that occurs in a single time period. An example of a point event is a strike that shuts down production for part of a time period. The value of the interventions indicator variable is zero except for the date specied. is used to indicate a continuing event that changes the level of the series. An example of a step event is a change in the law, such as a tax rate increase. The value of the interventions indicator variable is zero before the date specied and 1 thereafter.
Step
Ramp
is used to indicate a continuing event that changes the trend of the series. The value of the interventions indicator variable is zero before the date specied, and it increases linearly with time thereafter.
The areas titled Effect Time Window and Effect Decay Pattern specify how to model the effect that the intervention has on the dependent series. These options are not used for simple interventions, they will be discussed later in this chapter.
Now that you know the date that the trend reversal occurred, enter that date in the Date eld of the Intervention Specication window. Select Ramp as the type of intervention. The window should now appear as shown in Figure 41.17.
Select the OK button. This adds the intervention to the list of interventions for the PETROL series, and returns you to the Interventions for Series window, as shown in Figure 41.18.
This window allows you to select interventions for inclusion in the forecasting model. Since you need to dene other interventions, select the Add button. This returns you to the Intervention Specication window (shown in Figure 41.15).
Select the table row for February 1986 to set the Date eld. Select Step as the intervention type. The window should now appear as shown in Figure 41.19.
Figure 41.19 Step Intervention Specied
Select the OK button to add this intervention to the list for the series. Since the trend reverses again after the drop, add a ramp intervention for the same date as the step intervention. Select Add from the Interventions for Series window. Enter FEB86 in the Date eld, select Ramp, and then select the OK button.
shift in level or trend. For this pattern, you need a complex intervention model for an event that causes a sharp rise followed by a rapid return to the previous trend line. To specify this model, use the Effect Time Window and Effect Decay Rate options. The Effect Time Window option controls the number of lags of the interventions indicator variable used to model the effect of the intervention on the dependent series. For a simple intervention, the number of lags is zero, which means that the effect of the intervention is modeled by tting a single regression coefcient to the interventions indicator variable. When you set the number of lags greater than zero, regression coefcients are t to lags of the indicator variable. This allows you to model interventions whose effects take place gradually, or to model rebound effects. For example, severe weather might reduce production during one period but cause an increase in production in the following period as producers struggle to catch up. You could model this by using a point intervention with an effect time window of 1 lag. This would t two coefcients for the intervention, one for the immediate effect and one for the delayed effect. The Effect Decay Pattern option controls how the effect of the intervention dissipates over time. None species that there is no gradual decay: for point interventions, the effect ends immediately; for step and ramp interventions, the effect continues indenitely. Exp species that the effect declines at an exponential rate. Wave species that the effect declines like an exponentially damped sine wave (or as the sum of two exponentials, depending on the t to the data). If you are familiar with time series analysis, these options might be clearer if you note that together the Effect Time Window and Effect Decay Pattern options dene the numerator and denominator orders of a transfer function or dynamic regression model for the indicator variable of the intervention. See the section Dynamic Regressor on page 2523 for more information. For this example, select 2 lags as the value of the Event Time Window option, and select Exp as the Effect Decay Pattern option. The window should now appear as shown in Figure 41.20.
Select the OK button. This returns you to the ARIMA Model Specication window, which now lists items in the Predictors list, as shown in Figure 41.22.
Select the OK button to t this model. After the model is t, bring up the Model Viewer. You will see a plot of the model predictions, as shown in Figure 41.23.
You can use the Zoom In feature to take a closer look at how the complex intervention effect ts the excursion in the series starting in August 1990.
or level change. Step and ramp interventions are valuable when there are other patterns in the datasuch as seasonality, autocorrelated errors, and error variancethat are stable across the changes in level or trend. Step and ramp interventions enable you to t seasonal and error autocorrelation patterns to the whole series while tting the trend only to the latter part of the series. Point interventions are a useful tool for dealing with outliers in the data. A point intervention will t the series value at the specied date exactly, and it has the effect of removing that point from the analysis. When you specify an effect time window, a point intervention will exactly t as many additional points as the number of lags specied.
Seasonal Dummies
A Seasonal Dummies predictor is a special feature that adds to the model seasonal indicator or dummy variables to serve as regressors for seasonal effects. From the Develop Models window, select Fit ARIMA Model. From the ARIMA Model Specication window, select Add and then select Seasonal Dummies from the menu (shown in Figure 41.1). A Seasonal Dummies input is added to the Predictors list, as shown in Figure 41.24.
Select the OK button. A model consisting of an intercept and 11 seasonal dummy variables is t and added to the model list in the Develop Models window. This is effectively a mean model with a separate mean for each month. Now return to the Model Viewer, which displays a plot of the model predictions and actual series values, as shown in Figure 41.25. This is obviously a poor model for this series, but it serves to illustrate how seasonal dummy variables work.
Now select the parameter estimates icon, the fth from the top on the vertical toolbar. This displays the Parameter Estimates table, as shown in Figure 41.26.
Since the data for this example are monthly, the Seasonal Dummies option added 11 seasonal dummy variables. These include a dummy regressor variable that is 1.0 for January and 0 for other months, a regressor that is 1.0 only for February, and so forth through November. Because the model includes an intercept, no dummy variable is added for December. The December effect is measured by the intercept, while the effect of other seasons is measured by the difference between the intercept and the estimated regression coefcient for the seasons dummy variable. The same principle applies for other data frequencies: the Seasonal Dummy 1 parameter always refers to the rst period in the seasonal cycle; and, when an intercept is present in the model, there is no seasonal dummy parameter for the last period in the seasonal cycle.
References
Box, G.E.P. and Jenkins, G.M. (1976), Time Series Analysis: Forecasting and Control, San Francisco: Holden-Day.
References ! 2543
Pankratz, Alan (1991), Forecasting with Dynamic Regression Models, New York: John Wiley & Sons.
2544
Chapter 42
Command Reference
Contents
TSVIEW Command and Macro . . Syntax . . . . . . . . . . . . Examples . . . . . . . . . . . FORECAST Command and Macro . Syntax . . . . . . . . . . . . Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2545 2545 2546 2546 2547 2550
Syntax
The TSVIEW command has the following form:
TSVIEW [options] ;
The following options can be specied for the command and the macro.
DATA=data set name
species the name of the SAS data set containing the input data.
VAR=time series variable name
species the series variable name. It must be a numeric variable contained in the data set.
ID=time id variable name
species the time ID variable name for the data set. If the ID= option is not specied, the system attempts to locate the variables named DATE, DATETIME, and TIME in the data set specied by the DATA= option.
INTERVAL=interval name
Examples
TSVIEW Command
tsview data=sashelp.air var=air tsview data=dept.prod var=units id=period interval=qtr
%TSVIEW Macro
%tsview( data=sashelp.air, var=air); %tsview( data=dept.prod, var=units, id=period, interval=qtr);
Syntax ! 2547
Bring up the system starting at a different window than the default Time Series Forecasting window Run the system in unattended mode so that a task such as creating a forecast data set is accomplished without any user interaction. By submitting such commands repeatedly from a SAS/AF or SAS/EIS application, it is possible to do batch processing for many data sets or by-group processing for many subsets of a data set. You can create a project in unattended mode and later open it for inspection interactively. You can also create a project interactively in order to set options, t a model, or edit the list of models, and then use this project later in unattended mode. The Forecast Command Builder, a point-and-click SAS/AF application, makes it easy to specify, run, save, and rerun forecasting jobs by using the FORECAST command. To use it, enter the following on the command line (not the program editor):
%FCB
or
AF C=SASHELP.FORCAST.FORCCMD.FRAME.
Syntax
The FORECAST command has the following form:
FORECAST [options] ;
The following options can be specied for the command and the macro.
PROJECT=project name
species the name of the SAS catalog entry in which forecasting models and other results are stored and from which previously stored results are loaded into the forecasting system.
DATA=data set name
species the name of the SAS data set containing the input data.
VAR=time series variable name
species the series variable name. It must be a numeric variable contained in the data set.
ID=time id variable name
species the time ID variable name for the data set. If the ID= option is not specied, the system attempts to locate the variables named DATE, DATETIME, and TIME in the data set specied by the DATA= option. However, it is recommended that you specify the time ID variable whenever you are using the ENTRY= argument.
INTERVAL=interval name
species the time ID interval between observations in the data set. Commonly used intervals are year, semiyear, qtr, month, semimonth, week, weekday, day, hour, minute, and second. See Chapter 4, Date Intervals, Formats, and Functions, for information about more complex interval specications. If the INTERVAL= option is not specied, the system attempts to determine the interval based on the time ID variable. However, it is recommended that you specify the interval whenever you are using the ENTRY= argument.
STAT=statistic
species the name of the goodness-of-t statistic to be used as the model selection criterion. The default is RMSE. Valid names are
sse mse rmse mae mape aic sbc rsquare adjrsq rwrsq arsq apc
sum of square error mean square error root mean square error mean absolute error mean absolute percent error Akaike information criterion Schwarz Bayesian information criterion R-square adjusted R-square random walk R-square Amemiyas adjusted R-square Amemiyas prediction criterion
CLIMIT=integer
species the level of the condence limits to be computed for the forecast. This integer represents a percentage; for example, 925 indicates 92.5% condence limits. The default is 95that is, 95% condence limits.
HORIZON=integer
species the number of periods into the future for which forecasts are computed. The default is 12 periods. The maximum is 9999.
ENTRY=name
The name of an entry point into the system. Valid names are
main devmod viewmod
starts the system at the Time Series Forecasting window (default). starts the system at the Develop Models window. starts the system at the Model Viewer window. Specify a project that contains a forecasting model by using the PROJECT= option. If a project containing a model is not specied, the message No forecasting model to view appears.
Syntax ! 2549
viewser autofit
starts the system at the Time Series Viewer window. runs the system in unattended mode, tting a forecasting model automatically and saving it in a project. If PROJECT= is not specied, the default project name SASUSER.FMSPROJ.PROJ is used. runs the system in unattended mode to generate a forecast data set. The name of this data set is specied by the OUT= parameter. If OUT= is not specied, a window appears to prompt for the name and label of the output data set. If PROJECT= is not specied, the default project name SASUSER.FMSPROJ.PROJ is used. If the project does not exist or does not contain a forecasting model for the specied series, automatic model tting is performed and the forecast is computed by using the automatically selected model. If the project exists and contains a forecasting model for the specied series, the forecast is computed by using this model. If the series covers a different time range than it did when the project was created, use the REFIT or REEVAL keyword to reset the time ranges.
forecast
OUT=argument
species one or two-level name of a SAS data set in which forecasts are saved. Use in conjunction with ENTRY=FORECAST. If omitted, the system prompts for the name of the forecast data set.
KEEP=argument
species the number of models to keep in the project when automatic model tting is performed. This corresponds to Models to Keep in the Automatic Model Selection Options window. A value greater than 9 indicates that all models are kept. The default is 1.
DIAG=YES|NO
species which models to search with regard to series diagnostics. DIAG= YES causes the automatic model selection process to search only over those models that are consistent with the series diagnostics. DIAG= NO causes the automatic model selection process to search over all models in the selection list, without regard for the series diagnostics. This corresponds to Models to Fit in the Automatic Model Selection Options window. The default is YES.
REFIT=keyword
(for macro usage) rets a previously saved forecasting model by using the current t range; that is, it reestimates the model parameters. Retting also causes the model to be reevaluated (statistics of t recomputed), and it causes the time ranges to be reset if the data range has changed (for example, if new observations have been added to the series). This keyword has no effect if you do not use the PROJECT= argument to reference an existing project containing a forecasting model. Use the REFIT keyword if you have added new data to the input series and you want to ret the forecasting model and update the forecast by using the new time ranges. Be sure to use the same project, data set, and series names that you used previously.
REEVAL=keyword
(for macro usage) reevaluates a previously saved forecasting model by using the current evaluation range; that is, it recomputes the statistics of t. Reevaluating also causes the time ranges to be reset if the data range has changed (for example, if new observations have been
added to the series). It does not ret the model parameters. This keyword has no effect if you also specify REFIT, or if you do not use the PROJECT= argument to reference an existing project containing a forecasting model. Use the REEVAL keyword if you have added new data to the input series and want to update your forecast by using a previously t forecasting model and the same project, data set, and series names that you used previously.
Examples
FORECAST Command
The following command opens the Time Series Forecasting window with the data set name and series name lled in. The time ID variable is also lled in since the data set contains the variable DATE. The interval is lled in because the system recognizes that the observations are monthly.
forecast data=sashelp.air var=air
The following command opens the Time Series Forecasting window with the project, data set name, series, time ID, and interval elds lled in, assuming that the project SAMPROJ was previously saved either interactively or by using unattended mode as depicted below. Previously t models appear when the Develop Models or Manage Projects window is opened.
forecast project=samproj
The following command runs the system in unattended mode, tting a model automatically, storing it in the project SAMPROJ in the default catalog SASUSER.FMSPROJ, and placing the forecasts in the data set WORK.SAMPOUT.
forecast data=sashelp.workers var=electric id=date interval=month project=samproj entry=forecast out=sampout
The following command assumes that a new months data have been added to the data set from the previous example and that an updated forecast is needed that uses the previously t model. Time ranges are automatically updated to include the new data since the REEVAL keyword is included. Substitute REFIT for REEVAL if you want the system to reestimate the model parameters.
forecast data=sashelp.workers var=electric id=date interval=month project=samproj entry=forecast out=sampout reeval
The following command opens the model viewer with the project created in the previous example and with 99 percent condence limits in the forecast graph.
forecast data=sashelp.workers var=electric id=date interval=month project=samproj entry=viewmod climit=99
Examples ! 2551
The nal example illustrates using unattended mode with an existing project that has been dened interactively. In this example, the goal is to add a model to the model selection list, to specify that all models in that list be t, and that all models which are t successfully be retained. First open the Time Series Forecasting window and specify a new project name, WORKPROJ. Then select Develop Models, choosing SASHELP.WORKERS as the data set and MASONRY as the series. Now select Model Selection List from the Options menu. In the Model Selection List window, click Actions, then Add, and then ARIMA Model. Dene the model ARIMA(0,1,0)(0,1,0)s NOINT by setting the differencing value to 1 under both ARIMA Options and Seasonal ARIMA Options. Select OK to save the model and OK to close the Model Selection List window. Now select Automatic Fit from the Options menu. In the Automatic Model Selection Options window, select All autot models in selection list in the Models to t radio box, select All models from the Models to keep combo box, and then click OK to close the window. Select Save Project from the File menu, and then close the Develop Models window and the Time Series Forecasting window. You now have a project with a new model added to the selection list, options set for automatic model tting, and one series selected but no models t. Now enter the command:
forecast data=sashelp.workers var=electric id=date interval=month project=workproj entry=forecast out=workforc
The system runs in unattended mode to update the project and create the forecast data set WORKFORC. Check the messages in the Log window to nd out if the run was successful and which model was selected for forecasting. To see the forecast data set, issue the command viewtable WORKFORC. To see the contents of the project, open the Time Series Forecasting window, open the project WORKPROJ, and select Manage Projects. You will see that the variable ELECTRIC was added to the project and has a forecasting model. Select this row in the table and then select List Models from the Tools menu. You will see that all of the models in the selection list which t successfully are there, including the new model you added to the selection list.
%FORECAST Macro
This example demonstrates the use of the %FORECAST macro to start the Time Series Forecasting System from a SAS program submitted from the Editor window. The SQL procedure is used to create a view of a subset of a products data set. Then the %FORECAST macro is used to produce forecasts.
proc sql; create view selprod as select * from products where type eq A order by date; run; %forecast(data=selprod, var=amount, id=date, interval=day, entry=forecast, out=typea, proj=proda, refit= );
2552
Chapter 43
Window Reference
Contents
Overview . . . . . . . . . . . . . . . . . . . . . . . Adjustments Selection Window . . . . . . . . . . . AR/MA Polynomial Specication Window . . . . . ARIMA Model Specication Window . . . . . . . . ARIMA Process Specication Window . . . . . . . Automatic Model Fitting Window . . . . . . . . . . Automatic Model Fitting Results Window . . . . . . Automatic Model Selection Options Window . . . . Custom Model Specication Window . . . . . . . . Data Set Selection Window . . . . . . . . . . . . . . Default Time Ranges Window . . . . . . . . . . . . Develop Models Window . . . . . . . . . . . . . . . Differencing Specication Window . . . . . . . . . Dynamic Regression Specication Window . . . . . Dynamic Regressors Selection Window . . . . . . . Error Model Options Window . . . . . . . . . . . . External Forecast Model Specication Window . . . Factored ARIMA Model Specication Window . . . Forecast Combination Model Specication Window . Forecasting Project File Selection Window . . . . . Forecast Options Window . . . . . . . . . . . . . . Intervention Specication Window . . . . . . . . . . Interventions for Series Window . . . . . . . . . . . Manage Forecasting Project Window . . . . . . . . . Model Fit Comparison Window . . . . . . . . . . . Model List Window . . . . . . . . . . . . . . . . . . Model Selection Criterion Window . . . . . . . . . . Model Selection List Editor Window . . . . . . . . . Model Viewer Window . . . . . . . . . . . . . . . . Models to Fit Window . . . . . . . . . . . . . . . . Polynomial Specication Window . . . . . . . . . . Produce Forecasts Window . . . . . . . . . . . . . . Regressors Selection Window . . . . . . . . . . . . Save Data As . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2554 2554 2555 2557 2560 2561 2565 2568 2569 2573 2575 2576 2584 2585 2586 2587 2588 2589 2591 2593 2595 2595 2597 2599 2605 2606 2610 2611 2615 2621 2622 2623 2627 2628
Save Graph As . . . . . . . . . . . . . . . . . . . Seasonal ARIMA Model Options Window . . . . . Series Diagnostics Window . . . . . . . . . . . . . Series Selection Window . . . . . . . . . . . . . . Series to Process Window . . . . . . . . . . . . . Series Viewer Transformations Window . . . . . . Smoothing Model Specication Window . . . . . . Smoothing Weight Optimization Window . . . . . Statistics of Fit Selection Window . . . . . . . . . Time ID Creation 1,2,3 Window . . . . . . . . . Time ID Creation from Several Variables Window . Time ID Creation from Starting Date Window . . . Time ID Creation Using Informat Window . . . . . Time ID Variable Specication Window . . . . . . Time Ranges Specication Window . . . . . . . . Time Series Forecasting Window . . . . . . . . . . Time Series Simulation Window . . . . . . . . . . Time Series Viewer Window . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
2630 2631 2632 2633 2636 2637 2639 2641 2643 2644 2644 2646 2647 2648 2649 2651 2653 2654
Overview
This chapter provides a reference to the various windows of the Time Series Forecasting System. The windows are presented in alphabetical order by name. Each section describes the purpose of the window, how to open it, its controls, elds, and menus. For windows that have their own menus, there is a description of each menu item under the heading Menu Bar. These windows also have a toolbar with icons that duplicate the more commonly used menu items. Each icon has a screen tip: a brief description that appears when you hover the mouse cursor over the icon. If you dont see the screen tips, open the SAS Preferences window, under the Options submenu of the Tools menu. Select the View tab and make sure the Screen tips check box is checked.
is a table that lists the names and labels in the input data set available for selection as adjustments. The variables you select are highlighted. Selecting a highlighted row again deselects that variable.
OK
closes the Adjustments Selection window and adds the selected variables as adjustments in the model.
Cancel
resets all selections to their initial values upon entry to the window.
Lists the polynomials that have been specied. Each polynomial is represented by a commadelimited list of lag values enclosed in parentheses.
New
Opens the Polynomial Specication window to add a new polynomial to the model.
Edit
Opens the Polynomial Specication window to edit a polynomial that has been selected. If no polynomial is selected, this button is unavailable.
Remove
Removes a selected polynomial from the list. If none are selected, this button is unavailable.
Remove All
Moves a selected polynomial up one position in the list. If no polynomial is selected, or the rst one is selected, this button is unavailable.
Move Down
Moves a selected polynomial down one position in the list. If no polynomial is selected, or the last one is selected, this button is unavailable.
OK
Closes the window and returns the specied list of polynomials to the Factored ARIMA Model Specication window.
Cancel
Closes the window and discards any changes made to the list of polynomials.
is a descriptive label for the model that you specify. You can type a label in this eld or allow the system to provide a label. If you leave the label blank, a label is generated automatically based on the options you specify.
ARIMA Options
specify the orders of the ARIMA model. You can either type in a value or click the arrow to select from a list.
Autoregressive
denes the order of simple differencingfor example, rst difference or second difference.
Moving Average
specify the orders of the seasonal part of the ARIMA model. You can either type in a value or click the arrow to select from a list.
Autoregressive
denes the order of seasonal differencingfor example, rst difference or second difference at the seasonal lags.
Moving Average
denes the series transformation for the model. When a transformation is specied, the ARIMA model is t to the transformed series, and forecasts are produced by applying the inverse transformation to the ARIMA model predictions. The available transformations are: Log, Logistic, Square Root, Box-Cox, and None.
Intercept
specify whether a mean or intercept parameter is included in the ARIMA model. By default, the Intercept option is set to No when the model includes differencing and Yes when there is no differencing.
Predictors
closes the ARIMA Model Specication window without tting the model. Any options you specied are lost.
Reset
resets all options to their initial values upon entry to the ARIMA Model Specication window. This might be useful when editing an existing model specication; otherwise, Reset has the same function as Clear.
Clear
Delete
opens a menu of different time trend curves and adds the curve you select to the Predictors list. Certain trend curve specications also set the Transformation eld.
Add Regressors
opens the Regressors Selection window to enable you to select other series in the input data set as regressors to predict the dependent series and add them to the Predictors list.
Add Adjustments
opens the Adjustments Selection window to enable you to select other series in the input data set for use as adjustments to the forecasts and add them to the Predictors list.
Add Dynamic Regressor
opens the Dynamic Regressor Selection window to enable you to select a series in the input data set as a predictor of the dependent series and also specify a transfer function model for the effect of the predictor series.
Add Interventions
opens the Interventions for Series window to enable you to dene and select intervention effects and add them to the Predictors list.
Add Seasonal Dummies
is a table of autoregressive terms for the simulated ARIMA process. Enter a value for Factor,
Lag, and Value for each term of the AR part of the process you want to simulate. For a nonfactored AR model, make the Factor values the same for all terms. For a factored AR model, use different Factor values to group the terms into the factors.
MA Parameters
is a table of moving-average terms for the simulated ARIMA process. Enter a value for Factor, Lag, and Value for each term of the MA part of the process you want to simulate. For a nonfactored MA model, make the Factor values the same for all terms. For a factored MA model, use different Factor values to group the terms into the factors.
OK
closes the ARIMA Process Specication window and adds the specied process to the Series to Generate list in the Time Series Simulation window.
Cancel
closes the window without adding to the Series to Generate list. Any options you specied are lost.
Reset
resets all the elds to their initial values upon entry to the window.
Clear
the name of the SAS catalog entry in which the results of the model search process are stored.
Input Data Set
is the name of the current input data set. You can type in a one-level or two-level data set name here.
Browse button
opens the Data Set Selection window for selecting an input data set.
Time ID
is the name of the ID variable for the input data set. You can type in the variable name here or use the Select or Create button. time ID Select button opens the Time ID Variable Specication window. time ID Create button opens a menu of choices of methods for creating a time ID variable for the input data set. Use this feature if the input data set does not already contain a valid time ID variable.
Interval
is the time interval between observations (data frequency) in the current input data set. You can type in an interval name or select one by using the combo box pop-up menu.
Series to Process
indicates the number and names of time series variables for which forecasting model selection will be applied.
Series to Process Select button opens the Series to Process window to let you select the series for which you want to t models.
Selection Criterion
shows the goodness-of-t statistic that will be used to determine the best tting model for each series. Selection Criterion Select button opens the Model Selection Criterion window to enable you to select the goodness-of-t statistic that will be used to determine the best tting model for each series.
Run button
opens the Automatic Model Fitting Results window to display the models t during the current invocation of the Automatic Model Fitting window. The results appear automatically when model tting is complete, but this button enables you to redisplay the results window.
Close button
Menu Bar
File
Import Data
is available if you license SAS/Access software. It opens an Import Wizard, which you can use to import your data from an external spreadsheet or data base to a SAS data set for use in the Time Series Forecasting System.
Export Data
is available if you license SAS/Access software. It opens an Export Wizard, which you can use to export a SAS data set, such as a forecast data set created with the Time Series Forecasting System, to an external spreadsheet or data base.
Print Setup
opens the Print Setup window, which allows you to access your operating system print setup.
Close
opens Automatic Model Fitting Results window to show the forecasting models t during the current invocation of the Automatic Model Fitting window. This is the same as the Models Fit button.
Tools
Fit Models
performs the automatic model selection process for the selected series. This is the same as the Run button.
Options
opens the Default Time Ranges window to enable you to control how the system sets the time ranges for series.
Model Selection List
opens the Model Selection List editor window. Use this action to control the forecasting models considered by the automatic model selection process and displayed in the Models to Fit window.
Model Selection Criterion
opens the Model Selection Criterion window, which presents a list of goodness-of-t statistics and enables you to select the t statistic that is displayed in the table and used by the automatic model selection process to determine the best tting model. This action is the same as the Selection Criterion Select button.
Statistics of Fit
opens the Statistics of Fit Selection window, which presents a list of statistics that the system can display. Use this action to customize the list of statistics shown in the Statistics of Fit table and available for selection in the Model Selection Criterion menu.
Forecast Options
opens the Forecast Options window, which enables you to control the widths of forecast condence limits and control the kind of predicted values computed for models that include series transformations.
Forecast Data Set
Beginning
aligns dates that the system generates to identify forecast observations in output data sets to the beginning of the time intervals.
Middle
aligns dates that the system generates to identify forecast observations in output data sets to the midpoints of the time intervals.
End
aligns dates that the system generates to identify forecast observations in output data sets to the end of the time intervals.
Automatic Fit
opens the Automatic Model Selection Options window, which enables you to control the
number of models retained by the automatic model selection process and whether the models considered for automatic selection are subset according to the series diagnostics.
Tool Bar Type
Image Only
displays the toolbar items with both text and icon images.
Include Interventions
controls whether intervention effects dened for the current series are automatically added as predictors to the models considered by the automatic selection process. A check mark or lled check box next to this item indicates that the option is turned on.
Print Audit Trail
prints to the SAS log information about the models t by the system. A check mark or lled check box next to this item indicates that the audit option is turned on.
Show Source Statements
controls whether SAS statements submitted by the forecasting system are printed in the SAS log. When the Show Source Statements option is selected, the system sets the SAS system option SOURCE before submitting SAS statements; otherwise, the system uses the NOSOURCE option. Note that only some of the functions performed by the forecasting system are accomplished by submitting SAS statements. A check mark or lled check box next to this item indicates that the option is turned on.
Table Contents The results table displays the series name in the rst column and the model label in the second column. If you have chosen to retain more than one model by using the Automatic Model Selection Options window, more than one row appears in the table for each series; that is, there is a row for each model t. If you have already t models to the same series before invoking the Automatic Model Fitting window, those models do not appear here, since the Automatic Model Fitting Results window is intended to show the results of the current operation of Automatic Model Fitting. To see all models that have been t, use the Manage Projects window. The third column of the table shows the values of the current model selection criterion statistic. Additional columns show the values of other t statistics. The set of statistics shown are selectable by using the Statistics of Fit Selection window. The table can be sorted by any column other than Series Name by clicking on the column heading.
opens the Model Viewer window on the model currently selected in the table.
Stats
opens the Statistics of Fit Selection window. This controls the set of goodness-of-t statistics displayed in the table and in other parts of the Time Series Forecasting System.
Compare
opens the Model Fit Comparison window for the series currently selected in the table. This button is unavailable if the currently selected row in the table represents a series for which fewer than two models have been t.
Save
opens an output data set dialog, enabling you to specify a SAS data set to which the contents of the table is saved. Note that this operation saves what you see in the table. If you want to save the models themselves for use in a future session, use the Manage Projects window.
Print
closes the window and returns to the Automatic Model Fitting window.
Menu Bar
File
Save
opens an output data set dialog, enabling you to specify a SAS data set to which the contents of the table is saved. This is the same as the Save button.
Print
prints the contents of the table. This is the same as the Print button.
Import Data
is available if you license SAS/Access software. It opens an Import Wizard, which you can use to import your data from an external spreadsheet or data base to a SAS data set for use in the Time Series Forecasting System.
Export Data
is available if you license SAS/Access software. It opens an Export Wizard, which you can use to export a SAS data set, such as a forecast data set created with the Time Series Forecasting System, to an external spreadsheet or data base.
Print Setup
opens the Print Setup window, which allows you to access your operating system print setup.
Close
closes the window and returns to the Automatic Model Fitting window.
View
Model Predictions
opens the Model Viewer to display a predicted and actual plot for the currently highlighted model.
Prediction Errors
opens the Model Viewer to display the prediction errors for the currently highlighted model.
Prediction Error Autocorrelations
opens the Model Viewer to display the prediction error autocorrelations, partial autocorrelations, and inverse autocorrelations for the currently highlighted model.
opens the Model Viewer to display graphs of white noise and stationarity tests on the prediction errors of the currently highlighted model.
Parameter Estimates
opens the Model Viewer to display the parameter estimates table for the currently highlighted model.
Statistics of Fit
opens the Model Viewer window to display goodness-of-t statistics for the currently highlighted model.
Forecast Graph
opens the Model Viewer to graph the forecasts for the currently highlighted model.
Forecast Table
opens the Model Viewer to display forecasts for the currently highlighted model in a table.
Tools
Compare Models
opens the Model Fit Comparison window to display t statistics for selected pairs of forecasting models. This item is unavailable until you select a series in the table for which the automatic model tting run selected two or more models.
Options
Statistics of Fit
opens the Statistics of Fit Selection window. This is the same as the Stats button.
Column Labels
selects long or short column labels for the table. Long column labels are used by default.
ID Columns
freezes or unfreezes the series and model columns. By default they are frozen so that they remain visible when you scroll the table horizontally to view other columns.
when selected, causes the automatic model selection process to search only over those models consistent with the series diagnostics.
All models in selection list
when selected, causes the automatic model selection process to search over all models in the search list, without regard for the series diagnostics.
Models to keep
species how many of the models tried by the automatic model selection process are retained and added to the model list for the series. You can specify the best tting model only, the best n models, where n can be 1 through 9, or all models tried.
OK
closes the window and saves the automatic model selection options you specied.
Cancel
closes the window without changing the automatic model selection options.
is a descriptive label for the model that you specify. You can type a label in this eld or allow the system to provide a label. If you leave the label blank, a label is generated automatically based on the options you specify.
Transformation
denes the series transformation for the model. When a transformation is specied, the model is t to the transformed series, and forecasts are produced by applying the inverse transformation to the resulting forecasts. The following transformations are available:
Log
parameter.
Trend Model
controls the model options to model and forecast the series trend. Select from the following:
Linear Trend
brings of a menu of different time trend curves and adds the curve you select to the Predictors list.
First Difference
controls the model options to model and forecast the series seasonality. Select from the following:
Seasonal ARIMA
opens the Seasonal ARIMA Model Options window to enable you to specify an ARIMA model for the seasonal pattern in the series.
Seasonal Difference
displays the current settings of the autoregressive and moving-average terms, if any, for modeling the prediction error autocorrelation pattern in the series.
Set button
opens the Error Model Options window to enable you to set the autoregressive and movingaverage terms for modeling the prediction error autocorrelation pattern in the series.
Intercept
species whether a mean or intercept parameter is included in the model. By default, the Intercept option is set to No when the model includes differencing and set to Yes when there is no differencing.
Predictors
Cancel
closes the Custom Model Specication window without tting the model. Any options you specied are lost.
Reset
resets all options to their initial values upon entry to the Custom Model Specication window. This might be useful when editing an existing model specication; otherwise, Reset has the same function as Clear.
Clear
opens a menu of types of predictors to add to the Predictors list. Select from the following:
Linear Trend
opens a menu of different time trend curves and adds the curve you select to the Predictors list.
Regressors
opens the Regressors Selection window to enable you to select other series in the input data set as regressors to predict the dependent series and add them to the Predictors list.
Adjustments
opens the Adjustments Selection window to enable you to select other series in the input data set for use as adjustments to the forecasts and add them to the Predictors list.
Dynamic Regressor
opens the Dynamic Regressor Selection window to enable you to select a series in the input data set as a predictor of the dependent series and also specify a transfer function model for the effect of the predictor series.
Interventions
opens the Interventions for Series window to enable you to dene and select intervention effects and add them to the Predictors list.
Seasonal Dummies
adds a Seasonal Dummies predictor item to the Predictors list. This is unavailable if the series interval is not one which has a seasonal cycle.
Delete
If you right-click an entry in the Predictors list and press the right mouse button, the system displays a menu of actions that encompass the features of the Add, Delete, and Edit buttons.
Access this window by using the Browse button to the right of the Data Set eld in the Time Series Forecasting, Automatic Model Fitting, and Produce Forecasts windows. It functions in the same way as the Series Selection window, except that it does not allow you to select or view a time series variable.
is a SAS libname assigned within the current SAS session. If you know the libname associated with the data set of interest, you can type it in this eld. If it is a valid choice, it will appear in the libraries list and will be highlighted. The SAS Data Sets list will be populated with data sets associated with that libname. See also Libraries under Selection Lists.
Data Set
is the name of a SAS data set (data le or data view) that resides under the selected libname.
If you know the name, you can type it in and press Return. If it is a valid choice, it will appear in the SAS Data Sets list and will be highlighted.
Time ID
is the name of the ID variable for the selected input data set. To specify the ID variable, you can type the ID variable name in this eld or select the control arrows to the right of the eld. Time ID Select button opens the Time ID Variable Specication window. Time ID Create button opens a menu of methods for creating a time ID variable for the input data set. Use this feature if the data set does not already contain a valid time ID variable.
Interval
is the time interval between observations (data frequency) in the selected data set. If the interval is not automatically identied by the system, you can type in the interval name or select it from a list by clicking the combo box arrow. For more information about intervals, see Chapter 4, Date Intervals, Formats, and Functions, in this book.
OK
closes the Data Set Selection window and makes the selected data set the current input data set.
Cancel
resets the elds to their initial values upon entry to the window.
Refresh
updates all elds and lists in the window. If you assign a new libname without exiting the Data Set Selection window, use the refresh action to update the Libraries list so that it will include the newly assigned libname.
Selection Lists
Libraries
displays a list of currently assigned libnames. You can select a libname by clicking it with the left mouse button, which is equivalent to typing its name in the Library eld. If you cannot locate the library or directory you are interested in, go to the SAS Explorer window, select New from the File menu, then select Library and OK. This opens the New Library window. You also assign a libname by submitting a libname statement from the Editor window. Select the Refresh button to make the new libname available in the libraries list.
SAS Data Sets
displays a list of the SAS data sets (data les or data views) contained in the selected library. You can select one of these by clicking with the left mouse button, which is equivalent to typing its name in the Data set eld. You can double-click a data set name to select it and exit the window.
species the forecast horizon as either a number of periods or years from the last nonmissing data value or as a xed date. You can type a number or date value in this eld. Date value must be entered in a form recognized by a SAS date informat. (See SAS Language Reference: Concepts for information about SAS date informats.)
Forecast Horizon Units
indicates whether the value in the forecast horizon eld represents periods or years or a date. Click the arrow and select one from the pop-up list.
Hold-out Sample Size
species that a number of observations, number of years, or percent of the data at the end of the data range be used for the period of evaluation with the remainder of data used as the period of t.
Hold-out Sample Size Units
indicates whether the hold-out sample size represents periods or years or percent of data range.
Period of Fit
species how much of the data range for a series is to be used as the period of t for models t to the series. ALL indicates that all the available data is used. You can specify a number of periods, number of years, or a xed date, depending on the value of the units eld to the right. When you specify a date, the start of the period of t is the specied date or the rst nonmissing series value, whichever is more recent. Date value must be entered in a form
recognized by a SAS date informat. (See SAS Language Reference: Concepts for information about SAS date informats.) When you specify the number of periods or years, the start of the period of t is computed as the date that number of periods or years from the end of the data.
Period of Fit Units
closes the window without saving changes. Any options you specied are lost.
Defaults
resets the options to their initial values upon entry to the window.
is the time interval (data frequency) for the input data set.
Series
opens the Series Selection window to enable you to change the current input data set or series.
Data Range
is the date of the rst and last nonmissing data values available for the current series in the input data set.
Fit Range
is the current period of t setting. This is the range of data that will be used to t models to the series.
Evaluation Range
is the current period of evaluation setting. This is the range of data that will be used to calculate the goodness-of-t statistics for models t to the series.
Set Ranges button
opens the Time Ranges Specication window to enable you to change the t range or evalua-
tion range. Note: A new t range is applied when new models are t or when existing models are ret. A new evaluation range is applied when new models are t or when existing models are ret or reevaluated. Changing the ranges does not automatically ret or reevaluate any models in the table: Use the Ret Models or Reevaluate Models items under the Edit menu.
View Series Graphically icon
opens the Time Series Viewer window to display plots of the current series.
View Selected Model Graphically icon
opens the Model Viewer to display graphs and tables for the currently highlighted model.
Forecast Model
is the column of the model table that contains check boxes to select which model is used to produce the nal forecasts for the current series.
Model Title
is the column of the model table that contains the descriptive labels of the forecasting models t to the current series.
Root Mean Square Error (or other statistic name) button
is the button above the right side of the table. It displays the name of the current model selection criterion: a statistic that measures how well each model in the table ts the values of the current series for observations within the evaluation range. Clicking this button opens the Model Selection Criterion window to let you to select a different statistic. When you select a statistic, the model table the Develop Models window is updated to show current values of that statistic.
Menu Bar
File
New Project
opens a dialog that lets you create a new project, assign it a name and description, and make it the active project.
Open Project
opens a dialog that lets you select and load a previously saved project.
Save Project
saves the current state of the system (including all the models t to a series) to the current project catalog entry.
Save Project as
saves the current state of the system with a prompt for the name of the catalog entry in which to store the information.
Clear Project
clears the system, deleting all the models for all series.
Save Forecast
writes forecasts from the currently highlighted model to an output data set.
Save Forecast As
prompts for an output data set name and saves the forecasts from the currently highlighted model.
Output Forecast Data Set
opens a dialog for specifying the default data set used when you select Save Forecast.
Import Data
is available if you license SAS/Access software. It opens an Import Wizard, which you can use to import your data from an external spreadsheet or data base to a SAS data set for use in the Time Series Forecasting System.
Export Data
is available if you license SAS/Access software. It opens an Export Wizard, which you can use to export a SAS data set, such as a forecast data set created with the Time Series Forecasting System, to an external spreadsheet or data base.
Print Setup
opens the Print Setup window, which enables you to access your operating system print setup.
Close
closes the Develop Models window and returns to the main window.
Edit
Fit Model
Automatic Fit
enables you to modify the specication of the currently highlighted model in the table and t the modied model. The new model replaces the current model in the table.
Delete Model
Refit Models
All Models
rets all models in the table by using data within the current t range.
Selected Model
rets the currently highlighted model by using data within the current t range.
Reevaluate Models
All Models
recomputes statistics of t for all models in the table by using data within the current evaluation range.
Selected Model
recomputes statistics of t for the currently highlighted model by using data within the current evaluation range.
View
Project
opens the Time Series Viewer window to display plots of the current series. This is the same as the View Series Graphically icon.
Model Predictions
opens the Model Viewer to display a predicted versus actual plot for the currently highlighted model. This is the same as the View Selected Model Graphically icon.
Prediction Errors
opens the Model Viewer to display the prediction errors for the currently highlighted model.
Prediction Error Autocorrelations
opens the Model Viewer to display the prediction error autocorrelations, partial autocorrelations, and inverse autocorrelations for the currently highlighted model.
Prediction Error Tests
opens the Model Viewer to display graphs of white noise and stationarity tests on the prediction errors of the currently highlighted model.
Parameter Estimates
opens the Model Viewer to display the parameter estimates table for the currently highlighted model.
Statistics of Fit
opens the Model Viewer window to display goodness-of-t statistics for the currently highlighted model.
Forecast Graph
opens the Model Viewer to graph the forecasts for the currently highlighted model.
Forecast Table
opens the Model Viewer to display forecasts for the currently highlighted model in a table.
Tools
Diagnose Series
opens the Series Diagnostics window to determine the kinds of forecasting models appropriate for the current series.
Define Interventions
opens the Interventions for Series window to enable you to edit or add intervention effects for use in modeling the current series.
Sort Models
sorts the models in the table by the values of the currently displayed t statistic.
Compare Models
opens the Model Fit Comparison window to display t statistics for selected pairs of forecasting models. This is unavailable if there are fewer than two models in the table.
Generate Data
opens the Time Series Simulation window. This window enables you to simulate ARIMA time series processes and is useful for educational exercises or testing the system.
Options
Time Ranges
opens the Time Ranges Specication window to enable you to change the t and evaluation time ranges and the forecast horizon. This action is the same as the Set Ranges button.
Default Time Ranges
opens the Default Time Ranges window to enable you to control how the system sets the time ranges for series when you do not explicitly set time ranges with the Time Ranges Specication window. Settings made by using this window do not affect series you are already working with; they take effect when you select a new series.
Model Selection List
opens the Model Selection List editor window. Use this action to edit the set of forecasting models considered by the automatic model selection process and displayed by the Models to Fit window.
Model Selection Criterion
opens the Model Selection Criterion window, which presents a list of goodness-of-t statistics and enables you to select the t statistic that is displayed in the table and used by the automatic model selection process to determine the best tting model. This action is the same as clicking the button above the table which displays the name of the current model selection criterion.
Statistics of Fit
opens the Statistics of Fit Selection window, which presents a list of statistics that the system can display. Use this action to customize the list of statistics shown in the Model Viewer, Automatic Model Fitting Results, and Model Fit Comparison windows and available for selection in the Model Selection Criterion menu.
Forecast Options
opens the Forecast Options window, which enables you to control the widths of forecast condence limits and control the kind of predicted values computed for models that include series transformations.
Alignment of Dates
Beginning
aligns dates that the system generates to identify forecast observations in output data sets to the beginning of the time intervals.
Middle
aligns dates that the system generates to identify forecast observations in output data sets to the midpoints of the time intervals.
End
aligns dates that the system generates to identify forecast observations in output data sets to the end of the time intervals.
Automatic Fit
opens the Automatic Model Selection Options window, which enables you to control the number of models retained by the automatic model selection process and whether the models considered for automatic selection are subset according to the series diagnostics.
Include Interventions
controls whether intervention effects dened for the current series are automatically added as predictors to the models considered by the automatic selection process and displayed by the Models to Fit window. When the Include Interventions option is selected, the series interventions are also automatically added to the predictors list when you specify a model in the ARIMA and Custom Models Specication windows. A check mark or lled check box next to this item indicates that the option is turned on.
Print Audit Trail
prints to the SAS log information about the models t by the system. A check mark or lled check box next to this item indicates that the audit option is turned on.
Show Source Statements
Controls whether SAS statements submitted by the forecasting system are printed in the SAS log. When the Show Source Statements option is selected, the system sets the SAS system option SOURCE before submitting SAS statements; otherwise, the system uses the NOSOURCE option. Note that only some of the functions performed by the forecasting system are accomplished by submitting SAS statements. A check mark or lled check box next to this item indicates that the option is turned on.
opens the Model Viewer for the selected model. This action is the same as the View Model Graphically icon.
View Parameter Estimates
opens the Model Viewer to display the parameter estimates table for the currently highlighted model. This is the same as the Parameter Estimates item in the View menu.
View Statistics of Fit
opens the Model Viewer to display a table of goodness-of-t statistics for the currently highlighted model. This is the same as the Statistics of Fit item in the View menu.
Edit Model
enables you to modify the specication of the currently highlighted model in the table and t the modied model. This is the same as the Edit Model item in the Edit menu.
Refit Model
rets the highlighted model by using data within the current t range. This is the same as the Selected Model item under the Ret Models submenu of the Edit menu.
Reevaluate Model
reevaluates the highlighted model by using data within the evaluation t range. This is the same as the Selected Model item under the Reevaluate Models submenu of the Edit menu.
Delete Model
deletes the currently highlighted model from the model table. This is the same as the Delete Model item under the Edit menu.
View Forecasts
opens the Model Viewer to display the forecasts for the currently highlighted model. This is the same as the Forecast Graph item under the View menu. When the model list is empty or when no model is selected, the right mouse button opens the same menu of model tting actions as the left mouse button.
species a lag value to add to the list. Type in a positive integer or select one by clicking the spin box arrows. Duplicates are allowed.
Add
adds the value in the Lag spin box to the list of differencing lags.
Remove
closes the window and returns the specied list to the Factored ARIMA Model Specication window.
Cancel
closes the window and discards any lags added to the list.
is a descriptive label for the dynamic regression model. You can type a label in this eld or allow the system to provide the label. If you leave the label blank, a label is generated automatically based on the options you specify. When no options are specied, the label is the name and variable label of the predictor variable.
Input Transformation
displays the transformation specied for the predictor variable. When a transformation is specied, the transfer function model is t to the transformed input variable.
Lagging periods
is the order of differencing, d. Set this eld to 1 to use the changes in the predictor variable.
Seasonal Order of Differencing
is the order of seasonal differencing, D. Set this eld to 1 to difference the predictor variable at the seasonal lagsfor example, to use the year-over-year or week-over-week changes in the predictor variable.
closes the window and adds the dynamic regression model specied to the model predictors list.
Cancel
closes the window without adding the dynamic regression model. Any options you specied are lost.
Reset
resets all options to their initial values upon entry to the window. This might be useful when editing a predictor specication; otherwise, Reset has the same function as Clear.
Clear
is a table listing the variables in the input data set. Select one variable in this list as the predictor series.
OK
opens the Dynamic Regression Specication window for you to specify the form of the dynamic regression for the selected predictor series, and then closes the Dynamic Regressors Selection window and adds the specied dynamic regression to the model predictors list.
Cancel
closes the window without adding the dynamic regression model. Any options you specied are lost.
Reset
resets all options to their initial values upon entry to the window.
Use these combo boxes to specify the orders of the ARIMA model. You can either type in a value or click the combo box arrow to select from a pop-up list.
Autoregressive
closes the Error Model Options window and returns to the Custom Model Specication window.
Cancel
closes the Error Model Options window and returns to the Custom Model Specication window, discarding any changes made.
Reset
resets all options to their initial values upon entry to the window.
closes the window and adds the external forecast to the project.
Cancel
where p, d, and q represent autoregressive, differencing, and moving-average terms, respectively. Access it from the Develop Models menu, where it is invoked from the Fit Model item under Edit in the menu bar, or from the pop-up menu when you click an empty area of the model table.
The Factored ARIMA Model Specication window is identical to the ARIMA Model Specication window, except that the p, d, and q terms are specied in a more general and less limited way. Only those controls and elds that differ from the ARIMA Model Specication window are described here.
is a descriptive label for the model. You can type a label in this eld or allow the system to provide a label. If you leave the label blank, a label is generated automatically based on the p, d, and q terms that you specify. For example, if you specify p=(1,2,3), d=(1), q=(12) and no intercept, the model label is ARIMA p=(1,2,3) d=(1) q=(12) NOINT. For monthly data, this is equivalent to the model ARIMA(3,1,0)(0,0,1)s NOINT as specied in the ARIMA Model Specication window or the Custom Model Specication window.
ARIMA Options
Species the ARIMA model in terms of the autoregressive lags (p), differencing lags (d), and moving-average lags (q).
Autoregressive
denes the autoregressive part of the model. Select the Set button to open the AR Polynomial
Specication window, where you can add any set of autoregressive lags grouped into any number of factors.
Differencing
species differencing to be applied to the input data. Select the Set button to open the Differencing Specication window, where you can specify any set of differncing lags.
Moving Average
denes the moving-average part of the model. Select the Set button to open the MA Polynomial Specication window, where you can add any set of moving-average lags grouped into any number of factors.
Estimation Method
species the method used to estimate the model parameters. The Conditional Least Squares and Unconditional Least Squares methods generally require fewer computing resources and are more likely to succeed in tting complex models. The Maximum Likelihood method requires more resources but provides a better t in some cases. See also Estimation Details in Chapter 7, The ARIMA Procedure.
is a descriptive label for the model that you specify. You can type a label in this eld or allow the system to provide a label. If you leave the label blank, a label is generated automatically based on the options you specify.
Weight
is a column of the forecasting model table that contains the weight values for each model. The forecasts for the combined model are computed as a weighted average of the predictions from the models in the table that use these weights. Models with missing weight values are not included in the forecast combination. You can type weight values in these elds or you can use other features of the window to set the weights.
Model Description
is a column of the forecasting model table that contains the descriptive labels of the forecasting models t to the current series that are available for combination.
Root Mean Square Error (or other statistic name) button
is the button above the right side of the table. It displays the name of the current model selection criterion: a statistic that measures how well each model in the table ts the values of the current series for observations within the evaluation range. Clicking this button opens the Model Selection Criterion window to enable you to select a different statistic.
Normalize Weights button
replaces each nonmissing value in the Weights column with the current value divided by the sum of the weights. The resulting weights are proportional to original weights and sum to 1.
computes weight values for the models in the table by regressing the series on the predictions from the models. The values in the Weights column are replaced by the estimated coefcients produced by this linear regression. If some weight values are nonmissing and some are missing, only models with nonmissing weight values are included in the regression. If all weights are missing, all models are used.
OK
closes the Forecast Combination Model Specication window and ts the model.
Cancel
closes the Forecast Combination Model Specication window without tting the model. Any options you specied are lost.
Reset
resets all options to their initial values upon entry to the Forecast Combination Model Specication window. This might be useful when editing an existing model specication; otherwise, Reset has the same function as Clear.
Clear
Selection Lists
Libraries
is a list of currently assigned libraries. When you select a library from this list, the catalogs in that library are shown in the catalog selection list.
Catalogs
is a list of catalogs contained in the currently selected library. When you select a catalog from this list, any forecasting project entries stored in that catalog are shown in the projects selection list.
Projects
restores selections to those which were set before the window was opened.
species the size of the condence limits for the forecast values. For example, a value of 0.95 species 95% condence intervals. You can type in a number or select from the pop-up list.
Predictions for transformed models
controls how forecast values are computed for models that employ a series transformation. See the section Predictions for Transformed Models in Chapter 44, Forecasting Process Details, for more information. The values are as follows.
Mean
species that forecast values be predictions of the conditional mean of the series.
Median
species that forecast values be predictions of the conditional median of the series.
OK
closes the window and saves the option settings you specied.
Cancel
closes the window without changing the forecast options. Any options you specied are lost.
is a descriptive label for the intervention effect that you specify. You can type a label in this eld or allow the system to provide the label. If you leave the label blank, a label is generated automatically based on the options you specify.
Date
is the date that the intervention occurs. You can type a date value in this eld, or you can set the date by selecting a row of the data table on the right side of the window.
Type of Intervention
Point
species that the intervention variable is zero except for the specied date.
Step
species that the intervention variable is zero before the specied date and a constant 1.0 after the date.
Ramp
species that the intervention variable is an increasing linear function of time after the date of the intervention and zero before the intervention date.
Number of lags
species the numerator order for the transfer function model for the intervention effect. Select a value from the pop-up list.
Effect Decay Pattern
species the denominator order for the transfer function model for the intervention effect. The value Exp species a single lag denominator factor; the value Wave species a two-lag denominator factor.
OK
closes the window and adds the intervention effect specied to the series interventions list.
Cancel
closes the window without adding the intervention. Any options you specied are lost.
Reset
resets all options to their initial values upon entry to the window. This might be useful when editing an intervention specication; otherwise, Reset has the same function as Clear.
Clear
closes the window. If you access this window from the ARIMA Model Specication window or the Custom Model Specication window, any interventions that are selected (highlighted) in the list are added to the model. If you access this window from the Tools menu, all interventions in the list are saved for the current series.
Cancel
closes the window without returning a selection or changing the interventions list. Any options you specied are lost.
Reset
opens the Intervention Specication window to specify a new intervention effect and add it to the list.
Delete
opens the Intervention Specication window to edit the currently selected (highlighted) intervention.
is the name of the SAS catalog entry in which forecasting models and other results will be stored and from which previously stored results are loaded into the forecasting system. You can specify the project by typing a SAS catalog entry name in this eld or by selecting the Browse button to the right of this eld. If you specify the name of an existing catalog entry, the information in the project le is loaded. If you specify a one-level name, it is assumed to be the name of a project in the fmsproj catalog in the sasuser library. For example, typing samproj is equivalent to typing sasuser.fmsproj.samproj. project Browse button opens the Forecasting Project File Selection window to enable you to select and load the project from a list of previously stored project les.
Description
is a descriptive label for the forecasting project. The description you type in this eld will be stored with the catalog entry shown in the Project eld if you save the project.
is the name of the time series variable represented in the given row of the table.
Series Frequency
is the input data set that provided the data for the series.
Forecasting Model
is the descriptive label for the forecasting model selected for the series.
Statistic Name
is the statistic of t for the forecasting model selected for the series.
Number of Models
is the total number of forecasting models t to the series. If there is more than one model for a series, use the Model List window to see a list of models.
Series Label
is the time ID variable for the input data set for the series.
Series Data Range
Menu Bar
File
New
opens a dialog which lets you create a new project, assign it a name and description, and make it the active project.
Open
opens a dialog that lets you select and load a previously saved project.
Close
closes the Manage Forecasting Project window and returns to the main window.
Save
saves the current state of the system (including all the models t to a series) to the current project catalog entry.
Save As
saves the current state of the system with a prompt for the name of the catalog entry in which to store the information.
Save to Data Set
saves the current project le information in a SAS data set. The contents of the data set are the same as the information displayed in the series list table.
Delete
is available if you license SAS/Access software. It opens an Import Wizard, which you can use to import your data from an external spreadsheet or data base to a SAS data set for use in the Time Series Forecasting System.
Export Data
is available if you license SAS/Access software. It opens an Export Wizard, which you can use to export a SAS data set, such as a forecast data set created with the Time Series Forecasting System, to an external spreadsheet or data base.
Print
opens the Print Setup window, which allows you to access your operating system print setup.
Edit
Delete Series
deletes all models for the selected (highlighted) row of the table and removes the series from the project.
Clear
resets the system, deleting all series and models from the project.
Reset
Data Set
opens a Viewtable window to display the input data set for the selected (highlighted) series.
Series
opens the Time Series Viewer window to display plots of the selected (highlighted) series.
Model
opens the Model Viewer window to show the current forecasting model for the selected series.
Forecast
opens the Model Viewer to display plots of the forecasts produced by the forecasting model for the selected (highlighted) series.
Tools
Diagnose Series
opens the Series Diagnostics window to perform the automatic series diagnostic process to determine the kinds of forecasting models appropriate for the selected (highlighted) series.
List Models
opens the Model List window for the selected (highlighted) series, which displays a list of all the models that you t for the series. This action is the same as double-clicking the mouse on the table row.
Generate Data
opens the Time Series Simulation window. This window enables you to simulate ARIMA time series processes and is useful for educational exercises or testing the system.
Refit Models
All Series
rets all the models for all the series in the project by using data within the current t range.
Selected Series
rets all the models for the currently highlighted series by using data within the current t range.
Reevaluate Models
All Series
reevaluates all the models for all the series in the project by using data within the current evaluation t range.
Selected Series
reevaluates all the models for the currently highlighted series by using data within the current evaluation range.
Options
Time Ranges
opens the Time Ranges Specication window to enable you to change the t and evaluation time ranges and the forecast horizon.
Default Time Ranges
opens the Default Time Ranges window to enable you to control how the system sets the time ranges for series when you do not explicitly set time ranges with the Time Ranges Specication window. Settings made by using this window do not affect series you are already working with; they take effect when you select a new series.
Model Selection List
opens the Model Selection List editor window. Use this to edit the set of forecasting models considered by the automatic model selection process and displayed by the Models to Fit window.
Statistics of Fit
opens the Statistics of Fit Selection window, which controls which of the available statistics will be displayed.
Forecast Options
opens the Forecast Options window, which enables you to control the widths of forecast condence limits and control the kind of predicted values computed for models that include series transformations.
Column Labels
enables you to set long or short column labels. Long labels are used by default.
Include Interventions
controls whether intervention effects dened for the current series are automatically added as predictors to the models considered by the automatic selection process and displayed by the Model Selection List editor window. When the Include Interventions option is selected, the series interventions are also automatically added to the predictors list when you specify a model in the ARIMA and Custom Models Specication windows.
prints to the SAS log information about the models t by the system. A check mark or lled check box next to this item indicates that the audit option is turned on.
Show Source Statements
controls whether SAS statements submitted by the forecasting system are printed in the SAS log. When the Show Source Statements option is selected, the system sets the SAS system option SOURCE before submitting SAS statements; otherwise, the system uses the NOSOURCE option. Note that only some of the functions performed by the forecasting system are accomplished by submitting SAS statements. A check mark or lled check box next to this item indicates that the option is turned on.
deletes the highlighted series and its models from the project. This is the same as Delete Series in the Edit menu.
Refit All Models
rets all models attached to the highlighted series by using data within the current t range. This is the same as the Selected Series item under Ret Models in the Tools menu.
Reevaluate All Models
reevaluates all models attached to the highlighted series by using data within the current evaluation range. This is the same as the Selected Series item under Reevaluate Models in the Tools menu.
List Models
invokes the Model List window. This is the same as List Models under the Tools menu.
View Series
opens the Time Series Viewer window to display plots of the highlighted series. This is the same as the Series item under the View menu.
View Forecasting Model
invokes the Model Viewer window to display the forecasting model for the highlighted series. This is the same as the Model item under the View menu.
View Forecast
opens the Model Viewer window to display the forecasts for the highlighted series. This is the same as the Forecast item under the View menu.
Refresh
displays the starting and ending dates of the series data range.
Model 1
enables you to change the model identied as Model 1 if it is not already the rst model in the
list of models associated with the series. Select this button to cycle upward through the list of models.
Model 1 downward arrow button
enables you to change the model identied as Model 1 if it is not already the last model in the list of models. Select this button to cycle downward through the list of models.
Model 2
enables you to change the model identied as Model 2 if it is not already the rst model in the list of models associated with the series. Select this button to cycle upward through the list of models.
Model 2 downward arrow button
enables you to change the model identied as Model 2 if it is not already the last model in the list of models. Select this button to cycle downward through the list of models.
Close
opens a dialog for specifying the name and label of a SAS data set to which the statistics will be saved. The data set will contain all available statistics and their values for Model 1 and Model 2, as well as a ag variable that is set to 1 for those statistics that were displayed.
Print
prints the contents of the table to the SAS Output window. If you nd that the contents do not appear immediately in the Output window, you need to set scrolling options. Select Preferences under the Options submenu of the Tools menu. In the Preferences window, select the Advanced tab, then set output scroll lines to a number greater than zero. If you want to route the contents to a printer, go to the Output window and select Print from the File menu.
Statistics
opens the Statistics of Fit Selection window for controlling which statistics are displayed.
is the time interval (data frequency) for the input data set.
Series
is the date of the rst and last nonmissing data values available for the current series in the input data set.
Fit Range
is the current period of t setting. This is the range of data that will be used to t models to the series. It might be different from the t ranges shown in the table, which were in effect when the models were t.
Evaluation Range
is the current period of evaluation setting. This is the range of data that will be used to calculate the goodness-of-t statistics for models t to the series. It might be different from the evaluation ranges shown in the table, which were in effect when the models were t.
View Series Graphically icon
opens the Time Series Viewer window to display plots of the current series.
opens the Model Viewer to display graphs and tables for the currently highlighted model.
opens the Model Viewer on the selected model. This is the same as Model Predictions under the View menu.
View Parameter Estimates
opens the Model Viewer to display the parameter estimates table for the currently highlighted model. This is the same as Parameter Estimates under the View menu.
View Statistics of Fit
opens the Model Viewer to display the statistics of t table for the currently highlighted model. This is the same as Statistics of FIt under the View menu.
Edit Model
opens the appropriate model specication window for changing the attributes of the highlighted model and tting the modied model.
Refit Model
opens the Model Viewer to show the forecasts for the highlighted model. This is the same as Forecast Graph under the View menu.
Menu Bar
File
Save
opens a dialog which lets you save the contents of the table to a specied SAS data set.
Import Data
is available if you license SAS/Access software. It opens an Import Wizard, which you
can use to import your data from an external spreadsheet or data base to a SAS data set for use in the Time Series Forecasting System.
Export Data
is available if you license SAS/Access software. It opens an Export Wizard, which you can use to export a SAS data set, such as a forecast data set created with the Time Series Forecasting System, to an external spreadsheet or data base.
Print
sends the contents of the table to a printer as dened through Print Setup.
Print Setup
opens the Print Setup window, which allows you to access your operating system print setup.
Close
closes the window and returns to the Manage Forecasting Projects window.
Edit
Edit Model
enables you to modify the specication of the currently highlighted model in the table and t the modied model. The new model replaces the current model in the table.
Refit Model
rets the currently highlighted model using data within the current t range.
Reevaluate Model
recomputes statistics of t for the currently highlighted model using data within the current evaluation range.
Delete Model
restores the contents of the Model List window to the state initially displayed.
View
Series
opens the Time Series Viewer window to display plots of the current series. This is the same as the View Series Graphically icon.
Model Predictions
opens the Model Viewer to display a predicted and actual plot for the currently highlighted model. This is the same as the View Model Graphically icon.
Prediction Errors
opens the Model Viewer to display the prediction errors for the currently highlighted model.
Prediction Error Autocorrelations
opens the Model Viewer to display the prediction error autocorrelations, partial autocorrelations, and inverse autocorrelations for the currently highlighted model.
opens the Model Viewer to display graphs of white noise and stationarity tests on the prediction errors of the currently highlighted model.
Parameter Estimates
opens the Model Viewer to display the parameter estimates table for the currently highlighted model.
Statistics of Fit
opens the Model Viewer window to display goodness-of-t statistics for the currently highlighted model.
Forecast Graph
opens the Model Viewer to graph the forecasts for the currently highlighted model.
Forecast Table
opens the Model Viewer to display forecasts for the currently highlighted model in a table.
Options
Statistics of Fit
opens the Statistics of Fit Selection window, which presents a list of statistics that the system can display. Use this action to customize the list of statistics shown in the Model Viewer, Automatic Model Fitting Results, and Model Fit Comparison windows and available for selection in the Model Selection Criterion menu.
Column Labels
enables you to set long or short column labels. Long labels are used by default.
when selected, lists only those model selection criterion statistics that are selected in the Statistics of Fit Selection window.
Show all
closes the window and sets the model selection criterion to the statistic you specied.
Cancel
Turn the autot option on or off for individual models. Those that are not agged for autot will be available by using the Models to Fit window but not by using automatic model tting. Delete models from the list that are not needed for your project. Reorder the models in the list. Edit models in the list. Create a new empty list. Add new models to the list. Having modied the current model list, you can save it for future use in several ways: Save it in a catalog so it can be opened later in the Model Selection List Editor. Save it as the user default to be used automatically when new projects are created. Select close to close the Model Selection List Editor and attach the modied model selection list to the current project. Select cancel to close the Model Selection List Editor without changing the current projects model selection list. Since model selection lists are not bound to specic data sources, care must be taken when including data-specic features such as interventions and regressors. When you add an ARIMA, Factored ARIMA, or Custom model to the list, you can add regressors by selecting from the variables in the current data set. If there is no current data set, you will be prompted to specify a data set so you can select regressors from the series it contains. If you use a model list that has models with a particular regressor name on a data set that does not contain a series of that name, model tting will fail. However, you can make global changes to the regressor names in the model list by using Set regressor names. For example, you might use the list of dynamic regression models found in the sashelp.forcast catalog. It uses the regressor name price. If your regessor series is named x, you can specify price as the current regressor name and x as the change to name. The change will be applied to all models in the list that contain the specied regressor name. Interventions cannot be dened for models dened from the Model Selection List Editor. However, you can dene interventions by using the Intervention Specication Window and apply them to your models by turning on the Include Interventions option.
Auto Fit
The auto t column of check boxes enables you to eliminate some of the models from being used in the automatic tting process without having to delete them from the list. By default, all models are checked, meaning that they are all used for automatic tting.
Model
This column displays the descriptions of all models in the model selection list. You can select one or more models by clicking them. Selected models are highlighted and become the object of the actions Edit, Move, and Delete.
Menu Bar
File
New
opens a dialog for selecting one or more existing model selection lists to open. If you select multiple lists, they are all opened at once as a concatenated list. This helps you build large specialized model lists quickly by mixing and matching various existing lists such as the various ARIMA model lists included in SASHELP.FORCAST. By default,
the lists you open replace the current model list. Select the append radio button if you want to append them to the current model list.
Open System Default
exits the window without applying any changes to the current projects model selection list.
Close
closes the window and applies any changes made to the projects model selection list.
Save
opens a dialog for saving the edited model selection list in a catalog of your choice.
Save as User Default
saves your edited model list as a default list for new projects. The location of this saved list is shown on the message line. When you create new projects, the system searches for this model list and uses it if it is found. If it is not found, the system uses the original default model list supplied with the product.
Edit
Reset
restores the list to its initial state when the window was invoked.
Add Model
enables you to add new models to the selection list. You can use the Smoothing Model Specication window, the ARIMA Model Specication window, the Factored ARIMA Model Specication window, or the Custom Model Specication window.
Edit Selected
opens the appropriate model specication window for changing the attributes of the highlighted model and adding the modied model to the selection list. The original model is not deleted.
Move Selected
enables you to reorder the models in the list. Select one or more models, then select Move Selected from the menu or toolbar. A note appears on the message line: Select the row after which the selected models are to be moved. Then select any unhighlighted row in the table. The selected models will be moved after this row.
Delete
deletes any highlighted models from the list. This item is not available if no models are selected.
Set Regressor Names
opens a dialog for changing all occurrences of a given regressor name in the models of the current model selection list to a name that you specify.
Select All
You can access Model Viewer in a number of ways, including the View Model Graphically icon of the Develop Models and Model List windows, the Graph button of the Automatic Model Fitting Results window, and the Model item under the View menu in the Manage Forecasting Project window. In addition, you can go directly to a selected view in the Model Viewer window by selecting Model Predictions, Prediction Errors, Statistics of Fit, Prediction Error Autocorrelations, Prediction Error Tests, Parameter Estimates, Forecast Graph, or Forecast Table from the View menu or corresponding toolbar icon or pop-up menu item in the Develop Models, Model List, or Automatic Model Fitting Results windows. The state of the Model Viewer window is controlled by the current model and the currently selected view. You can resize this window, and you can use other windows without closing the Model Viewer window. By default, the Model Viewer window is automatically updated to display the new model when you switch to working with another model (that is, when you highlight a different model). You can unlink the Model Viewer window from the current model selection by selecting the Link/Unlink icon from the windows horizontal toolbar. See Link/Unlink in the section Toolbar Icons on page 2616. For more information, see the section Model Viewer on page 2427.
Toolbar Icons
The Model Viewer window contains a horizontal row of icons called the Toolbar. Corresponding menu items appear under various menus. The function of each icon is explained in the following list.
Zoom in
In the Model Predictions, Prediction Errors, and Forecast Graph views, the Zoom In action changes the mouse cursor into cross hairs that you can use with the left mouse button to dene a region of the graph to zoom in on. In the Prediction Error Autocorrelations and Prediction Error Tests views, Zoom In reduces the number of lags displayed.
Zoom out
disconnects or connects the Model Viewer window to the model table (Develop Models window, Model List window, or Automatic Model Fitting Results window). When the viewer is linked, selecting another model in the model table causes the model viewer to be updated to show the selected model. When the Viewer is unlinked, selecting another model does not affect the viewer. This feature is useful for comparing two or more models graphically. You can display a model of interest in the Model Viewer, unlink it, then select another model and open another Model Viewer window for that model. Position the viewer windows side by side for convenient comparisons of models, or use the Next Viewer icon or F12 function key to switch between them.
Save
saves the contents of the Model Viewer window. By default, an HTML page is created. This enables you to display graphs and tables by using the Results Viewer or publish them on the Web or your intranet. See also Save Graph As and Save Data As under Menu Bar below.
closes the Model Viewer window and returns to the window from which it was invoked.
displays a plot of actual series values and model predictions over time. Click individual points in the graph to get a display of the type (actual or predicted), ID value, and data value in the upper right corner of the window.
Prediction Errors
displays a plot of model prediction errors (residuals) over time. Click individual points in the graph to get a display of the prediction error value in the upper right corner of the window.
Prediction Error Autocorrelations
displays horizontal bar charts of the sample autocorrelation, partial autocorrelation, and inverse autocorrelation functions for the model prediction errors. Overlaid line plots represent condence limits computed at plus and minus two standard errors. Click any of the bars to display its value.
Prediction Error Tests
displays horizontal bar charts that represent results of white noise and stationarity tests on the model prediction errors. The rst bar chart shows the signicance probability of the LjungBox chi-square statistic computed on autocorrelations up to the given lag. Longer bars favor rejection of the null hypothesis that the series is white noise. Click any of the bars to display an interpretation. The second bar chart shows tests of stationarity of the model prediction errors, where longer bars favor the conclusion that the series is stationary. Each bar displays the signicance probability of the augmented Dickey-Fuller unit root test to the given autoregressive lag. Long bars represent higher levels of signicance against the null hypothesis that the series contains a unit root. For seasonal data, a third bar chart appears for seasonal root tests. Click on any of the bars to display an interpretation.
Parameter Estimates
displays a table showing model parameter estimates along with standard errors and t tests for the null hypothesis that the parameter is zero.
Statistics of Fit
displays a table of statistics of t for the selected model. The set of statistics shown can be changed by using the Statistics of Fit item under Options in the menu bar.
Forecast Graph
displays a plot of actual and predicted values for the series data range, followed by a horizontal reference line and forecasted values with condence limits. Click individual points in the
graph to get a display of the type, date/time, and value of the data point in the upper right corner of the window.
Forecast Table
displays a data table with columns containing the date/time, actual, predicted, error (residual), lower condence limit, and upper condence limit values, together with any predictor series.
Menu Bar
File
Save Graph
saves the plot displayed in viewer window as a SAS/GRAPH grseg catalog entry. When the current view is a table, this menu item is not available. See also Save in the section Toolbar Icons on page 2616. If a graphics catalog entry name has not already been specied, this action functions like Save Graph As.
Save Graph As
saves the current graph as a SAS/GRAPH grseg catalog entry in a SAS catalog that you specify and/or as an Output Delivery System (ODS) object. By default, an HTML page is created, with the graph embedded as a gif image.
Save Data
saves the data displayed in the viewer window in a SAS data set, where applicable.
Save Data As
saves the data in a SAS data set that you specify and/or as an Output Delivery System (ODS) object. By default, an HTML page is created, with the data displayed as a table.
Import Data
is available if you license SAS/Access software. It opens an Import Wizard, which you can use to import your data from an external spreadsheet or data base to a SAS data set for use in the Time Series Forecasting System.
Export Data
is available if you license SAS/Access software. It opens an Export Wizard, which you can use to export a SAS data set, such as a forecast data set created with the Time Series Forecasting System, to an external spreadsheet or data base.
Print Graph
prints the contents of the viewer window if the current view is a graph. This is the same as the Print toolbar icon. If the current view is a table, this menu item is not available.
Print Data
opens the Print Setup window, which allows you to access your operating system print setup.
Print Preview
opens a preview window to show how your plots will appear when printed.
Close
closes the Model Viewer window and returns to the window from which it was invoked.
Edit
Edit Model
enables you to modify the specication of the current model and to t the modied model, which is then displayed in the viewer.
Refit Model
rets the current model by using data within the current t range. This action also causes the ranges to be reset if the data range has changed.
Reevaluate Model
reevaluates the current model by using data within the current evaluation range. This action also causes the ranges to be reset if the data range has changed.
View
See View Selection Icons on page 2617. It describes each of the items available under View, except Zoom Way Out.
Zoom Way Out
zooms the plot out as far as it will go, undoing all prior zoom in operations.
Tools
Link Viewer
Time Ranges
opens the Time Ranges Specication window to enable you to change the period of t, period of evaluation, or forecast horizon to be applied to subsequently t models.
Statistics of Fit
opens the Statistics of Fit Selection window, which presents a list of statistics that the system can display. Use this action to customize the list of statistics shown in the statistics of t table and available for selection in the Model Selection Criterion menu.
Forecast Options
opens the Forecast Options window, which enables you to control the widths of forecast condence limits and control the kind of predicted values computed for models that include series transformations.
Residual Plot Options
Provides a choice of four methods of computing prediction errors for models which include a data transformation.
Prediction Errors
computes the difference between the transformed series actual values and model predictions.
computes the difference between the untransformed series values and the untransformed model predictions.
Normalized Model Residuals
opens a window to enable you to specify the number of lags shown in the Prediction Error Autocorrelations and Prediction Error Tests views. You can also use the Zoom In and Zoom Out actions to control the number of lags displayed.
Correlation Probabilities
controls whether the bar charts in the Prediction Error Autocorrelations view represent signicance probabilities or values of the correlation coefcient. A check mark or lled check box next to this item indicates that signicance probabilities are displayed. In each case the bar graph horizontal axis label changes accordingly.
Include Interventions
controls whether intervention effects dened for the current series are automatically added as predictors to the models considered by the automatic selection process. A check mark or lled check box next to this item indicates that the option is turned on.
Print Audit Trail
prints to the SAS log information about the models t by the system. A check mark or lled check box next to this item indicates that the audit option is turned on.
Show Source Statements
controls whether SAS statements submitted by the forecasting system are printed in the SAS log. When the Show Source Statements option is selected, the system sets the SAS system option SOURCE before submitting SAS statements; otherwise, the system uses the NOSOURCE option. Note that only some of the functions performed by the forecasting system are accomplished by submitting SAS statements. A check mark or lled check box next to this item indicates that the option is turned on.
mouse cursor at one corner of the region, press the left mouse button, and move the mouse cursor to the opposite corner of the region while holding the left mouse button down. When you release the mouse button, the plot is redrawn to show an expanded view of the data within the region you selected.
To select a model to be t, use the left mouse button. To select more than one model to t, drag with the mouse, or select the rst model, then press the shift key while selecting the last model. For noncontiguous selections, press the control key while selecting with the mouse. To begin tting the models, double-click the last selection or select the OK button. If series diagnostics have been performed, the radio box is available. If the Subset by series diagnostics radio button is selected, only those models in the selection list that t the diagnostic criteria will be shown for selection. If you want to choose models that do not t the diagnostic criteria, select the Show all models button.
when selected, lists all available models, regardless of the setting of the series diagnostics options.
Subset by series diagnostics
when selected, lists only the available models that are consistent with the series diagnostics options.
OK
closes the window without tting any models. Any selections you made are lost.
species a lag value to add to the list. Type in a positive integer or select one by clicking the
adds the value in the Lag spin box to the list of polynomial lags. Duplicate values are not allowed.
Remove
closes the window and returns the specied polynomial to the AR or MA polynomial specication window.
Cancel
closes the window and discards any polynomial lags added to the list.
is the name of the time ID variable for the input data set. To specify this variable, you can type the ID variable name in this eld or use the Select button. Time ID Select button opens the Time ID Variable Specication window.
Create button
opens a menu of choices of methods for creating a time ID variable for the input data set. Use this feature if the input data set does not already contain a valid time ID variable.
Interval
is the time interval between observations (data frequency) in the current input data set. If the interval is not automatically lled in by the system, you can type in an interval name here, or select one from the pop-up list.
Series
indicates the number and names of time series variables for which forecasts will be produced. Series Select button opens the Series to Process window to let you select the series for which you want to produce forecasts. Forecast Output Data Set is the name of the output data set that will contain the forecasts. Type the name of the output data set in this eld or click the Browse button. Forecast Output Browse button opens a dialog to let you locate an existing data set to which to save the forecasts.
Format
enables you to select one of three formats for the forecast data set:
Simple
species the simple format for the output data set. The data set contains the time ID variable and the forecast variables and contains one observation per time period. Observations for earlier time periods contain actual values copied from the input data set; later observations contain the forecasts.
Interleaved
species the interleaved format for the output data set. The data set contains the time ID variable, the variable TYPE, and the forecast variables. There are several observations per time period, with the meaning of each observation identied by the TYPE variable.
Concatenated
species the concatenated format for the output data set. The data set contains the variable SERIES, the time ID variable, and the variables ACTUAL, PREDICT, ERROR,
LOWER, and UPPER. There is one observation per time period per forecast series. The variable SERIES contains the name of the forecast series, and the data set is sorted by SERIES and DATE.
Horizon
is the number of periods or years to forecast beyond the end of the input data range. To specify the forecast horizon, you can type a value in this eld or select one from the pop-up list.
Horizon periods
selects the units to apply to the horizon. By default, the horizon value represents number of periods. For example, if the interval is month, the horizon represents the number of months to forecast. Depending on the interval, you can also select weeks or years, so that the horizon is measured in those units.
Horizon date
is the ending date of the forecast horizon. You can type in a date that uses a form recognized by a SAS date informat, or you can increment or decrement the date shown by using the left and right arrows. The outer arrows change the date by a larger amount than the inner arrows. The date eld and the horizon eld reset each other, so you can use either one to specify the forecasting horizon.
Run button
produces forecasts for the selected series and stores the forecasts in the specied output SAS data set.
Output button
opens a Viewtable window to display the output data set. This button becomes available once the forecasts have been written to the data set.
Close button
closes the Produce Forecasts window and returns to the Time Series Forecasting window.
Menu Bar
File
Import Data
is available if you license SAS/Access software. It opens an Import Wizard, which you can use to import your data from an external spreadsheet or data base to a SAS data set for use in the Time Series Forecasting System.
Export Data
is available if you license SAS/Access software. It opens an Export Wizard, which you can use to export a SAS data set, such as a forecast data set created with the Time Series Forecasting System, to an external spreadsheet or data base.
Print Setup
opens the Print Setup window, which allows you to access your operating system print setup.
Close
closes the Produce Forecasts window and returns to the Time Series Forecasting window.
View
opens a Viewtable window to browse the output data set. This is the same as the Output button.
Tools
Produce Forecasts
produces forecasts for the selected series and stores the forecasts in the specied output SAS data set. This is the same as the Run button.
Options
opens the Default Time Ranges window to enable you to control how the system sets the time ranges when new series are selected.
Model Selection List
opens the Model Selection List editor window. Use this to edit the set of forecasting models considered by the automatic model selection process and displayed by the Models to Fit window.
Model Selection Criterion
opens the Model Selection Criterion window, which presents a list of goodness-of-t statistics and enables you to select the t statistic that is displayed in the table and used by the automatic model selection process to determine the best tting model.
Statistics of Fit
opens the Statistics of Fit Selection window, which presents a list of statistics that the system can display. Use this action to customize the list of statistics shown in the Statistics of Fit table and available for selection in the Model Selection Criterion window.
Forecast Options
opens the Forecast Options window, which enables you to control the widths of forecast condence limits and control the kind of predicted values computed for models that include series transformations.
Forecast Data Set
enables you to select one of three formats for the forecast data set. See Format, which is described previously in this section.
Alignment of Dates
Beginning
aligns dates that the system generates to identify forecast observations in output data sets to the beginning of the time intervals.
Middle
aligns dates that the system generates to identify forecast observations in output data sets to the midpoints of the time intervals.
End
aligns dates that the system generates to identify forecast observations in output data sets to the end of the time intervals.
Automatic Fit
opens the Automatic Model Selection Options window, which enables you to control the number of models retained by the automatic model selection process and whether the models considered for automatic selection are subset according to the series diagnostics.
Set Toolbar Type
Image Only
controls whether intervention effects dened for the current series are automatically added as predictors to the models considered by the automatic selection process. A check mark or lled check box next to this item indicates that the option is turned on.
Print Audit Trail
prints to the SAS log information about the models t by the system. A check mark or lled check box next to this item indicates that the audit option is turned on.
Show Source Statements
controls whether SAS statements submitted by the forecasting system are printed in the SAS log. When the Show Source Statements option is selected, the system sets the SAS system option SOURCE before submitting SAS statements; otherwise, the system uses the NOSOURCE option. Note that only some of the functions performed by the forecasting system are accomplished by submitting SAS statements. A check mark or lled check box next to this item indicates that the option is turned on.
is a table listing the names and labels of the variables in the input data set available for selection as regressors. The variables that you select are highlighted. Selecting a highlighted row again deselects that variable.
OK
closes the Regressors Selection window and adds the selected variables as regressors in the model.
Cancel
closes the window without adding any regressors. Any selections you made are lost.
Reset
resets all options to their initial values upon entry to the window.
Save Data As
Use Save Data As from the Time Series Viewer Window or the Model Viewer Window to save data displayed in a table to a SAS data set or external le. Use Save Forecast As from the Develop Models Window to save forecasts and related data including the series name, model, and interval. It supports append mode, enabling you to accumulate the forecasts of multiple series in a single data set.
To save your data in a SAS data set, provide a library name or assign one by using the Browse button, then provide a data set name or accept the default. Enter a descriptive label for the data set in the Label eld. Click OK to save the data set. If you specify an existing data set, it will be overwritten, except in the case of Save Forecast As. External le output takes advantage of the Output Delivery System (ODS) and is designed primarily for creating HTML tables for Web reporting. You can build a set of Web pages quickly and use the ODS Results window to view and organize them. To use this feature, check Save External File in the External File Output box. To set ODS options, click Results Preferences, then select the Results tab in the Preferences dialog. If you have previously saved data of the current type, the system remembers your previous labels and titles. To reuse them, click the arrow button to the right of each of these window elds. Use the Customize button if you need to specify the name of a custom macro that contains ODS statements. The default macro simply runs the PRINT procedure. A custom macro can be used to add PRINT procedure and/or ODS statements to customize the type and organization of output les produced.
Save Graph As
Use Save Graph As from the Time Series Viewer Window or the Model Viewer Window to save any of the graphs in a catalog or external le.
To save your graph as a grseg catalog entry, enter a two level name for the catalog or select Browse to open an Open dialog. Use it to select an existing library or assign a new one and then select a catalog to contain the graph. Click the Open button to open the catalog and close the dialog. Then enter a graphics entry name (eight characters or less) and a label or accept the defaults and click the OK button to save the graph. External le output takes advantage of the Output Delivery System (ODS) and is designed primarily for creating gif images and HTML for Web reporting. You can build a set of Web pages that contain graphs and use the Results window to view and organize them. To use this feature, check Save External File in the External File Output box. To set ODS options, click Results Preferences, then select the Results tab in the Preferences dialog. If you have previously saved graphs of the current type, the system remembers your previous labels and titles. To reuse them, click the arrow button to the right of each of these window elds. Use the Customize button if you need to specify the name of a custom macro that contains ODS statements. The default macro simply runs the GREPLAY procedure. Users familiar with ODS
might want to add statements to the macro to customize the type and organization of output les produced.
Use these combo boxes to specify the orders of the ARIMA model. You can either type in a value or click the combo box arrow to select from a pop-up list.
Autoregressive
closes the Seasonal ARIMA Model Options window and returns to the Custom Model Specication window.
Cancel
closes the Seasonal ARIMA Model Options window and returns to the Custom Model Specication window, discarding any changes made.
Reset
resets all options to their initial values upon entry to the window.
For each of the options Log Transform, Trend, and Seasonality, the value Yes means that only models appropriate for series with that characteristic should be considered. The value No means that only models appropriate for series without that characteristic should be considered. The value Maybe means that models should be considered without regard for that characteristic.
Series Characteristics
Log Transform
species whether forecasting models with or without a logarithmic series transformation are appropriate for the series.
Trend
species whether forecasting models with or without a trend component are appropriate for the series.
Seasonality
species whether forecasting models with or without a seasonal component are appropriate for the series.
Automatic Series Diagnostics
performs the automatic series diagnostic process. The options Log Transform, Trend, and Seasonality are set according to the results of statistical tests.
OK
closes the Series Diagnostics window without changing the series diagnostics options. Any options you specied are lost.
Reset
resets all options to their initial values upon entry to the Series Diagnostics window.
Clear
This window appears automatically when you select the View Series Graphically or Develop Models buttons in the Time Series Forecasting window and no series has been selected, and when you open the Time Series Viewer as a standalone tool. It is also invoked by using the Browse button in the Develop Models window. The system requires that series names be unique for each frequency (interval) within the forecasting project. If you select a series from the current input data set that already exists in the project with the same interval but a different input data set name, the system warns you and gives you the option to cancel the selection, to ret all models associated with the series by using the data from the current input data set, to delete the models for the series, or to inherit the existing models.
is a SAS libname assigned within the current SAS session. If you know the libname associated with the data set of interest, you can type it in this eld and press Return. If it is a valid choice, it will appear in the libraries list and will be highlighted. The SAS Data Sets list will be populated with data sets associated with that libname.
Data Set
is the name of a SAS data set (data le or data view) that resides under the selected libname. If you know the name, you can type it in and press Return. If it is a valid choice, it will appear in the SAS Data Sets list and will be highlighted, and the Time Series Variables list will be populated with the numeric variables in the data set.
Variable
is the name of a numeric variable contained in the selected data set. You can type the variable name in this eld or you can select the variable with the mouse from the Time Series Variables list.
Time ID
is the name of the ID variable for the input data set. To specify the ID variable, you can type the ID variable name in this eld or click the Select button.
Select button
opens the Time ID Variable Specication window to let you select an existing variable in the data set as the Time ID.
Create button
opens a menu of methods for creating a time ID variable for the input data set. Use this feature if the data set does not already contain a valid time ID variable.
Interval
is the time interval between observations (data frequency) in the selected data set. If the interval is not automatically lled in by the system, you can type in an interval name or select one from the pop-up list. For more information about intervals, see Chapter 4, Date Intervals, Formats, and Functions, in this book.
OK
This button is present when you have selected Develop Models from the Time Series Forecasting window. It closes the Series Selection window and makes the selected series the current series.
Close
If you have selected the View Series Graphically icon from the Time Series Forecasting window, this button returns you to that window. If you have selected a series, it remains selected as the current series. If you are using the Time Series Viewer as a standalone application, this button closes the application.
Cancel
This button is present when you have selected Develop Models from the Time Series Forecasting window. It closes the Series Selection window without applying any selections made.
Reset
opens a Viewtable window for browsing the selected data set. This can assist you in locating the variable containing data you are looking for.
Graph
opens the Time Series Viewer window to display the selected time series variable. You can switch to a different series in the Series Selection window without closing the Time Series Viewer window. Position the windows so they are both visible, or use the Next Viewer toolbar icon or F12 function key to switch between windows.
Refresh
updates all elds and lists on the window. If you assign a new libname without exiting the Series Selection window, use the refresh action to update the Libraries list so that it will include the newly assigned libname. Also use the Refresh action to update the variables list if the input data set is changed.
Selection Lists
Libraries
displays a list of currently assigned libnames. You can select a libname by clicking it, which is equivalent to typing its name in the Library eld. If you cannot locate the library or directory you are interested in, go to the SAS Explorer window, select New from the File menu, then select Library and OK. This opens the New Library dialog window. You also assign a libname by submitting a libname statement from the Editor window. Select the Refresh button to make the new libname available in the libraries list.
SAS Data Sets
displays a list of the SAS data sets (data les or data views) located under the selected libname. You can select one of these by clicking it, which is equivalent to typing its name in the Data Set eld.
Time Series Variables
displays a list of numeric variables contained within the selected data set. You can select one of these by clicking it, which is equivalent to typing its name in the Variable eld. You can double-click a series to select it and exit the window.
closes the window and applies the series selection(s) which have been made. At least one series must be selected.
Cancel
closes the window, ignoring series selections which have been made, if any.
Clear
is the transformation applied to the time series displayed in the Time Series Viewer window. Select Log, Logistic, Square Root, Box-Cox, or none from the pop-up list.
Simple Differencing
is the order of differencing applied to the time series displayed in the Time Series Viewer window. Select a number from 0 to 5 from the pop-up list.
Seasonal Differencing
is the order of seasonal differencing applied to the time series displayed in the Time Series Viewer window. Select a number from 0 to 3 from the pop-up list.
Percent Change
is a check box that if selected displays the series in terms of percent change from the previous period.
Additive Decomposition
is a check box that produces a display of a selected series component derived by using additive decomposition.
Multiplicative Decomposition
is a check box that produces a display of a selected series component derived using multiplicative decomposition.
Component
turned on. You can display the seasonally adjusted component, the trend-cycle component, the seasonal component, or the irregular componentthat is, the residual that remains after removal of the other components. The heading in the viewer window shows which component is currently displayed.
OK
applies the transformation options you selected to the series displayed in the Time Series Viewer window and closes the Series Viewer Transformations window.
Cancel
closes the Series Viewer Transformations window without changing the series displayed by the Time Series Viewer window.
Apply
applies the transformation options you selected to the series displayed in the Time Series Viewer window without closing the Series Viewer Transformations window.
Reset
resets the transformation options to their initial values upon entry to the Series Viewer Transformations window.
Clear
is a descriptive label for the model that you specify. You can type a label in this eld or allow the system to provide a label. If you leave the label blank, a label is generated automatically based on the options you specify.
Smoothing Methods
Simple Smoothing
species double exponential smoothing by using Browns one parameter model (single exponential smoothing applied twice).
Seasonal Smoothing
species seasonal exponential smoothing. (This is like Winters method with the trend term omitted.)
Linear (Holt) Smoothing
species exponential smoothing of both the series level and trend (Holts two parameter model).
Damped-Trend Smoothing
species exponential smoothing of both the series level and trend with a trend damping weight.
Winters Method - Additive
displays the values used for the smoothing weights. By default, the Smoothing Weights elds are set to optimize, which means that the system will compute the weight values that best t the data. You can also type smoothing weight values in these elds.
Level
is the smoothing weight used by the damped-trend method to damp the forecasted trend towards zero as the forecast horizon increases.
Season
is the smoothing weight used for the seasonal factors in Winters method and seasonal exponential smoothing.
Transformation
displays the series transformation specied for the model. When a transformation is specied, the model is t to the transformed series, and forecasts are produced by applying the inverse transformation to the model predictions. Select Log, Logistic, Square Root, Box-Cox, or None from the pop-up list.
Bounds
displays the constraints imposed on the tted smoothing weights. Select one of the following from the pop-up list:
Zero-One/Additive
sets the smoothing weight optimization region to the intersection of the region bounded by the intervals from zero (0.001) to one (0.999) and the additive invertible region. This is the default.
Zero-One Boundaries
sets the smoothing weight optimization region to the region bounded by the intervals from zero (0.001) to one (0.999).
Additive Invertible
sets the smoothing weight optimization region to the additive invertible region.
Unrestricted
opens the Smoothing Weights window to enable you to customize the constraints for smoothing weights optimization.
OK
closes the Smoothing Model Specication window and ts the model you specied.
Cancel
closes the Smoothing Model Specication window without tting the model. Any options you specied are lost.
Reset
resets all options to their initial values upon entry to the window. This might be useful when editing an existing model specication; otherwise, Reset has the same function as Clear.
Clear
when selected, restricts the tted smoothing weights to be within the bounds that you specify with the Smoothing Weight Bounded Region options.
Additive invertible region
when selected, restricts the tted smoothing weights to be within the additive invertible region of the parameter space of the ARIMA model equivalent to the smoothing model. (See the section Smoothing Models on page 2669 for details.)
Additive invertible and bounded region
when selected, restricts the tted smoothing weights to be both within the additive invertible region and within bounds that you specify.
Smoothing Weight Bounded Region
is a group of numeric entry elds that enable you to specify lower and upper limits on the tted value of each smoothing weight. The elds that appear in this part of the window depend on the kind of smoothing model that you specied.
OK
closes the window and sets the options that you specied.
Cancel
closes the window without changing any options. Any values you specied are lost.
Reset
resets all options to their initial values upon entry to the window.
Clear
list the available statistics. Select a row of the table to select or deselect the statistic shown in that row.
OK
is the name of the time ID variable to be created. You can type any valid SAS variable name in this eld.
OK
closes the window and proceeds to the next step in the time ID creation process.
Cancel
closes the window without creating a Time ID variable. Any options you specied are lost.
is a list of variables in the input data set. Select existing ID variables from this list.
Date Part
is a list of date parts that you can specify for the selected ID variable. For each ID variable that you select from the Variables list, select the Date Part value that describes what the values of the ID variable represent.
arrow button
moves the selected existing ID variable and date part specication to the Existing Time IDs list. Once you have done this, you can select another ID variable from the Variables list.
New variable
is the name of the time ID variable to be created. You can type any valid SAS variable name in this eld.
New interval
is the time interval between observations in the input data set implied by the date part ID variables you have selected.
OK
closes the window and proceeds to the next step in the time ID creation process.
Cancel
closes the window without creating a time ID. Any options you specied are lost.
Reset
resets the options to their initial values upon entry to the window.
is the starting date for the time series in the data set. Enter a date value in this eld, using a form recognizable by a SAS date informat, for example, 1998:1, feb1997, or 03mar1998.
Interval
is the time interval between observations in the data set. Select an interval from the pop-up list.
New ID variable name
is the name of the time ID variable to be created. You can type any valid SAS variable name in this eld.
OK
closes the window and proceeds to the next step in the time ID creation process.
Cancel
closes the window without changing the input data set. Any options you specied are lost.
is the name of an existing ID variable in the input data set. Click the Select button to select a variable.
Select button
opens a list of variables in the input data set for you to select from.
Informat
is a SAS date or datetime informat for reading date or datetime value from the values of the specied existing ID variable. You can type in an informat or select one from the pop-up list.
First Obs
is the value of the variable you selected from the rst observation in the data set, displayed here for convenience.
Date Value
is the SAS date or datetime value read from the rst observation value that uses the informat that you specied.
is the name of the time ID variable to be created. You can type any valid SAS variable name in this eld.
OK
closes the window and proceeds to the next step in the time ID creation process.
Cancel
closes the window without changing the input data set. Any options you specied are lost.
Reset
resets the options to their initial values upon entry to the window.
is the time interval between observations (data frequency) in the input data set.
Select a Time ID Variable
is a selection list of variables in the input set. Select one variable to assign it as the Time ID variable.
OK
closes the window and retains the selection made, if it is a valid time ID.
Cancel
restores the time ID variable to the one assigned when the window was initially opened, if any.
is the time interval (data frequency) for the input data set.
Series
gives the date of the rst and last nonmissing data values available for the current series in the input data set.
Period of Fit
gives the starting and ending dates of the period of t. This is the time range used for estimating model parameters. By default, it is the same as the data range. You can type dates in these elds, or you can use the arrow buttons to the left and right of the date elds to decrement or increment the date values shown. Date values must be entered in a form recognized by a SAS date informat. (See SAS Language Reference: Concepts for information about SAS date informats.) The inner arrows increment by periods, the outer arrows increment by larger amounts, depending on the data interval.
Period of Evaluation
gives the starting and ending dates of the period of evaluation. This is the time range used for evaluating models in terms of statistics of t. By default, it is the same as the data range. You can type dates in these elds, or you can use the control arrows to the left and right of the date elds to decrement or increment the date values shown. Date values must be entered in a form recognized by a SAS date informat. (See SAS Language Reference: Concepts for information about SAS date informats.) The inner arrows increment by periods, the outer arrows increment by larger amounts, depending on the data interval.
Forecast Horizon
is the forecasting horizon expressed as a number of forecast periods or number of years (or number of weeks for daily data). You can type a number or select one from the pop-up list. The ending date for the forecast period is automatically updated when you change the number of forecasts periods.
Forecast Horizon - Units
indicates whether the Forecast Horizon value represents periods or years (or weeks for daily data).
Forecast Horizon Date Value
is the date of the last forecast observation. You can type a date in this eld, or you can use the arrow buttons to the left and right of the date eld to decrement or increment the date values shown. Date values must be entered in a form recognized by a SAS date informat. (See SAS Language Reference: Concepts for information about SAS date informats.) The Forecast Horizon is automatically updated when you change the ending date for the forecast period.
Hold-out Sample
species that a number of observations or years (or weeks) of data at the end of the data range are used for the period of evaluation with the remainder of data used as the period of t. You can type a number in this eld or select one from the pop-up list. When the hold-out sample
value is changed, the Period of Fit and Period of Evaluation ranges are changed to reect the hold-out sample specication.
Hold-out Sample - Units
indicates whether the hold-out sample eld represents periods or years (or weeks for daily data).
OK
closes the window without saving changes. Any options you specied are lost.
Reset
resets the options to their initial values upon entry to the window.
Clear
is the name of the SAS catalog entry in which forecasting models and other results will be stored and from which previously stored results are loaded into the forecasting system. You can specify the project by typing a SAS catalog entry name in this eld or by selecting the Browse button to right of this eld. If you specify the name of an existing catalog entry, the information in the project le is loaded. If you specify a one-level name, the catalog name is assumed to be fmsproj and the library is assumed to be sasuser. For example, samproj is equivalent to sasuser.fmsproj.samproj. Project Browse button opens the Forecasting Project File Selection window to enable you to select and load the project from a list of previously stored projects.
Description
is a descriptive label for the forecasting project. The description you type in this eld will be stored with the catalog entry shown in the Project eld.
Data Set
is the name of the current input data set. To specify the input data set, you can type the data set name in this eld or use the Browse button to the right of the eld. Data set Browse button opens the Data Set Selection window to enable you to select the input data set.
Time ID
is the name of the ID variable for the input data set. To specify the ID variable, you can type the ID variable name in this eld or use the Select button. If the time ID variable is named date, time, or datetime, it is automatically picked up by the system.
Select button
opens a menu of choices of methods for creating a time ID variable for the input data set. Use this feature if the input data set does not already contain a valid time ID variable.
Interval
is the time interval between observations (data frequency) in the current input data set. If the interval is not automatically lled in, you can type an interval name or select one from the pop-up list. For more information about intervals, see the section Time Series Data Sets, ID Variables, and Time Intervals on page 2395.
View Series Graphically icon
opens the Time Series Viewer window to display plots of series in the current input data set.
View Data as a Table
opens a Viewtable window for browsing the selected input data set.
Develop Models
opens the Develop Models window to enable you to t forecasting models to individual time series and choose the best models to use to produce the nal forecasts of each series.
opens the Automatic Model Fitting window for applying the automatic model selection process to all series or to selected series in an input data set.
Produce Forecast
opens the Produce Forecasts window for producing forecasts for the series in the current input data set for which you have t forecasting models.
Manage Projects
opens the Manage Forecasting Project window for viewing or editing information stored in projects.
Exit
is the name of the data set to be created. Type in a one-level or two-level SAS data set name.
Interval
is the time interval between observations (data frequency) in the simulated data set. Type in an interval name or select one from the pop-up list.
Seed
is the seed for the random number generator used to produce the simulated time series.
N Observations
is the starting date for the simulated observations. Type in a date in a form recognizable by a SAS data informat, for example, 1998:1, feb1997, or 03mar1998.
Ending Date
is the ending date for the simulated observations. Type in a date in a form recognizable by a SAS data informat.
Series to Generate
opens the ARIMA Process Specication window to enable you to add entries to the Series to Generate list.
Delete Series
closes the Time Series Simulation window and performs the specied simulations and creates the specied data set.
Cancel
closes the window without creating a simulated data set. Any options you specied are lost.
The state of the Time Series Viewer window is controlled by the current series, the current series transformation specication, and the currently selected view. You can resize this window, and you can use other windows without closing the Time Series Viewer window. You can explore a number of series conveniently by keeping the Series Selection window open. Each time you make a selection, the viewer window is updated to show the selected series. Keep both windows visible, or switch between them by using the Next Viewer toolbar icon or the F12 function key. You can open multiple Time Series Viewer windows. This enables you to freezea plot so you can come back to it later, or compare two plots side by side on your screen. To do this, unlink the viewer by using the Link/Unlink icon on the windows toolbar or the corresponding item in the Tools menu. While the viewer window remains unlinked, it is not updated when other selections are made in the Series Selection window. Instead, when you select a series and click the Graph button, a new Time Series Viewer window is invoked. You can continue this process to open as many viewer windows as you want. The Next Viewer icon and corresponding F12 function key are useful for navigating between windows when they are not simultaneously visible on your screen. A wide range of series transformations is available. Basic transformations are available from the windows horizontal toolbar, and others are available by selecting Other Transformations from the Tools menu.
Zoom in
changes the mouse cursor into cross hairs that you can use with the left mouse button to drag out a region of the time series plot to zoom in on. In the Autocorrelations view and the White Noise and Stationarity Tests view, Zoom In reduces the number of lags displayed.
Zoom out
reverses the previous Zoom In action and expands the time range of the plot to show more of the series. In the Autocorrelations view and the White Noise and Stationarity Tests view, Zoom Out increases the number of lags displayed.
Link/Unlink viewer
disconnects or connects the Time Series Viewer window to the window in which the series was selected. When the Viewer is linked, it always shows the current series. If you select another series, linked Viewers are updated. Unlinking a Viewer freezes its current state, and changing the current series has no effect on the Viewers display. The View Series action creates a new Series Viewer window if there is no linked Viewer. By using the unlink feature, you can open several Time Series Viewer windows and display several different series simultaneously.
Log Transform
applies a log transform to the current view. This can be combined with other transformations; the current transformations are shown in the title.
Difference
applies a simple difference to the current view. This can be combined with other transformations; the current transformations are shown in the title.
Seasonal Difference
applies a seasonal difference to the current view. For example, if the data are monthly, the seasonal cycle is one year. Each value has subtracted from it the value from one year previous. This can be combined with other transformations; the current transformations are shown in the title.
Close
closes the Time Series Viewer window and returns to the window from which it was invoked.
displays plots of the sample autocorrelations, partial autocorrelation, and inverse autocorrelation functions for the series, with lines overlaid at plus and minus two standard errors.
White Noise and Stationarity Tests
displays horizontal bar charts that represent results of white noise and stationarity tests. The rst bar chart shows the signicance probability of the Ljung-Box chi-square statistic computed on autocorrelations up to the given lag. Longer bars favor rejection of the null hypothesis that the series is white noise. Click any of the bars to display an interpretation.
The second bar chart shows tests of stationarity, where longer bars favor the conclusion that the series is stationary. Each bar displays the signicance probability of the augmented Dickey-Fuller unit root test to the given autoregressive lag. Long bars represent higher levels of signicance against the null hypothesis that the series contains a unit root. For seasonal data, a third bar chart appears for seasonal root tests. Click any of the bars to display an interpretation.
Data Table
displays a data table containing the values in the input data set.
Menu Bar
File
Save Graph
saves the current plot as a SAS/GRAPH grseg catalog entry in a default or most recently specied catalog. This item is unavailable in the Data Table view.
Save Graph as
saves the current graph as a SAS/GRAPH grseg catalog entry in a SAS catalog that you specify and/or as an Output Delivery System (ODS) object. By default, an HTML page is created, with the graph embedded as a gif image. This item is unavailable in the Data Table view.
Save Data
saves the data displayed in the viewer window to an output SAS data set. This item is unavailable in the Series view.
Save Data as
saves the data in a SAS data set that you specify and/or as an Output Delivery System (ODS) object. By default, an HTML page is created, with the data displayed as a table.
Import Data
is available if you license SAS/Access software. It opens an Import Wizard, which you can use to import your data from an external spreadsheet or data base to a SAS data set for use in the Time Series Forecasting System.
Export Data
is available if you license SAS/Access software. It opens an Export Wizard, which you can use to export a SAS data set, such as a forecast data set created with the Time Series Forecasting System, to an external spreadsheet or data base.
Print Graph
prints the plot displayed in the viewer window. This item is unavailable in the Data Table view.
Print Data
prints the data displayed in the viewer window. This item is unavailable in the Series view.
Print Setup
opens the Print Setup window, which allows you to access your operating system print setup.
Print Preview
opens a preview window to show how your plots will look when printed.
Close
closes the Time Series Viewer window and returns to the window from which it was invoked.
View
Series
displays a plot of series values over time. This is the same as the Series icon in the vertical toolbar.
Autocorrelations
displays plots of the sample autocorrelation, partial autocorrelation, and inverse autocorrelation functions for the series. This is the same as the Autocorrelations icon in the vertical toolbar.
White Noise and Stationarity Tests
displays horizontal bar charts representing results of white noise and stationarity tests. This is the same as the White Noise and Stationarity Tests icon in the vertical toolbar.
Data Table
displays a data table containing the values in the input data set. This is the same as the Data Table icon in the vertical toolbar.
Zoom In
zooms the display. This is the same as the Zoom In icon in the windows horizontal toolbar.
Zoom Out
undoes the last zoom in action. This is the same as the Zoom Out icon in the windows horizontal toolbar.
Zoom Way Out
reverses all previous Zoom In actions and expands the time range of the plot to show all of the series, or shows the maximum number of lags in the Autocorrelations View or the White Noise and Stationarity Tests view.
Tools
Log Transform
applies a log transformation. This is the same as the Log Transform icon in the windows horizontal toolbar.
Difference
applies simple differencing. This is the same as the Difference icon in the windows horizontal toolbar.
Seasonal Difference
applies seasonal differencing. This is the same as the Seasonal Difference icon in the windows horizontal toolbar.
Other Transformations
opens the Series Viewer Transformations window to enable you to apply a wide range of transformations.
Diagnose Series
opens the Series Diagnostics window to determine the kinds of forecasting models appropriate for the current series.
Define Interventions
opens the Interventions for Series window to enable you to edit or add intervention effects for use in modeling the current series.
Link Viewer
connects or disconnects the Time Series Viewer window to the window from which series are selected. This is the same as the Link item in the windows horizontal toolbar.
Options
Number of Lags
opens a window to let you specify the number of lags shown in the Autocorrelations view and the White Noise and Stationarity Tests view. You can also use the Zoom In and Zoom Out actions to control the number of lags displayed.
Correlation Probabilities
controls whether the bar charts in the Autocorrelations view represent signicance probabilities or values of the correlation coefcient. A check mark or lled check box next to this item indicates that signicance probabilities are displayed. In each case the bar graph horizontal axis label changes accordingly.
2660
Chapter 44
This chapter provides computational details on several aspects of the Time Series Forecasting System.
Parameter Estimation
The parameter estimation process for ARIMA and smoothing models is described graphically in Figure 44.1.
Figure 44.1 Model Fitting Flow Diagram
The specication of smoothing and ARIMA models is described in Chapter 39, Specifying Forecasting Models. Computational details for these kinds of models are provided in the following sections Smoothing Models on page 2669 and ARIMA Models on page 2680. The results of the parameter estimation process are displayed in the Parameter Estimates table of the Model Viewer windows along with the estimate of the model variance and the nal smoothing state.
Model Evaluation
The model evaluation process is described graphically in Figure 44.2.
Model evaluation is based on the one-step-ahead prediction errors for observations within the period of evaluation. The one-step-ahead predictions are generated from the model specication and
parameter estimates. The predictions are inverse transformed (median or mean) and adjustments are removed. The prediction errors (the difference of the dependent series and the predictions) are used to compute the statistics of t, which are described in the section Series Diagnostic Tests on page 2687. The results generated by the evaluation process are displayed in the Statistics of Fit table of the Model Viewer window.
Forecasting
The forecasting generation process is described graphically in Figure 44.3.
Forecasting ! 2665
The forecasting process is similar to the model evaluation process described in the preceding section, except that k-step-ahead predictions are made from the end of the data through the specied forecast horizon, and prediction standard errors and condence limits are calculated. The forecasts and condence limits are displayed in the Forecast plot or table of the Model Viewer window.
Adjustments ! 2667
Adjustments
Adjustment predictors are subtracted from the response time series prior to model parameter estimation, evaluation, and forecasting. After the predictions of the adjusted response time series are obtained from the forecasting model, the adjustments are added back to produce the forecasts. If yt is the response time series and Xi;t , 1 i m are m adjustment predictor series, then the adjusted response series wt is wt D yt
m X i D1
Xi;t
Parameter estimation for the model is performed by using the adjusted response time series wt . The forecasts wt of wt are adjusted to obtain the forecasts yt of yt . O O yt D wt C O O
m X i D1
Xi;t
Series Transformations
For pure ARIMA models, transforming the response time series can aid in obtaining stationary noise series. For general ARIMA models with inputs, transforming the response time series or one or more of the input time series can provide a better model t. Similarly, the t of smoothing models can improve when the response series is transformed. There are four transformations available, for strictly positive series only. Let yt > 0 be the original time series, and let wt be the transformed series. The transformations are dened as follows: Log is the logarithmic transformation, wt D ln.yt / Logistic is the logistic transformation, wt D ln.cyt =.1 cyt //
/10
and ceil.x/ is the smallest integer greater than or equal to x. Square Root is the square root transformation, p wt D yt
Box Cox
Parameter estimation is performed by using the transformed series. The transformed model predictions and condence limits are then obtained from the transformed time series and these parameter estimates. The transformed model predictions wt are used to obtain either the minimum mean absolute error O (MMAE) or minimum mean squared error (MMSE) predictions yt , depending on the setting of the O forecast options. The model is then evaluated based on the residuals of the original time series and these predictions. The transformed model condence limits are inverse-transformed to obtain the forecast condence limits.
then
.E wt / D F
.wt / O
The minimum mean squared error (MMSE) predictions are the mean of the conditional probability density function at each point in time. Assuming that the prediction errors are normally distributed with variance t2 , the MMSE predictions for each of the transformations are as follows: Log is the conditional expectation of inverse-logarithmic transformation, yt D E e wt D exp wt C O O Logistic
2 t =2
is the conditional expectation of inverse-logistic transformation, 1 yt D E O c.1 C exp. wt // where the scaling factor c D .1 e
6 /10 ceil.log10 .max.yt /// .
Square Root
Box Cox
The expectations of the inverse logistic and Box-Cox ( 0 ) transformations do not generally have explicit solutions and are computed by using numerical integration.
Smoothing Models
This section details the computations performed for the exponential smoothing and Winters method forecasting models.
C t t C sp .t / C
represents the time-varying mean term. represents the time-varying slope. represents the time-varying seasonal contribution for one of the p seasons. are disturbances.
t sp .t /
t
For smoothing models without trend terms, t D 0; and for smoothing models without seasonal terms, sp .t / D 0. Each smoothing model is described in the following sections. At each time t , the smoothing models estimate the time-varying components described above with the smoothing state. After initialization, the smoothing state is updated for each observation using the smoothing equations. The smoothing state at the last nonmissing observation is used for predictions.
Tt is a smoothed trend that estimates t . j D 0; : : :; p 1, are seasonal factors that estimate sp .t/.
The smoothing process starts with an initial estimate of the smoothing state, which is subsequently updated for each observation by using the smoothing equations.
The smoothing equations determine how the smoothing state changes as time progresses. Knowledge of the smoothing state at time t 1 and that of the time series value at time t uniquely determine the smoothing state at time t. The smoothing weights determine the contribution of the previous smoothing state to the current smoothing state. The smoothing equations for each smoothing model are listed in the following sections.
Missing Values
When a missing value is encountered at time t, the smoothed values are updated using the errorcorrection form of the smoothing equations with the one-step-ahead prediction error, et , set to zero. The missing value is estimated using the one-step-ahead prediction at time t 1, that is O Yt 1 .1/ (Aldrin 1989). The error-correction forms of each of the smoothing models are listed in the following sections.
and the sum of squares of the one-step-ahead prediction errors is the objective function used in smoothing weight optimization. The variance of the prediction errors are used to calculate the condence limits (Sweet 1985, McKenzie 1986, Yar and Chateld 1990, and Chateld and Yar 1991). The equations for the variance of the prediction errors for each smoothing model are listed in the following sections. Note: var. t / is estimated by the mean square of the one-step-ahead prediction errors.
Smoothing Weights
Depending on the smoothing model, the smoothing weights consist of the following: is a level smoothing weight. is a trend smoothing weight. is a seasonal smoothing weight. is a trend damping weight. Larger smoothing weights (less damping) permit the more recent data to have a greater inuence on the predictions. Smaller smoothing weights (more damping) give less weight to recent data.
Standard Errors
The standard errors associated with the smoothing weights are calculated from the Hessian matrix of the sum of squared, one-step-ahead prediction errors with respect to the smoothing weights used in the optimization process.
Weights Near Zero or One
Sometimes the optimization process results in weights near zero or one. For simple or double (Brown) exponential smoothing, a level weight near zero implies that simple differencing of the time series might be appropriate. For linear (Holt) exponential smoothing, a level weight near zero implies that the smoothed trend is constant and that an ARIMA model with deterministic trend might be a more appropriate model. For damped-trend linear exponential smoothing, a damping weight near one implies that linear (Holt) exponential smoothing might be a more appropriate model. For Winters method and seasonal exponential smoothing, a seasonal weight near one implies that a nonseasonal model might be more appropriate and a seasonal weight near zero implies that deterministic seasonal factors might be present.
C et
(Note: For missing values, et D 0.) The k-step prediction equation is O Yt .k/ D Lt The ARIMA model equivalency to simple exponential smoothing is the ARIMA(0,1,1) model .1 B/Yt D .1 B/
t
D1
1 X j D1
t j
For simple exponential smoothing, the additive-invertible region is f0 < < 2g The variance of the prediction errors is estimated as 2 3 k 1 X var.et .k// D var. t / 41 C 2 5 D var. t /.1 C .k
j D1
1/ 2 /
C t t C
C .1
/Tt
This method can be equivalently described in terms of two successive applications of simple exponential smoothing: St
1 2 1
D Yt C .1 D St C .1
1
/St
St
1 1 2 /St 1 2
where St are the smoothed values of Yt , and St equation then takes the form: O Yt .k/ D .2 C k=.1 //St
1
.1 C k=.1
C Tt
2
C et
C et
(Note: For missing values, et D 0.) The k-step prediction equation is O Yt .k/ D Lt C ..k 1/ C 1=/Tt
The ARIMA model equivalency to double exponential smoothing is the ARIMA(0,2,2) model, .1 B/2 Yt D .1 B/2
t
D1
1 X j D1
.2 C .j
1/ 2 /
t j
For double exponential smoothing, the additive-invertible region is f0 < < 2g The variance of the prediction errors is estimated as 2 3 k 1 X var.et .k// D var. t / 41 C .2 C .j 1/ 2 /2 5
j D1
C t t C
C Tt /Tt
1/ 1
C .1
C Tt
C et
C et
(Note: For missing values, et D 0.) The k-step prediction equation is O Yt .k/ D Lt C kTt The ARIMA model equivalency to linear exponential smoothing is the ARIMA(0,2,2) model, .1 B/2 Yt D .1 1 1 B 2 B 2 /
t
1 D 2 2 D
1 X j D1
. C j /
t j
For linear exponential smoothing, the additive-invertible region is f0 < < 2g f0 < < 4= 2g
C t t C
C Tt / Tt
1/ 1
C .1
C Tt
C et Tt D Tt
C et
(Note: For missing values, et D 0.) The k-step prediction equation is O Yt .k/ D Lt C
k X i D1 i
Tt
The ARIMA model equivalency to damped-trend linear exponential smoothing is the ARIMA(1,1,2) model, .1 B/.1 1/ B/Yt D .1 1 B 2 B 2 /
t
1 D 1 C 2 D .
1 X j D1
. C
1/=.
1//
t j
For damped-trend linear exponential smoothing, the additive-invertible region is f0 < < 2g f0 < < 4= 2g
3
j
1/=.
1//2 5
C sp .t / C
C .1
/Lt
p
Lt / C .1
/St
C et C .1 /et
(Note: For missing values, et D 0.) The k-step prediction equation is O Yt .k/ D Lt C St
pCk
The ARIMA model equivalency to seasonal exponential smoothing is the ARIMA(0,1,p+1)(0,1,0)p model, .1 B/.1 .1 /. / 1/ B p /Yt D .1 1 B 2 B p 3 B pC1 /
t
1 D 1 2 D 1 3 D .1
1 X j D1
j t j
(
j
C .1
For seasonal exponential smoothing, the additive-invertible region is fmax. p; 0/ < .1 / < .2 /g
C .1
/Lt
p
St D .Yt =Lt / C .1
/St
C et =St C .1
/et =Lt
(Note: For missing values, et D 0.) The k-step prediction equation is O Yt .k/ D Lt St
pCk
The multiplicative version of seasonal smoothing does not have an ARIMA equivalent; however, when the seasonal variation is small, the ARIMA additive-invertible region of the additive version of seasonal described in the preceding section can approximate the stability region of the multiplicative version. The variance of the prediction errors is estimated as 2 1 X Xp 1 4 var.et .k// D var. t / . j Cip St Ck =St Ck
i D0 j D0
3
25 j/
where
D 0 for j
k.
C t t C sp .t / C
C .1 C .1
/.Lt /Tt
p 1
C Tt
1/
Lt / C .1
/St
C Tt C .1
C et /et
C et
(Note: For missing values, et D 0.) The k-step prediction equation is O Yt .k/ D Lt C kTt C St
pCk
The ARIMA model equivalency to the additive version of Winters method is the ARIMA(0,1,p+1)(0,1,0)p model, " # pC1 X p i .1 B/.1 B /Yt D 1 i B t
i D1
8 1 <
1 X j D1
j t j
(
j
C j C j C .1
/;
forj forj
mod p 0 mod p D 0
For the additive version of Winters method (see Archibald 1990), the additive-invertible region is fmax. p; 0/ < .1 f0 < < 2 .1 / < .2 /.1 /g cos.#/g
where # is the smallest nonnegative solution to the equations listed in Archibald (1990). The variance of the prediction errors is estimated as 2 3 k 1 X 25 var.et .k// D var. t / 41 C j
j D1
C t t /sp .t / C
C .1 C .1
/.Lt /Tt
p
1 1
C Tt
1/
Lt
1/
St D .Yt =Lt / C .1
/St
C Tt
C et =St
p
C et =St
p C .1
/et =Lt
N OTE : For missing values, et D 0. The k-step prediction equation is O Yt .k/ D .Lt C kTt /St
pCk
The multiplicative version of Winters method does not have an ARIMA equivalent; however, when the seasonal variation is small, the ARIMA additive-invertible region of the additive version of Winters method described in the preceding section can approximate the stability region of the multiplicative version. The variance of the prediction errors is estimated as 2 1 X Xp 1 4 var.et .k// D var. t / . j Cip St Ck =StCk
i D0 j D0
3
25 j/
where
D 0 for j
k.
ARIMA Models
Autoregressive integrated moving-average (ARIMA) models predict values of a dependent time series with a linear combination of its own past values, past errors (also called shocks or innovations), and current and past values of other time series (predictor time series). The Time Series Forecasting System uses the ARIMA procedure of SAS/ETS software to t and forecast ARIMA models. The maximum likelihood method is used for parameter estimation. Refer to Chapter 7, The ARIMA Procedure, for details of ARIMA model estimation and forecasting. This section summarizes the notation used for ARIMA models.
Given a dependent time series fYt W 1 t ng, mathematically the ARIMA model is written as .1 where t B .B/ .B/ at indexes time. is the mean term. is the backshift operator; that is, BXt D Xt
1.
B/d Yt D
.B/ at .B/
is the autoregressive operator, represented as a polynomial in the back p shift operator: .B/ D 1 ::: 1B pB . is the moving-average operator, represented as a polynomial in the back shift operator: .B/ D 1 1 B : : : q B q . is the independent disturbance, also called the random error.
Given a dependent time series fYt W 1 t ng, mathematically the ARIMA seasonal model is written as .1 where
s .B s/
B/d .1
B s /D Yt D
.B/s .B s / at .B/ s .B s /
is the seasonal autoregressive operator, represented as a polynomial in the back shift operator: s s sP ::: s .B / D 1 s;1 B s;P B is the seasonal moving-average operator, represented as a polynomial in the back shift operator: s .B s / D 1 s;1 B s : : : s;Q B sQ
s .B s /
For example, the mathematical form of the ARIMA(1,0,1)(1,1,2)12 model is .1 B 12 /Yt D C .1 1 B/.1 s;1 B 12 s;2 B 24 / at 12 .1 1 B/.1 s;1 B /
Given the ith predictor time series fXi;t W 1 t ng, the transfer function is written as
Dif.di /Lag.ki /N.qi /=D.pi / where di ki pi qi is the simple order of the differencing for the ith predictor time series, .1 B/di Xi;t (rarely should di > 2 be needed). is the pure time delay (lag) for the effect of the ith predictor time series, Xi;t B ki D Xi;t ki . is the simple order of the denominator for the ith predictor time series. is the simple order of the numerator for the ith predictor time series.
The mathematical notation used to describe a transfer function is i .B/ D where B i .B/ !i .B/ is the backshift operator; that is, BXt D Xt
1.
!i .B/ .1 i .B/
B/di B ki
is the denominator polynomial of the transfer function for the ith predictor time series: i .B/ D 1 i;1 B : : : i;pi B pi . is the numerator polynomial of the transfer function for the ith predictor time series: !i .B/ D 1 !i;1 B : : : !i;qi B qi .
The numerator factors for a transfer function for a predictor series are like the MA part of the ARMA model for the noise series. The denominator factors for a transfer function for a predictor series are like the AR part of the ARMA model for the noise series. Denominator factors introduce exponentially weighted, innite distributed lags into the transfer function. For example, the transfer function for the ith predictor time series with
ki D 3 di D 1 pi D 1 qi D 2
time lag is 3 simple order of differencing is one simple order of the denominator is one simple order of the numerator is two
would be written as [Dif(1)Lag(3)N(2)/D(1)]. The mathematical notation for the transfer function in this example is i .B/ D .1 !i;1 B !i;2 B 2 / .1 .1 i;1 B/ B/B 3
The general transfer function notation for the ith predictor time series Xi;t with seasonal factors is [Dif(di )(Di )s Lag(ki ) N(qi )(Qi )s / D(pi )(Pi )s ] where Di Pi Qi s is the seasonal order of the differencing for the ith predictor time series (rarely should Di > 1 be needed). is the seasonal order of the denominator for the ith predictor time series (rarely should Pi > 2 be needed). is the seasonal order of the numerator for the ith predictor time series, (rarely should Qi > 2 be needed). is the length of the seasonal cycle.
The mathematical notation used to describe a seasonal transfer function is i .B/ D where s;i .B s / is the denominator seasonal polynomial of the transfer function for the ith predictor time series: s;i .B/ D 1 s;i;1 B : : : s;i;Pi B sPi is the numerator seasonal polynomial of the transfer function for the ith predictor time series: !s;i .B/ D 1 !s;i;1 B : : : !s;i;Qi B sQi !i .B/!s;i .B s / .1 i .B/s;i .B s / B/di .1 B s /Di B ki
!s;i .B s /
For example, the transfer function for the ith predictor time series Xi;t whose seasonal cycle s D 12 with di D 2 Di D 1 qi D 2 Qi D 1 simple order of differencing is two seasonal order of differencing is one simple order of the numerator is two seasonal order of the numerator is one
would be written as [Dif(2)(1)s N(2)(1)s ]. The mathematical notation for the transfer function in this example is i .B/ D .1 !i;1 B !i;2 B 2 /.1 !s;i;1 B 12 /.1 B/2 .1 B 12 /
Predictor Series
This section discusses time trend curves, seasonal dummies, interventions, and adjustments.
variable name _QUAD_, with Xt D .t c/2 . Note that a quadratic trend implies a linear trend as a special case and results in two regressors: _QUAD_ and _LINEAR_. variable name _CUBE_, with Xt D .t c/3 . Note that a cubic trend implies a quadratic trend as a special case and results in three regressors: _CUBE_, _QUAD_, and _LINEAR_. variable name _LOGIT_, with Xt D t. The model is a linear time trend applied to the logistic transform of the dependent series. Thus, specifying a logistic trend is equivalent to specifying the logistic series transformation and a linear time trend. A logistic trend predictor can be used only in conjunction with the logistic transformation, which is set automatically when you specify logistic trend.
Cubic trend
Logistic trend
variable name _LOG_, with Xt D ln.t/ variable name _EXP_, with Xt D t . The model is a linear time trend applied to the logarithms of the dependent series. Thus, specifying an exponential trend is equivalent to specifying the log series transformation and a linear time trend. An exponential trend predictor can be used only in conjunction with the log transformation, which is set automatically when you specify exponential trend. variable name _HYP_, with Xt D 1=t variable name _POW_, with Xt D ln.t/. The model is a logarithmic time trend applied to the logarithms of the dependent series. Thus, specifying a power curve is equivalent to specifying the log series transformation and a logarithmic time trend. A power curve predictor can be used only in conjunction with the log transformation, which is set automatically when you specify a power curve trend. variable name _ERT_, with Xt D 1=t. The model is a hyperbolic time trend applied to the logarithms of the dependent series. Thus, specifying this trend curve is equivalent to specifying the log series transformation and a hyperbolic time trend. This trend curve can be used only in conjunction with the log transformation, which is set automatically when you specify this trend.
EXP(A+B/TIME) trend
Intervention Effects
Interventions are used for modeling events that occur at specic times. That is, they are known changes that affect the dependent series or outliers. The ith intervention series is included in the output data set with variable name _INTVi _, which is a reserved variable name.
Point Interventions
The point intervention is a one-time event. The ith intervention series Xi;t has a point intervention at time ti nt when the series is nonzero only at time ti nt that is, ( 1; t D ti nt Xi;t D 0; ot herwi se
Step Interventions
Step interventions are continuing, and the input time series ags periods after the intervention. For a step intervention, before time ti nt , the ith intervention series Xi;t is zero and then steps to a constant level thereafterthat is, ( 1; t ti nt Xi;t D 0; ot herwi se
Ramp Interventions
A ramp intervention is a continuing intervention that increases linearly after the intervention time. For a ramp intervention, before time ti nt , the ith intervention series Xi;t is zero and increases linearly thereafterthat is, proportional to time. ( t ti nt ; t ti nt Xi;t D 0; ot herwise
Intervention Effect
Given the ith intervention series Xi;t , you can dene how the intervention takes effect by lters (transfer functions) of the form i .B/ D 1 1 !i;1 B i;1 B ::: ::: !i;qi B qi i;pi B pi
1.
The denominator of the transfer function determines the decay pattern of the intervention effect, whereas the numerator terms determine the size of the intervention effect time window. For example, the following intervention effects are associated with the respective transfer functions. Immediately Gradually 1 lag window 3 lag window
Intervention Notation
i .B/ D 1 i .B/ D 1=.1 i .B/ D 1 i .B/ D 1 i;1 B/ !i;2 B 2 !i;3 B 3 !i;1 B !i;1 B
The notation used to describe intervention effects has the form type :ti nt (qi )/(pi ), where type is point, step, or ramp; ti nt is the time of the intervention (for example, OCT87); qi is the transfer function numerator order; and pi is the transfer function denominator order. If qi D 0, the part (qi ) is omitted; if pi D 0, the part /(pi ) is omitted. In the Intervention Specication window, the Number of Lags option species the transfer function numerator order qi , and the Effect Decay Pattern option species the transfer function denominator order pi . In the Effect Decay Pattern options, values and resulting pi are: None, pi D 0; Exp, pi D 1; Wave, pi D 2. For example, a step intervention with date 08MAR90 and effect pattern Exp is denoted Step:08MAR90/(1) and has a transfer function lter i .B/ D 1=.1 1 B/. A ramp intervention immediately applied on 08MAR90 is denoted Ramp:08MAR90 and has a transfer function lter i .B/ D 1.
for models that include an intercept term and fXi;t W 1 i s; 1 t ng for models that exclude an intercept term. Each element of a seasonal dummy regressor is either zero or one, based on the following rule: ( 1; when i D t mod s Xi;t D 0; otherwise Note that if the model includes an intercept term, the number of seasonal dummy regressors is one less than s to ensure that the linear system is full rank. The seasonal dummy variables are included in the output data set with variable names prexed with SDUMMYi and sequentially numbered. They are reserved variable names.
1. Log transform test. The log test ts a high-order autoregressive model to the series and to the log of the series and compares goodness-of-t measures for the prediction errors of the two models. If this test nds that log transforming the series is suitable, the Log Transform option is set to YES, and the subsequent diagnostic tests are performed on the log transformed series. 2. Trend test. The resultant series is tested for presence of a trend by using an augmented Dickey-Fuller test and a random walk with drift test. If either test nds that the series appears to have a trend, the Trend option is set to YES, and the subsequent diagnostic tests are performed on the differenced series. 3. Seasonality test. The resultant series is tested for seasonality. A seasonal dummy model with AR(1) errors is t and the joint signicance of the seasonal dummy estimates is tested. If the seasonal dummies are signicant, the AIC statistic for this model is compared to the AIC for and AR(1) model without seasonal dummies. If the AIC for the seasonal model is lower than that of the nonseasonal model, the Seasonal option is set to YES.
Statistics of Fit
This section explains the goodness-of-t statistics reported to measure how well different models t the data. The statistics of t for the various forecasting models can be viewed or stored in a data set by using the Model Viewer window. Statistics of t are computed by using the actual and forecasted values for observations in the period of evaluation. One-step forecasted values are used whenever possible, including the case when a hold-out sample contains no missing values. If a one-step forecast for an observation cannot be computed due to missing values for previous series observations, a multi-step forecast is computed, using the minimum number of steps as the previous nonmissing values in the data range permit. The various statistics of t reported are as follows. In these formulas, n is the number of nonmissing observations and k is the number of tted parameters in the model. Number of Nonmissing Observations. The number of nonmissing observations used to t the model. Number of Observations. The total number of observations used to t the model, including both missing and nonmissing observations. Number of Missing Actuals. The number of missing actual values. Number of Missing Predicted Values. The number of missing predicted values. Number of Model Parameters. The number of parameters t to the data. For combined forecast, this is the number of forecast components.
Total Sum of Squares (Uncorrected). P The total sum of squares for the series, SST, uncorrected for the mean: nD1 yt2 . t Total Sum of Squares (Corrected). P The total sum of squares for the series, SST, corrected for the mean: nD1 .yt t is the series mean. Sum of Square Errors. P The sum of the squared prediction errors, SSE. SSE D nD1 .yt t step predicted value. y/2 , where y
Mean Squared Error. The mean squared prediction error, MSE, calculated from the one-step-ahead forecasts. 1 MSE D n SSE. This formula enables you to evaluate small hold-out samples. Root Mean Squared Error. p The root mean square error (RMSE), MSE. Mean Absolute Percent Error. The mean absolute percent prediction error (MAPE), The summation ignores observations where yt D 0. Mean Absolute Error. The mean absolute prediction error,
1 n 100 n
Pn
tD1 j.yt
yt /=yt j. O
Pn
t D1 jyt
yt j. O
R-Square. The R2 statistic, R2 D 1 SSE=SST. If the model ts the series badly, the model error sum of squares, SSE, can be larger than SST and the R2 statistic will be negative. Adjusted R-Square. The adjusted R2 statistic, 1
n .n 1 /.1 k
R2 /. R2 /.
Random Walk R-Square. The random walk R2 statistic (Harveys R2 statistic by using the random walk model P /2 , and for comparison), 1 . n n 1 /SSE=RWSSE, where RWSSE D nD2 .yt yt 1 t Pn D n 1 1 t D2 .yt yt 1 /. Akaikes Information Criterion. Akaikes information criterion (AIC), n ln.MSE/ C 2k. Schwarz Bayesian Information Criterion. Schwarz Bayesian information criterion (SBC or BIC), n ln.MSE/ C k ln.n/. Amemiyas Prediction Criterion. 1 Amemiyas prediction criterion, n SST. nCk /.1 n k Maximum Error. The largest prediction error. Minimum Error. The smallest prediction error. Maximum Percent Error. The largest percent prediction error, 100 max..yt tions where yt D 0. yt /=yt /. The summation ignores observaO
1 R2 / D . nCk / n SSE. n k
Minimum Percent Error. The smallest percent prediction error, 100 min..yt vations where yt D 0. Mean Error. The mean prediction error,
1 n
Pn
yt /. O Pn
t D1 .yt yt / O . yt
References
Akaike, H. (1974), A New Look at the Statistical Model Identication, IEEE Transaction on Automatic Control, AC-19, 716723. Aldrin, M. and Damsleth, E. (1989), Forecasting Nonseasonal Time Series with Missing Observations, Journal of Forecasting, 8, 97116. Anderson, T.W. (1971), The Statistical Analysis of Time Series, New York: John Wiley & Sons. Ansley, C. (1979), An Algorithm for the Exact Likelihood of a Mixed Autoregressive MovingAverage Process, Biometrika, 66, 59. Ansley, C. and Newbold, P. (1980), Finite Sample Properties of Estimators for Autoregressive Moving-Average Models, Journal of Econometrics, 13, 159. Archibald, B.C. (1990), Parameter Space of the Holt-Winters Model, International Journal of Forecasting, 6, 199209. Bartolomei, S.M. and Sweet, A.L. (1989), A Note on the Comparison of Exponential Smoothing Methods for Forecasting Seasonal Series, International Journal of Forecasting, 5, 111116. Bhansali, R.J. (1980), Autoregressive and Window Estimates of the Inverse Correlation Function, Biometrika, 67, 551566. Bowerman, B.L. and OConnell, R.T. (1979), Time Series and Forecasting: An Applied Approach, North Scituate, Massachusetts: Duxbury Press. Box, G.E.P. and Cox D.R. (1964), An Analysis of Transformations, Journal of Royal Statistical Society B, No. 26, 211243. Box, G.E.P. and Jenkins, G.M. (1976), Time Series Analysis: Forecasting and Control, Revised Edition, San Francisco: Holden-Day. Box, G.E.P. and Tiao, G.C. (1975), Intervention Analysis with Applications to Economic and Environmental Problems, JASA, 70, 7079. Brocklebank, J.C. and Dickey, D.A. (1986), SAS System for Forecasting Time Series, 1986 Edition,
References ! 2691
Cary, North Carolina: SAS Institute Inc. Brown, R.G. (1962), Smoothing, Forecasting, and Prediction of Discrete Time Series, New York: Prentice-Hall. Brown, R.G. and Meyer, R.F. (1961), The Fundamental Theorem of Exponential Smoothing, Operations Research, 9, 673685. Chateld, C. (1978), The Holt-Winters Forecasting Procedure, Applied Statistics, 27, 264279. Chateld, C., and Prothero, D.L. (1973), Box-Jenkins Seasonal Forecasting: Problems in a Case Study, Journal of the Royal Statistical Society, Series A, 136, 295315. Chateld, C. and Yar, M. (1988), Holt-Winters Forecasting: Some Practical Issues, The Statistician, 37, 129140. Chateld, C. and Yar, M. (1991), Prediction Intervals for Multiplicative Holt-Winters, International Journal of Forecasting, 7, 3137. Cogger, K.O. (1974), The Optimality of General-Order Exponential Smoothing, Operations Research, 22, 858. Cox, D. R. (1961), Prediction by Exponentially Weighted Moving Averages and Related Methods, Journal of the Royal Statistical Society, Series B, 23, 414422. Davidson, J. (1981), Problems with the Estimation of Moving-Average Models, Journal of Econometrics, 16, 295. Dickey, D. A., and Fuller, W.A. (1979), Distribution of the Estimators for Autoregressive Time Series with a Unit Root, Journal of the American Statistical Association, 74(366), 427431. Dickey, D. A., Hasza, D. P., and Fuller, W.A. (1984), Testing for Unit Roots in Seasonal Time Series, Journal of the American Statistical Association, 79(386), 355367. Fair, R.C. (1986), Evaluating the Predictive Accuracy of Models, in Handbook of Econometrics, Vol. 3., Griliches, Z. and Intriligator, M.D., eds., New York: North Holland. Fildes, R. (1979), Quantitative Forecastingthe State of the Art: Extrapolative Models, Journal of Operational Research Society, 30, 691710. Fuller, W.A. (1976), Introduction to Statistical Time Series, New York: John Wiley & Sons. Gardner, E.S., Jr. (1984), The Strange Case of the Lagging Forecasts, Interfaces, 14, 4750. Gardner, E.S., Jr. (1985), Exponential Smoothing: the State of the Art, Journal of Forecasting, 4, 138. Granger, C.W.J. and Newbold, P. (1977), Forecasting Economic Time Series, New York: Academic Press, Inc. Greene, W.H. (1993), Econometric Analysis, Second Edition, New York: Macmillan Publishing Company.
Hamilton, J. D. (1994), Time Series Analysis, Princeton: Princeton University Press. Harvey, A.C. (1981), Time Series Models, New York: John Wiley & Sons. Harvey, A.C. (1984), A Unied View of Statistical Forecasting Procedures, Journal of Forecasting, 3, 245275. Hopewood, W.S., McKeown, J.C., and Newbold, P. (1984), Time Series Forecasting Models Involving Power Transformations, Journal of Forecasting, Vol 3, No. 1, 5761. Jones, Richard H. (1980), Maximum Likelihood Fitting of ARMA Models to Time Series with Missing Observations, Technometrics, 22, 389396. Judge, G.G., Grifths, W.E., Hill, R.C., and Lee, T.C. (1980), The Theory and Practice of Econometrics, New York: John Wiley & Sons. Ledolter, J. and Abraham, B. (1984), Some Comments on the Initialization of Exponential Smoothing, Journal of Forecasting, 3, 7984. Ljung, G.M. and Box, G.E.P. (1978), On a Measure of Lack of Fit in Time Series Models, Biometrika, 65, 297303. Makridakis, S., Wheelwright, S.C., and McGee, V.E. (1983), Forecasting: Methods and Applications, Second Edition, New York: John Wiley & Sons. McKenzie, Ed (1984), General Exponential Smoothing and the Equivalent ARMA Process, Journal of Forecasting, 3, 333344. McKenzie, Ed (1986), Error Analysis for Winters Additive Seasonal Forecasting System, International Journal of Forecasting, 2, 373382. Montgomery, D.C. and Johnson, L.A. (1976), Forecasting and Time Series Analysis, New York: McGraw-Hill. Morf, M., Sidhu, G.S., and Kailath, T. (1974), Some New Algorithms for Recursive Estimation on Constant Linear Discrete Time Systems, I.E.E.E. Transactions on Automatic Control, AC-19, 315323. Nelson, C.R. (1973), Applied Time Series for Managerial Forecasting, San Francisco: Holden-Day. Newbold, P. (1981), Some Recent Developments in Time Series Analysis, International Statistical Review, 49, 5366. Newton, H. Joseph and Pagano, Marcello (1983), The Finite Memory Prediction of Covariance Stationary Time Series, SIAM Journal of Scientic and Statistical Computing, 4, 330339. Pankratz, Alan (1983), Forecasting with Univariate Box-Jenkins Models: Concepts and Cases, New York: John Wiley & Sons. Pankratz, Alan (1991), Forecasting with Dynamic Regression Models, New York: John Wiley & Sons.
References ! 2693
Pankratz, A. and Dudley, U. (1987), Forecast of Power-Transformed Series, Journal of Forecasting, Vol 6, No. 4, 239248. Pearlman, J.G. (1980), An Algorithm for the Exact Likelihood of a High-Order Autoregressive Moving-Average Process, Biometrika, 67, 232233. Priestly, M.B. (1981), Spectral Analysis and Time Series, Volume 1: Univariate Series, New York: Academic Press, Inc. Roberts, S.A. (1982), A General Class of Holt-Winters Type Forecasting Models, Management Science, 28, 808820. Schwarz, G. (1978), Estimating the Dimension of a Model, Annals of Statistics, 6, 461464. Sweet, A.L. (1985), Computing the Variance of the Forecast Error for the Holt-Winters Seasonal Models, Journal of Forecasting, 4, 235243. Winters, P.R. (1960), Forecasting Sales by Exponentially Weighted Moving Averages, Management Science, 6, 324342. Yar, M. and Chateld, C. (1990), Prediction Intervals for the Holt-Winters Forecasting Procedure, International Journal of Forecasting, 6, 127137. Woodeld, T.J. (1987), Time Series Intervention Analysis Using SAS Software, Proceedings of the Twelfth Annual SAS Users Group International Conference, 331339. Cary, NC: SAS Institute Inc.
2694
Part V
Investment Analysis
2696
Chapter 45
Overview
Contents
About Investment Analysis . Starting Investment Analysis Getting Help . . . . . . . . Using Help . . . . . . . . . Software Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2697 2698 2699 2699 2700
Is this a reasonable price? Investment Analysis can be benecial to users in many industries for a variety of decisions: manufacturing: cost justication of automation or any capital investment, replacement analysis of major equipment, or economic comparison of alternative designs government: setting funds for services nance: investment analysis and portfolio management for xed-income securities
The other way is to type INVEST into the toolbars command prompt, as displayed in Figure 45.2.
Figure 45.2 Initializing Investment Analysis with the Toolbar
Getting Help
You can get help in Investment Analysis in three ways. One way is to use the Help Menu, as displayed in Figure 45.3. This is the right-most menu item on the menu bar.
Figure 45.3 The Help Menu
Help buttons, as in Figure 45.4, provide another way to access help. Most dialog boxes provide help buttons in their lower-right corners.
Figure 45.4 A Help Button
Also, the toolbar has a button (see Figure 45.5) that invokes the help system. This is the right-most icon on the toolbar.
Figure 45.5 The Help Icon
Each of these methods invokes a browser that gives specic help for the active window.
Using Help
The chapters pertaining to Investment Analysis in this document typically have a section that introduces you to a menu and summarizes the options available through the menu. Such chapters then have sections titled Task and Dialog Box Guides. The Task section provides a description of how to perform many useful tasks. The Dialog Box Guide lists all dialog boxes pertinent to those tasks and gives a brief description of each element of each dialog box.
Software Requirements
Investment Analysis uses the following SAS software: Base SAS SAS/ETS SAS/GRAPH (optional, to view bond pricing and breakeven graphs)
Chapter 46
Portfolios
Contents
The File Menu . . . . . . . . . . . . . . . . . Tasks . . . . . . . . . . . . . . . . . . . . . . Creating a New Portfolio . . . . . . . . . Saving a Portfolio . . . . . . . . . . . . Opening an Existing Portfolio . . . . . . Saving a Portfolio to a Different Name . Selecting Investments within a Portfolio . Dialog and Utility Guide . . . . . . . . . . . . Investment Analysis . . . . . . . . . . . Menu Bar Options . . . . . . . . . . . . Right-Clicking within the Portfolio Area . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2701 2702 2702 2703 2703 2704 2705 2706 2706 2707 2707
The File menu offers the following items: New Portfolio creates an empty portfolio with a new name. Open Portfolio opens the standard SAS Open dialog box where you select a portfolio to open. Save Portfolio saves the current portfolio to its current name. Save Portfolio As opens the standard SAS Save As dialog box where you supply a new portfolio name for the current portfolio. Close closes Investment Analysis. Exit closes SAS (Windows only).
Tasks
The Portfolio Name is WORK.INVEST.INVEST1 as displayed in Figure 46.2, unless you have saved a portfolio to that name in the past. In that case, some other unused portfolio name is given to the new portfolio.
Saving a Portfolio
From the Investment Analysis dialog box, select File ! Save Portfolio. The portfolio is saved to a catalog-entry with the name in the Portfolio Name box.
Click Open to load the portfolio. The portfolio should look like Figure 46.4.
Figure 46.4 The Opened Portfolio
Investment Analysis
Figure 46.7 Investment Analysis Dialog Box
Investment Portfolio Name holds the name of the portfolio. The name is of the form library.catalog_entry.portfolio. The default portfolio name is work.invest.invest1, as in Figure 46.7. Portfolio Description provides a more descriptive explanation of the portfolios contents. You can edit this description any time this dialog box is active. The Portfolio area contains the list of investments comprising the particular portfolio. Each investment in the Portfolio area displays the following attributes: Name is the name of the investment. It must be a valid SAS name. It is used to distinguish investments when performing analyses and computations. Label is a place where you can provide a more descriptive explanation of the investment. Type is the type of investment, which is xed when you create the investment. It is one of the following: LOAN, SAVINGS, DEPRECIATION, BOND, or OTHER. Additional tools to aid in the management of your portfolio are available by selecting from the menu bar or by right-clicking within the Portfolio area.
The menu bar (shown in Figure 46.8) provides many tools to aid in the management of portfolios and the investments that comprise them. The following menu items provide functionality particular to Investment Analysis: File opens and saves portfolios. Investment creates new investments within the portfolio. Compute performs constant dollar, after tax, and currency conversion computations on generic cashows. Analyze analyzes investments to aid in decision-making. Tools sets default values of ination and income tax rates.
After selecting an investment, right-clicking in the Portfolio area pops up a menu (see Figure 46.9) that offers the following options: Edit opens the selected investment within the portfolio. Duplicate creates a duplicate of the selected investment within the portfolio. Delete removes the selected investment from the portfolio. If you wish to perform one of these actions on a collection of investments, you must select a collection of investments (as described in the section Selecting Investments within a Portfolio on page 2705) before right-clicking.
2708
Chapter 47
Investments
Contents
The Investment Menu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Loan Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Specifying Savings Terms to Create an Account Summary . . . . . . . Depreciation Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bond Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Generic Cashow Tasks . . . . . . . . . . . . . . . . . . . . . . . . . Dialog Box Guide . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Loan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Loan Initialization Options . . . . . . . . . . . . . . . . . . . . . . . . Loan Prepayments . . . . . . . . . . . . . . . . . . . . . . . . . . . . Balloon Payments . . . . . . . . . . . . . . . . . . . . . . . . . . . . Rate Adjustment Terms . . . . . . . . . . . . . . . . . . . . . . . . . Rounding Off . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Savings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Depreciation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Depreciation Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bond . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bond Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bond Price . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Generic Cashow . . . . . . . . . . . . . . . . . . . . . . . . . . . . Right-Clicking within Generic Cashows Cashow Specication Area Flow Specication . . . . . . . . . . . . . . . . . . . . . . . . . . . . Forecast Specication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2709 2711 2711 2718 2719 2722 2726 2733 2733 2735 2736 2737 2737 2739 2739 2741 2743 2743 2745 2746 2747 2748 2749 2751
The Investment menu, shown in Figure 47.1, offers the following items: New ! Loan opens the Loan dialog box. Loans are useful for acquiring capital to pursue various interests. Available terms include rate adjustments for variable rate loans, initialization costs, prepayments, and balloon payments. New ! Savings opens the Savings dialog box. Savings are necessary when planning for the future, whether for business or personal purposes. Account summary calculations available per deposit include starting balance, deposits, interest earned, and ending balance. New ! Depreciation opens the Depreciation dialog box. Depreciations are relevant in tax calculation. The available depreciation methods are Straight Line, Sum-of-years Digits, Depreciation Table, and Declining Balance. Depreciation Tables are necessary when depreciation calculations must conform to set yearly percentages. Declining Balance with conversion to Straight Line is also provided. New ! Bond opens the Bond dialog box. Bonds have widely varying terms depending on the issuer. Because bond issuers frequently auction their bonds, the ability to price a bond between the issue date and maturity date is desirable. Fixed-coupon bonds may be analyzed for the following: price versus yield-to-maturity, duration, and convexity. These are available at different times in the bonds life. New ! Generic Cashow opens the Generic Cashow dialog box. Generic cashows are the most exible investments. Only a sequence of date-amount pairs is necessary for specication. You can enter date-amount pairs and load values from SAS data sets to specify any type of investment. You can generate uniform, arithmetic, and geometric cashows with ease. SASs forecasting ability is available to forecast future cashows as well. The new graphical display aids in visualization of the cashow and enables the user to change the frequency of the cashow view to aggregate and disaggregate the view. Edit opens the specication dialog box for an investment selected within the portfolio. Duplicate creates a duplicate of an investment selected within the portfolio. Delete removes an investment selected from the portfolio. If you want to edit, duplicate, or delete a collection of investments, you must select a collection of investments as described in the section Selecting Investments within a Portfolio on page 2705 before performing the menu-option.
Tasks ! 2711
Tasks
Loan Tasks
Suppose you want to buy a home that costs $100,000. You can make a down payment of $20,000. Hence, you need a loan of $80,000. You are able to acquire a 30-year loan at 7% interest starting January 1, 2000. Lets use Investment Analysis to specify and analyze this loan. In the Investment Analysis dialog box, select Investment ! New ! Loan from the menu bar to open the Loan dialog box.
Adding Prepayments
Now lets observe the effect of prepayments on the loan. Consider the loan described in the section Loan Tasks on page 2711. You must pay a minimum of $532.24 each month to keep up with payments. However, lets say you dislike entering this amount in your checkbook. You would rather pay $550.00 to keep the arithmetic simpler. This would constitute a uniform prepayment of $17.76 each month. In the Loan dialog box, click Prepayments, which opens the Loan Prepayments dialog box shown in Figure 47.4.
You can specify an arbitrary sequence of prepayments in the Prepayments area. If you want a uniform prepayment, clear the Prepayments area and enter the uniform payment amount in the Uniform Prepayment text box. That amount will be added to each payment until the loan is paid off. To specify this uniform prepayment, follow these steps: 1. Enter 17.76 for the Uniform Prepayment. 2. Click OK to return to the Loan dialog box. 3. Click Create Amortization Schedule, and the amortization schedule is updated, as displayed in Figure 47.5.
The last payment is on January 2030 without prepayments and February 2027 with prepayment; you would pay the loan off almost three years earlier with the $17.76 prepayments. To continue this example you must remove the prepayments from the loan specication, following these steps: 1. Reopen the Loan Prepayments dialog box from the Loan dialog box by clicking Prepayments. 2. Enter 0 for Uniform Prepayment. 3. Click OK to return to the Loan dialog box.
You can specify an arbitrary sequence of balloon payments by adding date-amount pairs to the Balloon Payments area. To specify these balloon payments, follow these steps: 1. Right-click within the Balloon Payment area (which pops up a menu) and release on New. 2. Set the pairs Date to 01JAN2007. 3. Set Amount to 6000. 4. Right-click within the Balloon Payment area (which pops up a menu) and release on New. 5. Set the new pairs Date to 01JAN2023. 6. Set its Amount to 6000. Click OK to return to the Loan dialog box. Click Create Amortization Schedule, and the amortization schedule is updated. Your monthly payment is now $500.30, a difference of approximately $32 each month. To continue this example you must remove the balloon payments from the loan specication, following these steps: 1. Reopen the Balloon Payments dialog box. 2. Right-click within the Balloon Payment area (which pops up a menu) and release on Clear. 3. Click OK to return to the Loan dialog box.
To specify these loan adjustment terms, follow these steps: 1. Enter 3 for the Life Cap. The Life Cap is the maximum deviation from the Initial Rate. 2. Enter 1 for the Periodic Cap. 3. Enter 36 for the Adjustment Frequency. 4. Conrm that Worst Case is selected from the Rate Adjustment Assumption options. 5. Click OK to return to the Loan dialog box. 6. Enter 6 for the Initial Rate. 7. Click Create Amortization Schedule, and the amortization schedule is updated. Your monthly payment drops to $479.64 each month. However, if the worst-case scenario plays out, the payments will increase to $636.84 in nine years. Figure 47.8 displays amortization table information for the nal few months under this scenario.
You must specify the savings before generating the account summary. After you have specied the savings, click Create Account Summary to compute the ending date and balance and to generate the account summary displayed in Figure 47.9.
Figure 47.9 Creating an Account Summary
Depreciation Tasks
Commercial assets are considered to lose value as time passes. For tax purposes, you want to quantify this loss. This investment structure helps calculate appropriate values. Suppose you spend $50,000 for a commercial shing boat that is considered to have a ten-year useful life. How would you depreciate it? In the Investment Analysis dialog box, select Investment ! New ! Depreciation from the menu bar to open the Depreciation dialog box.
2. Enter 50000 for the Cost. 3. Enter 2000 for the Year of Purchase. 4. Enter 10 for the Useful Life. 5. Enter 0 for the Salvage Value. You must specify the depreciation before generating the depreciation schedule. After you have specied the depreciation, click Create Depreciation Schedule to generate a depreciation schedule like the one displayed in Figure 47.10.
Figure 47.10 Creating a Depreciation Schedule
The default depreciation method is Declining Balance (with Conversion to Straight Line). Try the following methods to see how they each affect the schedule: Straight Line Sum-of-years Digits Declining Balance (without conversion to Straight Line) It might be useful to compare the value of the boat at 5 years for each method. A description of these methods is available in the section Depreciation Methods on page 2785.
Click OK to return to the Depreciation dialog box. Click Create Depreciation Schedule and the depreciation schedule lls (see Figure 47.12).
Note there are eleven entries in this depreciation schedule. This is because of the half-year convention that enables you to deduct one half of a year the rst year which leaves a half year to deduct after the useful life is over.
Figure 47.12 Depreciation Table with MACRS10
Bond Tasks
Suppose someone offers to sell you a 20-year utility bond that was issued six years ago. It has a $1,000 face value and pays semi-year coupons at 2%. You can purchase it for $780. Would you be satised with this bond if you expect an 8% minimum attractive rate of return (MARR)? In the Investment Analysis dialog box, select Investment ! New ! Bond from the menu bar to open the Bond dialog box.
2. Enter 1000 for the Face Value. 3. Enter 2 for the Coupon Rate. The Coupon Payment updates to 20. 4. Select SEMIYEAR for Coupon Interval. 5. Enter 28 for the Number of Coupons. Because 14 years remain before the bond matures, the bond still has 28 semiyear coupons to pay. The Maturity Date updates.
Click Return to return to the Bond Analysis dialog box. In the Bond Analysis dialog box, click OK to return to the Bond dialog box. In the Bond dialog box, click OK to return to the Investment Analysis dialog box.
The following sections describe how to use most of these right-click options. The Specify and Forecast menu items are described in the sections Including a Generated Cashow on page 2729 and Including a Forecasted Cashow on page 2731.
To add a new date-amount pair manually, follow these steps: 1. Right-click in the Cashow Specication area as shown in Figure 47.18, and release on Add. 2. Enter 01JAN01 for the date. 3. Enter 100 for the amount.
Copying a Date-Amount Pair
To copy a selected date-amount pair, follow these steps: 1. Select the pair you just created. 2. Right-click in the Cashow Specication area as shown in Figure 47.18, but this time release on Copy.
Sorting All of the Date-Amount Pairs
Change the second date to 01JAN00. Now the dates are unsorted. Cashow Specication area as shown in Figure 47.18, and release on Sort.
Right-click in the
To delete a selected date-amount pair, follow these steps: 1. Select a date-amount pair. 2. Right-click in the Cashow Specication area as shown in Figure 47.18, and release on Delete.
To clear all date-amount pairs, right-click in the Cashow Specication area as shown in Figure 47.18, and release on Clear.
To load date-amount pairs from a SAS data set into the Cashow Specication area, follow these steps: 1. Right-click in the Cashow Specication area, and release on Load. Load Dataset dialog box. 2. Enter SASHELP.RETAIL for Dataset Name. This opens the
3. Click OK to return to the Generic Cashow dialog box. If there is a Date variable in the SAS data set, Investment Analysis loads it into the list. If there is no date-time-formatted variable, it loads the rst available date or date-time-formatted variable. Investment Analysis then searches the SAS data set for an Amount variable to use. If none exists, it takes the rst numeric variable that is not used by the Date variable.
To save date-amount pairs from the Cashow Specication area to a SAS data set, follow these steps: 1. Right-click in the Cashow Specication area and release on Save. Save Dataset dialog box. 2. Enter the name of the SAS data set for Dataset Name. 3. Click OK to return to the Generic Cashow dialog box. This opens the
Clicking Subtract subtracts the current cashow from the Generic Cashow dialog box, and it returns you to the Generic Cashow dialog box. You can generate arithmetic and geometric specications by clicking them within the Series Flow Type area. However, you must enter a value for the Gradient. In both cases the Level value is the value of the list at the Starting Date. With an arithmetic ow type, entries increment by the value Gradient for each Time Interval. With a geometric ow type, entries increase by the factor Gradient for each Time Interval. Figure 47.20 displays an arithmetic cashow with a Level of 100 and a Gradient of 12.
Clicking Subtract subtracts the current forecast from the Generic Cashow dialog box, and it returns you to the Generic Cashow dialog box. To review the values from the SAS data set you forecast, click View Table or View Graph. You can adjust the following values for the SAS data set you forecast: Time ID Variable, Time Interval, and Analysis Variable. You can adjust the following values for the forecast: the Horizon, the Condence, and choice of predicted value, lower condence limit, and upper condence limit.
Loan ! 2733
box to the right of the horizontal scroll bar to alter the number of entries you want to view. The number in that box must be no greater than the number of entries in the cashow list. Lessening this number has the effect of zooming in upon a portion of the cashow. When the number is less than the number of entries in the cashow list, you can use the scroll bar at the bottom of the chart to scroll through the chart.
Loan
Selecting Investment ! New ! Loan from the Investment Analysis dialog boxs menu bar opens the Loan dialog box displayed in Figure 47.22.
Figure 47.22 Loan Dialog Box
The following items are displayed: Name holds the name you assign to the loan. You can set the name here or within the Portfolio area of the Investment Analysis dialog box. This must be a valid SAS name. The Loan Specication area gives access to the values that dene the loan. Loan Amount holds the borrowed amount. Periodic Payment holds the value of the periodic payments.
Number of Payments holds the number of payments in loan terms. Payment Interval holds the frequency of the Periodic Payment. Compounding Interval holds the compounding frequency. Initial Rate holds the interest rate (a nominal percentage between 0 and 120) you pay on the loan. Start Date holds the SAS date when the loan is initialized. The rst payment is due one Payment Interval after this time. Initialization opens the Loan Initialization Options dialog box where you can dene initialization costs and down-payments relevant to the loan. Prepayments opens the Loan Prepayments dialog box where you can specify the SAS dates and amounts of any prepayments. Balloon Payments opens the Balloon Payments dialog box where you can specify the SAS dates and amounts of any balloon payments. Rate Adjustments opens the Rate Adjustment Terms dialog box where you can specify terms for a variable-rate loan. Rounding Off opens the Rounding Off dialog box where you can select the number of decimal places for calculations. Create Amortization Schedule becomes available when you adequately dene the loan within the Loan Specication area. Clicking it generates the amortization schedule. Amortization Schedule lls when you click Create Amortization Schedule. The schedule contains a row for the loans start-date and each payment-date with information about the following: Date is a SAS date, either the loans start-date or a payment-date. Beginning Principal Amount is the balance at that date. Periodic Payment Amount is the expected payment at that date. Interest Payment is zero for the loans start-date; otherwise it holds the interest since the previous date. Principal Repayment is the amount of the payment that went toward the principal. Ending Principal is the balance at the end of the payment interval. Print becomes available when you generate the amortization schedule. Clicking it sends the contents of the amortization schedule to the SAS session print device. Save Data As becomes available when you generate the amortization schedule. Clicking it opens the Save Output Dataset dialog box where you can save the amortization table (or portions thereof) as a SAS Dataset. OK returns you to the Investment Analysis dialog box. If this is a new loan specication, clicking OK appends the current loan specication to the portfolio. If this is an existing loan specication, clicking OK returns the altered loan specication to the portfolio. Cancel returns you to the Investment Analysis dialog box. If this is a new loan specication, clicking Cancel discards the current loan specication. If this is an existing loan specication, clicking Cancel discards the current changes.
The following items are displayed: The Price, Loan Amount and Downpayment area contains the following information: Purchase Price holds the actual price of the asset. This value equals the loan amount plus the downpayment. Loan Amount holds the loan amount. % of Price (to the right of Loan Amount) updates when you enter the Purchase Price and either the Loan Amount or Downpayment. This holds the percentage of the Purchase Price that comprises the Loan Amount. Setting the percentage manually causes the Loan Amount and Downpayment to update. Downpayment holds any downpayment paid for the asset. % of Price (to the right of Downpayment) updates when you enter the Purchase Price and either the Loan Amount or Downpayment. This holds the percentage of the Purchase Price that comprises the Downpayment. Setting the percentage manually causes the Loan Amount and Downpayment to update.
Initialization Costs and Discount Points area Loan Amount holds a copy of the Loan Amount above. Initialization Costs holds the value of any initialization costs. % of Amount (to the right of Initialization Costs) updates when you enter the Purchase Price and either the Initialization Costs or Discount Points. This holds the percentage of the Loan Amount that comprises the Initialization Costs. Setting the percentage manually causes the Initialization Costs to update. Discount Points holds the value of any discount points. % of Amount (to the right of Discount Points) updates when you enter the Purchase Price and either the Initialization Costs or Discount Points. This holds the percentage of the Loan Amount that comprises the Discount Points. Setting the percentage manually causes the Discount Points to update. OK returns you to the Loan dialog box, saving the information that is entered. Cancel returns you to the Loan dialog box, discarding any changes made since you opened the dialog box.
Loan Prepayments
Clicking Prepayments in the Loan dialog box opens the Loan Prepayments dialog box displayed in Figure 47.24.
Figure 47.24 Loan Prepayments Dialog Box
The following items are displayed: Uniform Prepayment holds the value of a regular prepayment concurrent to the usual periodic payment.
Prepayments holds a list of date-amount pairs to accommodate any prepayments. Right-clicking within the Prepayments area reveals many helpful tools for managing date-amount pairs. OK returns you to the Loan dialog box, storing the information entered on the prepayments. Cancel returns you to the Loan dialog box, discarding any prepayments entered since you opened the dialog box.
Balloon Payments
Clicking Balloon Payments in the Loan dialog box opens the Balloon Payments dialog box displayed in Figure 47.25.
Figure 47.25 Balloon Payments Dialog Box
The following items are displayed: Balloon Payments holds a list of date-amount pairs to accommodate any balloon payments. Rightclicking within the Balloon Payments area reveals many helpful tools for managing date-amount pairs. OK returns you to the Loan dialog box, storing the information entered on the balloon payments. Cancel returns you to the Loan dialog box, discarding any balloon payments entered since you opened the dialog box.
The following items are displayed: The Rate Adjustment Terms area Life Cap holds the maximum deviation from the Initial Rate allowed over the life of the loan. Periodic Cap holds the maximum adjustment allowed per adjustment. Adjustment Frequency holds how often (in months) the lender can adjust the interest rate. The Rate Adjustment Assumption determines the scenario the adjustments will take. Worst Case uses information from the Rate Adjustment Terms area to forecast a worstcase scenario. Best Case uses information from the Rate Adjustment Terms area to forecast a best-case scenario. Fixed Case species a xed-rate loan. Estimated Case uses information from the Rate Adjustment Terms and Estimated Rate area to forecast a best-case scenario. Estimated Rates holds a list of date-rate pairs, where each date is a SAS date and the rate is a nominal percentage between 0 and 120. The Estimated Case assumption uses these rates for its calculations. Right-clicking within the Estimated Rates area reveals many helpful tools for managing date-rate pairs. OK returns you to the Loan dialog box, taking rate adjustment information into account.
Savings ! 2739
Cancel returns you to the Loan dialog box, discarding any rate adjustment information provided since opening the dialog box.
Rounding Off
Clicking Rounding Off in the Loan dialog box opens the Rounding Off dialog box displayed in Figure 47.27.
Figure 47.27 Rounding Off Dialog Box
The following items are displayed: Decimal Places xes the number of decimal places your results will display. OK returns you to the Loan dialog box. Numeric values will then be represented with the number of decimals specied in Decimal Places. Cancel returns you to the Loan dialog box. Numeric values will be represented with the number of decimals specied prior to opening this dialog box.
Savings
Selecting Investment ! New ! Savings from the Investment Analysis dialog boxs menu bar opens the Savings dialog box displayed in Figure 47.28.
The following items are displayed: Name holds the name you assign to the savings. You can set the name here or within the Portfolio area of the Investment Analysis dialog box. This must be a valid SAS name. The Savings Specication area Periodic Deposit holds the value of your regular deposits. Number of Deposits holds the number of deposits into the account. Initial Rate holds the interest rate (a nominal percentage between 0 and 120) the savings account earns. Start Date holds the SAS date when deposits begin. Deposit Interval holds the frequency of your Periodic Deposit. Compounding Interval holds how often the interest compounds. Create Account Summary becomes available when you adequately dene the savings within the Savings Specication area. Clicking it generates the account summary. Account Summary lls when you click Create Account Summary. The schedule contains a row for each deposit-date with information about the following: Date is the SAS date of a deposit. Starting Balance is the balance at that date. Deposits is the deposit at that date. Interest Earned is the interest earned since the previous date.
Depreciation ! 2741
Ending Balance is the balance after the payment. Print becomes available when you generate an account summary. Clicking it sends the contents of the account summary to the SAS session print device. Save Data As becomes available when you generate an account summary. Clicking it opens the Save Output Dataset dialog box where you can save the account summary (or portions thereof) as a SAS Dataset. OK returns you to the Investment Analysis dialog box. If this is a new savings, clicking OK appends the current savings specication to the portfolio. If this is an existing savings specication, clicking OK returns the altered savings to the portfolio. Cancel returns you to the Investment Analysis dialog box. If this is a new savings, clicking Cancel discards the current savings specication. If this is an existing savings, clicking Cancel discards the current changes.
Depreciation
Selecting Investment ! New ! Depreciation from the Investment Analysis dialog boxs menu bar opens the Depreciation dialog box displayed in Figure 47.29.
Figure 47.29 Depreciation Dialog Box
The following items are displayed: Name holds the name you assign to the depreciation. You can set the name here or within the Portfolio area of the Investment Analysis dialog box. This must be a valid SAS name.
Depreciable Asset Specication Cost holds the assets original cost. Year of Purchase holds the assets year of purchase. Useful Life holds the assets useful life (in years). Salvage Value holds the assets value at the end of its Useful Life. The Depreciation Method area holds the depreciation methods available: Straight Line Sum-of-years Digits Depreciation Table Declining Balance DB Factor: choice of 2, 1.5, or 1 Conversion to SL: choice of Yes or No Create Depreciation Schedule becomes available when you adequately dene the depreciation within the Depreciation Asset Specication area. Clicking the Create Depreciation Schedule button then lls the Depreciation Schedule area. Depreciation Schedule lls when you click Create Depreciation Schedule. The schedule contains a row for each year. Each row holds: Year is a year. Start Book Value is the starting book value for that year. Depreciation is the depreciation value for that year. End Book Value is the ending book value for that year. Print becomes available when you generate the depreciation schedule. Clicking it sends the contents of the depreciation schedule to the SAS session print device. Save Data As becomes available when you generate the depreciation schedule. Clicking it opens the Save Output Dataset dialog box where you can save the depreciation table (or portions thereof) as a SAS Dataset. OK returns you to the Investment Analysis dialog box. If this is a new depreciation specication, clicking OK appends the current depreciation specication to the portfolio. If this is an existing depreciation specication, clicking OK returns the altered depreciation specication to the portfolio. Cancel returns you to the Investment Analysis dialog box. If this is a new depreciation specication, clicking Cancel discards the current depreciation specication. If this is an existing depreciation specication, clicking Cancel discards the current changes.
Bond ! 2743
Depreciation Table
Clicking Depreciation Table from Depreciation Method area of the Depreciation dialog box opens the Depreciation Table dialog box displayed in Figure 47.30.
Figure 47.30 Depreciation Table Dialog Box
The following items are displayed: The Depreciation area holds a list of year-rate pairs where the rate is an annual depreciation rate (a percentage between 0% and 100%). Right-clicking within the Depreciation area reveals many helpful tools for managing year-rate pairs. OK returns you to the Depreciation dialog box with the current list of depreciation rates from the Depreciation area. Cancel returns you to the Depreciation dialog box, discarding any editions to the Depreciation area since you opened the dialog box.
Bond
Selecting Investment ! New ! Bond from the Investment Analysis dialog boxs menu bar opens the Bond dialog box displayed in Figure 47.31.
The following items are displayed: Name holds the name you assign to the bond. You can set the name here or within the Portfolio area of the Investment Analysis dialog box. This must be a valid SAS name. Bond Specication Face Value holds the bonds value at maturity. Coupon Payment holds the amount of money you receive periodically as the bond matures. Coupon Rate holds the rate (a nominal percentage between 0% and 120%) of the Face Value that denes the Coupon Payment. Coupon Interval holds how often the bond pays its coupons. Number of Coupons holds the number of coupons before maturity. Maturity Date holds the SAS date when you can redeem the bond for its Face Value. The Valuation area becomes available when you adequately dene the bond within the Bond Specication area. Entering either the Value or the Yield causes the calculation of the other. If you respecify the bond after performing a calculation here, you must reenter the Value or Yield value to update the calculation. Value holds the bonds value if expecting the specied Yield. Yield holds the bonds yield if the bond is valued at the amount of Value. You must specify the bond before analyzing it. After you have specied the bond, clicking Analyze opens the Bond Analysis dialog box where you can compare various values and yields.
OK returns you to the Investment Analysis dialog box. If this is a new bond specication, clicking OK appends the current bond specication to the portfolio. If this is an existing bond specication, clicking OK returns the altered bond specication to the portfolio. Cancel returns you to the Investment Analysis dialog box. If this is a new bond specication, clicking Cancel discards the current bond specication. If this is an existing bond specication, clicking Cancel discards the current changes.
Bond Analysis
Clicking Analyze from the Bond dialog box opens the Bond Analysis dialog box displayed in Figure 47.32.
Figure 47.32 Bond Analysis
The following items are displayed: Analysis Specications Yield-to-maturity holds the percentage yield upon which to center the analysis. +/- holds the maximum deviation percentage to consider from the Yield-to-maturity. Increment by holds the percentage increment by which the analysis is calculated. Reference Price holds the reference price. Analysis Dates holds a list of SAS dates for which you perform the bond analysis.
You must specify the analysis before valuing the bond for the various yields. After you adequately specify the analysis, click Create Bond Valuation Summary to generate the bond valuation summary. Bond Valuation Summary lls when you click Create Bond Valuation Summary. The schedule contains a row for each rate with information concerning the following: Date is the SAS date when the Value gives the particular Yield. Yield is the percent yield that corresponds to the Value at the given Date. Value is the value of the bond at Date for the given Yield. Percent Change is the percent change if the Reference Price is specied. Duration is the duration. Convexity is the convexity. Graphics opens the Bond Price graph that represents the price versus yield-to-maturity. Print becomes available when you generate the Bond Valuation Summary. Clicking it sends the contents of the summary to the SAS session print device. Save Data As becomes available when you ll the Bond Valuation Summary area. Clicking it opens the Save Output Dataset dialog box where you can save the valuation summary (or portions thereof) as a SAS Dataset. Return takes you back to the Bond dialog box.
Bond Price
Clicking Graphics from the Bond dialog box opens the Bond Price dialog box displayed in Figure 47.33.
It possesses the following item: Return takes you back to the Bond Analysis dialog box.
Generic Cashow
Selecting Investment ! New ! Generic Cashow from the Investment Analysis dialog boxs menu bar opens the Generic Cashow dialog box displayed in Figure 47.34.
Figure 47.34 Generic Cashow
The following items are displayed: Name holds the name you assign to the generic cashow. You can set the name here or within the Portfolio area of the Investment Analysis dialog box. This must be a valid SAS name. Cashow Specication holds date-amount pairs that correspond to deposits and withdrawals (or benets and costs) for the cashow. Each date is a SAS date. Right-clicking within the Cashow Specication area reveals many helpful tools for managing date-amount pairs. The Cashow Chart lls with a graph representing the cashow when the Cashow Specication area is nonempty. The box to the right of the scroll bar controls the number of entries with which to ll the graph. If the number in this box is less than the total number of entries, you can use the scroll bar to view different segments of the cashow. The left box below the scroll bar holds the frequency for drilling purposes. OK returns you to the Investment Analysis dialog box. If this is a new generic cashow specication, clicking OK appends the current cashow specication to the portfolio. If this is an existing cashow specication, clicking OK returns the altered cashow specication to the portfolio. Cancel returns you to the Investment Analysis dialog box. If this is a new cashow specication, clicking Cancel discards the current cashow specication. If this is an existing cashow specication, clicking Cancel discards the current changes.
Add creates a blank pair. Delete removes the currently highlighted pair. Copy duplicates the currently selected pair. Sort arranges the entered pairs in chronological order. Clear empties the area of all pairs.
Save opens the Save Dataset dialog box where you can save the entered pairs as a SAS Dataset for later use. Load opens the Load Dataset dialog box where you select a SAS Dataset to populate the area. Specify opens the Flow Specication dialog box where you can generate date-rate pairs to include in your cashow. Forecast opens the Forecast Specication dialog box where you can generate the forecast of a SAS data set to include in your cashow. If you want to perform one of these actions on a collection of pairs, you must select a collection of pairs before right-clicking. To select an adjacent list of pairs, do the following: click the rst pair, hold down the SHIFT key, and click the nal pair. After the list of pairs is selected, you can release the SHIFT key.
Flow Specication
Figure 47.36 Flow Specication
The following items are displayed: Flow Time Specication Time Interval holds the uniform frequency of the entries. You can set the Starting Date when you set the Time Interval. It holds the SAS date the entries will start.
You can set the Ending Date when you set the Time Interval. It holds the SAS date the entries will end. Number of Periods holds the number of entries. Flow Value Specication Series Flow Type describes the movement the entries can assume: Uniform assumes all entries are equal. Arithmetic assumes the entries begin at Level and increase by the value of Gradient per entry. Geometric assumes the entries begin at Level and increase by a factor of Gradient per entry. Level holds the starting amount for all ow types. You can set the Gradient when you select either Arithmetic or Geometric series ow type. It holds the arithmetic and geometric gradients, respectively, for the Arithmetic and Geometric ow types. When the cashow entries are adequately dened, the Cashow Chart lls with a graph displaying the dates and values of the entries. The box to the right of the scroll bar controls the number of entries with which to ll the graph. If the number in this box is less than the total number of entries, you can use the scroll bar to view different segments of the cashow. The left box below the scroll bar holds the frequency. Subtract becomes available when the collection of entries is adequately specied. Clicking Subtract then returns you to the Generic Cashow dialog box and subtracts the entries from the current cashow. Add becomes available when the collection of entries is adequately specied. Clicking Add then returns you to the Generic Cashow dialog box and adds the entries to the current cashow. Cancel returns you to Generic Cashow dialog box without changing the cashow.
Forecast Specication
Figure 47.37 Forecast Specication
The following items are displayed: Historical Data Specication Data Set holds the name of the SAS data set to forecast. Browse opens the standard SAS Open dialog box to help select a SAS data set to forecast. Time ID Variable holds the time ID variable to forecast over. Time Interval xes the time interval for the Time ID Variable. Analysis Variable holds the data variable upon which to forecast. View Table opens a table that displays the contents of the specied SAS data set. View Graph opens the Time Series Viewer that graphically displays the contents of the specied SAS data set. Forecast Specication Horizon holds the number of periods into the future you want to forecast. Condence holds the condence limit for applicable forecasts. Compute Forecast lls the Cashow Chart with the forecast. The box below Forecast Specication holds the type of forecast you want to generate:
Predicted Value Lower Condence Limit Upper Condence Limit The Cashow Chart lls when you click Compute Forecast. The box to the right of the scroll bar controls the number of entries with which to ll the graph. If the number in this box is less than the total number of entries, you can use the scroll bar to view different segments of the cashow. The left box below the scroll bar holds the frequency. Subtract becomes available when the collection of entries is adequately specied. Clicking Subtract then returns you to the Generic Cashow dialog box subtracting the forecast from the current cashow. Add becomes available when the collection of entries is adequately specied. Clicking Add then returns you to the Generic Cashow adding the forecast to the current cashow. Cancel returns to Generic Cashow dialog box without changing the cashow.
Chapter 48
Computations
Contents
The Compute Menu . . . . . . . . . . Tasks . . . . . . . . . . . . . . . . . Taxing a Cashow . . . . . . . Converting Currency . . . . . . Deating Cashows . . . . . . Dialog Box Guide . . . . . . . . . . . After Tax Cashow Calculation Currency Conversion . . . . . . Constant Dollar Calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2753 2754 2754 2756 2758 2760 2760 2761 2763
The Compute menu offers the following options that apply to generic cashows. After Tax Cashow opens the After Tax Cashow Calculation dialog box. Computing an after tax cashow is useful when taxes affect investment alternatives differently. Comparing after tax cashows provides a more accurate determination of the cashows protabilities. You can set default values for income tax rates by selecting Tools ! Dene Rate ! Income Tax Rate from the Investment Analysis dialog box. This opens the Income Tax Specication dialog box where you can enter the tax rates.
Currency Conversion opens the Currency Conversion dialog box. Currency conversion is necessary when investments are in different currencies. For data concerning currency conversion rates, see http://dsbb.imf.org/, the International Monetary Funds Dissemination Standards Bulletin Board. Constant Dollars opens the Constant Dollar Calculation dialog box. A constant dollar (ination adjusted monetary value) calculation takes cashow and ination information and discounts the cashow to a level where the buying power of the monetary unit is constant over time. Groups quantify ination (in the form of price indices and ination rates) for countries and industries by averaging the growth of prices for various products and sectors of the economy. For data concerning price indices, see the United States Department of Labor at http://www.dol.gov/ and the International Monetary Funds Dissemination Standards Bulletin Board at http://dsbb.imf. org/. You can set default values for ination rates by clicking Tools ! Dene Rate ! Ination from the Investment Analysis dialog box. This opens the Ination Specication dialog box where you can enter the ination rates.
Tasks
The next few sections show how to perform computations for the following situation. Suppose you buy a $10,000 certicate of deposit that pays 12% interest a year for ve years. Your earnings are taxed at a rate of 30% federally and 7% locally. Also, you want to transfer all the money to an account in England. British pounds convert to American dollars at an exchange rate of $1.00 to 0.60. The ination rate in England is 3%. The instructions in this example assume familiarity with the following: The right-clicking options of the Cashow Specication area in the Generic Cashow dialog box (described in the section Right-Clicking within Generic Cashows Cashow Specication Area on page 2748.) The Save Data As button located in many dialog boxes (described in the section Saving Output to SAS Data Sets on page 2781.)
Taxing a Cashow
Consider the example described in the section The Compute Menu on page 2753. To create the earnings, follow these steps: 1. Select Investment ! New ! Generic Cashow to create a generic cashow. 2. Enter CD_INTEREST for the Name. 3. Enter 1200 for each of the ve years starting one year from today as displayed in Figure 48.2. 4. Click OK to return to the Investment Analysis dialog box.
To compute the tax on the earnings, follow these steps: 1. Select CD_INTEREST from the Portfolio area. 2. Select Compute ! After Tax Cashow from the pull-down menu. 3. Enter 30 for Federal Tax. 4. Enter 7 for Local Tax. Note that Combined Tax updates. 5. Click Create After Tax Cashow. Figure 48.3. The After Tax Cashow area lls, as displayed in
Save the taxed earnings to a SAS data set named WORK.CD_AFTERTAX. Click Return to return to the Investment Analysis dialog box.
Converting Currency
Consider the example described in the section The Compute Menu on page 2753. To create the cashow to convert, follow these steps: 1. Select Investment ! New ! Generic Cashow to open a new generic cashow. 2. Enter CD_DOLLARS for the Name. 3. Load WORK.CD_AFTERTAX into its Cashow Specication. 4. Add 10,000 for today and +10,000 for ve years from today to the cashow as displayed in Figure 48.4. 5. Sort the transactions by date to aid your reading. 6. Click OK to return to the Investment Analysis dialog box.
To convert from British pounds to American dollars, follow these steps: 1. Select CD_DOLLARS from the portfolio. 2. Select Compute ! Currency Conversion from the pull-down menu. Currency Conversion dialog box. 3. Select USD for the From Currency. 4. Select GBP for the To Currency. 5. Enter 0.60 for the Exchange Rate. 6. Click Apply Currency Conversion to ll the Currency Conversion area as displayed in Figure 48.5. This opens the
Save the converted values to a SAS data set named WORK.CD_POUNDS. Click Return to return to the Investment Analysis dialog box.
Deating Cashows
Consider the example described in the section The Compute Menu on page 2753. To create the cashow to deate, follow these steps: 1. Select Investment ! New ! Generic Cashow to open a new generic cashow. 2. Enter CD_DEFLATED for Name. 3. Load WORK.CD_POUNDS into its Cashow Specication (see Figure 48.6). 4. Click OK to return to the Investment Analysis dialog box.
To deate the values, follow these steps: 1. Select CD_DEFLATED from the portfolio. 2. Select Compute ! Constant Dollars from the menu. This opens the Constant Dollar Calculation dialog box. 3. Clear the Variable Ination List area. 4. Enter 3 for the Constant Ination Rate. 5. Click Create Constant Dollar Equivalent to generate a constant dollar equivalent summary (see Figure 48.7).
You can save the deated cashow to a SAS data set for use in an internal rate of return analysis or breakeven analysis. Click Return to return to the Investment Analysis dialog box.
The following items are displayed: Name holds the name of the investment for which you are computing the after-tax cashow. Federal Tax holds the federal tax rate (a percentage between 0% and 100%). Local Tax holds the local tax rate (a percentage between 0% and 100%). Combined Tax holds the effective tax rate from federal and local income taxes. Create After Tax Cashow becomes available when Combined Tax is not empty. Create After Tax Cashow then lls the After Tax Cashow area. Clicking
After Tax Cashow lls when you click Create After Tax Cashow. It holds a list of date-amount pairs where the amount is the amount retained after taxes for that date. Print becomes available when you ll the after-tax cashow. Clicking it sends the contents of the after tax cashow to the SAS session print device. Save Data As becomes available when you ll the after tax cashow. Clicking it opens the Save Output Dataset dialog box where you can save the resulting cashow (or portions thereof) as a SAS Dataset. Return returns you to the Investment Analysis dialog box.
Currency Conversion
Having selected a generic cashow from the Investment Analysis dialog box, to perform a currency conversion, select Compute ! Currency Conversion from the Investment Analysis dialog boxs menu bar. This opens the Currency Conversion dialog box displayed in Figure 48.9.
The following items are displayed: Name holds the name of the investment to which you are applying the currency conversion. From Currency holds the name of the currency the cashow currently represents. To Currency holds the name of the currency to which you wish to convert. Exchange Rate holds the rate of exchange between the From Currency and the To Currency. Apply Currency Conversion becomes available when you ll Exchange Rate. Apply Currency Conversion lls the Currency Conversion area. Clicking
Currency Conversion lls when you click Apply Currency Conversion. The schedule contains a row for each cashow item with the following information: Date is a SAS date within the cashow. The From Currency value is the amount in the original currency at that date. The To Currency value is the amount in the new currency at that date. Print becomes available when you ll the Currency Conversion area. Clicking it sends the contents of the conversion table to the SAS session print device. Save Data As becomes available when you ll the Currency Conversion area. Clicking it opens the Save Output Dataset dialog box where you can save the conversion table (or portions thereof) as a SAS Dataset. Return returns you to the Investment Analysis dialog box.
The following items are displayed: Name holds the name of the investment for which you are computing the constant dollars value. Constant Ination Rate holds the constant ination rate (a percentage between 0% and 120%). This value is used if the Variable Ination List area is empty. Variable Ination List holds date-rate pairs that describe how ination varies over time. Each date is a SAS date, and the rate is a percentage between 0% and 120%. Each date refers to when that ination rate begins. Right-clicking within the Variable Ination area reveals many helpful tools for managing date-rate pairs. If you assume a xed ination rate, just insert that rate in Constant Rate. Dates holds the SAS date(s) at which you wish to compute the constant dollar equivalent. Rightclicking within the Dates area reveals many helpful tools for managing date lists. Create Constant Dollar Equivalent becomes available when you enter ination rate information. Clicking it lls the constant dollar equivalent summary with the computed constant dollar values. Constant Dollar Equivalent Summary lls with a summary when you click Create Constant Dollar Equivalent. The rst column lists the dates of the generic cashow. The second column contains the constant dollar equivalent of the original generic cashow item of that date.
Print becomes available when you ll the constant dollar equivalent summary. Clicking it sends the contents of the summary to the SAS session print device. Save Data As becomes available when you ll the constant dollar equivalent summary. Clicking it opens the Save Output Dataset dialog box where you can save the summary (or portions thereof) as a SAS Dataset. Return returns you to the Investment Analysis dialog box.
Chapter 49
Analyses
Contents
The Analyze Menu . . . . . . . . . . . . . . . . Tasks . . . . . . . . . . . . . . . . . . . . . . . Performing Time Value Analysis . . . . . Computing an Internal Rate of Return . . . Performing a Benet-Cost Ratio Analysis . Computing a Uniform Periodic Equivalent Performing a Breakeven Analysis . . . . . Dialog Box Guide . . . . . . . . . . . . . . . . . Time Value Analysis . . . . . . . . . . . . Uniform Periodic Equivalent . . . . . . . . Internal Rate of Return . . . . . . . . . . . Benet-Cost Ratio Analysis . . . . . . . . Breakeven Analysis . . . . . . . . . . . . Breakeven Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2765 2766 2766 2768 2769 2770 2771 2773 2773 2774 2775 2776 2777 2778
The Analyze menu offers the following options for use on applicable investments. Time Value opens the Time Value Analysis dialog box. Time value analysis involves moving money through time across a dened minimum attractive rate of return (MARR) so that you can
compare value at a consistent date. The MARR can be constant or variable over time. Periodic Equivalent opens the Uniform Periodic Equivalent dialog box. Uniform periodic equivalent analysis determines the payment needed to convert a cashow to uniform amounts over time, given a periodicity, a number of periods, and a MARR. This option helps when making comparisons where one alternative is uniform (such as renting) and another is not (such as buying). Internal Rate of Return opens the Internal Rate of Return dialog box. The internal rate of return of a cashow is the interest rate that makes the time value equal to 0. This calculation assumes uniform periodicity of the cashow. It is particularly applicable where the choice of MARR would be difcult. Benet-Cost Ratio opens the Benet-Cost Ratio Analysis dialog box. The benet-cost ratio divides the time value of the benets by the time value of the costs. For example, governments often use this analysis when deciding whether to commit to a public works project. Breakeven Analysis opens the Breakeven Analysis dialog box. Breakeven analysis computes time values at various MARRs to compare, which can be advantageous when it is difcult to determine a MARR. This analysis can help you determine how the cashows protability varies with your choice of MARR. A graph displaying the relationships between time value and MARR is also available.
Tasks
To compute the time values of these investments, follow these steps: 1. Select both cashows. 2. Select Analyze ! Time Value. This opens the Time Value Analysis dialog box. 3. Enter the date 01JAN1998 into the Dates area. 4. Enter 13 for the Constant MARR. 5. Click Create Time Value Summary.
As shown in Figure 49.3, option 1 has a time value of $264,000.00 naturally on 01JAN1998. However, option 2 has a time value of $263,408.94, which is slightly less expensive.
The results displayed in Figure 49.4 indicate that the internal rates of return for investments 2, 4, and 5 are greater than 9%. Hence, each of these is acceptable.
7. Enter 12 for the Constant MARR. 8. Click Create Time Value Summary.
Figure 49.6 Computing a Uniform Periodic Equivalent
Figure 49.6 indicates that renting costs about $1,300 less each year. Hence, renting is more nancially sound. Notice the periodic equivalent for renting is not $23,000. This is because the $23,000 per year does not account for the MARR.
7. Click Create Breakeven Analysis Summary to ll the Breakeven Analysis Summary area as displayed in Figure 49.7.
Figure 49.7 Performing a Breakeven Analysis
Click Graphics to view a plot displaying the relationship between time value and MARR.
Figure 49.8 Viewing a Breakeven Graph
As shown in Figure 49.8, renting is better if you want a MARR of 12%. However, if your MARR should drop to 10.5%, buying would be better. With a single investment, knowing where the graph has a time value of 0 tells the MARR when a venture switches from being protable to being a loss. With multiple investments, knowing where the graphs for the various investments cross each other tells at what MARR a particular investment becomes more protable than another.
The following items are displayed: Analysis Specications Dates holds the list of dates as of which to perform the time value analysis. Right-clicking within the Dates area reveals many helpful tools for managing date lists.
Constant MARR holds the desired MARR for the time value analysis. This value is used if the MARR List area is empty. MARR List holds date-rate pairs that express your desired MARR as it changes over time. Each date refers to when that expected MARR begins. Right-clicking within the MARR List area reveals many helpful tools for managing date-rate pairs. Create Time Value Summary becomes available when you adequately specify the analysis within the Analysis Specications area. Clicking Create Time Value Summary then lls the Time Value Summary area. Time Value Summary lls when you click Create Time Value Summary. The table contains a row for each date in the Dates area. The remainder of each row holds the time values at that date, one value for each investment selected. Print becomes available when you ll the time value summary. Clicking it sends the contents of the summary to the SAS session print device. Save Data As becomes available when you ll the time value summary. Clicking it opens the Save Output Dataset dialog box where you can save the summary (or portions thereof) as a SAS Dataset. Return takes you back to the Investment Analysis dialog box.
The following items are displayed: Analysis Specications Start Date holds the date the uniform periodic equivalents begin. Number of Periods holds the number of uniform periodic equivalents. Interval holds how often the uniform periodic equivalents occur. Constant MARR holds the Minimum Attractive Rate of Return. Create Periodic Equivalent Summary becomes available when you adequately ll the Analysis Specication area. Clicking Create Periodic Equivalent Summary then lls the periodic equivalent summary. Periodic Equivalent Summary lls with two columns when you click Create Periodic Equivalent Summary. The rst column lists the investments selected. The second column lists the computed periodic equivalent amount. Print becomes available when you ll the periodic equivalent summary. Clicking it sends the contents of the summary to the SAS session print device. Save Data As becomes available when you generate the periodic equivalent summary. Clicking it opens the Save Output Dataset dialog box where you can save the summary (or portions thereof) as a SAS Dataset. Return takes you back to the Investment Analysis dialog box.
IRR Summary contains a row for each deposit. Each row holds: Name holds the name of the investment. IRR holds the internal rate of return for that investment. interval holds the interest rate interval for that IRR. Print becomes available when you ll the IRR summary. Clicking it sends the contents of the summary to the SAS session print device. Save Data As opens the Save Output Dataset dialog box where you can save the IRR summary (or portions thereof) as a SAS data set. Return takes you back to the Investment Analysis dialog box.
The following items are displayed: Analysis Specications Dates holds the dates as of which to compute the Benet-Cost ratios.
Constant MARR holds the desired MARR. This value is used if the MARR List area is empty. MARR List holds date-rate pairs that express your desired MARR as it changes over time. Each date refers to when that expected MARR begins. Right-clicking within the MARR List area reveals many helpful tools for managing date-rate pairs. Create Benet-Cost Ratio Summary becomes available when you adequately specify the analysis. Clicking Create Benet-Cost Ratio Summary lls the benet-cost ratio summary. Benet-Cost Ratio Summary lls when you click Exchange the Rates. The area contains a row for each date in the Dates area. The remainder of each row holds the benet-cost ratios at that date, one value for each investment selected. Print becomes available when you ll the benet-cost ratio summary. Clicking it sends the contents of the summary to the SAS session print device. Save Data As becomes available when you generate the benet-cost ratio summary. Clicking it opens the Save Output Dataset dialog box where you can save the summary (or portions thereof) as a SAS Dataset. Return takes you back to the Investment Analysis dialog box.
Breakeven Analysis
Having selected a generic cashow from the Investment Analysis dialog box, to perform a breakeven analysis, select Analyze ! Breakeven Analysis from the Investment Analysis dialog boxs menu bar. This opens the Breakeven Analysis dialog box displayed in Figure 49.13.
Figure 49.13 Breakeven Analysis Dialog Box
The following items are displayed: Analysis Specication Analysis holds the analysis type. Only Time Value is currently available. Date holds the date for which you perform this analysis. Variable holds the variable upon which the breakeven analysis will vary. Only MARR is currently available. Value holds the desired rate upon which to center the analysis. +/- holds the maximum deviation from the Value to consider. Increment by holds the increment by which the analysis is calculated. Create Breakeven Analysis Summary becomes available when you adequately specify the analysis. Clicking Create Breakeven Analysis Summary then lls the Breakeven Analysis Summary area. Breakeven Analysis Summary lls when you click Create Breakeven Analysis Summary. The schedule contains a row for each MARR and date. Graphics becomes available when you ll the Breakeven Analysis Summary area. Clicking it opens the Breakeven Graph graph representing the time value versus MARR. Print becomes available when you ll the breakeven analysis summary. Clicking it sends the contents of the summary to the SAS session print device. Save Data As becomes available when you generate the breakeven analysis summary. Clicking it opens the Save Output Dataset dialog box where you can save the summary (or portions thereof) as a SAS Dataset. Return takes you back to the Investment Analysis dialog box.
Breakeven Graph
Suppose you perform a breakeven analysis in the Breakeven Analysis dialog box. Once you create the breakeven analysis summary, you can click the Graphics button to open the Breakeven Graph dialog box displayed in Figure 49.14.
The following item is displayed: Return takes you back to the Breakeven Analysis dialog box.
2780
Chapter 50
Details
Contents
Investments and Data Sets . . . . . . . . . . . . Saving Output to SAS Data Sets . . . . . . Loading a SAS Data Set into a List . . . . Saving Data from a List to a SAS Data Set Right Mouse Button Options . . . . . . . . . . . Depreciation Methods . . . . . . . . . . . . . . Straight Line (SL) . . . . . . . . . . . . . Sum-of-Years Digits . . . . . . . . . . . . Declining Balance (DB) . . . . . . . . . . Rate Information . . . . . . . . . . . . . . . . . The Tools Menu . . . . . . . . . . . . . . Dialog Box Guide . . . . . . . . . . . . . Minimum Attractive Rate of Return (MARR) . . Income Tax Specication . . . . . . . . . . . . . Ination Specication . . . . . . . . . . . . . . Reference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2781 2781 2783 2783 2784 2785 2785 2785 2787 2788 2788 2789 2789 2790 2791 2792
The following items are displayed: Dataset Name holds the SAS data set name to which you want to save. Browse opens the standard SAS Open dialog box, which enables you to select an existing SAS data set to overwrite. Dataset Label holds the SAS data sets label. Dataset Variables organizes variables. The variables listed in the Selected area will be included in the SAS data set. You can select variables one at a time, by clicking the single right-arrow after each selection to move it to the Selected area. If the desired SAS data set has many variables you want to save, it may be simpler to follow these steps: 1. Click the double right-arrow to select all available variables. 2. Remove any unwanted variable by selecting it from the Selected area and clicking the single left-arrow. The double left-arrow removes all selected variables from the proposed SAS data set. The up and down arrows below the Available and Selected boxes enable you to scroll up and down the list of variables in their respective boxes. Save Dataset attempts to save the SAS data set. If the SAS data set name exists, you are asked if you want to replace the existing SAS data set, append to the existing SAS data set, or cancel the current save attempt. You then return to this dialog box ready to create another SAS data set to save.
The following items are displayed: Dataset Name holds the name of the SAS data set that you want to load. Browse opens the standard SAS Open dialog box, which aids in nding a SAS data set to load. If there is a Date variable in the SAS data set, Investment Analysis loads it into the list. If there is no Date variable, it loads the rst available time-formatted variable. If an amount or rate variable is needed, Investment Analysis searches the SAS data set for a Amount or Rate variable to use. Otherwise it takes the rst numeric variable that is not used by the Date variable. Dataset Label holds a SAS data set label. OK attempts to load the SAS data set specied in Dataset Name. If the specied SAS data set exists, clicking OK returns you to the calling dialog box with the selected SAS data set lling the list. If the specied SAS data set does not exist and you click OK, you receive an error message and no SAS data set is loaded. Cancel returns you to the calling dialog box without loading a SAS data set.
The following items are displayed: Dataset Name holds the SAS data set name to which you want to save. Browse opens the standard SAS Save As dialog box, which enables you to nd an existing SAS data set to overwrite. Dataset Label holds a user-dened description to be saved as the label of the SAS data set. OK saves the current data to the SAS data set specied in Dataset Name. If the specied SAS data set does not already exist, clicking OK saves the SAS data set and returns you to the calling dialog box. If the specied SAS data set does already exist, clicking OK warns you and enables you to replace the old SAS data set with the new SAS data set or cancel the save attempt. Cancel aborts the save process. Clicking Cancel returns you to the calling dialog box without attempting to save.
Add creates a blank row. Delete removes any currently selected row.
Copy duplicates the currently selected row. Sort arranges the rows in chronological order according to the date variable. Clear empties the table of all rows. Save opens the Save Dataset dialog box where you can save the all rows to a SAS Dataset for later use. Load opens the Load Dataset dialog box where you select a SAS Dataset to ll the rows. If you want to perform one of these actions on a collection of rows, you must select a collection of rows before right-clicking. To select an adjacent list of rows, do the following: click the rst pair, hold down SHIFT, and click the nal pair. After the list of rows is selected, you may release the SHIFT key.
Depreciation Methods
Suppose an assets price is $20,000 and it has a salvage value of $5,000 in ve years. The following sections describe various methods to quantify the depreciation.
Sum-of-Years Digits
An asset often loses more of its value early in its lifetime. A method that exhibits this dynamic is desirable. Assume an asset depreciates from price P to salvage value S in N years. First compute the sumof-years as T D 1 C 2 C C N . The depreciation for the years after the assets purchase is:
Table 50.1
Annual Depreciation
N T .P N 1 T .P N 2 T .P
S/ S/ S/
: : :
1 T .P
S/
For the i th year of the assets use, the annual depreciation is: N C1 T i .P S/
For our example, N D 5 and the sum of years is T D 1 C 2 C 3 C 4 C 5 D 15. The depreciation during the rst year is .$20; 000 $5; 000/ 5 D $5; 000 15
Table 50.2 describes how Declining Balance would depreciate the asset.
Table 50.2 Sum-of-Years Example
Year 1 2 3 4 5 .$20; 000 .$20; 000 .$20; 000 .$20; 000 .$20; 000
Depreciation
5 $5; 000/ 15 D $5; 000 4 $5; 000/ 15 3 $5; 000/ 15 2 $5; 000/ 15 1 $5; 000/ 15
Year-End Value $15; 000:00 $11; 000:00 $8; 000:00 $6; 000:00 $5; 000:00 D $4; 000 D $3; 000 D $2; 000 D $1; 000
S DP DP DP
.5 years depreciation/ 5 4 .P S / C .P 10 10 .P S /
3 S/ C .P 10
2 S/ C .P 10
1 S/ C .P 10
S/
Year 1 2 3 4 5
Depreciation
$20;000:00 5 $16;000:00 5 $12;800:00 5 $10;240:00 5 $12;800:00 5
Year-End Value $16; 000:00 $12; 800:00 $10; 240:00 $8; 192:00 $6; 553:60
D $4; 000:00 D $3; 200:00 D $2; 560:00 D $2; 048:00 D $1; 638:40
DB Factor
You could also accelerate the depreciation by increasing the factor (and hence the rate) at which depreciation occurs. Other commonly accepted depreciation rates are 200 % (called double declinN 2 ing balance as the depreciation factor becomes N ) and 150 %. Investment Analysis enables you N to choose between these three types for declining balance: 2 (with 200 % depreciation), 1.5 (with N 150 %), and 1 (with 100 %). N N
If V .N / > S according to the usual calculation for V .N /, redene V .N / to equal S . If V .i / < S according to the usual calculation for V .i/ for some i (and hence for all subsequent V .i / values), you can redene all such V .i/ to equal S . This alteration to declining balance forces the depreciated value of the asset after N years to be S and keeps V .i / no less than S .
Conversion to SL
The second (and preferred) way to force declining balance to assume the salvage value is by conversion to straight line. If V .N / > S, the rst way redenes V .N / to equal S; you can think of this as converting to the straight line method for the last timestep. If the V .N / value supplied by DB is appreciably larger than S , then the depreciation in the nal year would be unrealistically large. An alternate way is to compute the DB and SL step at each timestep and take whichever step gives a larger depreciation (unless DB drops below the salvage value). After SL assumes a larger depreciation, it continues to be larger over the life of the asset. SL forces the value at the nal time to equal the salvage value. As an algorithm, this looks like the following statements:
V(0) = P; for i=1 to N if DB step > SL step from (i,V(i)) take a DB step to make V(i); else break; for j = i to N take a SL step to make V(j);
The MACRS, which is discussed in the section that describes the Depreciation Table window, is actually a variation on the declining balance method with conversion to the straight line method.
Rate Information
The Tools ! Dene Rates menu offers the following options. MARR opens the Minumum Attractive Rate of Return (MARR) dialog box. Income Tax Rate opens the Income Tax Specication dialog box. Ination opens the Ination Specication dialog box.
Name holds the name that you assign to the MARR specication. This must be a valid SAS name. Constant MARR holds the numeric value that you choose to be the constant MARR. This value is used if the MARR List table editor is empty. MARR List holds date-MARR pairs where the date refers to when the particular MARR value begins. Each date is a SAS date. OK returns you to the Investment Analysis dialog box. Clicking it causes the preceding MARR specication to be assumed when you do not specify MARR rates in a dialog box that needs MARR rates. Cancel returns you to the Investment Analysis dialog box, discarding any work that was done in the MARR dialog box.
Name holds the name you assign to the Income Tax specication. This must be a valid SAS name. Federal Tax holds the numeric value that you want to be the constant Federal Tax. Local Tax holds the numeric value that you want to be the constant Local Tax. Taxrate List holds date-Income Tax triples where the date refers to when the particular Income Tax value begins. Each date is a SAS date, and the value is a percentage between 0% and 100%. OK returns you to the Investment Analysis dialog box. Clicking it causes the preceding income tax specication to be the default income tax rates when using the After Tax Cashow Calculation dialog box. Cancel returns you to the Investment Analysis dialog box, discarding any changes that were made since this dialog box was opened.
Ination Specication
Selecting Tools ! Dene Rates ! Ination from the Investment Analysis dialog box menu bar opens the Ination Specication dialog box displayed in Figure 50.8.
Name holds the name that you assign to the Ination specication. This must be a valid SAS name. Constant Rate holds the numeric value that you want to be the constant ination rate. This value is used if the Ination Rate List table editor is empty. Ination Rate List holds date-rate pairs where the date refers to when the particular ination rate begins. Each date is a SAS date and the rate is a percentage between 0% and 120%. OK returns you to the Investment Analysis dialog box. Clicking it causes the preceding ination specication to be assumed when you use the Constant Dollar Calculation dialog box and do not specify ination rates. Cancel returns you to the Investment Analysis dialog box, discarding any changes that were made since this dialog box was opened.
Reference
Newnan, Donald G. and Lavelle, Jerome P. (1998), Engineering Economic Analysis, Austin, Texas: Engineering Press.
Subject Index
@CRSPDB Date Informats SASECRSP engine, 2209 @CRSPDR Date Informats SASECRSP engine, 2209 @CRSPDT Date Informats SASECRSP engine, 2209 2SLS estimation method, see two-stage least squares 3SLS estimation method, see three-stage least squares add factors, see adjustments additive model ARIMA model, 212 additive Winters method seasonal forecasting, 803 additive-invertible region smoothing weights, 2671 ADDWINTERS method FORECAST procedure, 803 adjacency graph MODEL procedure, 1178 adjust operators, 754 adjustable rate mortgage, see LOAN procedure LOAN procedure, 828 adjusted R squared MODEL procedure, 1027 adjusted R-square statistics of t, 1819, 2689 adjustments, 2554, 2667 add factors, 2522 forecasting models, 2522 specifying, 2522 After Tax Cashow Calculation, 2754 AGGREGATE method EXPAND procedure, 740 aggregation of time series data, 721, 724 aggregation of time series EXPAND procedure, 721, 724 AIC, see Akaike information criterion, see Akaikes information criterion Akaike Information Criterion VARMAX procedure, 1957 Akaike information criterion AIC, 251 ARIMA procedure, 251 AUTOREG procedure, 370 used to select state space models, 1591 Akaike information criterion corrected AUTOREG procedure, 370 Akaikes information criterion AIC, 2689 statistics of t, 2689 alignment of dates, 2582 time intervals, 132 alignment of dates, 142, 2582 Almon lag polynomials, see polynomial distributed lags MODEL procedure, 1102 alternatives to DIF function, 108 LAG function, 108 Amemiyas prediction criterion statistics of t, 2689 Amemiyas R-square statistics of t, 1820, 2689 AMO, 59 amortization schedule LOAN procedure, 856 analyzing models MODEL procedure, 1174 and goal seeking ordinary differential equations (ODEs), 1074 and state space models stationarity, 1571 and tests for autocorrelation lagged dependent variables, 326 and the OUTPUT statement output data sets, 83 Annuity, see Uniform Periodic Equivalent AR initial conditions conditional least squares, 1091 Hildreth-Lu, 1091 maximum likelihood, 1091 unconditional least squares, 1091 Yule-Walker, 1091 ARCH model AUTOREG procedure, 314 autoregressive conditional heteroscedasticity, 314 ARIMA model additive model, 212 ARIMA procedure, 190
autoregressive integrated moving-average model, 190, 2680 Box-Jenkins model, 190 factored model, 212 multiplicative model, 212 notation for, 207 seasonal model, 212 simulating, 2560, 2653 subset model, 212 ARIMA model specication, 2557, 2589 ARIMA models forecasting models, 2465 specifying, 2465 ARIMA procedure Akaike information criterion, 251 ARIMA model, 190 ARIMAX model, 190, 213 ARMA model, 190 autocorrelations, 193 autoregressive parameters, 256 BY groups, 227 conditional forecasts, 257 condence limits, 257 correlation plots, 193 cross-correlation function, 240 data requirements, 221 differencing, 210, 247, 254 factored model, 212 nite memory forecasts, 258 forecasting, 257, 259 Gauss-Marquardt method, 250 ID variables, 259 innite memory forecasts, 257 input series, 213 interaction effects, 217 intervention model, 213, 216, 219, 298 inverse autocorrelation function, 239 invertibility, 255 Iterative Outlier Detection, 307 log transformations, 258 Marquardt method, 250 Model Identication, 300 moving-average parameters, 256 naming model parameters, 256 ODS graph names, 276 ODS Graphics, 224 Outlier Detection, 305 output data sets, 262264, 267, 268 output table names, 272 predicted values, 257 prewhitening, 246, 247 printed output, 269 rational transfer functions, 219
regression model with ARMA errors, 213, 214 residuals, 257 Schwarz Bayesian criterion, 251 seasonal model, 212 stationarity, 194 subset model, 212 syntax, 221 time intervals, 259 transfer function model, 213, 218, 251 unconditional forecasts, 258 ARIMA process specication, 2560 ARIMAX model ARIMA procedure, 190, 213 ARIMAX models and design matrix, 217 ARMA model ARIMA procedure, 190 autoregressive moving-average model, 190 MODEL procedure, 1088 notation for, 207 as time ID observation numbers, 2443 asymptotic distribution of impulse response functions VARMAX procedure, 1943, 1951 asymptotic distribution of the parameter estimation VARMAX procedure, 1951 at annual rates percent change calculations, 110 attributes DATASOURCE procedure, 520 attributes of variables DATASOURCE procedure, 545 audit trail, 2582 augmented Dickey-Fuller tests, 231, 246 autocorrelation, 1058 autocorrelation tests, 1058 Durbin-Watson test, 343 Godfrey Lagrange test, 1058 Godfreys test, 344 autocorrelations ARIMA procedure, 193 multivariate, 1573 plotting, 193 prediction errors, 2430 series, 2495 automatic forecasting FORECAST procedure, 774 STATESPACE procedure, 1568 automatic generation forecasting models, 2396 automatic inclusion of
interventions, 2582 automatic model selection criterion, 2610 options, 2568 automatic selection forecasting models, 2459 AUTOREG procedure Akaike information criterion, 370 Akaike information criterion corrected, 370 ARCH model, 314 autoregressive error correction, 316 BY groups, 340 Cholesky root, 359 Cochrane-Orcutt method, 361 conditional variance, 383 condence limits, 355 dual quasi-Newton method, 367 Durbin h test, 326 Durbin t test, 326 Durbin-Watson test, 324 EGARCH model, 335 EGLS method, 361 estimation methods, 357 factored model, 329 GARCH model, 314 GARCH-M model, 335 Gauss-Marquardt method, 360 generalized Durbin-Watson tests, 324 heteroscedasticity, 329 Hildreth-Lu method, 362 IGARCH model, 335 Kalman lter, 360 lagged dependent variables, 326 maximum likelihood method, 362 nonlinear least-squares, 362 ODS graph names, 388 output data sets, 384 output table names, 386 Prais-Winsten estimates, 361 predicted values, 355, 381, 382 printed output, 385 quasi-Newton method, 342 random walk model, 397 residuals, 356 Schwarz Bayesian criterion, 370 serial correlation correction, 316 stepwise autoregression, 327 structural predictions, 381 subset model, 329 Toeplitz matrix, 358 trust region method, 342 two-step full transform method, 361 Yule-Walker equations, 358 Yule-Walker estimates, 357
autoregressive conditional heteroscedasticity, see ARCH model autoregressive error correction AUTOREG procedure, 316 autoregressive integrated moving-average model, see ARIMA model, see ARIMA model autoregressive models FORECAST procedure, 797 MODEL procedure, 1088 autoregressive moving-average model, see ARMA model autoregressive parameters ARIMA procedure, 256 auxiliary data sets DATASOURCE procedure, 520 auxiliary equations, 1074 MODEL procedure, 1074 balance of payment statistics data les, see DATASOURCE procedure balloon payment mortgage, see LOAN procedure LOAN procedure, 828 bandwidth functions, 1013 Base SAS software, 48 Basmann test SYSLIN procedure, 1639, 1654 batch mode, see unattended mode Bayesian vector autoregressive models VARMAX procedure, 1904, 1947 BEA data les, see DATASOURCE procedure BEA national income and product accounts PC Format DATASOURCE procedure, 590 BEA S-pages, see DATASOURCE procedure Benet-Cost Ratio Analysis, 2769 between between estimators, 1291 Between Estimators PANEL procedure, 1291 between estimators, 1291 between, 1291 between levels and rates interpolation, 125 between stocks and ows interpolation, 125 BIC, see Schwarz Bayesian information criterion block structure MODEL procedure, 1178 BLS consumer price index surveys DATASOURCE procedure, 591 BLS data les, see DATASOURCE procedure BLS national employment, hours, and earnings survey DATASOURCE procedure, 592
BLS producer price index survey DATASOURCE procedure, 591 BLS state and area employment, hours, and earnings survey DATASOURCE procedure, 593 Bond, 2722 BOPS data le DATASOURCE procedure, 609 boundaries smoothing weights, 2671 bounds on parameter estimates, 644, 975, 1386 BOUNDS statement, 644, 975, 1386 Box Cox transformations, 2667 Box Cox transformation, see transformations Box-Cox transformation BOXCOXAR macro, 150 Box-Jenkins model, see ARIMA model BOXCOXAR macro Box-Cox transformation, 150 output data sets, 151 SAS macros, 150 break even analysis LOAN procedure, 852 Breakeven Analysis, 2771 Breusch-Pagan test, 1050 heteroscedasticity tests, 1050 Brown smoothing model, see double exponential smoothing Bureau of Economic Analysis data les, see DATASOURCE procedure Bureau of Labor Statistics data les, see DATASOURCE procedure buydown rate loans, see LOAN procedure LOAN procedure, 828 BY groups ARIMA procedure, 227 AUTOREG procedure, 340 COUNTREG procedure, 491 cross-sectional dimensions and, 79 ESM procedure, 689 EXPAND procedure, 731 FORECAST procedure, 795 MDC procedure, 891 PANEL procedure, 1271 PDLREG procedure, 1355 SIMILARITY procedure, 1450 SIMLIN procedure, 1516 SPECTRA procedure, 1546 STATESPACE procedure, 1586 SYSLIN procedure, 1637 TIMESERIES procedure, 1688 TSCSREG procedure, 1735 UCM procedure, 1760
X11 procedure, 2046 X12 procedure, 2111 BY groups and time series cross-sectional form, 79 calculation of leads, 112 calculations smoothing models, 2669 calendar calculations functions for, 95, 143 interval functions and, 104 time intervals and, 104 calendar calculations and INTCK function, 104 INTNX function, 104 time intervals, 104 calendar functions and date values, 96 calendar variables, 95 computing dates from, 96 computing from dates, 96 computing from datetime values, 98 canonical correlation analysis for selection of state space models, 1570, 1593 STATESPACE procedure, 1570, 1593 Cashow, see Generic Cashow CATALOG procedure, 49 SAS catalogs, 49 Cauchy distribution estimation example, 1219 examples, 1219 CDT (COMPUTAB data table) COMPUTAB procedure, 454 ceiling of time intervals, 102 Censored Regression Models QLIM procedure, 1399 Census X-11 method, see X11 procedure Census X-11 methodology X11 procedure, 2059 Census X-12 method, see X12 procedure Center for Research in Security Prices data les, see DATASOURCE procedure centered moving time window operators, 745, 746 change vector, 1027 changes in trend forecasting models, 2530 changing by interpolation frequency, 125, 721, 733 periodicity, 125, 721 sampling frequency, 125
changing periodicity EXPAND procedure, 125 time series data, 125, 721 character functions, 50 character variables MODEL procedure, 1155 CHART procedure, 49 histograms, 49 checking data periodicity INTNX function, 103 time intervals, 103 Chirp-Z algorithm SPECTRA procedure, 1549 choice of instrumental variables, 1084 Cholesky root AUTOREG procedure, 359 Chow test, 343, 344, 375 Chow test for structural change, 343 Chow tests, 1081 MODEL procedure, 1081 CITIBASE format DATASOURCE procedure, 523 CITIBASE old format DATASOURCE procedure, 594 CITIBASE PC format DATASOURCE procedure, 595 classical decomposition operators, 749 Cochrane-Orcutt method AUTOREG procedure, 361 coherency cross-spectral analysis, 1553 coherency of cross-spectrum SPECTRA procedure, 1553 cointegration VARMAX procedure, 1959 cointegration test, 345, 377 cointegration testing VARMAX procedure, 1902, 1963 collinearity diagnostics MODEL procedure, 1032, 1041 column blocks COMPUTAB procedure, 455 column selection COMPUTAB procedure, 452, 453 COLxxxxx: label COMPUTAB procedure, 446 combination models forecasting models, 2482 specifying, 2482 combined seasonality test, 2076, 2136 combined with cross-sectional dimension interleaved time series, 82
combined with interleaved time series cross-sectional dimensions, 82 combining forecasts, 2591, 2666 combining time series data sets, 119 Command Reference, 2545 common trends VARMAX procedure, 1959 common trends testing VARMAX procedure, 1903, 1961 COMPARE procedure, 49 comparing SAS data sets, 49 comparing forecasting models, 2505, 2605 comparing forecasting models, 2505, 2605 comparing loans LOAN procedure, 835, 852, 856 comparing SAS data sets, see COMPARE procedure compiler listing MODEL procedure, 1171 COMPUSTAT data les, see DATASOURCE procedure DATASOURCE procedure, 595 COMPUSTAT IBM 360/370 general format 48 quarter les DATASOURCE procedure, 597 COMPUSTAT IBM 360/370 general format annual les DATASOURCE procedure, 596 COMPUSTAT universal character format 48 quarter les DATASOURCE procedure, 599 COMPUSTAT universal character format annual les DATASOURCE procedure, 598 COMPUTAB procedure CDT (COMPUTAB data table), 454 column blocks, 455 column selection, 452, 453 COLxxxxx: label, 446 consolidation tables, 446 controlling row and column block execution, 453 input block, 455 missing values, 457 order of calculations, 451 output data sets, 457 program ow, 448 programming statements, 445 reserved words, 456 row blocks, 456 ROWxxxxx: label, 446 table cells, direct access to, 456 computational details
VARMAX procedure, 2001 computing calendar variables from datetime values, 98 computing ceiling of intervals INTNX function, 102 computing dates from calendar variables, 96 computing datetime values from date values, 97 computing ending date of intervals INTNX function, 101 computing from calendar variables datetime values, 97 computing from dates calendar variables, 96 computing from datetime values calendar variables, 98 date values, 97 time variables, 98 computing from time variables datetime values, 97 computing lags RETAIN statement, 108 computing midpoint date of intervals INTNX function, 101 computing time variables from datetime values, 98 computing widths of intervals INTNX function, 101 concatenated data set, 2410 concentrated likelihood Hessian, 1021 conditional forecasts ARIMA procedure, 257 conditional least squares AR initial conditions, 1091 MA Initial Conditions, 1092 conditional logit model MDC procedure, 870, 871, 904 conditional t distribution GARCH model, 367 conditional variance AUTOREG procedure, 383 predicted values, 383 predicting, 383 condence limits, 2436 ARIMA procedure, 257 AUTOREG procedure, 355 FORECAST procedure, 807 forecasts, 2436 PDLREG procedure, 1359 STATESPACE procedure, 1601 VARMAX procedure, 1988 consolidation tables
COMPUTAB procedure, 446 Constant Dollar Calculation, 2758 constrained estimation heteroscedasticity models, 351 Consumer Price Index Surveys, see DATASOURCE procedure contemporaneous correlation of errors across equations, 1649 contents of SAS data sets, 49 CONTENTS procedure, 49 SASECRSP engine, 2194 SASEFAME engine, 2291 continuous compounding LOAN procedure, 850 contrasted with ow variables stocks, 724 contrasted with ows or rates levels, 724 contrasted with missing values omitted observations, 78 contrasted with omitted observations missing observations, 78 missing values, 78 contrasted with stock variables ows, 724 contrasted with stocks or levels rates, 724 control charts, 56 control key for multiple selections, 2398 control variables MODEL procedure, 1153 controlling row and column block execution COMPUTAB procedure, 453 controlling starting values MODEL procedure, 1034 convergence criteria MODEL procedure, 1028 convergence problems VARMAX procedure, 2001 conversion methods EXPAND procedure, 739 convert option SASEFAME engine, 2291 Converting Dates Using the CRSP Date Functions SASECRSP engine, 2208 converting frequency of time series data, 721 COPY procedure, 49 copying SAS data sets, 49 CORR procedure, 49
corrected sum of squares statistics of t, 2689 correlation plots ARIMA procedure, 193 cospectrum estimate cross-spectral analysis, 1553 SPECTRA procedure, 1553 counting time intervals, 99, 102 counting time intervals INTCK function, 102 COUNTREG procedure bounds on parameter estimates, 490 BY groups, 491 output table names, 508 restrictions on parameter estimates, 493 syntax, 487 covariance estimates GARCH model, 343 Covariance of GMM estimators, 1015 covariance of the parameter estimates, 1008 covariance stationarity VARMAX procedure, 1983 covariates heteroscedasticity models, 350, 1390 CPORT procedure, 49 CPU requirements VARMAX procedure, 2002 creating time ID variable, 2439 creating a FAME view, see SASEFAME engine creating a Haver view, see SASEHAVR engine creating from Model Viewer HTML, 2618 creating from Time Series Viewer HTML, 2657 criterion automatic model selection, 2610 cross sectional dimensions represented by different series, 79 cross sections DATASOURCE procedure, 527, 529, 532, 543 cross-correlation function ARIMA procedure, 240 cross-equation covariance matrix MODEL procedure, 1026 seemingly unrelated regression, 1011 cross-periodogram cross-spectral analysis, 1542, 1553 SPECTRA procedure, 1553 cross-reference MODEL procedure, 1170 cross-sectional dimensions, 79
combined with interleaved time series, 82 ID variables for, 79 represented with BY groups, 79 transposing time series, 121 cross-sectional dimensions and BY groups, 79 cross-spectral analysis coherency, 1553 cospectrum estimate, 1553 cross-periodogram, 1542, 1553 cross-spectrum, 1553 quadrature spectrum, 1553 SPECTRA procedure, 1542, 1553 cross-spectrum cross-spectral analysis, 1553 SPECTRA procedure, 1553 crossproducts estimator of the covariance matrix, 1021 crossproducts matrix, 1043 crosstabulations, see FREQ procedure CRSP and SAS Dates SASECRSP engine, 2208 CRSP annual data DATASOURCE procedure, 605 CRSP calendar/indices les DATASOURCE procedure, 601 CRSP daily binary les DATASOURCE procedure, 600 CRSP daily character les DATASOURCE procedure, 600 CRSP daily IBM binary les DATASOURCE procedure, 600 CRSP daily security les DATASOURCE procedure, 602 CRSP data les, see DATASOURCE procedure CRSP Date Formats SASECRSP engine, 2208 CRSP Date Functions SASECRSP engine, 2208 CRSP Date Informats SASECRSP engine, 2209 CRSP Integer Date Format SASECRSP engine, 2208 CRSP monthly binary les DATASOURCE procedure, 600 CRSP monthly character les DATASOURCE procedure, 600 CRSP monthly IBM binary les DATASOURCE procedure, 600 CRSP monthly security les DATASOURCE procedure, 603 CRSP stock les DATASOURCE procedure, 600 CRSPAccess Database
DATASOURCE procedure, 600 CRSPDB_SASCAL environment variable SASECRSP engine, 2194 CRSPDCI Date Functions SASECRSP engine, 2210 CRSPDCS Date Functions SASECRSP engine, 2210 CRSPDI2S Date Function SASECRSP engine, 2210 CRSPDIC Date Functions SASECRSP engine, 2210 CRSPDS2I Date Function SASECRSP engine, 2210 CRSPDSC Date Functions SASECRSP engine, 2210 CRSPDT Date Formats SASECRSP engine, 2208 cubic trend curves, 2684 cubic trend, 2684 cumulative statistics operators, 747 Currency Conversion, 2756 custom model specication, 2569 custom models forecasting models, 2472 specifying, 2472 CUSUM statistics, 355, 368 Da Silva method PANEL procedure, 1302 damped-trend exponential smoothing, 2675 smoothing models, 2675 data frequency, see time intervals data periodicity FORECAST procedure, 796 data requirements ARIMA procedure, 221 FORECAST procedure, 806 X11 procedure, 2064 data set, 2405 concatenated, 2410 forecast data set, 2406 forms of, 2406 interleaved, 2408 simple, 2407 data set selection, 2391, 2573 DATA step, 49 SAS data sets, 49 DATASETS procedure, 49 DATASOURCE procedure attributes, 520 attributes of variables, 545 auxiliary data sets, 520 balance of payment statistics data les, 520
BEA data les, 520 BEA national income and product accounts PC Format, 590 BEA S-pages, 520 BLS consumer price index surveys, 591 BLS data les, 520 BLS national employment, hours, and earnings survey, 592 BLS producer price index survey, 591 BLS state and area employment, hours, and earnings survey, 593 BOPS data le, 609 Bureau of Economic Analysis data les, 520 Bureau of Labor Statistics data les, 520 Center for Research in Security Prices data les, 520 CITIBASE format, 523 CITIBASE old format, 594 CITIBASE PC format, 595 COMPUSTAT data les, 520, 595 COMPUSTAT IBM 360/370 general format 48 quarter les, 597 COMPUSTAT IBM 360/370 general format annual les, 596 COMPUSTAT universal character format 48 quarter les, 599 COMPUSTAT universal character format annual les, 598 Consumer Price Index Surveys, 520 cross sections, 527, 529, 532, 543 CRSP annual data, 605 CRSP calendar/indices les, 601 CRSP daily binary les, 600 CRSP daily character les, 600 CRSP daily IBM binary les, 600 CRSP daily security les, 602 CRSP data les, 520 CRSP monthly binary les, 600 CRSP monthly character les, 600 CRSP monthly IBM binary les, 600 CRSP monthly security les, 603 CRSP stock les, 600 CRSPAccess Database, 600 direction of trade statistics data les, 520 DOTS data le, 609 DRI Data Delivery Service data les, 520 DRI data les, 520, 593 DRI/McGraw-Hill data les, 520, 593 DRIBASIC data les, 594 DRIBASIC economics format, 523 DRIDDS data les, 594 employment, hours, and earnings survey, 520 event variables, 542, 543, 549
FAME data les, 520 FAME Information Services Databases, 520, 605 formatting variables, 545 frequency of data, 524 frequency of input data, 539 generic variables, 550 GFS data les, 610 Global Insight data les, 520, 593, 594 Global Insight DRI data les, 593 government nance statistics data les, 520 Haver Analytics data les, 607 ID variable, 549 IMF balance of payment statistics, 609 IMF data les, 520 IMF direction of trade statistics, 609 IMF Economic Information System data les, 608 IMF government nance statistics, 610 IMF International Financial Statistics, 527 IMF international nancial statistics, 608 indexing the OUT= data set, 538, 581 input le, 538, 539 international nancial statistics data les, 520 International Monetary Fund data les, 520, 608 labeling variables, 546 lengths of variables, 534, 546 main economic indicators (OECD) data les, 520 national accounts data les (OECD), 520 national income and product accounts, 520, 590 NIPA Tables, 590 obtaining descriptive information, 525, 529531, 550553 OECD ANA data les, 610 OECD annual national accounts, 610 OECD data les, 520 OECD main economic indicators, 612 OECD MEI data les, 612 OECD QNA data les, 611 OECD quarterly national accounts, 611 Organization for Economic Cooperation and Development data les, 520, 610 OUTALL= data set, 530 OUTBY= data set, 529 OUTCONT= data set, 525, 531 output data sets, 523, 548, 550553 Producer Price Index Survey, 520 reading data les, 523 renaming variables, 532, 547 SAS YEARCUTOFF= option, 544
state and area employment, hours, and earnings survey, 520 stock data les, 520 subsetting data les, 523, 536 time range, 544 time range of data, 526 time series variables, 524, 549 type of input data le, 538 U.S. Bureau of Economic Analysis data les, 590 U.S. Bureau of Labor Statistics data les, 591 variable list, 547 DATE ID variables, 72 date values, 2389 calendar functions and, 96 computing datetime values from, 97 computing from datetime values, 97 difference between dates, 101 formats, 70, 138 formats for, 70 functions, 143 incrementing by intervals, 99 informats, 69, 136 informats for, 69 INTNX function and, 99 normalizing to intervals, 101 SAS representation for, 68 syntax for, 68 time intervals, 129 time intervals and, 100 DATE variable, 72 dates alignment of, 2582 DATETIME ID variables, 72 datetime values computing calendar variables from, 98 computing from calendar variables, 97 computing from time variables, 97 computing time variables from, 98 formats, 70, 142 formats for, 70 functions, 143 informats, 69, 136 informats for, 69 SAS representation for, 69 syntax for, 69 time intervals, 129 DATETIME variable, 72 dating variables, 2446 decomposition of prediction error covariance VARMAX procedure, 1897, 1933
default time ranges, 2575 dened INTCK function, 99 interleaved time series, 80 INTNX function, 98 omitted observations, 78 time values, 69 denition S matrix, 1008 time series, 2380 degrees of freedom correction, 1026 denominator factors transfer function model, 218 dependency list MODEL procedure, 1174 Depreciation, 2719 derivatives MODEL procedure, 1157 DERT. variable, 1068 descriptive statistics, see UNIVARIATE procedure design matrix ARIMAX models and, 217 details generalized method of moments, 1011 developing forecasting models, 2417, 2576 developing forecasting models, 2417, 2576 DFPVALUE macro Dickey-Fuller test, 153 SAS macros, 153 DFTEST macro Dickey-Fuller test, 154 output data sets, 155 SAS macros, 154 seasonality, testing for, 154 stationarity, testing for, 154 diagnostic tests, 2453, 2687 time series, 2453 diagnostics and debugging MODEL procedure, 1168 Dickey-Fuller test, 2688 DFPVALUE macro, 153 DFTEST macro, 154 PROBDF Function, 158 signicance probabilities, 158 signicance probabilities for, 153 unit root, 158 VARMAX procedure, 1901 Dickey-Fuller tests, 231 DIF function alternatives to, 108 explained, 106 higher order differences, 109, 110
introduced, 105 MODEL procedure version, 109 multiperiod lags and, 109 percent change calculations and, 110112 pitfalls of, 107 second difference, 109 DIF function and differencing, 105107 difference between dates date values, 101 differences with X11ARIMA/88 X11 procedure, 2058 Differencing, 2584 differencing ARIMA procedure, 210, 247, 254 DIF function and, 105107 higher order, 109, 110 MODEL procedure and, 109 multiperiod differences, 109 percent change calculations and, 110112 RETAIN statement and, 108 second difference, 109 STATESPACE procedure, 1587 testing order of, 154 time series data, 105112 VARMAX procedure, 1894 different forms of output data sets, 82 differential algebraic equations ordinary differential equations (ODEs), 1148 differential equations See ordinary differential equations, 1070 direction of trade statistics data les, see DATASOURCE procedure discussed EXPAND procedure, 124 distributed lag regression models PDLREG procedure, 1349 distribution of time series, 724 distribution of time series data, 724 distribution of time series EXPAND procedure, 724 DOT as a GLUE character SASEFAME engine, 2294 DOTS data le DATASOURCE procedure, 609 double exponential smoothing, see exponential smoothing, 2673 Brown smoothing model, 2673 smoothing models, 2673
DRI Data Delivery Service data les, see DATASOURCE procedure DRI data les, see DATASOURCE procedure DATASOURCE procedure, 593 DRI data les in FAME.db, see SASEFAME engine DRI/McGraw-Hill data les, see DATASOURCE procedure DATASOURCE procedure, 593 DRI/McGraw-Hill data les in FAME.db, see SASEFAME engine DRIBASIC data les DATASOURCE procedure, 594 DRIBASIC economics format DATASOURCE procedure, 523 DRIDDS data les DATASOURCE procedure, 594 DROP in the DATA step SASEFAME engine, 2304 dual quasi-Newton method AUTOREG procedure, 367 Durbin h test AUTOREG procedure, 326 Durbin t test AUTOREG procedure, 326 Durbin-Watson MODEL procedure, 1025 Durbin-Watson test autocorrelation tests, 343 AUTOREG procedure, 324 for rst-order autocorrelation, 324 for higher-order autocorrelation, 324 p-values for, 324 Durbin-Watson tests, 343 linearized form, 349 dynamic models SIMLIN procedure, 1512, 1513, 1519, 1534 dynamic multipliers SIMLIN procedure, 1519, 1520 dynamic regression, 190, 213, 2585, 2586 specifying, 2523 dynamic regressors forecasting models, 2523 dynamic simulation, 1068 MODEL procedure, 1068, 1117 SIMLIN procedure, 1513 dynamic simultaneous equation models VARMAX procedure, 1916 econometrics features in SAS/ETS software, 23 editing selection list forecasting models, 2478 EGARCH model
AUTOREG procedure, 335 EGLS method AUTOREG procedure, 361 embedded in time series missing values, 78 embedded missing values, 78 embedded missing values in time series data, 78 Empirical Distribution Estimation MODEL procedure, 1023 employment, hours, and earnings survey, see DATASOURCE procedure ending dates of time intervals, 101 endogenous variables SYSLIN procedure, 1616 endpoint restrictions for polynomial distributed lags, 1350, 1356 Enterprise Guide, 58 Enterprise MinerTime Series nodes, 59 ENTROPY procedure input data sets, 663 missing values, 662 ODS graph names, 666 output data sets, 664 output table names, 665 Environment variable, CRSPDB_SASCAL SASECRSP engine, 2194 EQ. variables, 1059, 1155 equality restriction linear models, 648 nonlinear models, 999, 1076 equation translations MODEL procedure, 1155 equation variables MODEL procedure, 1152 Error model options, 2587 error sum of squares statistics of t, 2689 ERROR. variables, 1155 errors across equations contemporaneous correlation of, 1649 ESACF (Extended Sample Autocorrelation Function method), 241 ESM procedure BY groups, 689 ODS graph names, 705 EST= data set SIMLIN procedure, 1521 ESTIMATE statement, 981 estimation convergence problems MODEL procedure, 1038 estimation methods AUTOREG procedure, 357
MODEL procedure, 1007 estimation of ordinary differential equations, 1070 MODEL procedure, 1070 evaluation range, 2649 event variables DATASOURCE procedure, 542, 543, 549 example Cauchy distribution estimation, 1219 generalized method of moments, 1054, 1105, 1108, 1109, 1111 Goldfeld Quandt Switching Regression Model, 1221 Mixture of Distributions, 1225 Multivariate Mixture of Distributions, 1225 ordinary differential equations (ODEs), 1215 The D-method, 1221 example of Bayesian VAR modeling VARMAX procedure, 1866 example of Bayesian VECM modeling VARMAX procedure, 1873 example of causality testing VARMAX procedure, 1881 example of cointegration testing VARMAX procedure, 1869 example of multivariate GARCH modeling VARMAX procedure, 1983 example of restricted parameter estimation and testing VARMAX procedure, 1878 example of VAR modeling VARMAX procedure, 1859 example of VARMA modeling VARMAX procedure, 1952 example of vector autoregressive modeling with exogenous variables VARMAX procedure, 1874 example of vector error correction modeling VARMAX procedure, 1868 example, COUNTREG, 509 examples Cauchy distribution estimation, 1219 Monte Carlo simulation, 1218 Simulating from a Mixture of Distributions, 1225 Switching Regression example, 1221 systems of differential equations, 1215 examples of time intervals, 135 exogenous variables SYSLIN procedure, 1616 EXPAND procedure AGGREGATE method, 740
aggregation of time series, 721, 724 BY groups, 731 changing periodicity, 125 conversion methods, 739 discussed, 124 distribution of time series, 724 extrapolation, 736 frequency, 721 ID variables, 733, 735 interpolation methods, 739 interpolation of missing values, 124 JOIN method, 740 ODS graph names, 758 output data sets, 756 range of output observations, 736 SPLINE method, 739 STEP method, 740 time intervals, 735 transformation of time series, 726, 742 transformation operations, 742 EXPAND procedure and interpolation, 124 time intervals, 124 experimental design, 56 explained DIF function, 106 LAG function, 106 explosive differential equations, 1147 ordinary differential equations (ODEs), 1147 exponential trend curves, 2685 exponential smoothing, see smoothing models double exponential smoothing, 798 FORECAST procedure, 774, 798 single exponential smoothing, 798 triple exponential smoothing, 798 exponential trend, 2685 Extended Sample Autocorrelation Function (ESACF) method, 241 external forecasts, 2666 external forecasts, 2666 external sources forecasting models, 2485, 2588 extrapolation EXPAND procedure, 736 Factored ARIMA, 2555, 2584, 2622 Factored ARIMA model specication, 2589 Factored ARIMA models forecasting models, 2468 specifying, 2468 factored model
ARIMA model, 212 ARIMA procedure, 212 AUTOREG procedure, 329 FAME data les, see DATASOURCE procedure, see SASEFAME engine FAME glue symbol named DOT SASEFAME engine, 2299 FAME Information Services Databases, see DATASOURCE procedure, see SASEFAME engine DATASOURCE procedure, 605 fast Fourier transform SPECTRA procedure, 1549 fatal error when reading from a FAME data base SASEFAME engine, 2290 FCMP procedure, 49 SAS functions, 49 features in SAS/ETS software econometrics, 23 FIML estimation method, see full information maximum likelihood Financial Functions PROBDF Function, 158 nancial functions, 51 nishing the FAME CHLI SASEFAME engine, 2290 nite Fourier transform SPECTRA procedure, 1542 nite memory forecasts ARIMA procedure, 258 rst-stage R squares, 1087 tting forecasting models, 2420 tting forecasting models, 2420 xed effects model one-way, 1283 two-way, 1285 xed rate mortgage, see LOAN procedure LOAN procedure, 828 ows contrasted with stock variables, 724 for rst-order autocorrelation Durbin-Watson test, 324 for higher-order autocorrelation Durbin-Watson test, 324 for interleaved time series ID variables, 80 for multiple selections control key, 2398 for nonlinear models instrumental variables, 1084 for selection of state space models canonical correlation analysis, 1570, 1593 for time series data
ID variables, 67 forecast combination, 2591, 2666 FORECAST command, 2546 forecast data set, see output data set forecast horizon, 2575, 2649 forecast options, 2595 FORECAST procedure ADDWINTERS method, 803 automatic forecasting, 774 autoregressive models, 797 BY groups, 795 condence limits, 807 data periodicity, 796 data requirements, 806 exponential smoothing, 774, 798 forecasting, 774 Holt two-parameter exponential smoothing, 774, 803 ID variables, 795 missing values, 795 output data sets, 806, 808 predicted values, 807 residuals, 807 seasonal forecasting, 799, 803 seasonality, 805 smoothing weights, 803 STEPAR method, 797 stepwise autoregression, 774, 797 time intervals, 796 time series methods, 786 time trend models, 784 Winters method, 774, 799 FORECAST procedure and interleaved time series, 80, 81 Forecast Studio, 51 forecasting, 2664 ARIMA procedure, 257, 259 FORECAST procedure, 774 MODEL procedure, 1120 STATESPACE procedure, 1568, 1597 VARMAX procedure, 1930 Forecasting menusystem, 45 forecasting models adjustments, 2522 ARIMA models, 2465 automatic generation, 2396 automatic selection, 2459 changes in trend, 2530 combination models, 2482 comparing, 2505, 2605 custom models, 2472 developing, 2417, 2576 dynamic regressors, 2523 editing selection list, 2478
external sources, 2485, 2588 Factored ARIMA models, 2468 tting, 2420 interventions, 2527 level shifts, 2532 linear trend, 2513 predictor variables, 2511 reference, 2508 regressors, 2518 seasonal dummy variables, 2539 selecting from a list, 2457 smoothing models, 2462, 2669 sorting, 2504, 2581 specifying, 2453 transfer functions, 2682 trend curves, 2515 forecasting of Bayesian vector autoregressive models VARMAX procedure, 1948 forecasting process, 2389 forecasting project, 2410 managing, 2599 Project Management window, 2411 saving and restoring, 2412 sharing, 2416 forecasts, 2437 condence limits, 2436 external, 2666 plotting, 2436 producing, 2404, 2623 form of state space models, 1568 formats date values, 70, 138 datetime values, 70, 142 recommended for time series ID, 71 time values, 142 formats for date values, 70 datetime values, 70 formatting variables DATASOURCE procedure, 545 forms of data set, 2406 Fourier coefcients SPECTRA procedure, 1553 Fourier transform SPECTRA procedure, 1542 fractional operators, 751 FREQ procedure, 49 crosstabulations, 49 frequency changing by interpolation, 125, 721, 733 EXPAND procedure, 721
of time series observations, 84, 125 SPECTRA procedure, 1552 time intervals and, 84, 125 frequency of data, see time intervals DATASOURCE procedure, 524 frequency of input data DATASOURCE procedure, 539 frequency option SASEHAVR engine, 2340 from interleaved form transposing time series, 119 from standard form transposing time series, 122 full information maximum likelihood FIML estimation method, 1614 MODEL procedure, 1019 SYSLIN procedure, 1624, 1649 Fuller Battese variance components, 1292 Fullers modication to LIML SYSLIN procedure, 1654 functions, 50 date values, 143 datetime values, 143 lag functions, 1160 mathematical functions, 1159 random-number functions, 1159 time intervals, 143 time values, 143 functions across time MODEL procedure, 1160 functions for calendar calculations, 95, 143 time intervals, 98, 143 functions of parameters nonlinear models, 981 G4 inverse, 985 GARCH in mean model, see GARCH-M model GARCH model AUTOREG procedure, 314 conditional t distribution, 367 covariance estimates, 343 generalized autoregressive conditional heteroscedasticity, 314 heteroscedasticity models, 350 initial values, 348 starting values, 342 t distribution, 367 GARCH-M model, 366 AUTOREG procedure, 335 GARCH in mean model, 366 Gauss-Marquardt method ARIMA procedure, 250
AUTOREG procedure, 360 Gauss-Newton method, 1027 Gaussian distribution MODEL procedure, 980 General Form Equations Jacobi method, 1142 Seidel method, 1142 generalized autoregressive conditional heteroscedasticity, see GARCH model generalized Durbin-Watson tests AUTOREG procedure, 324 generalized least squares PANEL procedure, 1300 generalized least squares estimator of the covariance matrix, 1021 generalized least-squares Yule-Walker method as, 361 Generalized Method of Moments V matrix, 1012, 1017 generalized method of moments details, 1011 example, 1054, 1105, 1108, 1109, 1111 generating models, 2561 Generic Cashow, 2726 generic variables DATASOURCE procedure, 550 GFS data les DATASOURCE procedure, 610 giving dates to time series data, 67 Global Insight data les DATASOURCE procedure, 593, 594 Global Insight DRI data les, see DATASOURCE procedure DATASOURCE procedure, 593 global statements, 50 GLUE symbol SASEFAME engine, 2294 GMM simulated method of moments, 1016 SMM, 1016 GMM in Panel: Arellano and Bonds Estimator Panel GMM, 1304 goal seeking MODEL procedure, 1138 goal seeking problems, 1074 Godfrey Lagrange test autocorrelation tests, 1058 Godfreys test, 344 autocorrelation tests, 344 Goldfeld Quandt Switching Regression Model example, 1221 goodness of t, see statistics of t
goodness-of-t statistics, see statistics of t, 2643, see statistics of t government nance statistics data les, see DATASOURCE procedure gradient of the objective function, 1042, 1043 Granger causality test VARMAX procedure, 1944 graphics SAS/GRAPH software, 52 graphs, see Model Viewer, see Time Series Viewer grid search MODEL procedure, 1036 Hausman specication test, 1079 MODEL procedure, 1079 Haver Analytics data les DATASOURCE procedure, 607 Haver data les, see SASEHAVR engine Haver Information Services Databases, see SASEHAVR engine HCCME 2SLS, 1057 HCCME 3SLS, 1057 HCCME = hccme=0, 1313 PANEL procedure, 1313 HCCME OLS, 1055 HCCME SUR, 1057 hccme=0 HCCME =, 1313 help system, 23 Henze-Zirkler test, 1048 normality tests, 1048 heteroscedastic errors, 1011 heteroscedastic extreme value model MDC procedure, 881, 906 heteroscedasticity, 947, 1050 AUTOREG procedure, 329 Lagrange multiplier test, 351 testing for, 329 Heteroscedasticity Corrected Covariance Matrices, 1313 heteroscedasticity models, see GARCH model constrained estimation, 351 covariates, 350, 1390 link function, 350 heteroscedasticity tests Breusch-Pagan test, 1050 Lagrange multiplier test, 351 Whites test, 1050 Heteroscedasticity-Consistent Covariance Matrix Estimation , 1055 higher order differencing, 109, 110
higher order differences DIF function, 109, 110 higher order sums summation, 114 Hildreth-Lu AR initial conditions, 1091 Hildreth-Lu method AUTOREG procedure, 362 histograms, see CHART procedure hold-out sample, 2575 hold-out samples, 2508 Holt smoothing model, see linear exponential smoothing Holt two-parameter exponential smoothing FORECAST procedure, 774, 803 Holt-Winters Method, see Winters Method Holt-Winters method, see Winters method homoscedastic errors, 1050 HTML creating from Model Viewer, 2618 creating from Time Series Viewer, 2657 hyperbolic trend curves, 2685 hyperbolic trend, 2685 ID groups MDC procedure, 891 ID values for time intervals, 100 ID variable, see time ID variable DATASOURCE procedure, 549 ID variable for time series data, 67 ID variables, 2395 ARIMA procedure, 259 DATE, 72 DATETIME, 72 EXPAND procedure, 733, 735 for interleaved time series, 80 for time series data, 67 FORECAST procedure, 795 PANEL procedure, 1273 SIMLIN procedure, 1517 sorting by, 72 STATESPACE procedure, 1586 TSCSREG procedure, 1735 X11 procedure, 2046, 2048 X12 procedure, 2111 ID variables for cross-sectional dimensions, 79 interleaved time series, 80 time series cross-sectional form, 79 IGARCH model AUTOREG procedure, 335
IMF balance of payment statistics DATASOURCE procedure, 609 IMF data les, see DATASOURCE procedure IMF direction of trade statistics DATASOURCE procedure, 609 IMF Economic Information System data les DATASOURCE procedure, 608 IMF government nance statistics DATASOURCE procedure, 610 IMF International Financial Statistics DATASOURCE procedure, 527 IMF international nancial statistics DATASOURCE procedure, 608 IML, see SAS/IML software impact multipliers SIMLIN procedure, 1519, 1524 impulse function intervention model and, 216 impulse response function VARMAX procedure, 1898, 1919 impulse response matrix of a state space model, 1600 in SAS data sets time series, 2380 in standard form output data sets, 83 incrementing by intervals date values, 99 incrementing dates INTNX function, 99 incrementing dates by time intervals, 98, 99 independent variables, see predictor variables indexing OUT= data set, 549 indexing the OUT= data set DATASOURCE procedure, 538, 581 inequality restriction linear models, 648 nonlinear models, 975, 999, 1076 innite memory forecasts ARIMA procedure, 257 innite order AR representation VARMAX procedure, 1898 innite order MA representation VARMAX procedure, 1898, 1919 informats date values, 69, 136 datetime values, 69, 136 time values, 136 informats for date values, 69 datetime values, 69 initial values, 348, 896
GARCH model, 348 initializations smoothing models, 2670 initializing lags MODEL procedure, 1163 SIMLIN procedure, 1522 innovation vector of a state space model, 1569 input block COMPUTAB procedure, 455 input data set, 2391, 2573 input data sets ENTROPY procedure, 663 MODEL procedure, 1104 input le DATASOURCE procedure, 538, 539 input matrix of a state space model, 1569 input series ARIMA procedure, 213 INPUT variables X12 procedure, 2113 inputs, see predictor variables installment loans, see LOAN procedure instrumental regression, 1010 instrumental variables, 1010 choice of, 1084 for nonlinear models, 1084 number to use, 1085 SYSLIN procedure, 1616 instruments, 1008 INTCK function calendar calculations and, 104 counting time intervals, 102 dened, 99 INTCK function and time intervals, 99, 102 interaction effects ARIMA procedure, 217 interest rates LOAN procedure, 851 interim multipliers SIMLIN procedure, 1515, 1520, 1523, 1524 interleaved data set, 2408 interleaved form output data sets, 82 interleaved form of time series data set, 80 interleaved time series and _TYPE_ variable, 80, 81 combined with cross-sectional dimension, 82 dened, 80
FORECAST procedure and, 80, 81 ID variables for, 80 plots of, 90 Internal Rate of Return, 2768 internal rate of return LOAN procedure, 852 internal variables MODEL procedure, 1153 international nancial statistics data les, see DATASOURCE procedure International Monetary Fund data les, see DATASOURCE procedure DATASOURCE procedure, 608 interpolation between levels and rates, 125 between stocks and ows, 125 EXPAND procedure and, 124 of missing values, 124, 723 time series data, 125 to higher frequency, 125 to lower frequency, 125 interpolation methods EXPAND procedure, 739 interpolation of missing values, 124 time series data, 124, 125, 723 interpolation of missing values EXPAND procedure, 124 interpolation of time series step function, 740 interrupted time series analysis, see intervention model interrupted time series model, see intervention model interval functions, see time intervals, functions interval functions and calendar calculations, 104 INTERVAL= option and time intervals, 84 intervals, see time intervals, 2395 intervention analysis, see intervention model intervention model ARIMA procedure, 213, 216, 219, 298 interrupted time series analysis, 216 interrupted time series model, 213 intervention analysis, 216 intervention model and impulse function, 216 step function, 217 intervention notation, 2686 intervention specication, 2595, 2597 interventions, 2685 automatic inclusion of, 2582 forecasting models, 2527
point, 2685 predictor variables, 2685 ramp, 2686 specifying, 2527 step, 2685 INTNX function calendar calculations and, 104 checking data periodicity, 103 computing ceiling of intervals, 102 computing ending date of intervals, 101 computing midpoint date of intervals, 101 computing widths of intervals, 101 dened, 98 incrementing dates, 99 normalizing dates in intervals, 101 INTNX function and date values, 99 time intervals, 98 introduced DIF function, 105 LAG function, 105 percent change calculations, 110 time variables, 95 inverse autocorrelation function ARIMA procedure, 239 invertibility ARIMA procedure, 255 VARMAX procedure, 1949 Investment Analysis System, 46 Investment Portfolio, 2706 invoking the system, 2384 IRoR, 2768 irregular component X11 procedure, 2034, 2040 iterated generalized method of moments, 1015 iterated seemingly unrelated regression SYSLIN procedure, 1649 iterated three-stage least squares SYSLIN procedure, 1649 Iterative Outlier Detection ARIMA procedure, 307 Jacobi method MODEL procedure, 1141 Jacobi method with General Form Equations MODEL procedure, 1142 Jacobian, 1009, 1027 Jarque-Bera test, 344 normality tests, 344 JMP, 57 JOIN method EXPAND procedure, 740 joint generalized least squares, see seemingly unrelated regression
jointly dependent variables SYSLIN procedure, 1616 K-class estimation SYSLIN procedure, 1648 Kalman lter AUTOREG procedure, 360 STATESPACE procedure, 1570 used for state space modeling, 1570 KEEP in the DATA step SASEFAME engine, 2304 kernels, 1013, 1549 SPECTRA procedure, 1549 Kolmogorov-Smirnov test, 1048 normality tests, 1048 KPSS (Kwiatkowski, Phillips, Schmidt, Shin) test, 346 KPSS test, 346, 378 unit roots, 378 Kruskal-Wallis test, 2076 labeling variables DATASOURCE procedure, 546 LAG function alternatives to, 108 explained, 106 introduced, 105 MODEL procedure version, 109 multiperiod lags and, 109 percent change calculations and, 110112 pitfalls of, 107 LAG function and Lags, 106 lags, 105, 107 lag functions functions, 1160 MODEL procedure, 1160 lag lengths MODEL procedure, 1162 lag logic MODEL procedure, 1161 lagged dependent variables and tests for autocorrelation, 326 AUTOREG procedure, 326 lagged endogenous variables SYSLIN procedure, 1616 lagging time series data, 105112 Lagrange multiplier test heteroscedasticity, 351 heteroscedasticity tests, 351 linear hypotheses, 650 nonlinear hypotheses, 1006, 1078, 1410 Lags
LAG function and, 106 lags LAG function and, 105, 107 MODEL procedure and, 109 multiperiod lagging, 109 percent change calculations and, 110112 RETAIN statement and, 108 SIMLIN procedure, 1522 lambda, 1028 language differences MODEL procedure, 1164 large problems MODEL procedure, 1045 leads calculation of, 112 multiperiod, 112 time series data, 112 left-hand side expressions nonlinear models, 1152 lengths of variables DATASOURCE procedure, 534, 546 level shifts forecasting models, 2532 specifying, 2532 levels contrasted with ows or rates, 724 LIBNAME libref SASEHAVR physical name on UNIX SASEFAME engine, 2299 LIBNAME libref SASEHAVR physical name on Windows SASEFAME engine, 2299 LIBNAME interface engine for FAME database, see SASEFAME engine LIBNAME interface engine for Haver database, see SASEHAVR engine LIBNAME statement SASECRSP engine, 2190 SASEFAME engine, 2290 SASEHAVR engine, 2340 likelihood condence intervals, 1082 MODEL procedure, 1082 likelihood ratio test linear hypotheses, 650 nonlinear hypotheses, 1006, 1078 limitations on ordinary differential equations (ODEs), 1147 limitations on ordinary differential equations MODEL procedure, 1147 Limited Dependent Variable Models QLIM procedure, 1399 limited information maximum likelihood LIML estimation method, 1614
SYSLIN procedure, 1648 LIML estimation method, see limited information maximum likelihood linear trend curves, 2684 linear dependencies MODEL procedure, 1041 linear exponential smoothing, 2674 Holt smoothing model, 2674 smoothing models, 2674 linear hypotheses Lagrange multiplier test, 650 likelihood ratio test, 650 Wald test, 650 linear hypothesis testing, 1312 PANEL procedure, 1312 linear models equality restriction, 648 inequality restriction, 648 restricted estimation, 648 linear structural equations SIMLIN procedure, 1519 linear trend, 2513, 2684 forecasting models, 2513 linearized form Durbin-Watson tests, 349 link function heteroscedasticity models, 350 Loan, 2711 LOAN procedure adjustable rate mortgage, 827, 828 amortization schedule, 856 balloon payment mortgage, 827, 828 break even analysis, 852 buydown rate loans, 827, 828 comparing loans, 835, 852, 856 continuous compounding, 850 xed rate mortgage, 827, 828 installment loans, 827 interest rates, 851 internal rate of return, 852 loan repayment schedule, 856 loan summary table, 856 loans analysis, 827 minimum attractive rate of return, 852 mortgage loans, 827 output data sets, 853, 854 output table names, 856 present worth of cost, 852 rate adjustment cases, 846 taxes, 852 true interest rate, 852 types of loans, 828 loan repayment schedule
LOAN procedure, 856 loan summary table LOAN procedure, 856 loans analysis, see LOAN procedure log transformations, 2667 log likelihood value, 344 log test, 2688 log transformation, see transformations log transformations ARIMA procedure, 258 LOGTEST macro, 156 logarithmic trend curves, 2685 logarithmic trend, 2685 logistic transformations, 2667 trend curves, 2684 logistic trend, 2684 logit QLIM Procedure, 1376 LOGTEST macro log transformations, 156 output data sets, 157 SAS macros, 156 long-run relations testing VARMAX procedure, 1972 %MA and %AR macros combined, 1099 MA Initial Conditions conditional least squares, 1092 maximum likelihood, 1092 unconditional least squares, 1092 macros, see SAS macros MAE AUTOREG procedure, 368 main economic indicators (OECD) data les, see DATASOURCE procedure main economic indicators (OECD) data les in FAME.db, see SASEFAME engine managing forecasting project, 2599 managing forecasting projects, 2599 MAPE AUTOREG procedure, 368 Mardias test, 1048 normality tests, 1048 Marquardt method ARIMA procedure, 250 Marquardt-Levenberg method, 1028 MARR, see minimum attractive rate of return, 2789 mathematical functions, 51 functions, 1159
matrix language SAS/IML software, 54 maximizing likelihood functions, 56 maximum likelihood AR initial conditions, 1091 MA Initial Conditions, 1092 maximum likelihood method AUTOREG procedure, 362 MDC procedure binary data modeling example, 918 binary logit example, 918, 921 binary probit example, 918 bounds on parameter estimates, 890 BY groups, 891 conditional logit example, 921 conditional logit model, 870, 871, 904 goodness-of-t measures, 915 Hausmans specication and likelihood ratio tests for nested logit, 918 heteroscedastic extreme value model, 881, 906 ID groups, 891 introductory examples, 871 mixed logit model, 886, 907 multinomial discrete choice, 903 multinomial probit example, 924 multinomial probit model, 880, 909 nested logit example, 931 nested logit model, 876, 910 output table names, 917 restrictions on parameter estimates, 901 syntax, 887 mean absolute error statistics of t, 2689 mean absolute percent error statistics of t, 1819, 2689 mean percent error statistics of t, 2690 mean prediction error statistics of t, 2690 mean square error statistics of t, 1819 mean squared error statistics of t, 2689 MEANS procedure, 49 measurement equation observation equation, 1569 of a state space model, 1569 MELO estimation method, see minimum expected loss estimator memory requirements MODEL procedure, 1046 VARMAX procedure, 2002 menu interfaces
to SAS/ETS software, 45, 46 merging series time series data, 119 merging time series data sets, 119 Michaelis-Menten Equations, 1074 midpoint dates of time intervals, 101 MINIC (Minimum Information Criterion) method, 243 minimization methods MODEL procedure, 1027 minimization summary MODEL procedure, 1030 minimum attractive rate of return LOAN procedure, 852 MARR, 852 minimum expected loss estimator MELO estimation method, 1648 SYSLIN procedure, 1648 minimum information criteria method VARMAX procedure, 1940 Minimum Information Criterion (MINIC) method, 243 missing observations contrasted with omitted observations, 78 missing values, 748, 1106 COMPUTAB procedure, 457 contrasted with omitted observations, 78 embedded in time series, 78 ENTROPY procedure, 662 FORECAST procedure, 795 interpolation of, 124 MODEL procedure, 1025, 1143 smoothing models, 2670 time series data, 723 time series data and, 77 VARMAX procedure, 1912 missing values and time series data, 77, 78 MISSONLY operator, 748 mixed logit model MDC procedure, 886, 907 Mixture of Distributions example, 1225 MMAE, 2668 MMSE, 2668 model evaluation, 2662 Model Identication ARIMA procedure, 300 model list, 2425, 2606 MODEL procedure adjacency graph, 1178 adjusted R squared, 1027 Almon lag polynomials, 1102
analyzing models, 1174 ARMA model, 1088 autoregressive models, 1088 auxiliary equations, 1074 block structure, 1178 character variables, 1155 Chow tests, 1081 collinearity diagnostics, 1032, 1041 compiler listing, 1171 control variables, 1153 controlling starting values, 1034 convergence criteria, 1028 cross-equation covariance matrix, 1026 cross-reference, 1170 dependency list, 1174 derivatives, 1157 diagnostics and debugging, 1168 Durbin-Watson, 1025 dynamic simulation, 1068, 1117 Empirical Distribution Estimation, 1023 equation translations, 1155 equation variables, 1152 estimation convergence problems, 1038 estimation methods, 1007 estimation of ordinary differential equations, 1070 forecasting, 1120 full information maximum likelihood, 1019 functions across time, 1160 Gaussian distribution, 980 goal seeking, 1138 grid search, 1036 Hausman specication test, 1079 initializing lags, 1163 input data sets, 1104 internal variables, 1153 Jacobi method, 1141 Jacobi method with General Form Equations, 1142 lag functions, 1160 lag lengths, 1162 lag logic, 1161 language differences, 1164 large problems, 1045 likelihood condence intervals, 1082 limitations on ordinary differential equations, 1147 linear dependencies, 1041 memory requirements, 1046 minimization methods, 1027 minimization summary, 1030 missing values, 1025, 1143 model variables, 1152 Monte Carlo simulation, 1218
Moore-Penrose generalized inverse, 985 moving average models, 1088 Multivariate t-Distribution Estimation, 1022 n-period-ahead forecasting, 1117 nested iterations, 1027 Newtons Method, 1141 nonadditive errors, 1059 normal distribution, 980 ODS graph names, 1116 ordinary differential equations and goal seeking, 1074 output data sets, 1110 output table names, 1113 parameters, 1152 polynomial distributed lag models, 1102 program listing, 1169 program variables, 1155 properties of the estimates, 1025 quasi-random number generators, 1130 R squared, 1027, 1034 random-number generating functions, 1159 restrictions on parameters, 1098 S matrix, 1026 S-iterated methods, 1027 Seidel method, 1142 Seidel method with General Form Equations, 1142 SIMNLIN procedure, 945 simulated nonlinear least squares, 1019 simulation, 1120 solution mode output, 1132 solution modes, 1117, 1140 SOLVE Data Sets, 1148 starting values, 1031, 1038 static simulation, 1068 static simulations, 1117 stochastic simulation, 1120 storing programs, 1167 summary statistics, 1135 SYSNLIN procedure, 945 systems of ordinary differential equations, 1215 tests on parameters, 1078 time variable, 1074 troubleshooting estimation convergence problems, 1030 troubleshooting simulation problems, 1143 using models to forecast, 1120 using solution modes, 1117 variables in model program, 1151 _WEIGHT_ variable, 1052 MODEL procedure and differencing, 109 lags, 109
MODEL procedure version DIF function, 109 LAG function, 109 model selection, 2621 model selection criterion, 2502, 2610 model selection for X-11-ARIMA method X11 procedure, 2068 model selection list, 2611 model variables MODEL procedure, 1152 Model Viewer, 2427, 2615 graphs, 2419 plots, 2419 saving graphs and tables, 2628, 2630 Monte Carlo simulation, 1120, 1218 examples, 1218 MODEL procedure, 1218 Moore-Penrose generalized inverse, 985 mortgage loans, see LOAN procedure moving average function, 1160 moving average models, 1089 MODEL procedure, 1088 moving averages percent change calculations, 111, 112 moving between computer systems SAS data sets, 49 moving product operators, 752 moving rank operator, 751 moving seasonality test, 2076 moving t-value operators, 755 moving time window operators, 745 moving-average parameters ARIMA procedure, 256 multinomial discrete choice independence from irrelevant alternatives, 905 MDC procedure, 903 multinomial probit model MDC procedure, 880, 909 multiperiod leads, 112 multiperiod differences differencing, 109 multiperiod lagging lags, 109 multiperiod lags and DIF function, 109 LAG function, 109 summation, 114 multiple selections, 2398 multiplicative model ARIMA model, 212 multiplicative seasonal smoothing, 2677 smoothing models, 2677
multipliers SIMLIN procedure, 1515, 1516, 1519, 1520, 1523, 1524 multipliers for higher order lags SIMLIN procedure, 1520, 1534 multivariate autocorrelations, 1573 normality tests, 1048 partial autocorrelations, 1592 multivariate forecasting STATESPACE procedure, 1568 multivariate GARCH Modeling VARMAX procedure, 1907 Multivariate Mixture of Distributions example, 1225 multivariate model diagnostic checks VARMAX procedure, 1957 Multivariate t-Distribution Estimation MODEL procedure, 1022 multivariate time series STATESPACE procedure, 1568 n-period-ahead forecasting MODEL procedure, 1117 naming time intervals, 84, 130 naming model parameters ARIMA procedure, 256 national accounts data les (OECD), see DATASOURCE procedure national accounts data les (OECD) in FAME.db, see SASEFAME engine national income and product accounts, see DATASOURCE procedure DATASOURCE procedure, 590 negative log likelihood function, 1020 negative log-likelihood function, 1022 Nerlove variance components, 1294 nested iterations MODEL procedure, 1027 nested logit model MDC procedure, 876, 910 Newtons Method MODEL procedure, 1141 Newton-Raphson optimization methods, 490, 896 Newton-Raphson method, 490, 896 NIPA Tables DATASOURCE procedure, 590 NLO Overview NLO system, 165 NLO system NLO Overview, 165
Options, 165 output table names, 183 remote monitoring, 181 NOMISS operator, 748 nonadditive errors MODEL procedure, 1059 nonlinear hypotheses Lagrange multiplier test, 1006, 1078, 1410 likelihood ratio test, 1006, 1078 Wald test, 1006, 1078, 1410 nonlinear least-squares AUTOREG procedure, 362 nonlinear models equality restriction, 999, 1076 functions of parameters, 981 inequality restriction, 975, 999, 1076 left-hand side expressions, 1152 restricted estimation, 975, 999, 1076 test of hypotheses, 1005 nonmissing observations statistics of t, 2688 nonseasonal ARIMA model notation, 2680 nonseasonal transfer function notation, 2682 nonstationarity, see stationarity normal distribution MODEL procedure, 980 normality tests, 1048 Henze-Zirkler test, 1048 Jarque-Bera test, 344 Kolmogorov-Smirnov test, 1048 Mardias test, 1048 multivariate, 1048 Shapiro-Wilk test, 1048 normalizing dates in intervals INTNX function, 101 normalizing to intervals date values, 101 notation nonseasonal ARIMA model, 2680 nonseasonal transfer function, 2682 seasonal ARIMA model, 2681 seasonal transfer function, 2683 notation for ARIMA model, 207 ARMA model, 207 number of observations statistics of t, 2688 number to use instrumental variables, 1085 numerator factors transfer function model, 218
OBJECT convergence measure, 1028 objective function, 1008 observation equation, see measurement equation observation numbers, 2644 as time ID, 2443 time ID variable, 2443 obtaining descriptive information DATASOURCE procedure, 525, 529531, 550553 ODS graph names ARIMA procedure, 276 AUTOREG procedure, 388 ENTROPY procedure, 666 ESM procedure, 705 EXPAND procedure, 758 MODEL procedure, 1116 SIMILARITY procedure, 1483 SYSLIN procedure, 1660 TIMESERIES procedure, 1714 UCM procedure, 1813 VARMAX procedure, 2000 ODS Graphics ARIMA procedure, 224 UCM procedure, 1754 OECD ANA data les DATASOURCE procedure, 610 OECD annual national accounts DATASOURCE procedure, 610 OECD data les, see DATASOURCE procedure OECD data les in FAME.db, see SASEFAME engine OECD main economic indicators DATASOURCE procedure, 612 OECD MEI data les DATASOURCE procedure, 612 OECD QNA data les DATASOURCE procedure, 611 OECD quarterly national accounts DATASOURCE procedure, 611 of a state space model impulse response matrix, 1600 innovation vector, 1569 input matrix, 1569 measurement equation, 1569 state transition equation, 1569 state vector, 1568 transition equation, 1569 transition matrix, 1569 of a time series unit root, 154 of interleaved time series overlay plots, 90 of missing values interpolation, 124, 723
of time series distribution, 724 overlay plots, 88 sampling frequency, 71, 84, 125 simulation, 2560, 2653 stationarity, 210 summation, 113 time ranges, 77 of time series data set standard form, 76 time series cross-sectional form, 79 of time series observations frequency, 84, 125 periodicity, 71, 84, 125 omitted observations contrasted with missing values, 78 dened, 78 replacing with missing values, 104 omitted observations in time series data, 78 one-way xed effects model, 1283 random effects model, 1291 one-way xed effects model PANEL procedure, 1283 one-way xed-effects model, 1283 one-way random effects model PANEL procedure, 1291 one-way random-effects model, 1291 operations research SAS/OR software, 55 optimization methods Newton-Raphson, 490, 896 quasi-Newton, 350, 490, 896 trust region, 350, 490, 896 optimizations smoothing weights, 2671 Options NLO system, 165 options automatic model selection, 2568 order of calculations COMPUTAB procedure, 451 order statistics, see RANK procedure Ordinal Discrete Choice Modeling QLIM procedure, 1396 ordinary differential equations (ODEs) and goal seeking, 1074 differential algebraic equations, 1148 example, 1215 explosive differential equations, 1147 limitations on, 1147 systems of, 1215 ordinary differential equations and goal seeking
MODEL procedure, 1074 Organization for Economic Cooperation and Development data les, see DATASOURCE procedure DATASOURCE procedure, 610 Organization for Economic Cooperation and Development data les in FAME.db, see SASEFAME engine orthogonal polynomials PDLREG procedure, 1350 OUT= data set indexing, 549 OUTALL= data set DATASOURCE procedure, 530 OUTBY= data set DATASOURCE procedure, 529 OUTCONT= data set DATASOURCE procedure, 525, 531 Outlier Detection ARIMA procedure, 305 Output Data Sets VARMAX procedure, 1987 output data sets and the OUTPUT statement, 83 ARIMA procedure, 262264, 267, 268 AUTOREG procedure, 384 BOXCOXAR macro, 151 COMPUTAB procedure, 457 DATASOURCE procedure, 523, 548, 550553 DFTEST macro, 155 different forms of, 82 ENTROPY procedure, 664 EXPAND procedure, 756 FORECAST procedure, 806, 808 in standard form, 83 interleaved form, 82 LOAN procedure, 853, 854 LOGTEST macro, 157 MODEL procedure, 1110 PANEL procedure, 13201322 PDLREG procedure, 1363 produced by SAS/ETS procedures, 82 SIMLIN procedure, 1522, 1523 SPECTRA procedure, 1552 STATESPACE procedure, 1601, 1602 SYSLIN procedure, 1655, 1656 X11 procedure, 2071, 2072 Output Delivery System (ODS), 2618, 2657 OUTPUT statement SAS/ETS procedures using, 83 output table names ARIMA procedure, 272 AUTOREG procedure, 386
COUNTREG procedure, 508 ENTROPY procedure, 665 LOAN procedure, 856 MDC procedure, 917 MODEL procedure, 1113 NLO system, 183 PANEL procedure, 1323 PDLREG procedure, 1364 QLIM procedure, 1415 SIMLIN procedure, 1525 SPECTRA procedure, 1554 STATESPACE procedure, 1605 SYSLIN procedure, 1659 TSCSREG procedure, 1738 X11 procedure, 2085 over identication restrictions SYSLIN procedure, 1654 overlay plot of time series data, 88 overlay plots of interleaved time series, 90 of time series, 88 _TYPE_ variable and, 90 p-values for Durbin-Watson test, 324 panel data TSCSREG procedure, 1727 Panel GMM, 1304 GMM in Panel: Arellano and Bonds Estimator, 1304 PANEL procedure Between Estimators, 1291 BY groups, 1271 Da Silva method, 1302 generalized least squares, 1300 HCCME =, 1313 ID variables, 1273 linear hypothesis testing, 1312 one-way xed effects model, 1283 one-way random effects model, 1291 output data sets, 13201322 output table names, 1323 Parks method, 1300 Pooled Estimator, 1291 predicted values, 1279 printed output, 1323 R-square measure, 1316 residuals, 1280 specication tests, 1317 two-way xed effects model, 1285 two-way random effects model, 1294 Zellners two-stage method, 1301 parameter change vector, 1042
parameter estimates, 2433 parameter estimation, 2662 parameters MODEL procedure, 1152 UCM procedure, 17571768, 17701780 Pareto charts, 56 Parks method PANEL procedure, 1300 partial autocorrelations multivariate, 1592 partial autoregression coefcient VARMAX procedure, 1899, 1936 partial canonical correlation VARMAX procedure, 1899, 1939 partial correlation VARMAX procedure, 1938 PDL, see polynomial distributed lags PDLREG procedure BY groups, 1355 condence limits, 1359 distributed lag regression models, 1349 orthogonal polynomials, 1350 output data sets, 1363 output table names, 1364 polynomial distributed lags, 1350 predicted values, 1359 residuals, 1359 restricted estimation, 1360 percent change calculations at annual rates, 110 introduced, 110 moving averages, 111, 112 period-to-period, 110 time series data, 110112 year-over-year, 110 yearly averages, 111 percent change calculations and DIF function, 110112 differencing, 110112 LAG function, 110112 lags, 110112 percent operators, 755 period of evaluation, 2506 period of t, 2506, 2575, 2649 period-to-period percent change calculations, 110 Periodic Equivalent, see Uniform Periodic Equivalent periodicity changing by interpolation, 125, 721 of time series observations, 71, 84, 125 periodicity of time series data, 84, 125 periodicity of time series
time intervals, 84, 125 periodogram SPECTRA procedure, 1542, 1553 Phillips-Ouliaris test, 345, 377 Phillips-Perron test, 345, 376 unit roots, 345, 346, 376 Phillips-Perron tests, 231 Physical Names on Supported hosts SASEFAME engine, 2299 Physical path name syntax for variety of environments SASEFAME engine, 2299 pitfalls of DIF function, 107 LAG function, 107 plot axis and time intervals, 88 plot axis for time series SGPLOT procedure, 88 PLOT procedure, 49 plotting time series, 92 time series data, 92 plot reference lines and time intervals, 88 plots, see Model Viewer, see Time Series Viewer plots of interleaved time series, 90 plotting autocorrelations, 193 forecasts, 2436 prediction errors, 2429 residual, 91 time series data, 86 plotting time series PLOT procedure, 92 SGPLOT procedure, 87 Time Series Viewer procedure, 86 point interventions, 2685 point interventions, 2685 point-in-time values, 721, 724 polynomial distributed lag models MODEL procedure, 1102 polynomial distributed lags Almon lag polynomials, 1349 endpoint restrictions for, 1350, 1356 PDL, 1349 PDLREG procedure, 1350 Polynomial specication, 2555, 2584, 2622 pooled pooled estimator, 1291 Pooled Estimator PANEL procedure, 1291 pooled estimator, 1291
pooled, 1291 Portfolio, see Investment Portfolio power curve trend curves, 2685 power curve trend, 2685 PPC convergence measure, 1028 Prais-Winsten estimates AUTOREG procedure, 361 PRED. variables, 1155 predetermined variables SYSLIN procedure, 1616 predicted values ARIMA procedure, 257 AUTOREG procedure, 355, 381, 382 conditional variance, 383 FORECAST procedure, 807 PANEL procedure, 1279 PDLREG procedure, 1359 SIMLIN procedure, 1513, 1518 STATESPACE procedure, 1597, 1601 structural, 356, 381, 1359 SYSLIN procedure, 1640 transformed models, 1061 predicting conditional variance, 383 prediction error covariance VARMAX procedure, 1897, 1930, 1932 prediction errors autocorrelations, 2430 plotting, 2429 residuals, 2498 stationarity, 2431 predictions smoothing models, 2670 predictive Chow test, 344, 376 predictive Chow tests, 1081 predictor variables forecasting models, 2511 independent variables, 2511 inputs, 2511 interventions, 2685 seasonal dummies, 2687 specifying, 2511 trend curves, 2684 Present Value Analysis, see Time Value Analysis present worth of cost LOAN procedure, 852 prewhitening ARIMA procedure, 246, 247 principal component, 1041 PRINT procedure, 49 printing SAS data sets, 49 printed output ARIMA procedure, 269
AUTOREG procedure, 385 PANEL procedure, 1323 SIMLIN procedure, 1523 STATESPACE procedure, 1603 SYSLIN procedure, 1657 X11 procedure, 2074 printing SAS data sets, 49 printing SAS data sets, see PRINT procedure probability functions, 51 PROBDF Function Dickey-Fuller test, 158 Financial Functions, 158 signicance probabilities, 158 signicance probabilities for Dickey-Fuller tests, 158 PROBDF function dened, 158 probit QLIM Procedure, 1376 produced by SAS/ETS procedures output data sets, 82 Producer Price Index Survey, see DATASOURCE procedure producing forecasts, 2404, 2623 producing forecasts, 2623 program ow COMPUTAB procedure, 448 program listing MODEL procedure, 1169 program variables MODEL procedure, 1155 programming statements COMPUTAB procedure, 445 Project Management window forecasting project, 2411 properties of the estimates MODEL procedure, 1025 properties of time series, 2453 PROTO procedure, 49 printing SAS data sets, 49 QLIM Procedure, 1376 logit, 1376 probit, 1376 selection, 1376 tobit, 1376 QLIM procedure Bivariate Limited Dependent Variable Modeling, 1406 Box-Cox Modeling, 1406 BY groups, 1387 Censored Regression Models, 1399
Frontier, 1403 Heteroscedasticity, 1405 Limited Dependent Variable Models, 1399 Multivariate Limited Dependent Models, 1409 Ordinal Discrete Choice Modeling, 1396 Output, 1411 output table names, 1415 Selection Models, 1408 syntax, 1382 Tests on Parameters, 1410 Truncated Regression Models, 1402 Types of Tobit Model, 1400 quadratic trend curves, 2684 quadratic trend, 2684 quadrature spectrum cross-spectral analysis, 1553 SPECTRA procedure, 1553 quasi-Newton optimization methods, 350, 490, 896 quasi-Newton method, 350, 490, 896 AUTOREG procedure, 342 quasi-random number generators MODEL procedure, 1130 R convergence measure, 1028 R square statistic statistics of t, 1819 R squared MODEL procedure, 1027, 1034 R-square measure PANEL procedure, 1316 R-square statistic statistics of t, 2689 SYSLIN procedure, 1651 R-squared measure, 1316 ramp interventions, 2686 ramp function, see ramp interventions ramp interventions, 2686 ramp function, 2685 Ramseys test, see RESET test random effects model one-way, 1291 two-way, 1294 random number functions, 51 random walk model AUTOREG procedure, 397 random walk R-square statistics of t, 1820, 2689 random-number functions functions, 1159 random-number generating functions
MODEL procedure, 1159 random-walk with drift tests, 231 range of output observations EXPAND procedure, 736 RANGE= option in the LIBNAME statement SASEFAME engine, 2307 RANK procedure, 49 order statistics, 49 rate adjustment cases LOAN procedure, 846 rates contrasted with stocks or levels, 724 ratio operators, 755 rational transfer functions ARIMA procedure, 219 reading time series data, 66, 127 reading data les DATASOURCE procedure, 523 reading from a FAME data base SASEFAME engine, 2290 reading from a HAVER DLX database SASEHAVR engine, 2340 reading from CRSP data les SASECRSP engine, 2194 reading, with DATA step time series data, 125, 126 recommended for time series ID formats, 71 recursive residuals, 356, 368 reduced form coefcients SIMLIN procedure, 1519, 1524, 1528 SYSLIN procedure, 1653 reference forecasting models, 2508 SGPLOT procedure, 88 regression model with ARMA errors ARIMA procedure, 213, 214 regressor selection, 2627 regressors forecasting models, 2518 specifying, 2518 relation to ARMA models state space models, 1599 Remote FAME Access, Using FAME CHLI SASEFAME engine, 2292 remote monitoring NLO system, 181 RENAME in the DATA step SASEFAME engine, 2304 renaming SAS data sets, 49 renaming variables DATASOURCE procedure, 532, 547
replacing with missing values omitted observations, 104 represented by different series cross sectional dimensions, 79 represented with BY groups cross-sectional dimensions, 79 reserved words COMPUTAB procedure, 456 RESET test, 344 Ramseys test, 344 RESID. variables, 1054, 1059, 1155 residual plotting, 91 residual analysis, 2498 residuals, see prediction errors ARIMA procedure, 257 AUTOREG procedure, 356 FORECAST procedure, 807 PANEL procedure, 1280 PDLREG procedure, 1359 SIMLIN procedure, 1518 STATESPACE procedure, 1601 structural, 356, 1359 SYSLIN procedure, 1640 restarting the SASEFAME engine SASEFAME engine, 2290 RESTRICT statement, 352, 648, 999 restricted estimates STATESPACE procedure, 1587 restricted estimation, 352 linear models, 648 nonlinear models, 975, 999, 1076 PDLREG procedure, 1360 SYSLIN procedure, 1641, 1642 restricted vector autoregression, 1098 restrictions on parameters MODEL procedure, 1098 RETAIN statement computing lags, 108 RETAIN statement and differencing, 108 lags, 108 root mean square error statistics of t, 1819, 2689 row blocks COMPUTAB procedure, 456 ROWxxxxx: label COMPUTAB procedure, 446 RPC convergence measure, 1028 S convergence measure, 1028 S matrix denition, 1008 MODEL procedure, 1026
S matrix used in estimation, 1026 S-iterated methods MODEL procedure, 1027 sample cross covariances VARMAX procedure, 1897, 1935 sample cross-correlations VARMAX procedure, 1896, 1935 sample data sets, 2380, 2393 sampling frequency changing by interpolation, 125 of time series, 71, 84, 125 time intervals and, 84 sampling frequency of time series data, 84, 125 sampling frequency of time series time intervals, 84, 125 SAS and CRSP Dates SASECRSP engine, 2208 SAS catalogs, see CATALOG procedure SAS data sets contents of, 49 copying, 49 DATA step, 49 moving between computer systems, 49 printing, 49 renaming, 49 sorting, 49 structured query language, 50 summarizing, 49, 50 transposing, 50 SAS data sets and time series data, 65 SAS DATA step SASECRSP engine, 2194 SASEFAME engine, 2291 SASEHAVR engine, 2341 SAS Date Format SASECRSP engine, 2208 SAS language features for time series data, 64 SAS macros BOXCOXAR macro, 150 DFPVALUE macro, 153 DFTEST macro, 154 LOGTEST macro, 156 macros, 149 SAS options statement, using VALIDVARNAME=ANY SASEFAME engine, 2299, 2304 SAS output data set SASECRSP engine, 2206 SASEFAME engine, 2297 SASEHAVR engine, 2346 SAS representation for
date values, 68 datetime values, 69 SAS Risk Products, 60 SAS source statements, 2582 SAS YEARCUTOFF= option DATASOURCE procedure, 544 SAS/ETS procedures using OUTPUT statement, 83 SAS/GRAPH software, 52 graphics, 52 SAS/HPF, 51 SAS/IML software, 54 IML, 54 matrix language, 54 SAS/OR software, 55 operations research, 55 SAS/QC software, 56 statistical quality control, 56 SAS/STAT software, 53 SASECRSP engine @CRSPDB Date Informats, 2209 @CRSPDR Date Informats, 2209 @CRSPDT Date Informats, 2209 CONTENTS procedure, 2194 Converting Dates Using the CRSP Date Functions, 2208 CRSP and SAS Dates, 2208 CRSP Date Formats, 2208 CRSP Date Functions, 2208 CRSP Date Informats, 2209 CRSP Integer Date Format, 2208 CRSPDB_SASCAL environment variable, 2194 CRSPDCI Date Functions, 2210 CRSPDCS Date Functions, 2210 CRSPDI2S Date Function, 2210 CRSPDIC Date Functions, 2210 CRSPDS2I Date Function, 2210 CRSPDSC Date Functions, 2210 CRSPDT Date Formats, 2208 Environment variable, CRSPDB_SASCAL, 2194 LIBNAME statement, 2190 reading from CRSP data les, 2194 SAS and CRSP Dates, 2208 SAS DATA step, 2194 SAS Date Format, 2208 SAS output data set, 2206 SETID option, 2194 SQL procedure, creating a view, 2194 SASEFAME engine CONTENTS procedure, 2291 convert option, 2291 creating a FAME view, 2289
DOT as a GLUE character, 2294 DRI data les in FAME.db , 2289 DRI/McGraw-Hill data les in FAME.db, 2289 DROP in the DATA step, 2304 FAME data les, 2289 FAME glue symbol named DOT, 2299 FAME Information Services Databases, 2289 fatal error when reading from a FAME data base, 2290 nishing the FAME CHLI, 2290 GLUE symbol, 2294 KEEP in the DATA step, 2304 LIBNAME libref SASEHAVR physical name on UNIX, 2299 LIBNAME libref SASEHAVR physical name on Windows, 2299 LIBNAME interface engine for FAME databases, 2289 LIBNAME statement, 2290 main economic indicators (OECD) data les in FAME.db, 2289 national accounts data les (OECD) in FAME.db, 2289 OECD data les in FAME.db, 2289 Organization for Economic Cooperation and Development data les in FAME.db, 2289 Physical Names on Supported hosts, 2299 Physical path name syntax for variety of environments, 2299 RANGE= option in the LIBNAME statement, 2307 reading from a FAME data base, 2290 Remote FAME Access, Using FAME CHLI, 2292 RENAME in the DATA step, 2304 restarting the SASEFAME engine, 2290 SAS DATA step, 2291 SAS options statement, using VALIDVARNAME=ANY, 2299, 2304 SAS output data set, 2297 Special characters in SAS Variable names, the glue symbol DOT, 2299 SQL procedure, using clause, 2291 SQL procedure,creating a view, 2291 Supported hosts, 2290 Using CROSSLIST= option to create a view, 2292 Using INSET= option with the CROSSLIST= option to create a view, 7, 2292
Using RANGE= option to create a view, 2292 Using WHERE clause with INSET= option to create a view, 2292 Using WILDCARD= option to create a view, 2292 VALIDVARNAME=ANY, SAS option statement, 2299, 2304 viewing a FAME database, 2289 WHERE in the DATA step, 2307 SASEHAVR engine creating a Haver view, 2339 frequency option, 2340 Haver data les, 2339 Haver Information Services Databases, 2339 LIBNAME interface engine for Haver databases, 2339 LIBNAME statement, 2340 reading from a HAVER DLX database, 2340 SAS DATA step, 2341 SAS output data set, 2346 viewing a Haver database, 2339 SASHELP library, 2393 saving and restoring forecasting project, 2412 Savings, 2718 SBC, see Schwarz Bayesian criterion, see Schwarz Bayesian information criterion scale operators, 754 SCAN (Smallest Canonical) correlation method, 244 Schwarz Bayesian criterion ARIMA procedure, 251 AUTOREG procedure, 370 SBC, 251 Schwarz Bayesian information criterion BIC, 2689 SBC, 2689 statistics of t, 2689 seasonal adjustment time series data, 2034, 2103 X11 procedure, 2034, 2040 X12 procedure, 2103 seasonal ARIMA model notation, 2681 Seasonal ARIMA model options, 2631 seasonal component X11 procedure, 2034 X12 procedure, 2103 seasonal dummies, 2687 predictor variables, 2687 seasonal dummy variables forecasting models, 2539 specifying, 2539
seasonal exponential smoothing, 2676 smoothing models, 2676 seasonal forecasting additive Winters method, 803 FORECAST procedure, 799, 803 WINTERS method, 799 seasonal model ARIMA model, 212 ARIMA procedure, 212 seasonal transfer function notation, 2683 seasonal unit root test, 246 seasonality FORECAST procedure, 805 testing for, 154 seasonality test, 2688 seasonality tests, 2076 seasonality, testing for DFTEST macro, 154 second difference DIF function, 109 differencing, 109 See ordinary differential equations differential equations, 1070 seemingly unrelated regression, 1010 cross-equation covariance matrix, 1011 joint generalized least squares, 1614 SUR estimation method, 1614 SYSLIN procedure, 1622, 1649 Zellner estimation, 1614 Seidel method MODEL procedure, 1142 Seidel method with General Form Equations MODEL procedure, 1142 selecting from a list forecasting models, 2457 selection QLIM Procedure, 1376 selection criterion, 2610 sequence operators, 752 serial correlation correction AUTOREG procedure, 316 series autocorrelations, 2495 series adjustments, 2667 series diagnostics, 2453, 2632, 2687 series selection, 2633 series transformations, 2496 set operators, 753 SETID option SASECRSP engine, 2194 SETMISS operator, 749 SGMM simulated generalized method of moments, 1016
SGPLOT procedure plot axis for time series, 88 plotting time series, 87 reference, 88 time series data, 87 Shapiro-Wilk test, 1048 normality tests, 1048 sharing forecasting project, 2416 Shewhart control charts, 56 shifted time intervals, 131 shifted intervals, see time intervals, shifted signicance probabilities Dickey-Fuller test, 158 PROBDF Function, 158 unit root, 158 signicance probabilities for Dickey-Fuller test, 153 signicance probabilities for Dickey-Fuller tests PROBDF Function, 158 SIMILARITY procedure BY groups, 1450 ODS graph names, 1483 SIMLIN procedure BY groups, 1516 dynamic models, 1512, 1513, 1519, 1534 dynamic multipliers, 1519, 1520 dynamic simulation, 1513 EST= data set, 1521 ID variables, 1517 impact multipliers, 1519, 1524 initializing lags, 1522 interim multipliers, 1515, 1520, 1523, 1524 lags, 1522 linear structural equations, 1519 multipliers, 1515, 1516, 1519, 1520, 1523, 1524 multipliers for higher order lags, 1520, 1534 output data sets, 1522, 1523 output table names, 1525 predicted values, 1513, 1518 printed output, 1523 reduced form coefcients, 1519, 1524, 1528 residuals, 1518 simulation, 1513 statistics of t, 1524 structural equations, 1519 structural form, 1519 total multipliers, 1516, 1520, 1523, 1524 TYPE=EST data set, 1519 SIMNLIN procedure, see MODEL procedure simple data set, 2407
simple exponential smoothing, 2672 smoothing models, 2672 simulated method of moments GMM, 1016 simulated nonlinear least squares MODEL procedure, 1019 simulating ARIMA model, 2560, 2653 Simulating from a Mixture of Distributions examples, 1225 simulation MODEL procedure, 1120 of time series, 2560, 2653 SIMLIN procedure, 1513 time series, 2560, 2653 simultaneous equation bias, 1009 SYSLIN procedure, 1615 single equation estimators SYSLIN procedure, 1648 single exponential smoothing, see exponential smoothing sliding spans analysis, 2060 Smallest Canonical (SCAN) correlation method, 244 SMM, 1016 GMM, 1016 SMM simulated method of moments, 1016 smoothing equations, 2669 smoothing models, 2669 smoothing model specication, 2639, 2641 smoothing models calculations, 2669 damped-trend exponential smoothing, 2675 double exponential smoothing, 2673 exponential smoothing, 2669 forecasting models, 2462, 2669 initializations, 2670 linear exponential smoothing, 2674 missing values, 2670 multiplicative seasonal smoothing, 2677 predictions, 2670 seasonal exponential smoothing, 2676 simple exponential smoothing, 2672 smoothing equations, 2669 smoothing state, 2669 smoothing weights, 2671 specifying, 2462 standard errors, 2672 underlying model, 2669 Winters Method, 2678, 2679 smoothing state, 2669 smoothing models, 2669 smoothing weights, 2641, 2671 additive-invertible region, 2671
boundaries, 2671 FORECAST procedure, 803 optimizations, 2671 smoothing models, 2671 specications, 2671 weights, 2671 solution mode output MODEL procedure, 1132 solution modes MODEL procedure, 1117, 1140 SOLVE Data Sets MODEL procedure, 1148 SORT procedure, 49 sorting, 49 sorting, see SORT procedure forecasting models, 2504, 2581 SAS data sets, 49 time series data, 72 sorting by ID variables, 72 Special characters in SAS Variable names, the glue symbol DOT SASEFAME engine, 2299 specication tests PANEL procedure, 1317 specications smoothing weights, 2671 specifying adjustments, 2522 ARIMA models, 2465 combination models, 2482 custom models, 2472 dynamic regression, 2523 Factored ARIMA models, 2468 forecasting models, 2453 interventions, 2527 level shifts, 2532 predictor variables, 2511 regressors, 2518 seasonal dummy variables, 2539 smoothing models, 2462 state space models, 1578 time ID variable, 2648 trend changes, 2530 trend curves, 2515 SPECTRA procedure BY groups, 1546 Chirp-Z algorithm, 1549 coherency of cross-spectrum, 1553 cospectrum estimate, 1553 cross-periodogram, 1553 cross-spectral analysis, 1542, 1553 cross-spectrum, 1553 fast Fourier transform, 1549
nite Fourier transform, 1542 Fourier coefcients, 1553 Fourier transform, 1542 frequency, 1552 kernels, 1549 output data sets, 1552 output table names, 1554 periodogram, 1542, 1553 quadrature spectrum, 1553 spectral analysis, 1542 spectral density estimate, 1542, 1553 spectral window, 1547 white noise test, 1551, 1554 spectral analysis SPECTRA procedure, 1542 spectral density estimate SPECTRA procedure, 1542, 1553 spectral window SPECTRA procedure, 1547 SPLINE method EXPAND procedure, 739 splitting series time series data, 118 splitting time series data sets, 118 SQL procedure, 50 structured query language, 50 SQL procedure, creating a view SASECRSP engine, 2194 SQL procedure, using clause SASEFAME engine, 2291 SQL procedure,creating a view SASEFAME engine, 2291 square root transformations, 2667 square root transformation, see transformations stable seasonality test, 2076 standard errors smoothing models, 2672 standard form of time series data set, 76 standard form of time series data, 76 STANDARD procedure, 50 standardized values, 50 standardized values, see STANDARD procedure starting dates of time intervals, 100, 101 starting values GARCH model, 342 MODEL procedure, 1031, 1038 Stat Studio software, 54 state and area employment, hours, and earnings survey, see DATASOURCE procedure state space model
UCM procedure, 1787 state space models form of, 1568 relation to ARMA models, 1599 specifying, 1578 state vector of, 1568 STATESPACE procedure, 1568 state transition equation of a state space model, 1569 state vector of a state space model, 1568 state vector of state space models, 1568 state-space representation VARMAX procedure, 1913 STATESPACE procedure automatic forecasting, 1568 BY groups, 1586 canonical correlation analysis, 1570, 1593 condence limits, 1601 differencing, 1587 forecasting, 1568, 1597 ID variables, 1586 Kalman lter, 1570 multivariate forecasting, 1568 multivariate time series, 1568 output data sets, 1601, 1602 output table names, 1605 predicted values, 1597, 1601 printed output, 1603 residuals, 1601 restricted estimates, 1587 state space models, 1568 time intervals, 1585 Yule-Walker equations, 1590 static simulation, 1068 MODEL procedure, 1068 static simulations MODEL procedure, 1117 stationarity and state space models, 1571 ARIMA procedure, 194 nonstationarity, 194 of time series, 210 prediction errors, 2431 testing for, 154 VARMAX procedure, 1941, 1949 stationarity tests, 231, 246, 345 stationarity, testing for DFTEST macro, 154 statistical quality control SAS/QC software, 56 statistics of t, 1819, 2425, 2434, 2643, 2688 adjusted R-square, 1819, 2689
Akaikes information criterion, 2689 Amemiyas prediction criterion, 2689 Amemiyas R-square, 1820, 2689 corrected sum of squares, 2689 error sum of squares, 2689 goodness of t, 2434 goodness-of-t statistics, 1819, 2688 mean absolute error, 2689 mean absolute percent error, 1819, 2689 mean percent error, 2690 mean prediction error, 2690 mean square error, 1819 mean squared error, 2689 nonmissing observations, 2688 number of observations, 2688 R square statistic, 1819 R-square statistic, 2689 random walk R-square, 1820, 2689 root mean square error, 1819, 2689 Schwarz Bayesian information criterion, 2689 SIMLIN procedure, 1524 uncorrected sum of squares, 2689 step interventions, 2685 step function, see step interventions interpolation of time series, 740 intervention model and, 217 step interventions, 2685 step function, 2685 STEP method EXPAND procedure, 740 STEPAR method FORECAST procedure, 797 stepwise autoregression AUTOREG procedure, 327 FORECAST procedure, 774, 797 stochastic simulation MODEL procedure, 1120 stock data les, see DATASOURCE procedure stocks contrasted with ow variables, 724 stored in SAS data sets time series data, 75 storing programs MODEL procedure, 1167 structural predicted values, 356, 381, 1359 residuals, 356, 1359 structural change Chow test for, 343 structural equations SIMLIN procedure, 1519 structural form
SIMLIN procedure, 1519 structural predictions AUTOREG procedure, 381 structured query language, see SQL procedure SAS data sets, 50 subset model ARIMA model, 212 ARIMA procedure, 212 AUTOREG procedure, 329 subsetting data, see WHERE statement subsetting data les DATASOURCE procedure, 523, 536 summarizing SAS data sets, 49, 50 summary of time intervals, 132 summary statistics MODEL procedure, 1135 summation higher order sums, 114 multiperiod lags and, 114 of time series, 113 summation of time series data, 113, 114 Supported hosts SASEFAME engine, 2290 SUR estimation method, see seemingly unrelated regression Switching Regression example examples, 1221 syntax for date values, 68 datetime values, 69 time intervals, 84 time values, 69 SYSLIN procedure Basmann test, 1639, 1654 BY groups, 1637 endogenous variables, 1616 exogenous variables, 1616 full information maximum likelihood, 1624, 1649 Fullers modication to LIML, 1654 instrumental variables, 1616 iterated seemingly unrelated regression, 1649 iterated three-stage least squares, 1649 jointly dependent variables, 1616 K-class estimation, 1648 lagged endogenous variables, 1616 limited information maximum likelihood, 1648 minimum expected loss estimator, 1648 ODS graph names, 1660
output data sets, 1655, 1656 output table names, 1659 over identication restrictions, 1654 predetermined variables, 1616 predicted values, 1640 printed output, 1657 R-square statistic, 1651 reduced form coefcients, 1653 residuals, 1640 restricted estimation, 1641, 1642 seemingly unrelated regression, 1622, 1649 simultaneous equation bias, 1615 single equation estimators, 1648 system weighted MSE, 1651 system weighted R-square, 1651, 1657 tests of hypothesis, 1643, 1644 three-stage least squares, 1622, 1649 two-stage least squares, 1619, 1648 SYSNLIN procedure, see MODEL procedure system weighted MSE SYSLIN procedure, 1651 system weighted R-square SYSLIN procedure, 1651, 1657 systems of ordinary differential equations (ODEs), 1215 systems of differential equations examples, 1215 systems of ordinary differential equations MODEL procedure, 1215 t distribution GARCH model, 367 table cells, direct access to COMPUTAB procedure, 456 table names UCM procedure, 1810 TABULATE procedure, 50 tabulating data, 50 tabulating data, see TABULATE procedure taxes LOAN procedure, 852 tentative order selection VARMAX procedure, 1935 test of hypotheses nonlinear models, 1005 TEST statement, 353 testing for heteroscedasticity, 329 seasonality, 154 stationarity, 154 unit root, 154 testing order of differencing, 154
testing over-identifying restrictions, 1015 tests of hypothesis SYSLIN procedure, 1643, 1644 tests of parameters, 353, 649, 1005 tests on parameters MODEL procedure, 1078 The D-method example, 1221 three-stage least squares, 1011 3SLS estimation method, 1614 SYSLIN procedure, 1622, 1649 time functions, 95 time ID creation, 2644, 2646, 2647 time ID variable, 2389 creating, 2439 ID variable, 2389 observation numbers, 2443 specifying, 2648 time intervals, 2395 alignment of, 132 ARIMA procedure, 259 calendar calculations and, 104 ceiling of, 102 checking data periodicity, 103 counting, 99, 102 data frequency, 2384 date values, 129 datetime values, 129 ending dates of, 101 examples of, 135 EXPAND procedure, 735 EXPAND procedure and, 124 FORECAST procedure, 796 frequency of data, 2384 functions, 143 functions for, 98, 143 ID values for, 100 incrementing dates by, 98, 99 INTCK function and, 99, 102 INTERVAL= option and, 84 intervals, 84 INTNX function and, 98 midpoint dates of, 101 naming, 84, 130 periodicity of time series, 84, 125 plot axis and, 88 plot reference lines and, 88 sampling frequency of time series, 84, 125 shifted, 131 starting dates of, 100, 101 STATESPACE procedure, 1585 summary of, 132 syntax for, 84 UCM procedure, 1767
use with SAS/ETS procedures, 85 VARMAX procedure, 1892 widths of, 101, 735 time intervals and calendar calculations, 104 date values, 100 frequency, 84, 125 sampling frequency, 84 time intervals, functions interval functions, 98 time intervals, shifted shifted intervals, 131 time range DATASOURCE procedure, 544 time range of data DATASOURCE procedure, 526 time ranges, 2506, 2575, 2649 of time series, 77 time ranges of time series data, 77 time series denition, 2380 diagnostic tests, 2453 in SAS data sets, 2380 simulation, 2560, 2653 time series cross sectional form TSCSREG procedure and, 80 time series cross-sectional form BY groups and, 79 ID variables for, 79 of time series data set, 79 TSCSREG procedure and, 1727 time series cross-sectional form of time series data set, 79 time series data aggregation of, 721, 724 changing periodicity, 125, 721 converting frequency of, 721 differencing, 105112 distribution of, 724 embedded missing values in, 78 giving dates to, 67 ID variable for, 67 interpolation, 125 interpolation of, 124, 125, 723 lagging, 105112 leads, 112 merging series, 119 missing values, 723 missing values and, 77, 78 omitted observations in, 78 overlay plot of, 88 percent change calculations, 110112 periodicity of, 84, 125
PLOT procedure, 92 plotting, 86 reading, 66, 127 reading, with DATA step, 125, 126 sampling frequency of, 84, 125 SAS data sets and, 65 SAS language features for, 64 seasonal adjustment, 2034, 2103 SGPLOT procedure, 87 sorting, 72 splitting series, 118 standard form of, 76 stored in SAS data sets, 75 summation of, 113, 114 time ranges of, 77 Time Series Viewer, 86 transformation of, 726, 742 transposing, 119, 121 time series data and missing values, 77 time series data set interleaved form of, 80 time series cross-sectional form of, 79 time series forecasting, 2651 Time Series Forecasting System invoking, 2546 invoking from SAS/AF and SAS/EIS applications, 2546 running in unattended mode, 2546 time series methods FORECAST procedure, 786 time series variables DATASOURCE procedure, 524, 549 Time Series Viewer, 2419, 2491, 2654 graphs, 2419 invoking, 2545 plots, 2419 saving graphs and tables, 2628, 2630 time series data, 86 Time Series Viewer procedure plotting time series, 86 time trend models FORECAST procedure, 784 Time Value Analysis, 2766 time values dened, 69 formats, 142 functions, 143 informats, 136 syntax for, 69 time variable, 1074 MODEL procedure, 1074 time variables computing from datetime values, 98
introduced, 95 TIMEPLOT procedure, 50, 93 TIMESERIES procedure BY groups, 1688 ODS graph names, 1714 to higher frequency interpolation, 125 to lower frequency interpolation, 125 to SAS/ETS software menu interfaces, 45, 46 to standard form transposing time series, 119, 121 tobit QLIM Procedure, 1376 Toeplitz matrix AUTOREG procedure, 358 total multipliers SIMLIN procedure, 1516, 1520, 1523, 1524 trading-day component X11 procedure, 2034, 2040 transfer function model ARIMA procedure, 213, 218, 251 denominator factors, 218 numerator factors, 218 transfer functions, 2682 forecasting models, 2682 transformation of time series data, 726, 742 transformation of time series EXPAND procedure, 726, 742 transformations, 2637 Box Cox, 2667 Box Cox transformation, 2667 log, 2667 log transformation, 2667 logistic, 2667 square root, 2667 square root transformation, 2667 transformed models predicted values, 1061 transition equation of a state space model, 1569 transition matrix of a state space model, 1569 TRANSPOSE procedure, 50, 119, 121, 122, 126 transposing SAS data sets, 50 TRANSPOSE procedure and transposing time series, 119 transposing SAS data sets, 50 time series data, 119, 121 transposing SAS data sets, see TRANSPOSE procedure
transposing time series cross-sectional dimensions, 121 from interleaved form, 119 from standard form, 122 to standard form, 119, 121 TRANSPOSE procedure and, 119 trend changes specifying, 2530 trend curves, 2684 cubic, 2684 exponential, 2685 forecasting models, 2515 hyperbolic, 2685 linear, 2684 logarithmic, 2685 logistic, 2684 power curve, 2685 predictor variables, 2684 quadratic, 2684 specifying, 2515 trend cycle component X11 procedure, 2034, 2040 trend test, 2688 TRIM operator, 748 TRIMLEFT operator, 748 TRIMRIGHT operator, 748 triple exponential smoothing, see exponential smoothing troubleshooting estimation convergence problems MODEL procedure, 1030 troubleshooting simulation problems MODEL procedure, 1143 true interest rate LOAN procedure, 852 Truncated Regression Models QLIM procedure, 1402 trust region optimization methods, 350, 490, 896 trust region method, 350, 490, 896 AUTOREG procedure, 342 TSCSREG procedure BY groups, 1735 estimation techniques, 1730 ID variables, 1735 output table names, 1738 panel data, 1727 TSCSREG procedure and time series cross sectional form, 80 time series cross-sectional form, 1727 TSVIEW command, 2545 two-stage least squares, 1009 2SLS estimation method, 1614 SYSLIN procedure, 1619, 1648 two-step full transform method
AUTOREG procedure, 361 two-way xed effects model, 1285 random effects model, 1294 two-way xed effects model PANEL procedure, 1285 two-way xed-effects model, 1285 two-way random effects model PANEL procedure, 1294 two-way random-effects model, 1294 type of input data le DATASOURCE procedure, 538 _TYPE_ variable and interleaved time series, 80, 81 overlay plots, 90 TYPE=EST data set SIMLIN procedure, 1519 types of loans LOAN procedure, 828 Types of Tobit Model QLIM procedure, 1400 U.S. Bureau of Economic Analysis data les DATASOURCE procedure, 590 U.S. Bureau of Labor Statistics data les DATASOURCE procedure, 591 UCM procedure BY groups, 1760 ODS graph names, 1813 ODS Graphics, 1754 ODS table names, 1810 parameters, 17571768, 17701780 state space model, 1787 Statistical Graphics, 1800 syntax, 1750 table names, 1810 time intervals, 1767 unattended mode, 2546 unconditional forecasts ARIMA procedure, 258 unconditional least squares AR initial conditions, 1091 MA Initial Conditions, 1092 uncorrected sum of squares statistics of t, 2689 underlying model smoothing models, 2669 Uniform Periodic Equivalent, 2770 unit root Dickey-Fuller test, 158 of a time series, 154 signicance probabilities, 158 testing for, 154 unit roots
KPSS test, 378 Phillips-Perron test, 345, 346, 376 univariate autoregression, 1093 univariate model diagnostic checks VARMAX procedure, 1958 univariate moving average models, 1099 UNIVARIATE procedure, 50, 1219 descriptive statistics, 50 unlinking viewer windows, 2495 unrestricted vector autoregression, 1095 use with SAS/ETS procedures time intervals, 85 used for state space modeling Kalman lter, 1570 used to select state space models Akaike information criterion, 1591 vector autoregressive models, 1590 Yule-Walker estimates, 1590 Using CROSSLIST= option to create a view SASEFAME engine, 2292 Using INSET= option with the CROSSLIST= option to create a view SASEFAME engine, 7, 2292 using models to forecast MODEL procedure, 1120 Using RANGE= option to create a view SASEFAME engine, 2292 using solution modes MODEL procedure, 1117 Using WHERE clause with INSET= option to create a view SASEFAME engine, 2292 Using WILDCARD= option to create a view SASEFAME engine, 2292 V matrix Generalized Method of Moments, 1012, 1017 VALIDVARNAME=ANY, SAS option statement SASEFAME engine, 2299, 2304 variable list DATASOURCE procedure, 547 variables in model program MODEL procedure, 1151 variance components Fuller Battese, 1292 Nerlove, 1294 Wallace Hussain, 1293 Wansbeek Kapteyns, 1292 VARMAX procedure Akaike Information Criterion, 1957 asymptotic distribution of impulse response functions, 1943, 1951
asymptotic distribution of the parameter estimation, 1951 Bayesian vector autoregressive models, 1904, 1947 cointegration, 1959 cointegration testing, 1902, 1963 common trends, 1959 common trends testing, 1903, 1961 computational details, 2001 condence limits, 1988 convergence problems, 2001 covariance stationarity, 1983 CPU requirements, 2002 decomposition of prediction error covariance, 1897, 1933 Dickey-Fuller test, 1901 differencing, 1894 dynamic simultaneous equation models, 1916 example of Bayesian VAR modeling, 1866 example of Bayesian VECM modeling, 1873 example of causality testing, 1881 example of cointegration testing, 1869 example of multivariate GARCH modeling, 1983 example of restricted parameter estimation and testing, 1878 example of VAR modeling, 1859 example of VARMA modeling, 1952 example of vector autoregressive modeling with exogenous variables, 1874 example of vector error correction modeling, 1868 forecasting, 1930 forecasting of Bayesian vector autoregressive models, 1948 Granger causality test, 1944 impulse response function, 1898, 1919 innite order AR representation, 1898 innite order MA representation, 1898, 1919 invertibility, 1949 long-run relations testing, 1972 memory requirements, 2002 minimum information criteria method, 1940 missing values, 1912 multivariate GARCH Modeling, 1907 multivariate model diagnostic checks, 1957 ODS graph names, 2000 Output Data Sets, 1987 partial autoregression coefcient, 1899, 1936 partial canonical correlation, 1899, 1939 partial correlation, 1938
prediction error covariance, 1897, 1930, 1932 sample cross covariances, 1897, 1935 sample cross-correlations, 1896, 1935 state-space representation, 1913 stationarity, 1941, 1949 tentative order selection, 1935 time intervals, 1892 univariate model diagnostic checks, 1958 vector autoregressive models, 1941 vector autoregressive models with exogenous variables , 1944 vector autoregressive moving-average models, 1912, 1949 vector error correction models, 1906, 1962 weak exogeneity testing, 1974 Yule-Walker estimates, 1899 vector autoregressive models, 1098 used to select state space models, 1590 VARMAX procedure, 1941 vector autoregressive models with exogenous variables VARMAX procedure, 1944 vector autoregressive moving-average models VARMAX procedure, 1912, 1949 vector error correction models VARMAX procedure, 1906, 1962 vector moving average models, 1101 viewing a FAME database, see SASEFAME engine viewing a Haver database, see SASEHAVR engine viewing time series, 2419 Wald test linear hypotheses, 650 nonlinear hypotheses, 1006, 1078, 1410 Wallace Hussain variance components, 1293 Wansbeek Kapteyns variance components, 1292 weak exogeneity testing VARMAX procedure, 1974 _WEIGHT_ variable MODEL procedure, 1052 weights, see smoothing weights WHERE in the DATA step SASEFAME engine, 2307 WHERE statement subsetting data, 50 white noise test SPECTRA procedure, 1551, 1554 white noise test of the residuals, 233 white noise test of the series, 232
Whites test, 1050 heteroscedasticity tests, 1050 widths of time intervals, 101, 735 WINTERS method seasonal forecasting, 799 Winters Method, 2678, 2679 Holt-Winters Method, 2678 smoothing models, 2678, 2679 Winters method FORECAST procedure, 774, 799 Holt-Winters method, 803 X-11 ARIMA methodology X11 procedure, 2058 X-11 seasonal adjustment method, see X11 procedure X-11-ARIMA seasonal adjustment method, see X11 procedure X-12 seasonal adjustment method, see X12 procedure X-12-ARIMA seasonal adjustment method, see X12 procedure X11 procedure BY groups, 2046 Census X-11 method, 2034 Census X-11 methodology, 2059 data requirements, 2064 differences with X11ARIMA/88, 2058 ID variables, 2046, 2048 irregular component, 2034, 2040 model selection for X-11-ARIMA method, 2068 output data sets, 2071, 2072 output table names, 2085 printed output, 2074 seasonal adjustment, 2034, 2040 seasonal component, 2034 trading-day component, 2034, 2040 trend cycle component, 2034, 2040 X-11 ARIMA methodology, 2058 X-11 seasonal adjustment method, 2034 X-11-ARIMA seasonal adjustment method, 2034 X12 procedure BY groups, 2111 Census X-12 method, 2102 ID variables, 2111 INPUT variables, 2113 seasonal adjustment, 2103 seasonal component, 2103 X-12 seasonal adjustment method, 2102 X-12-ARIMA seasonal adjustment method, 2102
year-over-year percent change calculations, 110 yearly averages percent change calculations, 111 Yule-Walker AR initial conditions, 1091 Yule-Walker equations AUTOREG procedure, 358 STATESPACE procedure, 1590 Yule-Walker estimates AUTOREG procedure, 357 used to select state space models, 1590 VARMAX procedure, 1899 Yule-Walker method as generalized least-squares, 361 Zellner estimation, see seemingly unrelated regression Zellners two-stage method PANEL procedure, 1301 zooming graphs, 2493
2834
Syntax Index
2SLS option FIT statement (MODEL), 986, 1009 PROC SYSLIN statement, 1635 3SLS option FIT statement (MODEL), 986, 1011, 1111 PROC SYSLIN statement, 1635 A option PROC SPECTRA statement, 1545 A= option FIXED statement (LOAN), 841 ABORT, 1166 ABS function, 1159 ACCEPTDEFAULT option AUTOMDL statement (X12), 2120 ACCUMULATE= option FORECAST statement (ESM), 689 ID statement (ESM), 692 ID statement (SIMILARITY), 1451 ID statement (TIMESERIES), 1694 INPUT statement (SIMILARITY), 1454 TARGET statement (SIMILARITY), 1456 VAR statement (TIMESERIES), 1698 ADDITIVE option MONTHLY statement (X11), 2047 QUARTERLY statement (X11), 2052 ADDMAXIT= option MODEL statement (MDC), 894 ADDRANDOM option MODEL statement (MDC), 894 ADDVALUE option MODEL statement (MDC), 895 ADF= option ARM statement (LOAN), 846 ADJMEAN option PROC SPECTRA statement, 1545 ADJSMMV option FIT statement (MODEL), 985 ADJUST statement X12 procedure, 2114 ADJUSTFREQ= option ARM statement (LOAN), 846 ALIGN= option FORECAST statement (ARIMA), 143, 237 ID statement (ENG), 143 ID statement (ESM), 143, 693 ID statement (HPF), 143 ID statement (HPFDIAGNOSE), 143 ID statement (HPFEVENTS), 143 ID statement (SIMILARITY), 143, 1452 ID statement (TIMESERIES), 143, 1694 ID statement (UCM), 143, 1767 ID statement (VARMAX), 143, 1892 PROC DATASOURCE statement, 143, 538 PROC EXPAND statement, 143, 729, 734 PROC FORECAST statement, 143, 790 ALL option COMPARE statement (LOAN), 848 MODEL statement (AUTOREG), 342 MODEL statement (MDC), 896 MODEL statement (PDLREG), 1356 MODEL statement (SYSLIN), 1639 PROC SYSLIN statement, 1636 TEST statement (ENTROPY), 650 TEST statement (MODEL), 1006 TEST statement (PANEL), 1281 TEST statement (QLIM), 1395 ALPHA= option FORECAST statement (ARIMA), 237 FORECAST statement (ESM), 689 FORECAST statement (UCM), 1766 IDENTIFY statement (ARIMA), 228 MODEL statement (SYSLIN), 1639 OUTLIER statement (ARIMA), 236 OUTPUT statement (VARMAX), 1909 PROC FORECAST statement, 790 PROC SYSLIN statement, 1635 ALPHA=option OUTLIER statement (UCM), 1773 ALPHACLI= option OUTPUT statement (AUTOREG), 354 OUTPUT statement (PDLREG), 1358 ALPHACLM= option OUTPUT statement (AUTOREG), 354 OUTPUT statement (PDLREG), 1358 ALPHACSM= option OUTPUT statement (AUTOREG), 354 ALTPARM option ESTIMATE statement (ARIMA), 232, 253 ALTW option PROC SPECTRA statement, 1545 AMOUNT= option FIXED statement (LOAN), 841 AMOUNTPCT= option FIXED statement (LOAN), 842 AOCV= option
OUTLIER statement (X12), 2123 APCT= option FIXED statement (LOAN), 842 %AR macro, 1097, 1098 AR option IRREGULAR statement (UCM), 1770 AR= option BOXCOXAR macro, 151 DFTEST macro, 155 ESTIMATE statement (ARIMA), 234 LOGTEST macro, 156 PROC FORECAST statement, 790 ARCHTEST option MODEL statement (AUTOREG), 342 ARCOS function, 1159 ARIMA procedure, 221 syntax, 221 ARIMA procedure, PROC ARIMA statement PLOT option, 224 ARIMA statement X11 procedure, 2043 X12 procedure, 2115 ARM statement LOAN procedure, 845 ARMACV= option AUTOMDL statement (X12), 2120 ARMAX= option PROC STATESPACE statement, 1583 ARSIN function, 1159 ARTEST= option MODEL statement (PANEL), 1276 ASCII option PROC DATASOURCE statement, 538 ASTART= option PROC FORECAST statement, 790 AT= option COMPARE statement (LOAN), 849 ATAN function, 1159 ATOL= option MODEL statement (PANEL), 1276 ATTRIBUTE statement DATASOURCE procedure, 545 AUTOMDL statement X12 procedure, 2118 AUTOREG procedure, 335 syntax, 335 AUTOREG statement UCM procedure, 1757 B option ARM statement (LOAN), 847 BACK= option ESTIMATE statement (UCM), 1763 FORECAST statement (ARIMA), 238
FORECAST statement (UCM), 1766 OUTPUT statement (VARMAX), 1909 PROC ESM statement, 686 PROC STATESPACE statement, 1585 BACKCAST= option ARIMA statement (X11), 2043 BACKLIM= option ESTIMATE statement (ARIMA), 235 BACKSTEP option MODEL statement (AUTOREG), 348 BALANCED option AUTOMODL statement (X12), 2120 BALLOON statement LOAN procedure, 845 BALLOONPAYMENT= option BALLOON statement (LOAN), 845 BANDOPT= option MODEL statement (PANEL), 1276 BASE = option PROC PANEL statement, 1272 BCX option MODEL statement (QLIM), 1392 BESTCASE option ARM statement (LOAN), 847 BI option COMPARE statement (LOAN), 849 BLOCK option PROC MODEL statement, 972, 1178 BLOCKSEASON statement UCM procedure, 1758 BLOCKSIZE= option BLOCKSEASON statement (UCM), 1759 BLUS= option OUTPUT statement (AUTOREG), 354 BOUNDS statement COUNTREG procedure, 490 ENTROPY procedure, 644 MDC procedure, 890 MODEL procedure, 975 QLIM procedure, 1386 BOXCOXAR macro, 151 macro variable, 152 BP option COMPARE statement (LOAN), 849 MODEL statement (PANEL), 1276 BREAKINTEREST option COMPARE statement (LOAN), 849 BREAKPAYMENT option COMPARE statement (LOAN), 849 BREUSCH= option FIT statement (MODEL), 989 BSTART= option PROC FORECAST statement, 790
BTOL= option MODEL statement (PANEL), 1276 BTWNG option MODEL statement (PANEL), 1276, 1277 BUYDOWN statement LOAN procedure, 848 BUYDOWNRATES= option BUYDOWN statement (LOAN), 848 BY statement ARIMA procedure, 227 AUTOREG procedure, 340 COMPUTAB procedure, 446 COUNTREG procedure, 491 ENTROPY procedure, 646 ESM procedure, 689 EXPAND procedure, 731 FORECAST procedure, 795 MDC procedure, 891 MODEL procedure, 976 PANEL procedure, 1271 PDLREG procedure, 1355 QLIM procedure, 1387 SIMILARITY procedure, 1450 SIMLIN procedure, 1516 SPECTRA procedure, 1546 STATESPACE procedure, 1586 SYSLIN procedure, 1637 TIMESERIES procedure, 1688 TSCSREG procedure, 1735 UCM procedure, 1760 VARMAX procedure, 1888 X11 procedure, 2046 X12 procedure, 2111 CANCORR option PROC STATESPACE statement, 1583 CAPS= option ARM statement (LOAN), 846 CAUCHY option ERRORMODEL statement (MODEL), 980 CAUSAL statement VARMAX procedure, 1889 CDEC= option PROC COMPUTAB statement, 439 CDF= option ERRORMODEL statement (MODEL), 981 CELL statement COMPUTAB procedure, 443 CENSORED option ENDOGENOUS statement (QLIM), 1388 MODEL statement (ENTROPY), 647 CENTER option ARIMA statement (X11), 2045 IDENTIFY statement (ARIMA), 228
MODEL statement (AUTOREG), 340 MODEL statement (VARMAX), 1893 PROC SPECTRA statement, 1545 CEV= option OUTPUT statement (AUTOREG), 354 CHAR option COLUMNS statement (COMPUTAB), 440 ROWS statement (COMPUTAB), 442 CHARTS= option MONTHLY statement (X11), 2048 QUARTERLY statement (X11), 2053 CHECKBREAK option LEVEL statement (UCM), 1772 CHICR= option ARIMA statement (X11), 2044 CHISQUARED option ERRORMODEL statement (MODEL), 980 CHOICE= option MODEL statement (MDC), 892 CHOW= option FIT statement (MODEL), 990, 1081 MODEL statement (AUTOREG), 343 CLAG option LAG statement (PANEL), 1275 CLAG statement LAG statement (PANEL), 1275 CLASS statement PANEL procedure, 1272 CLEAR option IDENTIFY statement (ARIMA), 228 CLIMIT= option FORECAST command (TSFS), 2548 CMPMODEL options, 972 COEF option MODEL statement (AUTOREG), 343 PROC SPECTRA statement, 1546 COEF= option HETERO statement (AUTOREG), 351 COINTEG statement VARMAX procedure, 1890, 1973 COINTTEST= option MODEL statement (VARMAX), 1902 COINTTEST=(JOHANSEN) option MODEL statement (VARMAX), 1902 COINTTEST=(JOHANSEN=(IORDER=)) option MODEL statement (VARMAX), 1902, 1980 COINTTEST=(JOHANSEN=(NORMALIZE=)) option MODEL statement (VARMAX), 1903, 1966 COINTTEST=(JOHANSEN=(TYPE=)) option MODEL statement (VARMAX), 1903 COINTTEST=(SIGLEVEL=) option MODEL statement (VARMAX), 1903
COINTTEST=(SW) option MODEL statement (VARMAX), 1903, 1961 COINTTEST=(SW=(LAG=)) option MODEL statement (VARMAX), 1903 COINTTEST=(SW=(TYPE=)) option MODEL statement (VARMAX), 1904 COLLIN option ENTROPY procedure, 641 FIT statement (MODEL), 990 column headings option COLUMNS statement (COMPUTAB), 440 COLUMNS statement COMPUTAB procedure, 440 COMPARE statement LOAN procedure, 848 COMPOUND= option FIXED statement (LOAN), 842 COMPRESS= option TARGET statement (SIMILARITY), 1457 COMPUTAB procedure, 436 syntax, 436 CONDITIONAL OUTPUT statement (QLIM), 1393 CONST= option BOXCOXAR macro, 151 LOGTEST macro, 157 CONSTANT= option OUTPUT statement (AUTOREG), 355 OUTPUT statement (PDLREG), 1359 CONTROL, 1206 CONTROL statement MODEL procedure, 979, 1151 CONVERGE= option ARIMA statement (X11), 2044 ENTROPY procedure, 643 ESTIMATE statement (ARIMA), 235 FIT statement (MODEL), 992, 1028, 1036, 1038 MODEL statement (AUTOREG), 348 MODEL statement (MDC), 892 MODEL statement (PDLREG), 1357 PROC SYSLIN statement, 1635 SOLVE statement (MODEL), 1004 CONVERT statement EXPAND procedure, 732 CONVERT= option LIBNAME statement (SASEFAME), 2293 COPULA= option SOLVE statement (MODEL), 1003 CORR option FIT statement (MODEL), 990 MODEL statement (PANEL), 1277 MODEL statement (TSCSREG), 1736 CORR statement
TIMESERIES procedure, 1689 CORRB option ESTIMATE statement (MODEL), 982 FIT statement (MODEL), 990 MODEL statement, 492 MODEL statement (AUTOREG), 343 MODEL statement (MDC), 896 MODEL statement (PANEL), 1277 MODEL statement (PDLREG), 1356 MODEL statement (SYSLIN), 1639 MODEL statement (TSCSREG), 1736 QLIM procedure, 1385 CORROUT option PROC PANEL statement, 1270 PROC QLIM statement, 1385 PROC TSCSREG statement, 1734 CORRS option FIT statement (MODEL), 990 COS function, 1159 COSH function, 1159 COST option ENDOGENOUS statement (QLIM), 1389 COUNTREG procedure, 487 syntax, 487 COV option FIT statement (MODEL), 990 COV3OUT option PROC SYSLIN statement, 1635 COVB option ESTIMATE statement (MODEL), 982 FIT statement (MODEL), 990 MODEL statement, 492 MODEL statement (AUTOREG), 343 MODEL statement (MDC), 896 MODEL statement (PANEL), 1277 MODEL statement (PDLREG), 1356 MODEL statement (SYSLIN), 1639 MODEL statement (TSCSREG), 1736 PROC STATESPACE statement, 1584 QLIM procedure, 1385 COVBEST= option ENTROPY procedure, 641 FIT statement (MODEL), 985, 1021 COVEST= option MODEL statement (AUTOREG), 343 MODEL statement (MDC), 895 PROC COUNTREG statement, 490 QLIM procedure, 1385 COVOUT option ENTROPY procedure, 642 FIT statement (MODEL), 988 PROC AUTOREG statement, 339 PROC COUNTREG statement, 489 PROC MDC statement, 890
PROC PANEL statement, 1270 PROC QLIM statement, 1385 PROC SYSLIN statement, 1635 PROC TSCSREG statement, 1734 COVS option FIT statement (MODEL), 990, 1026 CPEV= option OUTPUT statement (AUTOREG), 354 CROSS option PROC SPECTRA statement, 1546 CROSSCORR statement TIMESERIES procedure, 1690 CROSSCORR= option IDENTIFY statement (ARIMA), 228 CROSSLIST= option LIBNAME statement (SASEFAME), 2295 CROSSPLOTS= option PROC TIMESERIES statement, 1686 CROSSVAR statement TIMESERIES procedure, 1698 CRSPLINKPATH= option LIBNAME statement (SASECRSP), 2201 CSPACE= option PROC COMPUTAB statement, 439 CSTART= option PROC FORECAST statement, 791 CUSIP= option LIBNAME statement (SASECRSP), 2199 CUSUM= option OUTPUT statement (AUTOREG), 355 CUSUMLB= option OUTPUT statement (AUTOREG), 355 CUSUMSQ= option OUTPUT statement (AUTOREG), 355 CUSUMSQLB= option OUTPUT statement (AUTOREG), 355 CUSUMSQUB= option OUTPUT statement (AUTOREG), 355 CUSUMUB= option OUTPUT statement (AUTOREG), 355 CUTOFF= option SSPAN statement (X11), 2055 CV= option OUTLIER statement (X12), 2122 CWIDTH= option PROC COMPUTAB statement, 439 CYCLE statement UCM procedure, 1760 DASILVA option MODEL statement (PANEL), 1277 MODEL statement (TSCSREG), 1737 DATA Step IF Statement, 73
WHERE Statement, 73 DATA step DROP statement, 74 KEEP statement, 74 DATA= option ENTROPY procedure, 641 FIT statement (MODEL), 987, 1104 FORECAST command (TSFS), 2546, 2547 IDENTIFY statement (ARIMA), 229 PROC ARIMA statement, 224 PROC AUTOREG statement, 339 PROC COMPUTAB statement, 438 PROC COUNTREG statement, 489 PROC ESM statement, 686 PROC EXPAND statement, 729 PROC FORECAST statement, 791 PROC MDC statement, 889 PROC MODEL statement, 970 PROC PANEL statement, 1270 PROC PDLREG statement, 1355 PROC QLIM statement, 1384 PROC SIMILARITY statement, 1448 PROC SIMLIN statement, 1515, 1522 PROC SPECTRA statement, 1546 PROC STATESPACE statement, 1582 PROC SYSLIN statement, 1635 PROC TIMESERIES statement, 1686 PROC TSCSREG statement, 1734 PROC UCM statement, 1754 PROC VARMAX statement, 1886 PROC X11 statement, 2042 PROC X12 statement, 2109 SOLVE statement (MODEL), 1000, 1151 TSVIEW command (TSFS), 2546, 2547 DATASOURCE procedure, 536 syntax, 536 DATE function, 144 DATE= option MONTHLY statement (X11), 2048 PROC X12 statement, 2109 QUARTERLY statement (X11), 2053 DATEJUL function, 96 DATEJUL function, 144 DATEPART function, 97, 144 DATETIME function, 144 DAY function, 96, 144 DBNAME= option PROC DATASOURCE statement, 538 DBTYPE= option PROC DATASOURCE statement, 538
DECOMP statement TIMESERIES procedure, 1691 DEGREE= option SPLINEREG statement (UCM), 1778 SPLINESEASON statement (UCM), 1779 DELTA= option ESTIMATE statement (ARIMA), 235 DEPLAG statement UCM procedure, 1762 DETAILS option FIT statement (MODEL), 1046 PROC MODEL statement, 973 DETTOL= option PROC STATESPACE statement, 1584 DFPVALUE macro, 153 macro variable, 154, 155 DFTEST macro, 154 DFTEST option MODEL statement (VARMAX), 1901, 2002 DFTEST=(DLAG=) option MODEL statement (VARMAX), 1901 DHMS function, 97 DHMS function, 144 DIAG= option FORECAST command (TSFS), 2549 DIF function, 105 DIF function MODEL procedure, 109 DIF= option BOXCOXAR macro, 151 DFTEST macro, 155 INPUT statement (SIMILARITY), 1454 LOGTEST macro, 157 MODEL statement (VARMAX), 1893 TARGET statement (SIMILARITY), 1458 VAR statement (TIMESERIES), 1698 DIFF= option IDENTIFY statement (X12), 2117 DIFFORDER= option AUTOMDL statement (X12), 2119 DIFX= option MODEL statement (VARMAX), 1894 DIFY= option MODEL statement (VARMAX), 1894, 2013 DIMMAX= option PROC STATESPACE statement, 1584 DISCRETE option ENDOGENOUS statement (QLIM), 1388 DIST= option COUNTREG statement (COUNTREG), 492
MODEL statement (AUTOREG), 341 MODEL statement (COUNTREG), 492 DISTRIBUTION= option ENDOGENOUS statement (QLIM), 1388 DLAG= option DFPVALUE macro, 153 DFTEST macro, 155 DO, 1164 DOL option ROWS statement (COMPUTAB), 442 DOWNPAYMENT= option FIXED statement (LOAN), 842 DOWNPAYPCT= option FIXED statement (LOAN), 842 DP= option FIXED statement (LOAN), 842 DPCT= option FIXED statement (LOAN), 842 DROP statement DATASOURCE procedure, 541 DROP= option FIT statement (MODEL), 984 LIBNAME statement (SASEHAVR), 2344 DROPEVENT statement DATASOURCE procedure, 543 DROPGROUP= option LIBNAME statement (SASEHAVR), 2344 DROPH= option SEASON statement (UCM), 1775 DROPSOURCE= option LIBNAME statement (SASEHAVR), 2344 DUAL option ENTROPY procedure, 643 DUL option ROWS statement (COMPUTAB), 442 DW option FIT statement (MODEL), 990 MODEL statement (SYSLIN), 1639 DW= option MODEL statement (AUTOREG), 343 MODEL statement (PDLREG), 1357 DWPROB option FIT statement (MODEL), 990 MODEL statement (AUTOREG), 343 MODEL statement (PDLREG), 1357 DYNAMIC option FIT statement (MODEL), 985, 1070, 1072 SOLVE statement (MODEL), 1002, 1068, 1117 EBCDIC option PROC DATASOURCE statement, 538 ECM= option MODEL statement (VARMAX), 1906
ECM=(ECTREND) option MODEL statement (VARMAX), 1906, 1970 ECM=(NORMALIZE=) option MODEL statement (VARMAX), 1871, 1906 ECM=(RANK=) option MODEL statement (VARMAX), 1871, 1906 Empirical Distribution Estimation ERRORMODEL statement (MODEL), 1023 EMPIRICAL= option ERRORMODEL statement (MODEL), 981 END= option ID statement (ESM), 693 ID statement (SIMILARITY), 1452 ID statement (TIMESERIES), 1695 LIBNAME statement (SASEHAVR), 2344 MONTHLY statement (X11), 2048 QUARTERLY statement (X11), 2053 ENDOGENOUS statement MODEL procedure, 979, 1151 SIMLIN procedure, 1516 SYSLIN procedure, 1637 ENTROPY procedure, 638 syntax, 638 ENTRY= option FORECAST command (TSFS), 2548 EPSILON = option FIT statement (MODEL), 992 ERRORMODEL statement MODEL procedure, 980 ERRSTD OUTPUT statement (QLIM), 1393 ESACF option IDENTIFY statement (ARIMA), 229 ESM, 681 ESM procedure, 684 syntax, 684 EST= option PROC SIMLIN statement, 1515, 1521 ESTDATA= option FIT statement (MODEL), 988, 1105 SOLVE statement (MODEL), 1000, 1121, 1149 ESTIMATE statement ARIMA procedure, 232 MODEL procedure, 981 UCM procedure, 1763 X12 procedure, 2115 ESTIMATEDCASE= option ARM statement (LOAN), 847 ESTPRINT option PROC SIMLIN statement, 1515 ESUPPORTS= option MODEL statement (ENTROPY), 647 EVENT statement
X12 procedure, 2112 EXCLUDE= option FIT statement (MODEL), 1086 INSTRUMENTS statement (MODEL), 995 MONTHLY statement (X11), 2048 EXOGENEITY option COINTEG statement (VARMAX), 1890, 1976 EXOGENOUS statement MODEL procedure, 983, 1151 SIMLIN procedure, 1517 EXP function, 1159 EXPAND procedure, 727 CONVERT statement, 742 syntax, 727 EXPAND= option TARGET statement (SIMILARITY), 1458 EXPECTED OUTPUT statement (QLIM), 1393 EXTRADIFFUSE= option ESTIMATE statement (UCM), 1763 FORECAST statement (UCM), 1766 EXTRAPOLATE option PROC EXPAND statement, 730 F option ERRORMODEL statement (MODEL), 980 FACTOR= option PROC EXPAND statement, 729, 733, 734 FAMEOUT= option LIBNAME statement (SASEFAME), 2297 FAMEPRINT option PROC DATASOURCE statement, 538 FCMPOPT statement SIMILARITY procedure, 1451 FILETYPE= option PROC DATASOURCE statement, 538 FIML option FIT statement (MODEL), 985, 1019, 1105, 1211 PROC SYSLIN statement, 1635 FINAL= option X11 statement (X12), 2135 FIRST option PROC SYSLIN statement, 1636 FIT statement MODEL procedure, 984 FIT statement, MODEL procedure GINV= option, 985 FIXED statement LOAN procedure, 841 FIXEDCASE option ARM statement (LOAN), 847 FIXONE option
MODEL statement (PANEL), 1277 MODEL statement (TSCSREG), 1736 FIXONETIME option MODEL statement (PANEL), 1277 FIXTWO option MODEL statement (PANEL), 1277 MODEL statement (TSCSREG), 1737 FLATDATA statement PANEL procedure, 1272 FLOW option PROC MODEL statement, 973 FORCE= FREQ option LIBNAME statement (SASEHAVR), 2344 FORCE= option X11 statement (X12), 2135 FORECAST macro, 2547 FORECAST option SOLVE statement (MODEL), 1002, 1117, 1120 FORECAST procedure, 788 syntax, 788 FORECAST statement ARIMA procedure, 237 ESM procedure, 689 UCM procedure, 1765 X12 procedure, 2116 FORECAST= option ARIMA statement (X11), 2044 FORM statement STATESPACE procedure, 1586 FORM= option GARCH statement, 1907 FORMAT statement DATASOURCE procedure, 545 FORMAT= option ATTRIBUTE statement (DATASOURCE), 545 COLUMNS statement (COMPUTAB), 441 ID statement (ESM), 693 ID statement (SIMILARITY), 1453 ID statement (TIMESERIES), 1695 ROWS statement (COMPUTAB), 443 FREQ= option LIBNAME statement (SASEHAVR), 2344 FREQUENCY= option PROC DATASOURCE statement, 539 FROM= option PROC EXPAND statement, 729, 733, 734 FRONTIER option ENDOGENOUS statement (QLIM), 1389 FSRSQ option FIT statement (MODEL), 991, 1010, 1087 FULLER option
MODEL statement (PANEL), 1277 MODEL statement (TSCSREG), 1737 FULLWEIGHT= option MONTHLY statement (X11), 2049 QUARTERLY statement (X11), 2053 FUNCTION= option TRANSFORM statement (X12), 2131 FUZZ= option PROC COMPUTAB statement, 438 GARCH statement VARMAX procedure, 1907 GARCH= option MODEL statement (AUTOREG), 341 GCE option ENTROPY procedure, 641 GCENM option ENTROPY procedure, 641 GCONV= option ENTROPY procedure, 643 GENERAL= option ERRORMODEL statement (MODEL), 980 GENGMMV option FIT statement (MODEL), 985 GETDER function, 1158 GINV option MODEL statement (AUTOREG), 343 GINV= option FIT statement (MODEL), 985 MODEL statement (PANEL), 1277 GME option ENTROPY procedure, 641 GMED option ENTROPY procedure, 641 GMENM option ENTROPY procedure, 641 GMM option FIT statement (MODEL), 985, 1011, 1054, 1105, 1108, 1109, 1111 MODEL statement (PANEL), 1277 GODFREY option FIT statement (MODEL), 991 MODEL statement (AUTOREG), 344 GRAPH option PROC MODEL statement, 972, 1178 GRID option ESTIMATE statement (ARIMA), 235 GRIDVAL= option ESTIMATE statement (ARIMA), 235 GROUP1 option CAUSAL statement (VARMAX), 1889 GROUP2 option CAUSAL statement (VARMAX), 1889 GROUP= option
LIBNAME statement (SASEHAVR), 2344 GVKEY= option LIBNAME statement (SASECRSP), 2198 H= option COINTEG statement (VARMAX), 1890, 1973 HALTONSTART= option MODEL statement (MDC), 892 HAUSMAN option FIT statement (MODEL), 991, 1080 HCCME= option FIT statement (MODEL), 986 MODEL statement (PANEL), 1277 HCUSIP= option LIBNAME statement (SASECRSP), 2199 HESSIAN= option FIT statement (MODEL), 992, 1021 HETERO statement AUTOREG procedure, 350 HEV option MODEL statement (MDC), 892 HMS function, 97 HMS function, 144 HOLIDAY function, 144 HORIZON= option FORECAST command (TSFS), 2548 HOUR function, 144 HRINITIAL option AUTOMDL statement (X12), 2120 HT= option OUTPUT statement (AUTOREG), 354 I option FIT statement (MODEL), 991, 1043 MODEL statement (PDLREG), 1357 MODEL statement (SYSLIN), 1639 ID statement ENTROPY procedure, 646 ESM procedure, 691 EXPAND procedure, 733 FORECAST procedure, 795 MDC procedure, 891 MODEL procedure, 993 PANEL procedure, 1273 SIMILARITY procedure, 1451 SIMLIN procedure, 1517 STATESPACE procedure, 1586 TIMESERIES procedure, 1693 TSCSREG procedure, 1735 UCM procedure, 1767 VARMAX procedure, 1892
X11 procedure, 2046 X12 procedure, 2111 ID= option FORECAST command (TSFS), 2546, 2547 FORECAST statement (ARIMA), 238 OUTLIER statement (ARIMA), 237 TSVIEW command (TSFS), 2546, 2547 IDENTIFY statement ARIMA procedure, 228, 236 X12 procedure, 2117 IDENTITY statement SYSLIN procedure, 1638 IF, 1164 INCLUDE, 1168 INCLUDE statement MODEL procedure, 993 INDEX option PROC DATASOURCE statement, 538 INDID = option PROC PANEL statement, 1272 INDNO= option LIBNAME statement (SASECRSP), 2200 INEVENT= option PROC X12 statement, 2111 INFILE= option PROC DATASOURCE statement, 538 INIT statement COMPUTAB procedure, 444 COUNTREG procedure, 491 QLIM procedure, 1391 INIT= option FIXED statement (LOAN), 843 INITIAL statement STATESPACE procedure, 1587 INITIAL= option FIT statement (MODEL), 984, 1072 FIXED statement (LOAN), 843 MODEL statement (AUTOREG), 348 MODEL statement (MDC), 896 INITIALPCT= option FIXED statement (LOAN), 843 INITMISS option PROC COMPUTAB statement, 438 INITPCT= option FIXED statement (LOAN), 843 INITVAL= option ESTIMATE statement (ARIMA), 234 INPUT statement SIMILARITY procedure, 1454 X12 procedure, 2113 INPUT= option ESTIMATE statement (ARIMA), 232, 253 INSET= option LIBNAME statement (SASECRSP), 2202
LIBNAME statement (SASEFAME), 2294 INSTRUMENTS statement MODEL procedure, 994, 1084 SYSLIN procedure, 1638 INTCINDEX function, 145 INTCK function, 99 INTCK function, 145 INTCYCLE function, 145 INTEGRATE= option MODEL statement (MDC), 892 INTERIM= option PROC SIMLIN statement, 1515 INTERVAL= option FORECAST command (TSFS), 2546, 2548 FORECAST statement (ARIMA), 238, 259 ID statement (ESM), 693 ID statement (SIMILARITY), 1453 ID statement (TIMESERIES), 1695 ID statement (UCM), 1767 ID statement (VARMAX), 1892 PROC DATASOURCE statement, 539 PROC FORECAST statement, 791 PROC STATESPACE statement, 1585 PROC X12 statement, 2110 TSVIEW command (TSFS), 2546, 2548 INTERVAL=option FIXED statement (LOAN), 843 INTFIT function, 145 INTFMT function, 145 INTGET function, 145 INTGPRINT option SOLVE statement (MODEL), 1004 INTINDEX function, 146 INTNX function, 98 INTNX function, 146 INTONLY option INSTRUMENTS statement (MODEL), 995 INTORDER= option MODEL statement (MDC), 892 INTPER= option PROC FORECAST statement, 791 PROC STATESPACE statement, 1585 INTSEAS function, 146 INTSHIFT function, 147 INTTEST function, 147 IRREGULAR statement UCM procedure, 1768 IT2SLS option FIT statement (MODEL), 986 IT3SLS option FIT statement (MODEL), 986 PROC SYSLIN statement, 1636
ITALL option FIT statement (MODEL), 991, 1044 ITDETAILS option FIT statement (MODEL), 991, 1042 ITGMM option FIT statement (MODEL), 986, 1015 MODEL statement (PANEL), 1277 ITOLS option FIT statement (MODEL), 986 ITPRINT option ENTROPY procedure, 642 ESTIMATE statement (X12), 2116 FIT statement (MODEL), 992, 1036, 1042 MODEL statement, 492 MODEL statement (AUTOREG), 344 MODEL statement (MDC), 896 MODEL statement (PANEL), 1278 MODEL statement (PDLREG), 1357 PROC STATESPACE statement, 1584 PROC SYSLIN statement, 1637 QLIM procedure, 1385 SOLVE statement (MODEL), 1005, 1145 ITSUR option FIT statement (MODEL), 986, 1011 PROC SYSLIN statement, 1636 J= option COINTEG statement (VARMAX), 1891 JACOBI option SOLVE statement (MODEL), 1003 JULDATE function, 96, 147 K option PROC SPECTRA statement, 1546 K= option MODEL statement (SYSLIN), 1639 PROC SYSLIN statement, 1636 KEEP = option PROC PANEL statement, 1272 KEEP statement DATASOURCE procedure, 541 KEEP= option FORECAST command (TSFS), 2549 LIBNAME statement (SASEHAVR), 2344 KEEPEVENT statement DATASOURCE procedure, 542 KEEPH= option SEASON statement (UCM), 1776 KERNEL option FIT statement (MODEL), 1108 KERNEL= option FIT statement (MODEL), 986, 1013 KLAG= option PROC STATESPACE statement, 1584
KNOTS= option SPLINEREG statement (UCM), 1778 SPLINESEASON statement (UCM), 1779 L= option FIXED statement (LOAN), 841 _LABEL_ option COLUMNS statement (COMPUTAB), 440 ROWS statement (COMPUTAB), 442 LABEL statement DATASOURCE procedure, 546 MODEL procedure, 995 LABEL= option ATTRIBUTE statement (DATASOURCE), 545 FIXED statement (LOAN), 843 LAG function, 105 LAG function MODEL procedure, 109 LAG statement PANEL procedure, 1274 LAGDEP option MODEL statement (AUTOREG), 344 MODEL statement (PDLREG), 1357 LAGDEP= option MODEL statement (AUTOREG), 344 MODEL statement (PDLREG), 1357 LAGDV option MODEL statement (AUTOREG), 344 MODEL statement (PDLREG), 1357 LAGDV= option MODEL statement (AUTOREG), 344 MODEL statement (PDLREG), 1357 LAGGED statement SIMLIN procedure, 1517 LAGMAX= option MODEL statement (VARMAX), 1896 PROC STATESPACE statement, 1582 LAGRANGE option TEST statement (ENTROPY), 650 TEST statement (MODEL), 1006 LAGS= option CORR statement (TIMESERIES), 1690 CROSSCORR statement (TIMESERIES), 1691 DEPLAG statement (UCM), 1762 LAMBDA= option DECOMP statement (TIMESERIES), 1692 LAMBDAHI= option BOXCOXAR macro, 151 LAMBDALO= option BOXCOXAR macro, 151 LCL= option
OUTPUT statement (AUTOREG), 355 OUTPUT statement (PDLREG), 1359 LCLM= option OUTPUT statement (AUTOREG), 355 OUTPUT statement (PDLREG), 1359 LDW option MODEL statement (AUTOREG), 349 LEAD= option FORECAST statement (ARIMA), 238 FORECAST statement (UCM), 1766 FORECAST statement (X12), 2117 OUTPUT statement (VARMAX), 1909 PROC ESM statement, 686 PROC FORECAST statement, 791 PROC STATESPACE statement, 1585 LENGTH option MONTHLY statement (X11), 2049 LENGTH statement DATASOURCE procedure, 546 LENGTH= option ATTRIBUTE statement (DATASOURCE), 545 SEASON statement (UCM), 1776 SPLINESEASON statement (UCM), 1780 LEVEL statement UCM procedure, 1771 LIBNAME libref SASECRSP statement, 2195 LIFE= option FIXED statement (LOAN), 841 LIKE option TEST statement (ENTROPY), 650 TEST statement (MODEL), 1006 LIMIT1= option MODEL statement (QLIM), 1392 LIML option PROC SYSLIN statement, 1636 LINK= option HETERO statement (AUTOREG), 351 LIST option FIT statement (MODEL), 1197 PROC MODEL statement, 972, 1169 LISTALL option PROC MODEL statement, 972 LISTCODE option PROC MODEL statement, 972, 1171 LISTDEP option PROC MODEL statement, 972, 1174 LISTDER option PROC MODEL statement, 973 LJC option COLUMNS statement (COMPUTAB), 441 ROWS statement (COMPUTAB), 443 LJUNGBOXLIMIT= option AUTOMDL statement (X12), 2120
LM option TEST statement (ENTROPY), 650 TEST statement (MODEL), 1006 TEST statement (PANEL), 1281 TEST statement (QLIM), 1395 LOAN procedure, 838 syntax, 838 LOG function, 1159 LOG10 function, 1159 LOG2 function, 1159 LOGLIKL option MODEL statement (AUTOREG), 344 LOGNORMALPARM= option MODEL statement (MDC), 893 LOGTEST macro, 156 macro variable, 157 LOWERBOUND= option ENDOGENOUS statement (QLIM), 1388, 1389 LR option TEST statement (ENTROPY), 650 TEST statement (MODEL), 1006 TEST statement (PANEL), 1281 TEST statement (QLIM), 1395 LRECL= option PROC DATASOURCE statement, 539 LSCV= option OUTLIER statement (X12), 2123 LTEBOUND= option FIT statement (MODEL), 992, 1147 MODEL statement (MODEL), 1147 SOLVE statement (MODEL), 1147 M= option MODEL statement (PANEL), 1278 MODEL statement (TSCSREG), 1737 %MA macro, 1100, 1101 MA= option ESTIMATE statement (ARIMA), 234 MACURVES statement X11 procedure, 2046 MAPECR= option ARIMA statement (X11), 2044 MARGINAL OUTPUT statement (QLIM), 1393 MARGINALS option MODEL statement (ENTROPY), 647 MARKOV option ENTROPY procedure, 641 MARR= option COMPARE statement (LOAN), 849 MAXAD= option ARM statement (LOAN), 846
MAXADJUST= option ARM statement (LOAN), 846 MAXBAND= option MODEL statement (PANEL), 1278 MAXDIFF= option AUTOMDL statement (X12), 2119 MAXERROR= option PROC ESM statement, 686 PROC TIMESERIES statement, 1686 MAXERRORS= option PROC FORECAST statement, 791 PROC MODEL statement, 973 MAXIT= PROC SYSLIN statement, 1636 MAXIT= option ESTIMATE statement (ARIMA), 235 PROC STATESPACE statement, 1584 MAXITER= option ARIMA statement (X11), 2044 ENTROPY procedure, 643 ESTIMATE statement (ARIMA), 235 ESTIMATE statement (X12), 2116 FIT statement (MODEL), 992 MODEL statement (AUTOREG), 349 MODEL statement (MDC), 896 MODEL statement (PANEL), 1278 MODEL statement (PDLREG), 1357 PROC SYSLIN statement, 1636 SOLVE statement (MODEL), 1004 MAXNUM= option OUTLIER statement (ARIMA), 237 OUTLIER statement (UCM), 1773 MAXORDER= option AUTOMDL statement (X12), 2119 MAXPCT= option OUTLIER statement (ARIMA), 237 OUTLIER statement (UCM), 1773 MAXR= option ARM statement (LOAN), 846 MAXRATE= option ARM statement (LOAN), 846 MAXSUBITER= option ENTROPY procedure, 643 FIT statement (MODEL), 992, 1028 SOLVE statement (MODEL), 1004 MDC procedure, 887 syntax, 887 MDC procedure, MODEL statement ADDMAXIT= option, 894 ADDRANDOM option, 894 ADDVALUE option, 895 ALL option, 896 CHOICE= option, 892 CONVERGE= option, 892
CORRB option, 896 COVB option, 896 COVEST= option, 895 HALTONSTART= option, 892 HEV= option, 892 INITIAL= option, 896 ITPRINT option, 896 LOGNORMALPARM= option, 893 MAXITER= option, 896 MIXED= option, 893 NCHOICE option, 893 NOPRINT option, 896 NORMALEC= option, 893 NORMALPARM= option, 893 NSIMUL option, 893 OPTMETHOD= option, 896 RANDINIT option, 894 RANDNUM= option, 893 RANK option, 894 RESTART= option, 894 SAMESCALE option, 895 SEED= option, 895 SPSCALE option, 895 TYPE= option, 895 UNIFORMEC= option, 893 UNIFORMPARM= option, 893 UNITVARIANCE= option, 895 MDC procedure, OUTPUT statement OUT= option, 901 P= option, 901 XBETA= option, 901 MDC procedure, PROC MDC statement COVOUT option, 890 DATA= option, 889 OUTEST= option, 890 MDCDATA statement, 890 MDLINFOIN= option PROC X12 statement, 2111 MDLINFOOUT= option PROC X12 statement, 2111 MDY function, 96 MDY function, 147 MEAN= option MODEL statement (AUTOREG), 342 MEASURE= option TARGET statement (SIMILARITY), 1459 MEDIAN option FORECAST statement (ESM), 690 MELO option PROC SYSLIN statement, 1636 MEMORYUSE option PROC MODEL statement, 974 METHOD= option
ARIMA statement (X11), 2044 CONVERT statement (EXPAND), 732, 739 ENTROPY procedure, 644 ESTIMATE statement (ARIMA), 232 FIT statement (MODEL), 992, 1027 MODEL statement (AUTOREG), 349 MODEL statement (PDLREG), 1357 MODEL statement (VARMAX), 1894 PROC COUNTREG statement, 490 PROC EXPAND statement, 730, 739 PROC FORECAST statement, 791 QLIM procedure, 1386 MILLS OUTPUT statement (QLIM), 1393 MINIC option IDENTIFY statement (ARIMA), 229 PROC STATESPACE statement, 1583 MINIC= option MODEL statement (VARMAX), 1900 MINIC=(P=) option MODEL statement (VARMAX), 1901, 1940 MINIC=(PERROR=) option MODEL statement (VARMAX), 1901 MINIC=(Q=) option MODEL statement (VARMAX), 1901, 1940 MINIC=(TYPE=) option MODEL statement (VARMAX), 1901 MINR= option ARM statement (LOAN), 846 MINRATE= option ARM statement (LOAN), 846 MINTIMESTEP= option FIT statement (MODEL), 993, 1147 MODEL statement (MODEL), 1147 SOLVE statement (MODEL), 1147 MINUTE function, 147 MISSING=option FIT statement (MODEL), 988 MIXED option MODEL statement (MDC), 893 MODE= option DECOMP statement (TIMESERIES), 1692 X11 statement (X12), 2133 MODEL procedure, 962 syntax, 962 MODEL statement AUTOREG procedure, 340 COUNTREG procedure, 491 ENTROPY procedure, 646 MDC procedure, 892 PANEL procedure, 1275, 1276 PDLREG procedure, 1356 QLIM procedure, 1391
SYSLIN procedure, 1638 TSCSREG procedure, 1736 UCM procedure, 1772 VARMAX procedure, 1892 MODEL= option ARIMA statement (X11), 2044 ARIMA statement (X12), 2115 FORECAST statement (ESM), 690 PROC MODEL statement, 971, 1167 MOMENT statement MODEL procedure, 996 MONTH function, 96, 147 MONTHLY statement X11 procedure, 2047 MTITLE= option COLUMNS statement (COMPUTAB), 440 MU= option ESTIMATE statement (ARIMA), 235 +n option COLUMNS statement (COMPUTAB), 441 ROWS statement (COMPUTAB), 442 N2SLS option FIT statement (MODEL), 986 N3SLS option FIT statement (MODEL), 986 NAHEAD= option SOLVE statement (MODEL), 1002, 1117, 1118 _NAME_ option COLUMNS statement (COMPUTAB), 441 ROWS statement (COMPUTAB), 442 NBACKCAST= option FORECAST statement (ESM), 690 NBLOCKS= option BLOCKSEASON statement (UCM), 1759 NCHOICE option MODEL statement (MDC), 893 NDEC= option MONTHLY statement (X11), 2049 PROC MODEL statement, 973 QUARTERLY statement (X11), 2054 SSPAN statement (X11), 2055 NDRAW option FIT statement (MODEL), 986 NDRAW= option QLIM procedure, 1386 NEST statement MDC procedure, 897 NESTIT option FIT statement (MODEL), 993, 1027 NEWTON option SOLVE statement (MODEL), 1003
NKNOTS= option SPLINEREG statement (UCM), 1778 NLAG= option CORR statement (TIMESERIES), 1690 CROSSCORR statement (TIMESERIES), 1691 IDENTIFY statement (ARIMA), 229 MODEL statement (AUTOREG), 341 MODEL statement (PDLREG), 1358 NLAGS= option PROC FORECAST statement, 790 NLAMBDA= option BOXCOXAR macro, 151 NLOPTIONS statement MDC procedure, 900 QLIM procedure, 1392 UCM procedure, 1772 VARMAX procedure, 1908, 1953 NO2SLS option FIT statement (MODEL), 987 NO3SLS option FIT statement (MODEL), 987 NOCENTER option PROC STATESPACE statement, 1583 NOCOMPRINT option COMPARE statement (LOAN), 850 NOCONST option HETERO statement (AUTOREG), 352 NOCONSTANT option ESTIMATE statement (ARIMA), 233 NOCURRENTX option MODEL statement (VARMAX), 1894 NODF option ESTIMATE statement (ARIMA), 233 NODIFFS option MODEL statement (PANEL), 1278 NOEST option AUTOREG statement (UCM), 1757 BLOCKSEASON statement (UCM), 1759 CYCLE statement (UCM), 1761 DEPLAG statement (UCM), 1762 ESTIMATE statement (ARIMA), 235 IRREGULAR statement (UCM), 1768, 1770 LEVEL statement (UCM), 1772 PROC STATESPACE statement, 1584 RANDOMREG statement (UCM), 1774 SEASON statement (UCM), 1776 SLOPE statement (UCM), 1777 SPLINEREG statement (UCM), 1779 SPLINESEASON statement (UCM), 1780 NOESTIM option MODEL statement (PANEL), 1278 NOGENGMMV option
FIT statement (MODEL), 987 NOINCLUDE option PROC SYSLIN statement, 1636 NOINT option ARIMA statement (X11), 2045 AUTOMODL statement (X12), 2119 ESTIMATE statement (ARIMA), 233 INSTRUMENTS statement (MODEL), 995 MODEL statement (AUTOREG), 340, 342 MODEL statement (COUNTREG), 492 MODEL statement (ENTROPY), 647 MODEL statement (PANEL), 1278 MODEL statement (PDLREG), 1358 MODEL statement (QLIM), 1392 MODEL statement (SYSLIN), 1639 MODEL statement (TSCSREG), 1737 MODEL statement (VARMAX), 1895 NOINTERCEPT option INSTRUMENTS statement (MODEL), 995 NOLEVELS option MODEL statement (PANEL), 1278 NOLS option ESTIMATE statement (ARIMA), 236 NOMEAN option MODEL statement (TSCSREG), 1737 NOMISS option IDENTIFY statement (ARIMA), 230 MODEL statement (AUTOREG), 350 NOOLS option FIT statement (MODEL), 987 NOOUTALL option FORECAST statement (ARIMA), 238 PROC ESM statement, 686 NOP option FIXED statement (LOAN), 844 NOPRINT option ARIMA statement (X11), 2045 COLUMNS statement (COMPUTAB), 441 ESTIMATE statement (ARIMA), 233 FIXED statement (LOAN), 844 FORECAST statement (ARIMA), 238 IDENTIFY statement (ARIMA), 230 MODEL statement (AUTOREG), 344 MODEL statement (MDC), 896 MODEL statement (PANEL), 1278 MODEL statement (PDLREG), 1358 MODEL statement (SYSLIN), 1639 MODEL statement (TSCSREG), 1737 MODEL statement (VARMAX), 1896 OUTPUT statement (VARMAX), 1909 PROC COMPUTAB statement, 439 PROC COUNTREG statement, 490 PROC ENTROPY statement, 642 PROC MODEL statement, 973
PROC QLIM statement, 1385 PROC SIMLIN statement, 1515 PROC STATESPACE statement, 1582 PROC SYSLIN statement, 1637 PROC UCM statement, 1754 PROC VARMAX statement (VARMAX), 1988 PROC X11 statement, 2043 ROWS statement (COMPUTAB), 443 SSPAN statement (X11), 2055 NOPROFILE ESTIMATE statement (UCM), 1763 NORED option PROC SIMLIN statement, 1515 NORMAL option ERRORMODEL statement (MODEL), 980 FIT statement (MODEL), 991 MODEL statement (AUTOREG), 344 NORMALEC= option MODEL statement (MDC), 893 NORMALIZE= option COINTEG statement (VARMAX), 1891, 2002 INPUT statement (SIMILARITY), 1454 TARGET statement (SIMILARITY), 1460 NORMALPARM= option MODEL statement (MDC), 893 NORTR option PROC COMPUTAB statement, 439 NOSTABLE option ESTIMATE statement (ARIMA), 236 NOSTORE option PROC MODEL statement, 971 NOSUM TABLES statement (X12), 2130 NOSUMMARYPRINT option FIXED statement (LOAN), 844 NOSUMPR option FIXED statement (LOAN), 844 NOTFSTABLE option ESTIMATE statement (ARIMA), 236 NOTRANS option PROC COMPUTAB statement, 439 NOTRANSPOSE option PROC COMPUTAB statement, 438 NOTSORTED option ID statement (ESM), 693 ID statement (SIMILARITY), 1453 ID statement (TIMESERIES), 1695 NOZERO option COLUMNS statement (COMPUTAB), 441 ROWS statement (COMPUTAB), 443 NPARMS= option CORR statement (TIMESERIES), 1690
NPERIODS= option DECOMP statement (TIMESERIES), 1693 TREND statement (TIMESERIES), 1697 NPREOBS option FIT statement (MODEL), 987 NSEASON= option MODEL statement (VARMAX), 1895 NSIMUL option MODEL statement (MDC), 893 NSSTART= MAX option PROC FORECAST statement, 792 NSSTART= option PROC FORECAST statement, 792 NSTART= MAX option PROC FORECAST statement, 792 NSTART= option PROC FORECAST statement, 792 NVDRAW option FIT statement (MODEL), 987 NWKDOM function, 147 OBSERVED= option CONVERT statement (EXPAND), 732, 737 PROC EXPAND statement, 730 OFFSET= option BLOCKSEASON statement (UCM), 1759 MODEL statement (COUNTREG), 492 SPLINESEASON statement (UCM), 1780 OL option ROWS statement (COMPUTAB), 443 OLS option FIT statement (MODEL), 987, 1180 PROC SYSLIN statement, 1636 ONEPASS option SOLVE statement (MODEL), 1003 OPTIONS option PROC COMPUTAB statement, 439 OPTMETHOD= option MODEL statement (AUTOREG), 350 MODEL statement (MDC), 896 ORDER= option ENDOGENOUS statement (QLIM), 1388 PROC SIMILARITY statement, 1448 OTHERWISE, 1166 OUT = option FlatData statement (PANEL), 1273 OUT1STEP option PROC FORECAST statement, 793 OUT= option BOXCOXAR macro, 151 DFTEST macro, 155 ENTROPY procedure, 641 FIT statement (MODEL), 988, 1110, 1182
FIXED statement (LOAN), 844, 853 FORECAST command (TSFS), 2549 FORECAST statement (ARIMA), 238, 262 LOGTEST macro, 157 OUTPUT statement (AUTOREG), 354 OUTPUT statement (COUNTREG), 493 OUTPUT statement (MDC), 901 OUTPUT statement (PANEL), 1279 OUTPUT statement (PDLREG), 1358 OUTPUT statement (QLIM), 1393 OUTPUT statement (SIMLIN), 1518 OUTPUT statement (SYSLIN), 1655 OUTPUT statement (VARMAX), 1909, 1987 OUTPUT statement (X11), 2051, 2071 OUTPUT statement (X12), 2121 PROC ARIMA statement, 227 PROC COMPUTAB statement, 439 PROC DATASOURCE statement, 539, 548 PROC ESM statement, 686 PROC EXPAND statement, 729, 756 PROC FORECAST statement, 792, 806 PROC SIMILARITY statement, 1448 PROC SIMLIN statement, 1523 PROC SPECTRA statement, 1546, 1552 PROC STATESPACE statement, 1585, 1601 PROC SYSLIN statement, 1635 PROC TIMESERIES statement, 1686 SOLVE statement (MODEL), 1001, 1121, 1150 TEST statement (ENTROPY), 650 TEST statement (MODEL), 1006 OUTACTUAL option FIT statement (MODEL), 988 PROC FORECAST statement, 792 SOLVE statement (MODEL), 1001 OUTALL option FIT statement (MODEL), 988 PROC FORECAST statement, 793 SOLVE statement (MODEL), 1001 OUTALL= option PROC DATASOURCE statement, 540, 552 OUTAR= option PROC STATESPACE statement, 1583, 1601 OUTBY= option PROC DATASOURCE statement, 540, 551 OUTCOMP= option COMPARE statement (LOAN), 850, 854 OUTCONT= option PROC DATASOURCE statement, 540, 550 OUTCORR option ESTIMATE statement (ARIMA), 234 PROC PANEL statement, 1270 PROC TSCSREG statement, 1734
OUTCORR= option PROC TIMESERIES statement, 1686 OUTCOV option ENTROPY procedure, 642 ESTIMATE statement (ARIMA), 234 ESTIMATE statement (MODEL), 982 FIT statement (MODEL), 988 PROC PANEL statement, 1270 PROC SYSLIN statement, 1635 PROC TSCSREG statement, 1734 PROC VARMAX statement, 1886, 1989 OUTCOV3 option PROC SYSLIN statement, 1635 OUTCOV= option IDENTIFY statement (ARIMA), 230, 263 OUTCROSSCORR= option PROC TIMESERIES statement, 1686 OUTDECOMP= option PROC TIMESERIES statement, 1686 OUTERRORS option SOLVE statement (MODEL), 1001 OUTEST= option ENTROPY procedure, 642 ENTROPY statement, 664 ESTIMATE statement (ARIMA), 234, 264 ESTIMATE statement (MODEL), 982 ESTIMATE statement (UCM), 1764 FIT statement (MODEL), 988, 1112 PROC AUTOREG statement, 339 PROC COUNTREG statement, 489 PROC ESM statement, 687 PROC EXPAND statement, 729, 756 PROC FORECAST statement, 793, 808 PROC MDC statement, 890 PROC PANEL statement, 1270, 1321 PROC QLIM statement, 1385 PROC SIMLIN statement, 1515, 1522 PROC SYSLIN statement, 1635, 1655 PROC TSCSREG statement, 1734 PROC VARMAX statement, 1886, 1989 OUTESTALL option PROC FORECAST statement, 793 OUTESTTHEIL option PROC FORECAST statement, 793 OUTEVENT= option PROC DATASOURCE statement, 540, 553 OUTEXTRAP option PROC X11 statement, 2042 OUTFITSTATS option PROC FORECAST statement, 793 OUTFOR= option FORECAST statement (UCM), 1766 PROC ESM statement, 687 OUTFORECAST option
X11 statement (X12), 2134 OUTFULL option PROC FORECAST statement, 793 OUTHT= option GARCH statement, 1907 PROC VARMAX statement, 1991 OUTL= option ENTROPY procedure, 642 ENTROPY statement, 665 OUTLAGS option FIT statement (MODEL), 988 SOLVE statement (MODEL), 1001 OUTLIER statement UCM procedure, 1773 X12 procedure, 2122 OUTLIMIT option PROC FORECAST statement, 793 OUTMEASURE= option PROC SIMILARITY statement, 1448 OUTMODEL= option ESTIMATE statement (ARIMA), 234, 267 PROC MODEL statement, 971, 1167 PROC STATESPACE statement, 1584, 1602 OUTP= option ENTROPY procedure, 642 ENTROPY statement, 664 OUTPARMS= option FIT statement (MODEL), 1112 PROC MODEL statement, 970, 1107 OUTPATH= option PROC SIMILARITY statement, 1448 OUTPREDICT option FIT statement (MODEL), 988 SOLVE statement (MODEL), 1001 OUTPROCINFO= option PROC ESM statement, 687 OUTPUT OUT=, 384 OUTPUT statement AUTOREG procedure, 354 COUNTREG procedure, 493 PANEL procedure, 1279 PDLREG procedure, 1358 PROC PANEL statement, 1320 QLIM procedure, 1392 SIMLIN procedure, 1517 SYSLIN procedure, 1640 VARMAX procedure, 1908 X11 procedure, 2051 X12 procedure, 2121 OUTRESID option FIT statement (MODEL), 988, 1182 PROC FORECAST statement, 793 SOLVE statement (MODEL), 1001
OUTS= option ENTROPY procedure, 642 FIT statement (MODEL), 989, 1026, 1112 OUTSEASON= option PROC TIMESERIES statement, 1686 OUTSELECT= option PROC DATASOURCE statement, 540 OUTSEQUENCE= option PROC SIMILARITY statement, 1448 OUTSN= option FIT statement (MODEL), 989 OUTSPAN= option PROC X11 statement, 2043, 2071 VAR statement (X11), 2071 OUTSSCP= option PROC SYSLIN statement, 1635, 1656 OUTSTAT= option DFTEST macro, 155 ESTIMATE statement (ARIMA), 234, 268 PROC ESM statement, 687 PROC VARMAX statement, 1886, 1992 OUTSTB= option PROC X11 statement, 2043, 2071 OUTSTD option PROC FORECAST statement, 793 OUTSUM= option FIXED statement (LOAN), 844 PROC ESM statement, 687 PROC LOAN statement, 841, 854 PROC SIMILARITY statement, 1448 PROC TIMESERIES statement, 1687 OUTSUSED= option ENTROPY procedure, 642 FIT statement (MODEL), 989, 1026, 1113 OUTTDR= option PROC X11 statement, 2043, 2072 OUTTRANS= option PROC PANEL statement, 1322 OUTTRANS=option PROC PANEL statement, 1270 OUTTREND= option PROC TIMESERIES statement, 1687 OUTUNWGTRESID option FIT statement (MODEL), 989 OUTV= option FIT statement (MODEL), 989, 1109, 1113 OUTVARS statement MODEL procedure, 997 OVDIFCR= option ARIMA statement (X11), 2045 OVERID option MODEL statement (SYSLIN), 1639 OVERPRINT option ROWS statement (COMPUTAB), 443
P option IRREGULAR statement (UCM), 1770 PROC SPECTRA statement, 1546 P= option ESTIMATE statement (ARIMA), 233 FIXED statement (LOAN), 842 GARCH statement, 1907 IDENTIFY statement (ARIMA), 230 MODEL statement (AUTOREG), 341 MODEL statement (VARMAX), 1900 OUTPUT statement (AUTOREG), 355 OUTPUT statement (MDC), 901 OUTPUT statement (PANEL), 1279 OUTPUT statement (PDLREG), 1359 OUTPUT statement (SIMLIN), 1518 _PAGE_ option COLUMNS statement (COMPUTAB), 441 ROWS statement (COMPUTAB), 443 PANEL procedure, 1268 syntax, 1268 PARAMETERS statement MODEL procedure, 997, 1151 PARKS option MODEL statement (PANEL), 1278 MODEL statement (TSCSREG), 1737 PARMS= option FIT statement (MODEL), 984 PARMSDATA= option PROC MODEL statement, 970, 1107 SOLVE statement (MODEL), 1001 PARMTOL= option PROC STATESPACE statement, 1585 PARTIAL option MODEL statement (AUTOREG), 344 MODEL statement (PDLREG), 1358 PASTMIN= option PROC STATESPACE statement, 1584 PATH= option TARGET statement (SIMILARITY), 1460 PAYMENT= option FIXED statement (LOAN), 842 PCHOW= option FIT statement (MODEL), 991, 1081 MODEL statement (AUTOREG), 344 PDATA= option ENTROPY procedure, 641 ENTROPY statement, 663 %PDL macro, 1103 PDLREG procedure, 1354 syntax, 1354 PDWEIGHTS statement X11 procedure, 2052 PERIOD= option CYCLE statement (UCM), 1761
PERMCO= option LIBNAME statement (SASECRSP), 2199 PERMNO= option LIBNAME statement (SASECRSP), 2196 PERROR= option IDENTIFY statement (ARIMA), 230 PH option PROC SPECTRA statement, 1546 PHI option MODEL statement (PANEL), 1278 MODEL statement (TSCSREG), 1737 PHI= option DEPLAG statement (UCM), 1762 PLOT HAXIS=, 88 PLOT option AUTOREG statement (UCM), 1758 BLOCKSEASON statement (UCM), 1759 CYCLE statement (UCM), 1761 ESTIMATE statement (ARIMA), 233 ESTIMATE statement (UCM), 1764 FORECAST statement (UCM), 1766 IRREGULAR statement (UCM), 1768 MODEL statement (SYSLIN), 1639 PROC ARIMA statement, 224 PROC UCM statement, 1754 RANDOMREG statement (UCM), 1774 SEASON statement (UCM), 1776 SLOPE statement (UCM), 1777 SPLINEREG statement (UCM), 1779 SPLINESEASON statement (UCM), 1780 PLOT= option PROC ESM statement, 687 PLOTS option PROC ARIMA statement, 224 PROC AUTOREG statement, 339 PROC ENTROPY statement, 642 PROC MODEL statement, 970 PROC PANEL statement, 1271 PROC UCM statement, 1754 PLOTS= option PROC EXPAND statement, 730 PROC SIMILARITY statement, 1449 PROC TIMESERIES statement, 1687 PM= option OUTPUT statement (AUTOREG), 356 OUTPUT statement (PDLREG), 1359 PMFACTOR= option MONTHLY statement (X11), 2049 PNT= option FIXED statement (LOAN), 843 PNTPCT= option FIXED statement (LOAN), 843 POINTPCT= option
FIXED statement (LOAN), 843 POINTS= option FIXED statement (LOAN), 843 POISSON option ERRORMODEL statement (MODEL), 980 POOLED option MODEL statement (PANEL), 1278 POWER= option TRANSFORM statement (X12), 2130 PRC= option FIXED statement (LOAN), 844 pred OUTPUT statement (COUNTREG), 493 PREDEFINED= option ADJUST statement (X12), 2114 PREDEFINED statement (X12), 2125 PREDICTED OUTPUT statement (QLIM), 1393 PREDICTED= option OUTPUT statement (AUTOREG), 355 OUTPUT statement (PANEL), 1279 OUTPUT statement (PDLREG), 1359 OUTPUT statement (SIMLIN), 1518 OUTPUT statement (SYSLIN), 1640 PREDICTEDM= option OUTPUT statement (AUTOREG), 356 OUTPUT statement (PDLREG), 1359 PREPAYMENTS= option FIXED statement (LOAN), 843 PRICE= option FIXED statement (LOAN), 844 PRIMAL option ENTROPY procedure, 643 PRINT option AUTOREG statement (UCM), 1758 BLOCKSEASON statement (UCM), 1760 CYCLE statement (UCM), 1761 ESTIMATE statement (UCM), 1765 FORECAST statement (UCM), 1767 IRREGULAR statement (UCM), 1768 LEVEL statement (UCM), 1772 OUTLIER statement (UCM), 1773 PROC STATESPACE statement, 1585 RANDOMREG statement (UCM), 1774 SEASON statement (UCM), 1776 SLOPE statement (UCM), 1777 SPLINESEASON statement (UCM), 1780 SSPAN statement (X11), 2056 STEST statement (SYSLIN), 1644 TEST statement (SYSLIN), 1646 PRINT= option AUTOMDL statement (X12), 2119 BOXCOXAR macro, 152 LOGTEST macro, 157
MODEL statement (VARMAX), 1896 PROC SIMILARITY statement, 1449 PROC TIMESERIES statement, 1688 PRINT=(CORRB) option MODEL statement (VARMAX), 1896 PRINT=(CORRX) option MODEL statement (VARMAX), 1896 PRINT=(CORRY) option MODEL statement (VARMAX), 1897, 1936 PRINT=(COVB) option MODEL statement (VARMAX), 1897 PRINT=(COVPE) option MODEL statement (VARMAX), 1897, 1931 PRINT=(COVX) option MODEL statement (VARMAX), 1897 PRINT=(COVY) option MODEL statement (VARMAX), 1897 PRINT=(DECOMPOSE) option MODEL statement (VARMAX), 1897, 1933 PRINT=(DIAGNOSE) option MODEL statement (VARMAX), 1897 PRINT=(DYNAMIC) option MODEL statement (VARMAX), 1897, 1917 PRINT=(ESTIMATES) option MODEL statement (VARMAX), 1897 PRINT=(IARR) option MODEL statement (VARMAX), 1871, 1898 PRINT=(IMPULSE) option MODEL statement (VARMAX), 1924 PRINT=(IMPULSE=) option MODEL statement (VARMAX), 1898 PRINT=(IMPULSX) option MODEL statement (VARMAX), 1920 PRINT=(IMPULSX=) option MODEL statement (VARMAX), 1898 PRINT=(PARCOEF) option MODEL statement (VARMAX), 1899, 1937 PRINT=(PCANCORR) option MODEL statement (VARMAX), 1899, 1939 PRINT=(PCORR) option MODEL statement (VARMAX), 1899, 1938 PRINT=(ROOTS) option MODEL statement (VARMAX), 1899, 1942 PRINT=(YW) option MODEL statement (VARMAX), 1899 PRINT=option PROC ESM statement, 688 PRINTALL option ARIMA statement (X11), 2045 ESTIMATE statement (ARIMA), 236 FIT statement (MODEL), 991 FORECAST statement (ARIMA), 239 MODEL statement, 492 MODEL statement (VARMAX), 1896
PROC MODEL statement, 974 PROC QLIM statement, 1385 PROC UCM statement, 1757 SOLVE statement (MODEL), 1005 SSPAN statement (X11), 2056 PRINTDETAILS option PROC ESM statement, 688 PROC SIMILARITY statement, 1450 PROC TIMESERIES statement, 1688 PRINTERR option ESTIMATE statement (X12), 2116 PRINTFORM= option MODEL statement (VARMAX), 1896, 1920 PRINTFP option ARIMA statement (X11), 2045 PRINTOUT= option MONTHLY statement (X11), 2049 PROC STATESPACE statement, 1583 QUARTERLY statement (X11), 2054 PRINTREG option IDENTIFY statement (X12), 2118 PRIOR option MODEL statement (VARMAX), 1904 PRIOR=(IVAR) option MODEL statement (VARMAX), 1904 PRIOR=(LAMBDA=) option MODEL statement (VARMAX), 1904 PRIOR=(MEAN=) option MODEL statement (VARMAX), 1904 PRIOR=(NREP=) option MODEL statement (VARMAX), 1905 PRIOR=(THETA=) option MODEL statement (VARMAX), 1905 PRIORS statement ENTROPY procedure, 648 PRL= option FIT statement (MODEL), 984, 1083 PROB OUTPUT statement (COUNTREG), 493 OUTPUT statement (QLIM), 1393 PROBALL OUTPUT statement (QLIM), 1393 PROBDF function, 158 macro, 158 PROBZERO OUTPUT statement (COUNTREG), 493 PROC ARIMA statement, 224 PROC AUTOREG OUTEST=, 384 PROC AUTOREG statement, 339 PROC COMPUTAB NOTRANS, 447 OUT=, 457
PROC COMPUTAB statement, 438 PROC DATASOURCE statement, 537 PROC ENTROPY statement, 640 PROC ESM statement, 686 PROC EXPAND statement, 728 PROC FORECAST statement, 790 PROC LOAN statement, 840 PROC MDC statement, 889 PROC MODEL statement, 970 PROC PANEL statement, 1270 PROC PDLREG statement, 1355 PROC SIMILARITY statement, 1447 PROC SIMLIN statement, 1515 PROC SPECTRA statement, 1545 PROC STATESPACE statement, 1582 PROC SYSLIN statement, 1634 PROC TIMESERIES statement, 1686 PROC TSCSREG statement, 1734 PROC UCM statement, 1753, see UCM procedure PROC VARMAX statement, 1885 PROC X11 statement, 2042 PROC X12 statement, 2109 PRODUCTION option ENDOGENOUS statement (QLIM), 1389 PROFILE ESTIMATE statement (UCM), 1765 PROJECT= option FORECAST command (TSFS), 2547 PSEUDO= option SOLVE statement (MODEL), 1003 PURE option ENTROPY procedure, 641 PURGE option RESET statement (MODEL), 999 PUT, 1165 PWC option COMPARE statement (LOAN), 849 PWOFCOST option COMPARE statement (LOAN), 849 Q option IRREGULAR statement (UCM), 1770 Q= option ESTIMATE statement (ARIMA), 233 GARCH statement, 1907 IDENTIFY statement (ARIMA), 230 MODEL statement (AUTOREG), 341 MODEL statement (VARMAX), 1900, 1953 QLIM procedure, 1382 syntax, 1382 QLIM procedure, CLASS statement, 1387 QLIM procedure, TEST statement, 1394 QLIM procedure, WEIGHT statement, 1395
QTR function, 147 QUARTERLY statement X11 procedure, 2052 QUASI= option SOLVE statement (MODEL), 1003 QUIET= option FCMPOPT statement (SIMILARITY), 1451 R= option FIXED statement (LOAN), 842 OUTPUT statement (AUTOREG), 356 OUTPUT statement (PANEL), 1280 OUTPUT statement (PDLREG), 1359 OUTPUT statement (SIMLIN), 1518 RANDINIT option MODEL statement (MDC), 894 RANDNUM= option MODEL statement (MDC), 893 RANDOM= option SOLVE statement (MODEL), 1003, 1121, 1136 RANDOMREG UCM procedure, 1773 RANGE, 1153 RANGE option MODEL statement (ENTROPY), 648 RANGE statement DATASOURCE procedure, 544 MODEL procedure, 998 RANGE= option LIBNAME statement (SASECRSP), 2201 LIBNAME statement (SASEFAME), 2294 RANK option MODEL statement (MDC), 894 RANK= option COINTEG statement (VARMAX), 1892, 1973 RANONE option MODEL statement (PANEL), 1279 MODEL statement (TSCSREG), 1737 RANTWO option MODEL statement (PANEL), 1279 MODEL statement (TSCSREG), 1737 RAO option TEST statement (ENTROPY), 650 TEST statement (MODEL), 1006 RATE= option FIXED statement (LOAN), 842 RECFM= option PROC DATASOURCE statement, 539 RECPEV= option OUTPUT statement (AUTOREG), 356 RECRES= option
OUTPUT statement (AUTOREG), 356 REDUCECV= option AUTOMDL statement (X12), 2120 REDUCED option PROC SYSLIN statement, 1637 REEVAL option FORECAST command (TSFS), 2549 REFIT option FORECAST command (TSFS), 2549 REGRESSION statement X12 procedure, 2124 Remote FAME data access physical name using #port number, 2293 RENAME statement DATASOURCE procedure, 547 REPLACEBACK option FORECAST statement (ESM), 690 REPLACEMISSING option FORECAST statement (ESM), 690 RESET option MODEL statement (AUTOREG), 344 RESET statement MODEL procedure, 999 RESIDDATA= option SOLVE statement (MODEL), 1001 RESIDEST option PROC STATESPACE statement, 1585 RESIDUAL OUTPUT statement (QLIM), 1393 RESIDUAL= option OUTPUT statement (AUTOREG), 356 OUTPUT statement (PANEL), 1280 OUTPUT statement (PDLREG), 1359 OUTPUT statement (SIMLIN), 1518 OUTPUT statement (SYSLIN), 1640 RESIDUALM= option OUTPUT statement (AUTOREG), 356 OUTPUT statement (PDLREG), 1359 RESTART option MODEL statement (MDC), 894 RESTRICT statement AUTOREG procedure, 352 COUNTREG procedure, 493 ENTROPY procedure, 648 MDC procedure, 901 MODEL procedure, 999 PDLREG procedure, 1360 QLIM procedure, 1393 STATESPACE procedure, 1587 SYSLIN procedure, 1641 VARMAX procedure, 1909 RETAIN statement MODEL procedure, 1167 RHO option
MODEL statement (PANEL), 1279 MODEL statement (TSCSREG), 1737 RHO= option AUTOREG statement (UCM), 1758 CYCLE statement (UCM), 1761 RKNOTS option SPLINESEASON statement (UCM), 1780 RM= option OUTPUT statement (AUTOREG), 356 OUTPUT statement (PDLREG), 1359 ROBUST option MODEL statement (PANEL), 1279 ROUND= NONE option FIXED statement (LOAN), 844 ROUND= option FIXED statement (LOAN), 844 row titles option ROWS statement (COMPUTAB), 442 ROWS statement COMPUTAB procedure, 441 RTS= option PROC COMPUTAB statement, 439 S option IRREGULAR statement (UCM), 1770 PROC SPECTRA statement, 1546 SAMESCALE option MODEL statement (MDC), 895 SAR option IRREGULAR statement (UCM), 1770 SATISFY= option SOLVE statement (MODEL), 1000 SCALE= option INPUT statement (SIMILARITY), 1455 SCAN option IDENTIFY statement (ARIMA), 230 SCENTER option MODEL statement (VARMAX), 1895 SCHEDULE option FIXED statement (LOAN), 845 SCHEDULE= option FIXED statement (LOAN), 845 SCHEDULE= YEARLY option FIXED statement (LOAN), 845 SDATA= option ENTROPY procedure, 642, 663 FIT statement (MODEL), 989, 1107, 1187 SOLVE statement (MODEL), 1001, 1121, 1148 SDIAG option PROC SYSLIN statement, 1636 SDIF= option INPUT statement (SIMILARITY), 1455 TARGET statement (SIMILARITY), 1460
VAR statement (TIMESERIES), 1698 SDIFF= option IDENTIFY statement (X12), 2117 SEASON statement TIMESERIES procedure, 1696 UCM procedure, 1774 SEASONALITY= option PROC ESM statement, 688 PROC SIMILARITY statement, 1450 PROC TIMESERIES statement, 1688 SEASONALMA= option X11 statement (X12), 2134 SEASONS= option PROC FORECAST statement, 793 PROC X12 statement, 2110 SECOND function, 147 SEED= option MODEL statement (MDC), 895 QLIM procedure, 1386 SOLVE statement (MODEL), 1004, 1121 SEIDEL option SOLVE statement (MODEL), 1003 SELECT, 1166 SELECT option ENDOGENOUS statement (QLIM), 1390 SETID= option LIBNAME statement (SASECRSP), 2195 SETMISSING= option FORECAST statement (ESM), 690 ID statement (ESM), 693 ID statement (SIMILARITY), 1453 ID statement (TIMESERIES), 1695 INPUT statement (SIMILARITY), 1455 TARGET statement (SIMILARITY), 1461 VAR statement (TIMESERIES), 1699 SICCD= option LIBNAME statement (SASECRSP), 2200 SIGCORR= option PROC STATESPACE statement, 1584 SIGMA= option OUTLIER statement (ARIMA), 236 SIGSQ= option FORECAST statement (ARIMA), 239 SIMILARITY procedure, 1446 syntax, 1446 SIMLIN procedure, 1514 syntax, 1514 SIMPLE option PROC SYSLIN statement, 1637 SIMULATE option SOLVE statement (MODEL), 1002 SIN function, 1159 SINGLE option
SOLVE statement (MODEL), 1003, 1141 SINGULAR= option ESTIMATE statement (ARIMA), 236 FIT statement (MODEL), 993 MODEL statement (TSCSREG), 1737 PROC FORECAST statement, 794 PROC STATESPACE statement, 1585 PROC SYSLIN statement, 1636 SINGULAR=option MODEL statement (PANEL), 1279 SINH function, 1159 SINTPER= option PROC FORECAST statement, 794 SKIP option ROWS statement (COMPUTAB), 443 SKIPFIRST= option ESTIMATE statement (UCM), 1765 FORECAST statement (UCM), 1767 SKIPLAST= option ESTIMATE statement (UCM), 1763 SLAG option LAG statement (PANEL), 1275 SLAG statement LAG statement (PANEL), 1275 SLENTRY= option PROC FORECAST statement, 794 SLIDE= option TARGET statement (SIMILARITY), 1461 SLOPE statement UCM procedure, 1777 SLSTAY= option MODEL statement (AUTOREG), 348 PROC FORECAST statement, 794 SMA option IRREGULAR statement (UCM), 1771 SOLVE statement MODEL procedure, 1000 SOLVEPRINT option SOLVE statement (MODEL), 1005 SORTNAMES option PROC ESM statement, 688 PROC SIMILARITY statement, 1450 PROC TIMESERIES statement, 1688 SOURCE= option LIBNAME statement (SASEHAVR), 2344 SP option IRREGULAR statement (UCM), 1771 SPAN= option OUTLIER statement (X12), 2122 PROC X12 statement, 2110 SPECTRA procedure, 1544 syntax, 1544 SPLINEREG UCM procedure, 1778
SPLINESEASON UCM procedure, 1779 SPSCALE option MODEL statement (MDC), 895 SQ option IRREGULAR statement (UCM), 1771 SQRT function, 1159 SRESTRICT statement SYSLIN procedure, 1642 SSPAN statement X11 procedure, 2055 START= option FIT statement (MODEL), 984, 1034, 1180 FIXED statement (LOAN), 844 ID statement (ESM), 694 ID statement (SIMILARITY), 1453 ID statement (TIMESERIES), 1696 LIBNAME statement (SASEHAVR), 2344 MODEL statement (AUTOREG), 348 MONTHLY statement (X11), 2050 PROC FORECAST statement, 794 PROC SIMLIN statement, 1516 PROC X12 statement, 2110 QUARTERLY statement (X11), 2054 SOLVE statement (MODEL), 1002 STARTITER option FIT statement (MODEL), 1035 STARTITER= option FIT statement (MODEL), 993 STARTSUM= option PROC ESM statement, 688 STARTUP= option MODEL statement (AUTOREG), 342 STAT= option FORECAST command (TSFS), 2548 STATESPACE procedure, 1580 syntax, 1580 STATIC option FIT statement (MODEL), 1070 SOLVE statement (MODEL), 1002, 1068, 1117 STATIONARITY= option IDENTIFY statement (ARIMA), 231 MODEL statement (AUTOREG), 345 STATS option SOLVE statement (MODEL), 1005, 1135 STB option MODEL statement (PDLREG), 1358 MODEL statement (SYSLIN), 1639 STD= option HETERO statement (AUTOREG), 351 STEST statement SYSLIN procedure, 1643 SUMBY statement
COMPUTAB procedure, 446 SUMMARY option MONTHLY statement (X11), 2050 QUARTERLY statement (X11), 2054 SUMONLY option PROC COMPUTAB statement, 439 SUR option ENTROPY procedure, 641 FIT statement (MODEL), 987, 1010, 1183 PROC SYSLIN statement, 1636 SYSLIN procedure, 1632 syntax, 1632 T option ERRORMODEL statement (MODEL), 981, 1022 TABLES statement X11 procedure, 2056 X12 procedure, 2129 TABLES table names TABLES statement (X12), 2130 TAN function, 1159 TANH function, 1159 TARGET statement SIMILARITY procedure, 1456 TAX= option COMPARE statement (LOAN), 849 TAXRATE= option COMPARE statement (LOAN), 849 TCCV= option OUTLIER statement (X12), 2124 TDCOMPUTE= option MONTHLY statement (X11), 2050 TDCUTOFF= option SSPAN statement (X11), 2055 TDREGR= option MONTHLY statement (X11), 2050 TECH= option ENTROPY procedure, 644 TECHNIQUE= option ENTROPY procedure, 644 TEST statement AUTOREG procedure, 353 ENTROPY procedure, 649 MODEL procedure, 1005 SYSLIN procedure, 1644 VARMAX procedure, 1880, 1911 TEST= option HETERO statement (AUTOREG), 351 THEIL option SOLVE statement (MODEL), 1005, 1135 TI option COMPARE statement (LOAN), 849 TICKER= option
LIBNAME statement (SASECRSP), 2200 TIME function, 148 TIME option MODEL statement (PANEL), 1279 TIME= option FIT statement (MODEL), 989, 1074 SOLVE statement (MODEL), 1074 TIMEPART function, 97, 148 TIMESERIES procedure, 1683 syntax, 1683 TIN=, 733 _TITLES_ option COLUMNS statement (COMPUTAB), 441 TO= option PROC EXPAND statement, 729, 733, 734 TODAY function, 148 TOL= option ESTIMATE statement (X12), 2116 TOTAL option PROC SIMLIN statement, 1516 TOUT=, 733 TR option MODEL statement (AUTOREG), 342 TRACE option PROC MODEL statement, 974 TRACE= option FCMPOPT statement (SIMILARITY), 1451 TRANSFORM statement X12 procedure, 2130 TRANSFORM=, 733 TRANSFORM= option ARIMA statement (X11), 2046 FORECAST statement (ESM), 691 INPUT statement (SIMILARITY), 1455 OUTPUT statement (AUTOREG), 356 OUTPUT statement (PDLREG), 1359 TARGET statement (SIMILARITY), 1461 VAR statement (TIMESERIES), 1699 TRANSFORMIN= option CONVERT statement (EXPAND), 733, 742 TRANSFORMOUT= option CONVERT statement (EXPAND), 733, 742 TRANSIN=, 733 TRANSOUT=, 733 TRANSPOSE procedure, 119 TRANSPOSE= option CORR statement (TIMESERIES), 1690 CROSSCORR statement (TIMESERIES), 1691 DECOMP statement (TIMESERIES), 1693 SEASON statement (TIMESERIES), 1696 TREND statement (TIMESERIES), 1698 TREND statement
TIMESERIES procedure, 1697 TREND= option DFPVALUE macro, 153 DFTEST macro, 155 MODEL statement (VARMAX), 1895 PROC FORECAST statement, 794 TREND=LINEAR option MODEL statement (VARMAX), 1970 TRENDADJ option MONTHLY statement (X11), 2050 QUARTERLY statement (X11), 2054 TRENDMA= option MONTHLY statement (X11), 2050 X11 statement (X12), 2135 TRIMMISS= option INPUT statement (SIMILARITY), 1456 TRIMMISSING= option INPUT statement (SIMILARITY), 1461 TRUEINTEREST option COMPARE statement (LOAN), 849 TRUNCATED option ENDOGENOUS statement (QLIM), 1389 TSCSREG procedure, 1733 syntax, 1733 TSFS, 2379 TSFS procedure, 2379 TSNAME = option PROC PANEL statement, 1273 TSVIEW macro, 2546 TWOSTEP option MODEL statement (PANEL), 1279 TYPE= option FIT statement (MODEL), 989 MODEL statement (AUTOREG), 341 MODEL statement (MDC), 895 OUTLIER statement (ARIMA), 236 OUTLIER statement (X12), 2122 PROC DATASOURCE statement, 539 PROC SIMLIN statement, 1516 SOLVE statement (MODEL), 1002 TEST statement (AUTOREG), 353 TYPE=option BLOCKSEASON statement (UCM), 1760 SEASON statement (UCM), 1776 U option ERRORMODEL statement (MODEL), 981 UCL= option OUTPUT statement (AUTOREG), 356 OUTPUT statement (PDLREG), 1359 UCLM= option OUTPUT statement (AUTOREG), 356 OUTPUT statement (PDLREG), 1359
UCM procedure, 1750 syntax, 1750 UCM procedure, PROC UCM statement PLOT option, 1754 UL option ROWS statement (COMPUTAB), 443 UNIFORMEC= option MODEL statement (MDC), 893 UNIFORMPARM= option MODEL statement (MDC), 893 UNITSCALE= option MODEL statement (MDC), 892 UNITVARIANCE= option MODEL statement (MDC), 895 UNREST option MODEL statement (SYSLIN), 1640 UPPERBOUND= option ENDOGENOUS statement (QLIM), 1389 URSQ option MODEL statement (AUTOREG), 348 USE= option FORECAST statement (ESM), 691 USERDEFINED statement X12 procedure, 2131 USSCP option PROC SYSLIN statement, 1637 USSCP2 option PROC SYSLIN statement, 1637 UTILITY statement MDC procedure, 902 VAR option MODEL statement (PANEL), 1277 MODEL statement (TSCSREG), 1736 VAR statement FORECAST procedure, 795 MODEL procedure, 1007, 1151 SPECTRA procedure, 1547 STATESPACE procedure, 1587 SYSLIN procedure, 1646 TIMESERIES procedure, 1698 X11 procedure, 2056 X12 procedure, 2132 VAR= option FORECAST command (TSFS), 2546, 2547 IDENTIFY statement (ARIMA), 231 TSVIEW command (TSFS), 2546, 2547 VARDEF= option FIT statement (MODEL), 987, 1013, 1026 MODEL statement (VARMAX), 1895 Proc ENTROPY, 641 PROC SYSLIN statement, 1637 VARIANCE= option AUTOREG statement (UCM), 1758
CYCLE statement (UCM), 1761 IRREGULAR statement (UCM), 1768 LEVEL statement (UCM), 1772 RANDOMREG statement (UCM), 1774 SLOPE statement (UCM), 1778 SPLINEREG statement (UCM), 1779 SPLINESEASON statement (UCM), 1780 VARIANCE=option BLOCKSEASON statement (UCM), 1760 SEASON statement (UCM), 1777 VARLIST statement, 890 VARMAX procedure, 1882 syntax, 1882 VCOMP= option MODEL statement (PANEL), 1279 VDATA= option FIT statement (MODEL), 989, 1108 W option ARM statement (LOAN), 847 WALD option TEST statement (ENTROPY), 650 TEST statement (MODEL), 1006 TEST statement (PANEL), 1281 TEST statement (QLIM), 1395 WEEK function, 148 WEEKDAY function, 96 WEEKDAY function, 148 WEIGHT statement, 1052 ENTROPY procedure, 650 MODEL procedure, 1007 SYSLIN procedure, 1646 WEIGHT= option PROC FORECAST statement, 794 WEIGHTS statement SPECTRA procedure, 1547 WHEN, 1166 WHERE statement DATASOURCE procedure, 543 WHITE option FIT statement (MODEL), 991 WHITENOISE= option ESTIMATE statement (ARIMA), 233 IDENTIFY statement (ARIMA), 232 WHITETEST option PROC SPECTRA statement, 1546 WILDCARD= option LIBNAME statement (SASEFAME), 2294 WISHART= option SOLVE statement (MODEL), 1004 WORSTCASE option ARM statement (LOAN), 847 X11 procedure, 2040
syntax, 2040 X11 statement X12 procedure, 2132 X12 procedure, 2106 syntax, 2106 XBETA OUTPUT statement (COUNTREG), 493 OUTPUT statement (QLIM), 1393 XBETA= option OUTPUT statement (MDC), 901 XLAG option LAG statement (PANEL), 1275 XLAG statement LAG statement (PANEL), 1275 XLAG= option MODEL statement (VARMAX), 1900 XPX option FIT statement (MODEL), 992, 1043 MODEL statement (PDLREG), 1358 MODEL statement (SYSLIN), 1640 XREF option PROC MODEL statement, 973, 1170 YEAR function, 96, 148 YRAHEADOUT option PROC X11 statement, 2043 YYQ function, 96, 148 ZERO= option COLUMNS statement (COMPUTAB), 441 ROWS statement (COMPUTAB), 443 ZEROMISS option PROC FORECAST statement, 794 ZEROMISS= option FORECAST statement (ESM), 691 INPUT statement (SIMILARITY), 1456 TARGET statement (SIMILARITY), 1462 ZEROMISSING= option ID statement (PROC ESM), 694 ID statement (SIMILARITY), 1454 ZEROMODEL statement COUNTREG procedure, 494 ZEROWEIGHT= option MONTHLY statement (X11), 2051 QUARTERLY statement (X11), 2055 ZGAMMA OUTPUT statement (COUNTREG), 493 ZLAG option LAG statement (PANEL), 1275 ZLAG statement LAG statement (PANEL), 1275
Your Turn
We welcome your feedback. 3 If you have comments about this book, please send them to yourturn@sas.com. Include the full title and page numbers (if applicable). 3 If you have comments about the software, please send them to suggest@sas.com.
Whether you are new to the work force or an experienced professional, you need to distinguish yourself in this rapidly changing and competitive job market. SAS Publishing provides you with a wide range of resources to help you set yourself apart. Visit us online at support.sas.com/bookstore.
SAS Press
Need to learn the basics? Struggling with a programming problem? Youll find the expert answers that you need in example-rich books from SAS Press. Written by experienced SAS professionals from around the world, SAS Press books deliver real-world insights on a broad range of topics for all skill levels.
SAS Documentation
support.sas.com/saspress
To successfully implement applications using SAS software, companies in every industry and on every continent all turn to the one source for accurate, timely, and reliable information: SAS documentation. We currently produce the following types of reference documentation to improve your work experience: Online help that is built into the software. Tutorials that are integrated into the product. Reference documentation delivered in HTML and PDF free on the Web. Hard-copy books.
support.sas.com/publishing
Subscribe to SAS Publishing News to receive up-to-date information about all new SAS titles, author podcasts, and new Web site features via e-mail. Complete instructions on how to subscribe, as well as access to past issues, are available at our Web site.
support.sas.com/spn
SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. indicates USA registration. Other brand and product names are trademarks of their respective companies. 2009 SAS Institute Inc. All rights reserved. 518177_1US.0109