Sei sulla pagina 1di 11

Overview

SAS
Robert N. Rodriguez
SAS software is a comprehensive set of integrated tools and solutions for accessing, managing, and analyzing data. SAS, which was formed as a company in 1976, is a leading developer of statistical software, which is widely used in academic, business, and government organizations. Since the 1980s, SAS has expanded its analytical software to include forecasting and econometrics, data mining, text mining, and operations research. SAS now builds on these components to provide software for business analytics and solutions for industry-specic problems such as customer intelligence, fraud prevention, and risk management. This article describes the evolution of SAS as a company and overviews new directions in its analytical software. An example program illustrates key elements of SAS programming that are useful for statistical analysis. 2010 John Wiley & Sons, Inc.
WIREs Comp Stat 2011 3 111 DOI: 10.1002/wics.131

INTRODUCTION
he term SAS (pronounced sass) refers to an integrated set of software tools and solutions produced by SAS for making data-driven decisions. The term also refers to SAS as a company and the SAS programming language. This article describes the evolution of SAS as a company and focuses on SAS software for statistical analysis and related analytical areas, including forecasting and econometrics, data mining, text analytics, and operations research. An example program illustrates key features of the language that are commonly used in statistical applications. In addition to the analytical components that are discussed in this article, SAS software provides comprehensive tools for data access, data management and transformation, data summarization, graphics, reporting, and applications development. Using all these components as a foundation, SAS develops business solutions that address complex, industryspecic problems such as customer intelligence, fraud prevention, and risk management. SAS software runs on all major computing platforms and is widely used in universities, research organizations, government agencies, and businesses in markets ranging from pharmaceutical and life sciences to retail and nancial services.
Correspondence to: Bob.Rodriguez@sas.com SAS Institute, SAS Campus Drive, Cary, NC, USA DOI: 10.1002/wics.131

EVOLUTION OF SAS AS A COMPANY


SAS is the leading provider of software for business analytics, with customers in 118 countries and more than 45,000 business, government, and university sites.1 Headquartered in Cary, NC, SAS is a privately held company with a long-standing reputation for reinvesting revenue in research and development (23% of revenue in 2009). SAS has received many awards for its workplace environment and employee benets, and it has been on FORTUNE magazines annual list of 100 Best Companies to Work For since the list was established in 1998. SAS was ranked rst on this list in 2010.2

EARLY HISTORY
In 1966, members of the University Statisticians Southern Experiment Stations, a consortium of eight universities, received a grant from the National Institutes of Health (NIH) to develop a statistical software package to analyze agricultural data. This project found a home in the Statistics Department at North Carolina State University, where it was led by Jim Barr and Jim Goodnight.3 NIH funding was discontinued in 1972, but development continued with support from the consortium. Along with Barr and Goodnight, the core team included Jane Helwig and John Sall. The package they developed ran on IBM mainframe computers and was called the Statistical Analysis System, which resulted in the name SAS as an acronym.
1

Vo lu me 3, Jan u ary/Febru ary 2011

2010 Jo h n Wiley & So n s, In c.

Overview

www.wiley.com/wires/compstats

The early success of SAS software became evident in 1976, when over 300 people attended the rst conference for SAS users. Attendees represented pharmaceutical, insurance, and automotive companies as well as universities. One of the strengths of the SAS 76 release was the GLM procedure, a versatile program for analyzing general linear models, which was written by Goodnight. At that time, SAS software consisted of about 300,000 lines of internal code. In 1976, the four cofounders left North Carolina State University to further the development of SAS software and formed SAS Institute Inc. as a private company (the company was later rebranded as SAS, and the name is no longer an acronym). By 1978, the number of employees had grown to 21, and there were 600 customer sites.

RAPID GROWTH
In 1980, the software grew considerably with the introduction of SAS/GRAPH software for business graphics and SAS/ETS software for the analysis of econometrics and time series data. The company moved to the location of its current headquarters in Cary, and its rst subsidiary, SAS Software Limited, was established in the United Kingdom. During the 1980s, the SAS MultiVendor Architecture was introduced, and the internal code was rewritten in the C language, allowing the software to run on minicomputers and personal computers in addition to mainframe computers. SAS software added full-screen spreadsheet capabilities, and SAS introduced JMP statistical software for the Macintosh computer in 1989 (see also JMP and Ref 4). The number of SAS employees reached nearly 1500 by 1990, with growth in sales and marketing, technical support, publications, and training, as well as research and development. During the 1990s, SAS expanded in many new directions with additions for executive information systems, Web-enablement, data warehousing, data mining, applications to customer relationship management, and business solutions that were initially oriented horizontally across industries. At the same time, software for statistical analysis continued to grow and remains one of the core strengths of SAS. The number of employees grew to more than 7000. SAS is currently the worlds largest privately held software company. Jim Goodnight is the President and CEO, and John Sall is the Executive Vice President. SAS has over 11,000 employees with 5600 in the United States. SAS has 400 ofces globally, and it has
2

research and development centers in Beijing, China, and Pune, India, in addition to its main research and development center in Cary. The development of statistical and other analytical software continues to be emphasized, and it is carried out by over 200 developers with doctorates in statistics, operations research, applied mathematics, numerical analysis, and computer science. A key factor in the growth of SAS has been its emphasis on listening to customers, and SAS has fostered a variety of activities for this purpose. The most prominent is an annual international conference for SAS users called SAS Global Forum5 (renamed from SAS Users Group International in 2007), which is independently organized by users and provides opportunities for users to exchange ideas and interact with SAS staff. In addition, there are six annual regional conferences for SAS users in the United States and numerous local user groups.6 In recent years, SAS has introduced conferences for executives (The Premier Business Leadership Series) and conferences for analysts in the areas of data mining and forecasting. SAS also conducts an annual survey, the SASware Ballot, which enables customers to indicate preferences for software enhancements.

BUSINESS ANALYTICS AND INDUSTRY SOLUTIONS


A major focus of new development at SAS is business analyticsthe combined use of statistical analysis, forecasting, data mining, and optimization to make critical business decisions by analyzing customer and operational data.7,8 Increasingly, the problems encountered in this arena are characterized by massive amounts of data, requiring the use of highperformance computing techniques which include parallel computing, grid computing, and in-database processing. Much of the data is unstructured, and consequently the application of text analytics is growing. Since 2000, SAS has introduced a large number of solutions for industry-specic business problems including anti-money laundering, credit scoring, fraud detection, and risk management in banking; campaign management, customer retention, and customer segmentation in communications; demand-driven forecasting and supply chain intelligence in manufacturing; and size optimization and revenue optimization in retail. These solutions are developed by interdisciplinary teams consisting of industry domain specialists, database and interface programmers, and analysts who draw on a variety of methods. For example, the SAS Markdown Optimization retail solution applies a combination of linear mixed models,
Vo lu me 3, Jan u ary/Febru ary 2011

2010 Jo h n Wiley & So n s, In c.

WIREs Computational Statistics

SAS

forecasting methods, and constrained optimization techniques to determine weekly price schedules.

managed grid computing environment for faster processing.

ANALYTICAL SOFTWARE
SAS software is composed of Base SAS together with many specialized components, which are available as integrated add-on products. This section provides an overview of the components and products that are most commonly used for statistical and other analytical applications (a number of these elements are illustrated in the example at the end of this article). Information about all SAS products is available at http://www.sas.com/software/, and complete documentation is available at http:// support.sas.com/documentation/index.html. At the time of this writing, SAS 9.2 is the most recent release of SAS software. Information about new releases is available at http://support.sas.com/software/, and updates to statistical and other analytical products are described at http://support.sas.com/rnd/app/.

Analytical Products
SAS analytical products provide functionality in the following areas: statistical analysis statistical graphics statistical quality improvement forecasting and econometrics data mining text analytics operations research

The following sections summarize the functionality in each of these areas.

Statistical Analysis

Foundation Tools
Base SAS provides the main foundational components for all SAS applications. The main features of Base SAS are as follows: A fourth-generation language, referred to as the SAS language or the DATA step language, which is designed for data access, management, transformation, and reporting. A library of procedures (large precompiled programs) for data manipulation, descriptive statistics, and report writing. Key procedures for sorting, querying, and summarizing data are multithreaded, enabling them to take advantage of symmetric multiprocessing (SMP) hardware. A macro facility for modularizing SAS programs. The Output Delivery System for reporting and displaying analytical results in a variety of output destinations and formats, such as RTF and PDF. In addition to Base SAS, three other foundation tools are commonly used in analytical work. SAS Enterprise Guide is a graphical user interface, implemented as a Microsoft Windows client application, which makes many SAS tasks accessible to a wide audience of users without requiring knowledge of SAS programming. SAS/ACCESS software provides access to data stored in third party databases. SAS Grid Computing enables SAS programs and solutions to leverage a centrally
Vo lu me 3, Jan u ary/Febru ary 2011

SAS/STAT software provides a comprehensive set of tools for data analysis and statistical modeling. Over 70 procedures are available for analysis of variance, linear mixed models, nonlinear mixed models, generalized linear mixed models, regression modeling, robust regression, nonparametric regression, partial least squares regression, quantile regression, categorical data analysis, exact inference methods, Bayesian modeling and inference, multivariate analysis, factor analysis, principal components analysis, structural equation models, psychometric analysis, cluster analysis, survival analysis, nonparametric analysis, survey design and analysis, analysis of spatial data, multiple imputation for missing data, interim analysis of clinical trials data, and power and sample size computations. SAS/STAT also includes the Power and Sample Size Application, a point-and-click interface for study design, and a variety of specialized macros, including the %MktEx and %ChoicEff macros for creating efcient experimental designs used in marketing research.9 Major new procedures in SAS/STAT 9.2 include the GLIMMIX procedure for generalized linear mixed models, the GLMSELECT procedure for modern regression modeling with classication effects, the QUANTREG procedure for quantile regression, and the SEQDESIGN and SEQTEST procedures for group sequential analysis of clinical trials data. Bayesian analysis is available for generalized linear models in the GENMOD procedure, parametric survival analysis in the LIFEREG procedure, and semiparametric survival analysis in the PHREG procedure.
3

2010 Jo h n Wiley & So n s, In c.

Overview

www.wiley.com/wires/compstats

In addition, general-purpose Bayesian modeling capability is available in the MCMC procedure. SAS/STAT 9.22, available in 2010,10 is the most recent release of SAS/STAT. It provides functionality for analyzing constructed effects in linear models, including spline effects, and extensive new functionality for posttting analysis in linear models. The PLM procedure provides the ability to store model t information and perform posttting inference without retting the model. The SURVEYPHREG procedure ts the semiparametric Cox model to sample survey data. Zero-inated negative binomial models for count data and exact Poisson regression are available in the GENMOD procedure. Bootstrap model averaging is provided in the GLMSELECT procedure. The CALIS procedure for structural equations modeling is completely updated, and the VARIOGRAM, KRIGE2D, and SIM2D procedures for spatial analysis are enhanced. SAS/IML software provides an extensive, interactive matrix programming language which is used by statisticians, researchers, and analysts to implement novel statistical algorithms and other analytical methods.11,12 The SAS/IML language provides the ability to read data into vectors and matrices, along with a concise syntax for matrix operations and over 300 built-in functions and subroutines. Base SAS functions can also be called in SAS/IML. The IML procedure implements the SAS/IML language and can be run as part of a SAS program or interactively, in the sense that each statement is executed as it is submitted. A major recent addition to SAS/IML software is SAS/IML Studio, an advanced computing environment for high-end data analysis that provides facilities for implementing algorithms, exploring data, tting and comparing statistical models, and creating interactive graphical displays.12,13 SAS/IML Studio is a client application that runs on a Windows PC and can connect to one or more SAS servers. SAS/IML Studio provides a programming environment (illustrated in Figure 1) for developing and running programs written in the IMLPlus language. IMLPlus extends the SAS/IML language with the added ability to call SAS procedures as functions, create customized, dynamic graphics, and call functions from libraries written in C/C++, FORTRAN, and Java. SAS/IML Studio 3.2 also provides an interface to R, which includes the ability to exchange data and matrices with R as well as execute R code and retrieve results. Multithreaded workspaces enable the user to move seamlessly between writing programs and analyzing data interactively.1113

programming steps after a statistical procedure was run. In SAS 9.2, major new functionality, called ODS Statistical Graphics (or ODS Graphics for short), facilitates the process of creating analysis-specic statistical graphics in over 60 procedures in SAS/STAT, SAS/ETS, SAS/QC, and Base SAS.14 These procedures now produce graphs as automatically as they produce tables. Figure 2 illustrates convergence diagnostic plots created by the MCMC procedure using ODS Graphics. In addition, new SAS/GRAPH procedures (SGPLOT, SGPANEL, and SGSCATTER) use this functionality to produce plots for exploratory data analysis and customized statistical displays.15,16 ODS Graphics is an extension of the Output Delivery System (ODS), and many ODS features for tabular output apply equally to graphs. ODS Graphics produces graphs in standard image le formats. The consistent appearance and individual layout of graphs are controlled by ODS styles and templates, respectively. The user can make programmatic changes to styles and templates, or directly modify graphs with the ODS Graphics Editor, a point-and-click interface. For an introduction to ODS Graphics see Ref 15. The Graph Template Language, on which ODS Graphics is based, can be used to create customized displays.16

Statistical Quality Improvement


SAS/QC software provides a broad range of methods for statistical quality improvement of products, processes, and services. These techniques include basic problem-solving methods (Pareto charts and Ishikawa diagrams), measurement system evaluation, statistical process control methods (Shewhart charts, cumulative sum control charts, moving average charts), process capability analysis, analysis of means, statistical reliability analysis, and design of experiments. SAS/QC also includes the point-and-click ADX Interface for Design and Analysis of Experiments. SAS/QC provides a wide variety of graphical displays, which are created with ODS Graphics in SAS 9.2.

Forecasting and Econometrics


SAS/ETS software provides an extensive set of econometric and time series analysis tools. Procedures are available for discrete choice and qualitative and limited dependent variable analysis, regression with autocorrelated and heteroscedastic errors, simultaneous systems linear regression, linear systems simulation, polynomial distributed lag regression, nonlinear systems regression and simulation, ARIMA (Box-Jenkins) and ARIMAX (Box-Tiao) modeling and forecasting, vector time series analysis, state space modeling and forecasting, spectral analysis, seasonal adjustment, structural time series modeling and forecasting,
Vo lu me 3, Jan u ary/Febru ary 2011

Statistical Graphics
Prior to SAS 9.2, creating graphs for statistical analysis was cumbersome because it required additional
4

2010 Jo h n Wiley & So n s, In c.

WIREs Computational Statistics

SAS

SAS/IML Studio - AdjMortality (AdjustedMortality)


File Edit View Program Graph Analysis Tools Window Help

Predicted probability: Mortality = Yes

AdjustedMortality.sx submit ModelVariables; proc logistic data=lib.StatePatients; class Severity(param=ordinal) / param=ref; model Mortality(event='Yes') = &ModelVariables; score data=lib.HospitalPatients out=AdjMortality; run; endsubmit; /* SAS/IML to adjust rates for physicians based on state model */ use AdjMortality; read all var (Mortality P_Yes PhysID); close AdjMortality; UniqPhys = unique(PhysID); /* add variables for (unadjusted) mortality rates */ = nrow ( Mortality ); numObs = j(numObs,1,.).; NumDied NumTreated = j(numObs,1,.); = j(numObs,1,.); MortRate do i = 1 to ncol(UniqPhys); obs = loc( PhysID = UniqPhys[i] ); = sum( Mortality[obs] = "Yes" ); NumDied[obs] NumTreated[obs] = ncol( obs ); = NumDied[obs] / NumTreated[obs]; MortRate[obs] Output19 Analysis of Maximum Likelihood Estimates Standard Error 0.6340 0.4916 0.2138 0.1752 0.00694 Wald Chi-Square 117.8848 0.3524 45.1154 116.6151 19.0222

Line Plot of AdjMortality (AdjustedMortality):2

Expected mortality, according to state model 0.4

0.3

0.2

0.1

0 40 60 Age 80 100

Parameter Intercept Severity 2 Severity 3 Severity 4 Age

DF 1 1 1 1 1

Estimate -6.8836 0.2918 1.4360 1.8918 0.0303

Pr > ChiSq <.0001 0.5527 <.0001 <.0001 <.0001

Line Plot of AdjMortality (AdjustedMortality):1

Covariates: Severity age Number of patients died/treated Adjusted mortality rate


0 31 0 19 0 15 0 11 1 13 1 20 2 16 2 27 3 23 1 13 1 12 1 13 1 13 2 13

Bar Chart of AdjMortality (AdjustedMortality):3

0.5

125 100 Frequency 75 50 25 0 1 2 3 Severity of patient condition 4

0.25

Avg rate 0.0628

CT KV WF LT HE NR AM CD OL OG DV RI PG RJ Physician ID

AdjustedMortality

FIGURE 1 | SAS/IML Studio session.

and time series cross-sectional regression analysis. Procedures are also available for automatic time series forecasting, time series interpolation and frequency conversion, trend and seasonal analysis on transaction databases, access to nancial and economic databases, spreadsheet calculations and nancial report generation, and loan analysis. Major new procedures in SAS/ETS 9.2 include the ESM procedure for forecasting by using exponential smoothing models with optimized smoothing weights and the SIMILARITY procedure for similarity analysis of time series data. SAS/ETS 9.22 is the most recent release, available in 2010; it adds the SEVERITY procedure for tting distributions of the magnitude of events and the TIMEID procedure for determining the time interval of observations in a time series data set. SAS Forecast Server is a solution that provides large-scale forecasting for high volumes of time series data by automatically selecting the most appropriate
Vo lu me 3, Jan u ary/Febru ary 2011

forecasting model for each item from an extensible model repository. For series with a hierarchical structure, SAS Forecast Server provides hierarchical forecast reconciliation and disaggregation. SAS Forecast Server includes a graphical user interface, SAS Forecast Studio, and two computational engines, SAS/ETS and SAS High-Performance Forecasting.

Data Mining

SAS Enterprise Miner provides a comprehensive set of data mining methods for predictive and descriptive modeling, which include linear and logistic regression models, decision trees, bagging and boosting, neural networks, memory-based reasoning, clustering, and associations. The graphical user interface organizes the data mining process into ve steps: sample, explore, modify, model, and assess. By deploying nodes and building process ow diagrams within this interface, the user can prepare and summarize data, develop and validate predictive
5

2010 Jo h n Wiley & So n s, In c.

Overview

www.wiley.com/wires/compstats

Diagnostics for alpha 4.0 4.5 alpha 5.0 5.5 0 1.0 0.5 0.0 0.5 1.0 0 10 20 Lag 30 40 50 5.5 5.0 alpha 4.5 4.0 Posterior density Autocorrelation 5000 10000 Iteration 15000 20000

FIGURE 2 | Plots for assessing Markov chain convergence.

models, assess and compare models, and generate scored data sets. Customized tools can be added via an extension node interface. SAS Enterprise Miner generates scoring code in SAS, C, Java, and PMML (Predictive Model Markup Language). This code can be deployed in various real-time or batch environments, within SAS, on the Web, or directly in relational databases. Results can be passed to SAS business solutions, such as SAS Marketing Automation, SAS Model Manager, and SAS Real-Time Decision Manager for deployment. SAS Enterprise Miner has an architecture that is based on a Java client and SAS server. Processes can be run in parallel for distribution across a grid of servers, or they can be scheduled for batch processing. SAS Rapid Predictive Modeler was introduced in 2010. This software facilitates and automates the process of creating standard predictive models by business analysts for customer intelligence applications. Models are built using SAS Enterprise Miner functionality, and the analysis can be saved to a SAS Enterprise Miner project. SAS Rapid Predictive Modeler runs as a task in SAS Enterprise Guide or the SAS Add-In for Microsoft Ofce. Models can be registered to a SAS Metadata Server and used in scoring and reporting processes in applications such as SAS Enterprise Guide, SAS Add-In for Microsoft Ofce, SAS Data Integration Studio, and SAS Model Manager, which provide
6

a model repository that enables version control of the scoring code, lifecycle management of models, and model performance monitoring. Model scoring can also be carried out directly in several relational database environments via SAS Scoring Accelerator.

Text Analytics
Businesses increasingly use information contained in large quantities of text, such as documents, Web pages, blogs and other social media, call center notes, and claims forms. Four SAS products are available for analyzing textual data. SAS Text Miner provides text parsing, dimension reduction of term-document matrices by singular value decomposition (SVD), text topic identication, and clustering. Multiple languages are supported. Documents can be classied into predened categories or grouped by applying clustering techniques to their SVD projections. Interactive visualization enables analysts to explore concepts and relationships between documents. The SVD projections can be combined with other structured data to build predictive models using SAS Enterprise Miner. SAS Sentiment Analysis collects digital content sources, including mainstream Web sites and social media outlets. It uses statistical techniques and linguistic rules to extract the sentiments expressed in these text collections. Business applications include
Vo lu me 3, Jan u ary/Febru ary 2011

2010 Jo h n Wiley & So n s, In c.

WIREs Computational Statistics

SAS

assessment of customer experience and evaluation of new products. SAS Content Categorization applies natural language processing and linguistic techniques to automatically categorize large volumes of multilingual content that is acquired or generated, or exists in a repository. It provides the ability to dene a hierarchical taxonomy in which related topics are grouped together, and it automatically classies documents. It also detects and extracts entities, concepts, and events from text and documents. SAS Ontology Management can build on a taxonomy created by SAS Content Categorization by associating other metadata with particular categories. The resulting semantic terms are used to organize previously disassociated and isolated text repositories. SAS Ontology Management enables collaborative ontology development, integrates existing document repository assets, and identies relationships between document repositories.

an algebraic optimization modeling language, enabling users to build optimization models easily and directly. The OPTMODEL procedure is accompanied by a suite of optimization solvers for linear, mixed-integer, quadratic, and general nonlinear optimization models, providing direct access to all of these solvers and also supporting the creation of customized solution methods. SAS Simulation Studio, added to SAS/OR in SAS 9.2, provides a set of tools and a graphical user interface for creating, executing, and analyzing the results of discrete event simulation models.

EXAMPLE PROGRAM
The example program in this section illustrates various elements of SAS programming, referred to in the preceding section, that are broadly useful for statistical analysis: the DATA step, a SAS/STAT procedure, ODS Graphics, SAS/IML code, and SAS macro variables. The program applies the adaptive lasso algorithm for variable selection to data that are simulated from the model y = X + N(0, 2 ), where Xnp is drawn from a multivariate normal distribution N(0, Vpp ) with Vi,j = |ij| , where 0 < < 1. This setup has been studied in investigations of variable selection methods.1820 The following statements use the IML procedure to simulate the matrix X. The matrix X is saved as a SAS data set named Regressors with variables X1, X2, . . . , X10. For convenience in rerunning the simulation, the rst statement sets the values of n, p, and with macro variables.

Operations Research
SAS/OR software brings together many of the analytical modeling and solution methods that are collectively referred to as operations research. Areas of operations research addressed by SAS/OR include mathematical optimization, project and resource scheduling, discrete event simulation, and genetic algorithms and constraint programming. Starting in 2006, SAS/OR has added completely new technologies for building and solving optimization models.17 The OPTMODEL procedure provides

%let nObs=100; %let nVars=10; %let rho=0.5; proc iml; /* Simulate X */ corr = j(&nVars, &nVars); do i = 1 to &nVars; do j = 1 to &nVars; corr[i,j] = &rho##abs(i-j); end; end; mean = j(1, &nVars, 0); call RANDSEED(1); x = RANDNORMAL( &nObs, mean, corr ); /* Save X as SAS data set */ varNames = x1:(x+strip(char(&nVars))); create Regressors from x[colname=varNames]; append from x; close Regressors; quit;
Vo lu me 3, Jan u ary/Febru ary 2011 2010 Jo h n Wiley & So n s, In c.

Overview

www.wiley.com/wires/compstats

The following DATA step combines Regressors with a response whose true coefcients are = (3, 1.5, 0, 0, 2, 0, 0, 0, 0, 0) and to which noise is added. The result is a data set named SimData. data SimData; set Regressors; yTrue = 3*x1 + 1.5*x2 + 2*x5; y = yTrue + 3*rannor(2); run; Note that the DATA step is often used to input raw data. The following statements illustrate how the rst two observations in SimData could be created by inputting values of the variables:

TABLE 1 Model Selected by Adaptive Lasso


Parameter Estimates Parameter Intercept x1 x2 x5 DF 1 1 1 1 Estimate 0.353559 2.808868 1.841914 1.740392

The STOP= and CHOOSE= options specify that the entire lasso solution path is to be generated and that the selected model is the model at the step that yields the minimum value of the Schwartz Bayesian

data SimData; input x1-x10 y; datalines; 0.020 0.886 0.573 0.592 0.974 0.467 0.492 0.594 0.955 2.107 7.272 0.226 -0.147 0.770 0.482 0.100 0.328 0.448 1.112 2.538 0.676 -1.335 run; The adaptive lasso algorithm is a modication of the standard lasso algorithm in which weights are applied to each of the parameters in forming the lasso constraint.20 Simulation studies show that the adaptive lasso tends to perform better than the standard lasso in selecting the correct regressors, particularly in high signal-to-noise ratio cases. The following statements use the GLMSELECT procedure to t an adaptive lasso model to the data in SimData. The rst statement invokes ODS Graphics. criterion (SBC). The solution path and the progression of SBC are shown in Figure 3. This display is requested with the PLOTS= option and is created with ODS Graphics. The parameter estimates of the selected model are shown in Table 1. The selected model contains only the relevant regressors x1, x2, and x5. The frequency with which adaptive lasso selects this set and the stability of the corresponding estimates can be investigated by

ods graphics on; proc glmselect data=SimData plots=coefficients; model y=x1-x10/selection=lasso(adaptive stop=none choose=sbc); run;
TABLE 2 Model Selection Frequency
The GLMSELECT Procedure Model Selection Frequency Times Selected 386 155 53 52 40 35 21 Selection Percentage 38.60 15.50 5.30 5.20 4.00 3.50 2.10 Number of Effects 4 5 6 5 6 5 5 Frequency Score 387.0 155.9 53.77 52.84 40.76 35.82 21.82 Effects in Model Intercept x1 x2 x5 Intercept x1 x2 x5 x10 Intercept x1 x2 x3 x5 x10 Intercept x1 x2 x3 x5 Intercept x1 x2 x4 x5 x10 Intercept x1 x2 x5 x8 Intercept x1 x2 x5 x9

2010 Jo h n Wiley & So n s, In c.

Vo lu me 3, Jan u ary/Febru ary 2011

WIREs Computational Statistics

SAS

Coefficient progression for y 0.5 Standardized coefficient 0.4 0.3 0.2 0.1 0.0 0.1 350 325 SBC 300 275 250 Intercept 1+x1 2+x2 3+x5 4+x10 5+x3 6+x7 Effect sequence 7+x4 8+x6 9+x8 10+x9
Selected step
x10 x6 x4 x3 x1

x2 x5

FIGURE 3 | Coefcient progression for adaptive lasso.

Parameter estimate distributions for y


Number of samples = 1000 Intercept 20 15 10 5 0 N = 1000 20 15 10 5 0 5 0 N = 1000 x1 15 10 N = 1000 x2

Percent

1.65 1.05 0.45 0.15 x5 25 20 15 10 5 0 15 10 5 0 N = 999 20

1.6 2.2 2.8 3.4 x10

4 40 30 20 10 0

0.675 1.425 2.175 2.925 x3

N = 402

N = 211

0.4

1.6 2.2 2.8 3.4

0.225 0.675 1.125 1.575

1.5 1 0.5

0.5

FIGURE 4 | Parameter estimate distributions.

Vo lu me 3, Jan u ary/Febru ary 2011

2010 Jo h n Wiley & So n s, In c.

Overview

www.wiley.com/wires/compstats

drawing subsamples from the data, tting a model for each sample, and examining how frequently different models are selected. The average model can be used for prediction. The following statements perform this analysis:

Although this model is selected in 38.6% of the samples, most of the selected models contain at least one irrelevant regressor. This is not surprising because although the true model has just a few large effects, the regressors have nontrivial pairwise correlations.

proc glmselect data=SimData plots=ParmDistribution seed=1; model y=x1-x10/selection=lasso(adaptive stop=none choose=SBC); modelAverage nSamples=1000; run;

Model selection frequencies are shown in Table 2. The PLOTS= option requests the display in Figure 4, which shows the distribution of the estimates for parameters in the average model. The most frequently selected model is the model that contains just the true underlying regressors.

NOTES
SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. indicates USA registration. Other brand and product names are trademarks of their respective companies.

REFERENCES
1. http://www.sas.com. 2. http://money.cnn.com/magazines/fortune/ bestcompanies/2010/. 3. http://www.sas.com/company/about/history.html. 4. http://www.jmp.com/. 5. http://www.sasglobalforum.org. 6. http://support.sas.com/usergroups/index.html. 7. Davenport TH, Harris JG. Competing on Analytics: The New Science of Winning. Boston: Harvard Business School Press; 2007. 8. Davenport TH, Harris JG, Morison R. Analytics at Work. Boston: Harvard Business School Press; 2010. 9. http://support.sas.com/resources/papers/tnote/tnote marketresearch.html. 10. Stokes M, Rodriguez RN, Cohen R. The Next Generation: SAS/STAT 9.22. In: Proceedings of the SAS Global Forum 2010 Conference. Cary, NC: SAS Institute Inc. Available at http://support.sas.com/resources/ papers/proceedings10/264-2010.pdf; 2010. 11. Wicklin R. Rediscovering SAS/IML Software: modern data analysis for the practicing statistician. In: Proceedings of the SAS Global Forum 2010 Conference. Cary, NC: SAS Institute Inc. Available at http:// support.sas.com/resources/papers/proceedings10/3292010.pdf; 2010. 12. Wicklin R. Statistical Programming with SAS/IML Software. Cary, NC: SAS Institute Inc., 2010. 13. Wicklin R. SAS Stat Studio: A Programming Environment for High-End Data Analysts. In: Proceedings of the SAS Global Forum 2008 Conference. Cary, NC: SAS Institute Inc. Available at http://www2.sas.com/proceedings/forum2008/3622008.pdf; 2008. 14. Rodriguez RN. Getting Started with ODS Statistical Graphics in SAS 9.2Revised 2009. Available at http://support.sas.com/rnd/app/papers/intodsgraph.pdf. 15. Heath D. Effective Graphics Made Simple Using SAS/GRAPH SG Procedures. In: Proceedings of the SAS Global Forum 2008 Conference. Cary, NC: SAS Institute Inc. Available at http://www2.sas.com/ proceedings/forum2008/255-2008.pdf; 2008. 16. Kuhfeld W. Statistical Graphics in SAS: An Introduction to the Graph Template Language and the Statistical Graphics Procedures. Cary, NC: SAS Institute Inc.; 2010. 17. Hughes E, Kearney T. New features in optimization with SAS/OR Software. In: Proceedings of the SAS Global Forum 2009 Conference. Cary, NC: SAS Institute Inc. Available at http://support.sas.com/ resources/papers/proceedings09/300-2009.pdf; 2009. 18. Breiman L. The little bootstrap and other methods for dimensionality selection in regression: X-xed prediction error. J Am Stat Assoc 1992, 87:738754. 19. Tibshirani R. Regression shrinkage and selection via the lasso. J Royal Stat Soc Series B 1996, 58:267288. 20. Zou H. The adaptive lasso and its oracle properties. J Am Stat Assoc 2006, 101:14181429.

10

2010 Jo h n Wiley & So n s, In c.

Vo lu me 3, Jan u ary/Febru ary 2011

WIREs Computational Statistics

SAS

FURTHER READING
The following books provide introductions to SAS programming and basic uses of SAS software. Complete documentation for SAS software products is available at http://support.sas.com/documentation/index.html. Burlew MM, Michele M. SAS Macro Programming Made Easy. 2nd ed. Cary, NC: SAS Institute Inc; 2006. Carpenter A. Carpenters Complete Guide to the SAS Macro Language. 2nd ed. Cary, NC: SAS Institute Inc.; 2004. Cody R. Learning SAS by Example: A Programmers Guide. Cary, NC: SAS Institute Inc.; 2007. Delwiche L, Slaughter S. The Little SAS Book: A Primer. 4th ed. Cary, NC: SAS Institute Inc.; 2008. Slaughter S, Delwiche L. The Little SAS Book for Enterprise Guide 4.2. Cary, NC: SAS Institute Inc.; 2010.

Vo lu me 3, Jan u ary/Febru ary 2011

2010 Jo h n Wiley & So n s, In c.

11

Potrebbero piacerti anche