Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Overview
PROC ROBUSTREG is experimental in SAS/ETS Version 9.* Main purpose is to detect outliers and provide resistant (stable) results in the presence of outliers Addresses three types of problems: problems with outliers in the y-direction (response direction) problems with multivariate outliers in the x-space (leverage points) problems with outliers in both the y-direction and x-space * These notes closely follow the SAS documentation for ROBUSTREG. Also, see the paper Robust Regression and Outlier Detection with the ROBUSTREG Procedure by Colin Chen presented at SUGI27 in 2002 (http://www2.sas.com/proceedings/sugi27/p265-27.pdf )
2
Overview
ROBUSTREG supports four methods: 1. M estimation: introduced by Huber in 1973. Simplest both computationally and theoretically. Only addresses contamination in the response direction. Least Trimmed Squares (LTS): introduced by Rouseeuw in 1984. It is a so-called high breakdown method. The breakdown value is a measure of the proportion of contamination that an estimation method can withstand and still maintain its robustness. Uses the FAST-LST algorithm of Rouseeuw and Van Driessen (1998). S estimation: introduced by Rouseeuw and Yohai in 1984. It is a high breakdown method that is more statistically efficient than LTS. MM estimation: introduced by Yohai in 1987. Combines high breakdown value estimation and M estimation. It is a high breakdown method that is more statistically efficient than S estimation 3
2.
3.
4.
Overview M Estimation
Before getting to the SAS code, its probably worthwhile to review whats involved with the simplest robust estimator, the M estimator. These notes follow some online documentation for the text Applied Regression Analysis, Linear Models, and Related Methods by John Fox. When the error distribution is normal, least squares (LS) is the most efficient regression estimator. However, LS is very sensitive to outliers (aberrant observations in the y-direction) at high leverage points (aberrant observations in the x-direction). Such cases result in heavy-tailed error distributions.
Overview M Estimation
Overview M Estimation
Overview M Estimation
Overview M Estimation
Overview M Estimation
Overview M Estimation
10
Overview M Estimation
11
Leverage Points
Leverage points are outlying points in the x-direction. A leverage points may or may not have an effect on the estimated regression model.
12
13
21 21
Variable x1 x2 x3 y
Note the response variable (y) has the biggest discrepancy between the two estimates of scale, the standard deviation and MAD.
15
9.5045 -60.9138 -23.6569 0.1077 0.7164 1.1387 0.2940 0.0744 1.2270 0.1249 -0.3571 0.1324
* *
Observations 4 and 21 are outliers because their robust residuals exceed the cutoff value in absolute value. 4 high leverage points are detected, mainly caused by x1. Note that only observation 21 is a bad leverage point, i.e., an aberrant x-value that results in a large residual.
17
19
20
21
22
Robust Linear Tests* Test Test Statistic 0.9378 0.8092 ChiSquare Pr > ChiSq 1.18 0.81 0.2782 0.3683
Lambda DF 0.7977 1 1
Rho is a robust version of the F-test, and Rn2 is a robust version of the Wald test.23
proc robustreg method=m(wf=bisquare(c=3.5)) data=stack; model y = x1 x2 x3 / diagnostics leverage; id x1; test x3; run;
* The constant c representing the cutoff value for a weight of zero is called k on p. 10.
24
5.4731 -47.8346 -26.3805 0.0620 0.6975 0.9407 0.1693 0.1855 0.8492 0.0719 -0.2138 0.0681
Obs 1 2 3 4 21
Leverage * * * *
Outlier *
* * *
In addition to observations 4 and 21, observations 1 and 3 are now detected as outliers.
26
27
73 74 75
Well 1st do M estimation proc robustreg data=hbk method=m; model y = x1 x2 x3 / diagnostics leverage; id index; run;
28
* * * *
M estimation (wrongly) identifies observation 11 to 14 as outliers and misses the real outliers, observations 1 to 10. 29
30
proc robustreg data=hbk fwls method=lts; model y = x1 x2 x3 / diagnostics leverage; id index; run;
31
75 75
Variable x1 x2 x3 y
Note large differences between the usual and robust location and scale estimates. 32
The option fwls requests that the final weighted least squares method be applied.
LTS Profile Total Number of Observations Number of Squares Minimized Number of Coefficients Highest Possible Breakdown Value 75 57 4 0.2533
In this case, the LTS estimate minimizes the sum of 57 smallest squares of residuals. It can still pick up the right model if the remaining 18 observations are contaminated. This corresponds to a breakdown value around 0.25, which is the default. 33
Two robust estimates of the scale parameter are displayed. The weighted scale estimate (Wscale) is a more efficient estimate of the scale parameter.
34
35
The final weighted least squares estimates are the least squares estimates computed after deleting the detected outliers. Compare with the M-estimation results on p. 30.
36
Since the data, by design, has outliers, and not high leverage points, both M and MM estimation methods are appropriate.
37
Variable Intercept x1 x2
DF 1 1 1
The RMSE estimate of 27.3 greatly overestimates the true error scale.
38
Proportion 0.1020
Cutoff 3.0000 39
Diagnostics Summary
Observation Type Outlier
Proportion 0.1000
Cutoff 3.0000 40
data b (drop=i); do i=1 to 1000; x1=rannor(1234); x2=rannor(1234); e=rannor(1234); if i > 600 then y=100 + e; else y=10 + 5*x1 + 3*x2 + .5 * e; output; end; run;
41
Proportion 0.4000
Cutoff 3.0000
42
Proportion 0.4000
Cutoff 3.0000 43
Variable x1 x2 y
When there are bad leverage points, the M estimates fail to pick up the underlying model no matter what constant c you use. In this case, other estimates (LTS, S, and MM estimates) in PROC ROBUSTREG, which are robust to bad leverage points, will pick up the underlying model. The following statements generate 1000 observations with 1% bad high leverage 44 points.
Note the summary statistics indicate both outliers in the y-direction and leverage points in the x-direction.
Summary Statistics Standard Deviation 32.0322 15.8316 44.8562
Variable x1 x2 y
46
Total Number of Observations Number of Coefficients Subset Size Chi Function K0 Breakdown Value Efficiency Parameter Estimates
Standard Error 0.0216 0.0208 0.0222 95% Confidence Limits 9.9383 4.9896 2.9782
Proportion 0.4100
Cutoff 3.0000
47
Proportion 0.4100
Cutoff 3.0000
48
The following statements invoke the REG procedure for the OLS analysis:
@@;
0.6079 0.5809 0.4109 0.8634 0.9474 0.8498
50
DF 1 1 1 1 1
The OLS analysis indicates that GAP and EQP have a significant influence on GDP at the 5% level.
51
Its not obvious from the summary statistics that there may be outliers or leverage points. 52
The parameter estimates now show that NEQ is also statistically significant.
53
Leverage * * * * * * * * * * * * *
Outlier
The diagnostics show that observation #60 (Zambia) is an outlier. While there are several leverage points in the data, none are serious. In this case, M estimation 54 is appropriate.
55
The final weighted lease squares estimates are identical to those reported in Zaman, Rousseeuw, and Orhan (2001).
56