Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Regression
Regression analysis at least 200 years old
o most used predictive modeling technique (including logistic regression)
Regression Challenges
Preparation of data errors, missing values, etc.
o Largest part of typical data analysis (modelers often report
80% time)
o Missing values a huge headache (listwise deletion of rows)
Goal:
MV
CRIM
NOX
AGE
DIS
RM
LSTAT
RAD
CHAS
INDUS
TAX
PT
ZN
10
11
-0.597
+5.247
-0.858
12
13
Regression Formulas
X matrix of potential predictors (NxK)
Y column: the target or dependent variable (Nx1)
Estimated = (XX)-1(Xy)
standard formula
Ridge
= (XX + rI)-1(Xy)
Simplest version: constant added to diagonal
elements of the XX matrix
r=0 yields usual LS
r= yields degenerate model =0
eed to find r that yields best generalization error
Observe that there is a potentially distinct solution
for every value of the penalty term r
Varying r traces a path of solutions
Salford Systems 2013
14
Ridge Regression
Shrinkage of regression coefficients towards zero
If zero correlation among all predictors then shrinkage
will be uniform over all coefficients (same percentage)
If predictors correlated then while the length of the
coefficient vector decreases some coefficients might
increase (in absoluter value)
Coefficients intentionally biased but yields both more
satisfactory estimates and superior generalization
o Better performance (test MSE) on previously unseen data
15
16
Ridge Regression
Classical Regression
17
18
Ridge:
Sum of squared
coefficients
Minimize
Mean Squared Error
Model Complexity
Minimize
Regularized Regression
Lasso:
Sum of absolute
coefficients
Compact:
Number of
coefficients
19
2
squared
||
absolute value
||0
count of s
General penalty
||
LASSO penalty
0<=<=2
20
LASSO Features
With highly correlated predictors the LASSO will tend
to pick just one of them for model inclusion
Dispersion of greater than for RIDGE
Unlike AIC and BIC model selection methods that
penalize after the model is built these penalties
influence the s
A convenient trick for estimating models with
regularization || is weighted average of any two
of the major elasticities 0, 1, and 2. e.g.:
w2
+ (1-w)
||
Computational Challenge
For a given regularization (e.g LASSO) find the
optimal penalty on the term
Find the best regularization from the family ||
Potentially very many models to fit
22
23
24
GPS Algorithm
Start with NO predictors in model
Seek the path () of solutions as function of penalty
strength
Define pj()= P/j marginal change in Penalty
Define gj() = - R/j marginal change in Loss
Define j()=gj()/pj() ratio (benefit/cost)
Find max|j()| to identify coefficient to update (j*)
Update j* in the direction of sign j*
- R/j requires computing inner products of
current residual with available predictors
o Easily parallelizable
25
Update an existing
variable coefficient
Step sizes are small, initial coefficients for any model are
very small and are updated in very small increments
This explains why the Ridge elasticity can have solutions
with less than all the variables
o Technically ridge does not select variables, it only shrinks
o In practice it can only add one variable per step
Salford Systems 2013
26
Next
Model
Current
Model
Next
Model
X1
0.0
X1
0.0
X1
0.0
X1
0.0
X2
0.0
X2
0.0
X2
0.0
X2
0.0
X3
0.2
X3
0.2
X3
0.2
X3
0.3
X4
0.0
X4
0.1
X4
0.0
X4
0.1
X5
0.4
X5
0.4
X5
0.4
X5
0.4
X6
0.5
X6
0.5
X6
0.5
X6
0.5
X7
0.0
X7
0.0
X7
0.0
X7
0.0
X8
0.0
X8
0.0
X8
0.0
X8
0.0
o If the selected coefficient was zero, a new variable effectively enters into the model
o If the selected coefficient was not zero, the model is simply updated
27
Sequence of
2-variable
models
Sequence of
1-variable
models
A Variable
is Added
A Variable
is Added
Sequence of
3-variable
models
A Variable
is Added
Final
OLS
Solution
= 0
28
Zero
Solution
Path 3: Steps
Points
Path 1
Path 2
Path 3
OLS
Solution
10
10
10
29
LS versus GPS
OLS Regression
Learn Sample
Test Sample
GPS Regression
GPS (Generalized Path Seeker) introduced by Jerome Friedman in 2008 (Fast Sparse
Regression and Classification)
Dramatically expands the pool of potential linear models by including different sets
of variables in addition to varying the magnitude of coefficients
The optimal model of any desirable size can then be selected based on its
performance on the TEST sample
Salford Systems 2013
30
31
Point 100
Point 150
Point 190
32
LS
26.147
+5.247
-0.858
-0.597
33
Along the path followed by GPS for every elasticity we identify the solution
(coecient vector) best for each performance measure
No aYention is paid to model size here so you might still prefer to select a model
from the graphical display
Salford Systems 2013
34
35
36
37
60
60
50
40
30
20
10
0
-10
50
Knots
40
MV
MV
30
20
10
0
10
20
LSTAT
30
40
10
20
LSTAT
30
40
38
Knot 1
Knot 2
Knot 3
80
60
40
20
0
0
30
60
90
True Function
80
60
40
20
0
0
30
60
Data
90
Knot 4
Knot 5
Knot 6
39
MARS Algorithm
Multivariate Adaptive Regression Splines
Introduced by Jerome Friedman in 1991
o (Annals of Statistics 19 (1): 1-67) (earlier discussion papers from 1988)
Forward stage:
o Add pairs of BFs (direct and mirror pair of basis functions represents a single
knot) in a step-wise regression manner
o The process stops once a user specified upper limit is reached
Backward stage:
o Remove BFs one at a time in a step-wise regression manner
o This creates a sequence of candidate models of declining complexity
Selection stage:
o Select optimal model based on the TEST performance (modern approach)
o Select optimal model based on GCV criterion (legacy approach)
40
41
42
GPS
MARS
43
44
BaDery Partition
Regression
27.069
27.97
23.80
15.91
14.12
21.11
23.15
21.361
45
Regression Tree
Out of the box results, no tuning of controls
9 regions (terminal
nodes)
Test MSE= 17.296
46
NOX
Running Score
Method
20% random
Parametric
Bootstrap
Repeated 100
20% Partitions
Regression
27.069
27.97
23.80
MARS
14.663
15.91
14.12
GPS Lasso
21.361
21.11
23.15
CART
17.296
17.26
20.66
49
Bagger Mechanism
Generate a reasonable number of bootstrap samples
o Breiman started with numbers like 50, 100, 200
50
NOX
51
Running Score
Method
20% random
Parametric
Bootstrap
BaDery Partition
Regression
27.069
27.97
23.80
MARS
14.663
15.91
14.12
GPS Lasso
21.361
21.11
23.15
CART
17.296
17.26
20.66
Bagged CART
9.545
12,79
52
53
Running Score
Method
20% random
Parametric
Bootstrap
BaDery Partition
Regression
27.069
27.97
23.80
MARS
14.663
15.91
14.12
GPS Lasso
21.361
21.11
23.15
CART
17.296
17.26
20.66
Bagged CART
9.545
12,79
RF Defaults
8.286
12.84
54
55
Tree 3
Tree 2
Every tree produces at least one positive and at least one negative node. Red
reflects a relatively large positive and deep blue reflects a relatively negative
node. Total score for a given record is obtained by finding relevant terminal
node in every tree in model and summing across all trees
Salford Systems Copyright
2005-2013
56
57
Running Score
Method
20% random
Parametric
Bootstrap
BaDery Partition
Regression
27.069
27.97
23.80
MARS
14.663
15.91
14.12
GPS Lasso
21.361
21.11
23.15
CART
17.296
17.26
20.66
Bagged CART
9.545
12,79
RF Defaults
8.286
12.84
RF PREDS=6
8.002
12.05
8.67
11.02
58
59
NOX
60
Running Score
Method
20% random
Parametric
Bootstrap
BaDery Partition
Regression
27.069
27.97
23.80
MARS
14.663
15.91
14.12
GPS Lasso
21.361
21.11
23.15
CART
17.296
17.26
20.66
Bagged CART
9.545
12,79
RF Defaults
8.286
12.84
RF PREDS=6
8.002
12.05
8.67
11.02
TreeNet Huber
6.682
7.86
11.46
TN Additive
9.897
10.48
61
References MARS
Friedman, J. H. (1991a). Multivariate adaptive regression
splines (with discussion). Annals of Statistics, 19, 1-141
(March).
Friedman, J. H. (1991b). Estimating functions of mixed
ordinal and categorical variables using adaptive splines.
Department of Statistics,Stanford University, Tech. Report
LCS108.
De Veaux R.D., Psichogios D.C., and Ungar L.H. (1993), A
Comparison of Two Nonparametric Estimation Schemes: Mars
and Neutral Networks, Computers Chemical Engineering, Vol.
17, No.8.
62
63
64
Whats Next
Visit our website for the full 4-hour video series
https://www.salford-systems.com/videos/tutorials/
the-evolution-of-regression-modeling
o 2 hours methodology
o 2 hours hands-on running of examples
o Also other tutorials on CART, TreeNet gradient boosting
o Just let the Unlock Department know you participated in the ondemand webinar series
65