Evolution of Regression - Ols To Gps To Mars SF Meetup

Evolution of Regression:
From Classical Least Squares to Regularized

Regression to Machine Learning Ensembles

Covering MARS, Generalized PathSeeker, TreeNet

Gradient Boosting and Random Forests

A Brief Overview the 4 Part Webinar

at www.salford-systems.com
May 2013
Dan Steinberg
Mikhail Golovnya
Salford Systems
Salford Systems 2013
Full Webinar Outline

Webinar Part 1
Regression Problem quick overview

Classical Least Squares the starting point
RIDGE/LASSO/GPS regularized regression
MARS adaptive non-linear regression splines
Webinar Part 2
CART Regression tree quick overview

Random Forest decision tree ensembles
TreeNet Stochastic Gradient Boosted Trees
Hybrid TreeNet/GPS (trees and regularized regression)
Regression
Regression analysis at least 200 years old
o most used predictive modeling technique (including logistic regression)
American Statistical Association reports 18,900 members

o Bureau of Labor Statistics reports more than 22,000 statisticians in 2008
Many other professionals involved in the sophisticated analysis

of data not included in these counts
o Statistical specialists in marketing, economics, psychology, bioinformatics
o Machine Learning specialists and Data Scientists
o Data Base professionals involved in data analysis
o Web analytics, social media analytics, text analytics
Few of these other researchers will call themselves statisticians

o but may make extensive use of variations of regression
One reason for popularity of regression: effective

Regression Challenges
Preparation of data errors, missing values, etc.
o Largest part of typical data analysis (modelers often report
80% time)
o Missing values a huge headache (listwise deletion of rows)
Determining which predictors to include in model

o Text book examples typically have 10 predictors available
o Hundreds, thousands, even tens and hundreds of thousands available
Transformation or coding of predictors

o Conventional approaches: logarithm, power, inverse, etc..
o Required to obtain a good model
High correlation among predictors

o With increasing numbers of predictors this complication
becomes more serious
More Regression Challenges

Obtaining sensible results (correct signs, no wild
outcomes)
Detecting and modeling important interactions

o Typically never done because too difficult
Wide data has more columns than rows

Lack of external knowledge or theory to guide
modeling as more topics are modeled
Boston Housing Data Set

Concerns the housing values in Boston area
Harrison, D. and D. Rubinfeld. Hedonic Prices and the
Demand For Clean Air.
o Journal of Environmental Economics and Management, v5, 81-102 , 1978
Combined information from 10 separate governmental

and educational sources to produce data set
506 census tracts in City of Boston for the year 1970
o
o
o
o
o
o
o
o
o
o
o
o
o
o
Goal:
MV
CRIM
NOX
AGE
DIS
RM
LSTAT
RAD
CHAS
INDUS
TAX
PT
ZN
study relationship between quality of life variables and property values

median value of owner-occupied homes in tract ($1,000s)
per capita crime rates
concentration nitric oxides (p.p. 10 million) proxy for air pollution generally
percent built before 1940
weighted distance to centers of employment
average number of rooms per house
% lower status of population (without some high school and male laborers)
index of accessibility to radial highways
borders Charles River (0/1)
percent of acreage non-retail business
property tax rate per $10,000
pupil teacher ratio
proportion of neighborhood zoned for large lots (>25K sq ft)
6
Ten Data Sources Organized

US Census (1970)
FBI (1970)
MIT Boston Project
Metropolitan Area Planning Commission (1972)
Voigt, Ivers, and Associates (1965) (Land Use Survey)
US Census Tract Maps
Massachusetts Dept Of Education (1971-1972)
Massachusetts Tax Payers Foundation (1970)
Transportation and Air Shed Simulation Model, Ingram, et. al.
Harvard University Dept of City and Regional Planning (1974)
A. Schnare: An Empirical Analysis of the dimensions of
neighborhood quality. Ph.D. Thesis. Harvard. (1974)
An excellent example of creative data blending

Also excellent example of careful model construction
Authors emphasize the quality (completeness of their data)
Least Squares Regression

LS ordinary least squares regression
o
o
o
o
Discovered by Legendre (1805) and Gauss (1809)

Solve problems in astronomy using pen and paper
Statistical foundation by Fisher in 1920s
1950s use of electro-mechanical calculators
The model is always of the form
Response = A + B1X1 + B2X2 + B3X3 +

The response surface is a hyper-plane!
A the intercept term
B1, B2, B3, parameter estimates
A usually unique combination of values exists which
minimizes the mean squared error of predictions on the
learn sample
Experimental approach to model building
Transformations In Original Paper

(For Historical Reference)
RM number of rooms in house: RM2

NOX raised to power p, experiments on value: NOXp
DIS, RAD, LSTAT entered as logarithms of predictor
Regression in paper is run on ln(MV)
Considerable experimentation undertaken

No train/test methodology
Classical Regression agrees very closely with paper on
reported coefficients and R2=.81 (same w/o logging MV)
Converting predictions back from logs yields MSE=15.77
Note that this is learn sample only no testing performed
Classical Regression Results
20% random test partition

Out of the box regression
No aYempt to perfect
Test MSE=27.069
10
BATTERY PARTITION: Rerun 80/20 Learn test 100 times

Note partition sizes are constant
All three partitions change each cycle
Mean MSE=23.80
11
Least Squares Regression on Raw Boston Data

3-variable
Solution
-0.597
+5.247
414 records in the learn

sample
92 records in the test
sample
Good agreement L/T:
o
o
-0.858
LEARN MSE = 27.455

TEST MSE
= 26.147
Used MARS in forward

stepwise LS mode to
generate this model
12
Motivation for Regularized Regression

1960s and 1970s
Unsatisfactory results based modeling physical processes
o Coefficients changed dramatically with small changes in data
o Some coefficients judged to be too large
o Appearance of coefficients with wrong sign
o Severe with substantial correlations among predictors
(multicollinearity)
Solution (1970) Hoerl and Kennard, Ridge Regression

Earlier version just for stabilization of coefficients 1962
o Initially poorly received by statistics profession
13
Regression Formulas
X matrix of potential predictors (NxK)
Y column: the target or dependent variable (Nx1)
Estimated = (XX)-1(Xy)
standard formula
Ridge
= (XX + rI)-1(Xy)

Simplest version: constant added to diagonal
elements of the XX matrix
r=0 yields usual LS
r= yields degenerate model =0

eed to find r that yields best generalization error
Observe that there is a potentially distinct solution
for every value of the penalty term r
Varying r traces a path of solutions
14
Ridge Regression
Shrinkage of regression coefficients towards zero
If zero correlation among all predictors then shrinkage
will be uniform over all coefficients (same percentage)
If predictors correlated then while the length of the
coefficient vector decreases some coefficients might
increase (in absoluter value)
Coefficients intentionally biased but yields both more
satisfactory estimates and superior generalization
o Better performance (test MSE) on previously unseen data
Coefficients much less variable even if biased

Coefficients will be typically be closer to the truth
15
Ridge Regression Features

Ridge frequently fixes the wrong sign problem
Suppose you have K predictors which happen to be
exact copies of each other
RIDGE will give each a coefficient equal to 1/K
times the coefficient that would be given to just one
copy in a model
16
Ridge Regression vs OLS

RIDGE TEST MSE=21.36
Ridge Regression
Ridge: Worse on training data but much beYer on test data

Without test data must use Cross-Validation to determine how much to shrink
Classical Regression
17
Lasso Regularized Regression

Tibshirani (1996) an alternative to RIDGE regression
Least Absolute Shrinkage and Selection Operator
Desire to gain the stability and lower variance of ridge
regression while also performing variable selection
Especially in the context of many possible predictors
looking for a simple, stable, low predictive variance
model
Historical note: Lasso inspired by related work (1993) by
Leo Breiman (of CART and RandomForests fame) nonnegative garotte.
Breimans simulation studies showed the potential for
improved prediction via selection and shrinkage
18
Regularized Regression - Concepts

LS Regression
Ridge:
Sum of squared
coefficients
Minimize
Mean Squared Error
Model Complexity
Minimize
Regularized Regression
Lasso:
Sum of absolute
coefficients
Compact:
Number of
coefficients
Any regularized regression approach tries to balance model

performance and model complexity
regularization parameter, to be estimated
o =
o = 0
Null model zero-coefficients (maximum possible penalty)

LS solution (no penalty)
19
Regularized Regression: Penalized Loss Functions

RIDGE penalty
COMPACT penalty
2
squared

||
absolute value

||0
count of s
General penalty
||
LASSO penalty
0<=<=2

RIDGE does no selection but Lasso and Compact select

Power on is called the elasticity ( 0, 1, 2)
Penalty to be estimated is a constant multiplying one of
the above functions of the vector
Intermediate elasticities can be created: e.g. we could
have a 50/50 mix of RIDGE and LASSO yielding an
elasticity of 1.5
20
LASSO Features
With highly correlated predictors the LASSO will tend
to pick just one of them for model inclusion
Dispersion of greater than for RIDGE
Unlike AIC and BIC model selection methods that
penalize after the model is built these penalties
influence the s
A convenient trick for estimating models with
regularization || is weighted average of any two
of the major elasticities 0, 1, and 2. e.g.:
w2
+ (1-w)
||
(the elastic net)

21
Computational Challenge
For a given regularization (e.g LASSO) find the
optimal penalty on the term
Find the best regularization from the family ||
Potentially very many models to fit
22
Computing Regularized Regressions -1

Earliest versions of regularized regressions required
considerable computation as the penalty
parameter is unknown and must be estimated
Lasso was originally computed by starting with no
penalty and gradually increasing the penalty
o So start with ALL vars in the model
o Gradually tighten the noose to squeeze predictors out
o Infeasible for problems with thousands of possible predictors
Need to solve a quadratic programming problem

to optimize the Lasso solution for every penalty
value
23
Computing Regularized Regressions -2

Work by Friedman and others introduced very fast
forward stepping approaches
Start with maximum penalty (no predictors)
Progress forward with stopping rule
o Dealing with millions of predictors possible
Coordinate gradient descent methods (next slides)

Will still want test sample or cross-validation for
optimization
Generalized PathSeeeker full range of regularization
from compact to Ridge (elasticies from 0 thru 2)
Glmnet in R partial range of regularization from Lasso to
Ridge (elasticities from 1 to 2)
24
GPS Algorithm
Start with NO predictors in model
Seek the path () of solutions as function of penalty
strength
Define pj()= P/j marginal change in Penalty
Define gj() = - R/j marginal change in Loss
Define j()=gj()/pj() ratio (benefit/cost)
Find max|j()| to identify coefficient to update (j*)
Update j* in the direction of sign j*
- R/j requires computing inner products of
current residual with available predictors
o Easily parallelizable
25
How to Forward Step

At any stage of model development choose between
Add a new variable to
model
Update an existing
variable coefficient
Step sizes are small, initial coefficients for any model are
very small and are updated in very small increments
This explains why the Ridge elasticity can have solutions
with less than all the variables
o Technically ridge does not select variables, it only shrinks
o In practice it can only add one variable per step
26
Regularized Regression Practical Algorithm

Introducing New Variable
Current
Model
Updating Existing Model
Next
Model
Current
Model
Next
Model
X1
0.0
X1
0.0
X1
0.0
X1
0.0
X2
0.0
X2
0.0
X2
0.0
X2
0.0
X3
0.2
X3
0.2
X3
0.2
X3
0.3
X4
0.0
X4
0.1
X4
0.0
X4
0.1
X5
0.4
X5
0.4
X5
0.4
X5
0.4
X6
0.5
X6
0.5
X6
0.5
X6
0.5
X7
0.0
X7
0.0
X7
0.0
X7
0.0
X8
0.0
X8
0.0
X8
0.0
X8
0.0
Start with the zero-coefficient solution

Look for best first step which moves one coefficient away from zero
Next step: Update one of the coefficients by a small amount
o Reduces Learn Sample MSE

o Increases Penalty as the model has become more complex
o If the selected coefficient was zero, a new variable effectively enters into the model
o If the selected coefficient was not zero, the model is simply updated
27
Path Building Process

Zero
Coefficient
Model
Sequence of
2-variable
models
Sequence of
1-variable
models
A Variable
is Added
A Variable
is Added
Sequence of
3-variable
models
A Variable
is Added
Final
OLS
Solution
= 0
Variable Selection Strategy
Elasticity Parameter controls the variable selection

strategy along the path (using the LEARN sample
only), it can be between 0 and 2, inclusive
o Elasticity = 2 fast approximation of Ridge Regression, introduces
variables as quickly as possible and then jointly varies the magnitude of
coefficients lowest degree of compression
o Elasticity = 1 fast approximation of Lasso Regression, introduces
variables sparingly letting the current active variables develop their
coefficients good degree of compression versus accuracy
o Elasticity = 0 fast approximation of Best Subset Regression, introduces
new variables only after the current active variables were fully developed
excellent degree of compression but may loose accuracy
28
Points Versus Steps

Path 1: Steps
Path 2: Steps
Zero
Solution
Path 3: Steps
Points
Path 1
Path 2
Path 3
OLS
Solution
Point Selection Strategy

1
10
10
10
Each path(elasticity) will have different number of steps

To facilitate model comparison among different paths,
the Point Selection Strategy extracts a fixed collection of
models into the points grid
o This eliminates some of the original irregularity among individual paths and
facilitates model extraction and comparison
29
LS versus GPS
OLS Regression
X1, X2 , X3, X4, X5, X6,
X1, X2 , X3, X4, X5, X6,
Learn Sample
Test Sample
GPS Regression
A Sequence of Linear Models

1-variable model
2-variable model
3-variable model
Large Collection of Linear Models (Paths)

1-variable models, varying coefficients
GPS (Generalized Path Seeker) introduced by Jerome Friedman in 2008 (Fast Sparse
Regression and Classification)
Dramatically expands the pool of potential linear models by including different sets
of variables in addition to varying the magnitude of coefficients
The optimal model of any desirable size can then be selected based on its
performance on the TEST sample
30
Paths Produced by SPM GPS
Example of 21 paths with different variable selection

strategies
31
Path Points on Boston Data

Path Development
Point 30
Point 100
Point 150
Point 190
Each path uses a different variable selection

strategy and separate coefficient updates
32
GPS on Boston Data

3-variable
Solution
LS
26.147
+5.247
414 records in the learn sample

92 records in the test sample
15% performance improvement
on the test sample
o GPS TEST MSE = 22.669
o LS
MSE= 26.147
-0.858
-0.597
33
Sentinel Solutions Detail
Along the path followed by GPS for every elasticity we identify the solution
(coecient vector) best for each performance measure
No aYention is paid to model size here so you might still prefer to select a model
from the graphical display
34
Regularized Logistic Regression

All the same GPS ideas apply
Specify Logistic Binary Analysis

Specify optimality criterion
35
How To Select a Best Model

Regularized regression was originally invented to
help modelers obtain more intuitively acceptable
models
Can think of the process as a search engine
generating predictive models
User can decide based on
o Complexity of model
o Acceptability of coefficients magnitude, signs, predictors included)
Clearly can be set to automatic mode

Criterion could well be performance on test data
36
Key Problems with GPS

Still a linear regression!
Response surface is still a global hyper-plane
Incapable of discovering local structure in the data
Develop non-linear algorithms that build response

surface locally based on the data itself
o By trying all possible data cuts as local boundaries
o By fitting first-order adaptive splines locally
o By exploiting regression trees and their ensembles
37
60
60
50
40
30
20
10
0
-10
50
Knots
40
MV
MV
From Linear to Non-linear

Localize
30
20
10
0
10
20
LSTAT
30
40
10
20
LSTAT
30
40
Classical regression and regularized regression build

globally linear models
Further accuracy can be achieved by building locally
linear models connected to each other at boundary
points called knots
Function is known as a spline
Each separate region of data represented by a basis
function (BF)
38
Finding Knots Automatically

True Knots
Knot 1
Knot 2
Knot 3
80
60
40
20
0
0
30
60
90
True Function
80
60
40
20
0
0
30
60
Data
90
Knot 4
Knot 5
Knot 6
Stage-wise knot placement process on a flat-top function

39
MARS Algorithm
Multivariate Adaptive Regression Splines
Introduced by Jerome Friedman in 1991
o (Annals of Statistics 19 (1): 1-67) (earlier discussion papers from 1988)
Forward stage:
o Add pairs of BFs (direct and mirror pair of basis functions represents a single
knot) in a step-wise regression manner
o The process stops once a user specified upper limit is reached
Backward stage:
o Remove BFs one at a time in a step-wise regression manner
o This creates a sequence of candidate models of declining complexity
Selection stage:
o Select optimal model based on the TEST performance (modern approach)
o Select optimal model based on GCV criterion (legacy approach)
40
MARS on Boston Data: TEST MSE=14.66

9-BF (7-variable)
Solution
41
Non-linear Response Surface
MARS automatically determined transition points between

various local regions
This model provides major insights into the nature of the
relationship
Observe in this model NOX appears linearly
42
200 Replications Learn/Test Partition

Regression
GPS
MARS
Models were repeated

with 200 randomly selected
20% test partitions
GPS shows marginal
performance
improvement but much
smaller model
MARS shows dramatic
performance
improvement
Distribution of TEST MSE across runs

43
Combining MARS and GPS

Use MARS as a search engine to break predictors
into ranges reflecting differences in relationship
between target and predictors
MARS also handles missing values with missing value
indicators and interactions for conditional use of a
predictor (only when not missing)
Allow the MARS model to be large
GPS can then select basis functions and shrink
coefficients
We will see that this combination of the best of both
worlds will also apply to ensembles of decision trees
44
Running Score: Test Sample MSE

Method
20% random Parametric

Bootstrap
BaDery Partition
Regression
27.069
27.97
23.80
MARS Regression Splines 14.663
15.91
14.12
GPS Lasso/ Regularized
21.11
23.15
Salford Systems Copyright

2005-2013
21.361
45
Regression Tree
Out of the box results, no tuning of controls
9 regions (terminal
nodes)

Test MSE= 17.296

2005-2013
46
Regression Tree Representation of a Surface

High Dimensional Step function
Should be at a disadvantage relative to other tools. Can never be smooth.

But always worth checking
Regression Tree Partial Dependency Plot

LSTAT
NOX
Use model to simulate impact of a change in predictor

Here we simulate separately for every training data record and then average
For CART trees is essentially a step function
May only get one knot in graph if variable appears only once in tree

See appendix to learn how to get these plots
Running Score
Method
20% random
Parametric
Bootstrap
Repeated 100
20% Partitions
Regression
27.069
27.97
23.80
MARS
14.663
15.91
14.12
GPS Lasso
21.361
21.11
23.15
CART
17.296
17.26
20.66

2005-2013
49
Bagger Mechanism
Generate a reasonable number of bootstrap samples
o Breiman started with numbers like 50, 100, 200
Grow a standard CART tree on each sample

Use the unpruned tree to make predictions
o Pruned trees yield inferior predictive accuracy for the ensemble
Simple voting for classification

o Majority rule voting for binary classification
o Plurality rule voting for multi-class classification
o Average predicted target for regression models
Will result in a much smoother range of predictions

o Single tree gives same prediction for all records in a terminal node
o In bagger records will have different patterns of terminal node results
Each record likely to have a unique score from ensemble

2005-2013
50
Bagger Partial Dependency Plot

LSTAT
NOX
Averaging over many trees allows for a more complex dependency

Opportunity for many splits of a variable (100 large trees)
Jaggedness may reect existence of interactions
2005-2013
51
Running Score
Method
20% random
Parametric
Bootstrap
BaDery Partition
Regression
27.069
27.97
23.80
MARS
14.663
15.91
14.12
GPS Lasso
21.361
21.11
23.15
CART
17.296
17.26
20.66
Bagged CART
9.545

2005-2013
12,79
52
RandomForests: Bagger on Steroids

Leo Breiman was frustrated by the fact that the bagger did
not perform better. Convinced there was a better way
Observed that trees generated bagging across different
bootstrap samples were surprisingly similar
How to make them more different?
Bagger induces randomness in how the rows of the data are
used for model construction
Why not also introduce randomness in how the columns are
used for model construction
Pick a random subset of predictors as candidate predictors
a new random subset for every node
Breiman was inspired by earlier research that experimented
with variations on these ideas
Breiman perfected the bagger to make RandomForests
2005-2013
53
Running Score
Method
20% random
Parametric
Bootstrap
BaDery Partition
Regression
27.069
27.97
23.80
MARS
14.663
15.91
14.12
GPS Lasso
21.361
21.11
23.15
CART
17.296
17.26
20.66
Bagged CART
9.545
12,79
RF Defaults
8.286
12.84

2005-2013
54
Stochastic Gradient Boosting (TreeNet )

SGB is a revolutionary data mining methodology first
introduced by Jerome H. Friedman in 1999
Seminal paper defining SGB released in 2001
o Google scholar reports more than 1600 references to this paper and a further
3300 references to a companion paper
Extended further by Friedman in major papers in 2004

and 2008 (Model compression and rule extraction)
Ongoing development and refinement by Salford
Systems
o Latest version released 2013 as part of SPM 7.0
TreeNet/Gradient boosting has emerged as one of the

most used learning machines and has been successfully
applied across many industries
Friedmans proprietary code in TreeNet

2005-2013
55
Trees incrementally revise predictions

Tree 1
Tree 3
Tree 2
First tree grown on

original target.
Intentionally
weak model
2nd tree grown on

residuals from first.
Predictions made to
improve first tree
3rd tree grown on

residuals from model
consisting of first two
trees
Every tree produces at least one positive and at least one negative node. Red
reflects a relatively large positive and deep blue reflects a relatively negative
node. Total score for a given record is obtained by finding relevant terminal
node in every tree in model and summing across all trees
2005-2013
56
Gradient Boosting Methodology: Key points

Trees are usually kept small (2-6 nodes common)
o However, should experiment with larger trees (12, 20, 30 nodes)
o Sometimes larger trees are surprisingly good
Updates are small (downweighted). Update factors can

be as small as .01, .001, .0001.
o Do not accept the full learning of a tree (small step size, also GPS style)
o Larger trees should be coupled with slower learn rates
Use random subsets of the training data in each cycle.

Never train on all the training data in any one cycle
o Typical is to use a random half of the learn data to grow each tree

2005-2013
57
Running Score
Method
20% random
Parametric
Bootstrap
BaDery Partition
Regression
27.069
27.97
23.80
MARS
14.663
15.91
14.12
GPS Lasso
21.361
21.11
23.15
CART
17.296
17.26
20.66
Bagged CART
9.545
12,79
RF Defaults
8.286
12.84
RF PREDS=6
8.002
12.05
TreeNet Defaults 7.417
8.67
11.02
Using cross-validation on learn partition to determine optimal number of trees

and then scoring the test partition with that model: TreeNet MSE=8.523
2005-2013
58
Vary HUBER Threshold: Best MSE=6.71
Vary threshold where we switch from squared errors to absolute errors

Optimum when the 5% largest errors are not squared in loss computation
Yields best MSE on test data. Sometimes LAD yields best test sample MSE.
2005-2013
59
Gradient Boosting Partial Dependency Plots

LSTAT

2005-2013
NOX
60
Running Score
Method
20% random
Parametric
Bootstrap
BaDery Partition
Regression
27.069
27.97
23.80
MARS
14.663
15.91
14.12
GPS Lasso
21.361
21.11
23.15
CART
17.296
17.26
20.66
Bagged CART
9.545
12,79
RF Defaults
8.286
12.84
RF PREDS=6
8.002
12.05
TreeNet Defaults 7.417
8.67
11.02
TreeNet Huber
6.682
7.86
11.46
TN Additive
9.897
10.48
If we had used cross-validation to determine the optimal number of trees and

then used those to score test partition the TreeNet Default model MSE=8.523
2005-2013
61
References MARS
Friedman, J. H. (1991a). Multivariate adaptive regression
splines (with discussion). Annals of Statistics, 19, 1-141
(March).
Friedman, J. H. (1991b). Estimating functions of mixed
ordinal and categorical variables using adaptive splines.
Department of Statistics,Stanford University, Tech. Report
LCS108.
De Veaux R.D., Psichogios D.C., and Ungar L.H. (1993), A
Comparison of Two Nonparametric Estimation Schemes: Mars
and Neutral Networks, Computers Chemical Engineering, Vol.
17, No.8.
62
References Regularized Regression

Arthur E. HOERL and Robert W. KENNARD. Ridge
Regression: Biased Estimation for Nonorthogonal
Problems TECHNOMETRICS, 1970, VOL. 12, 55-67
Friedman, Jerome. H. Fast Sparse regression and
Classification.
http://www-stat.stanford.edu/~jhf/ftp/GPSpaper.pdf
Friedman, J. H., and Popescu, B. E. (2003). Importance
sampled learning ensembles. Stanford University,
Department of Statistics. Technical Report.
http://www-stat.stanford.edu/~jhf/ftp/isle.pdf
Tibshirani, R. (1996). Regression shrinkage and selection
via the lasso. J. Royal. Statist. Soc. B. 58, 267-288.
63
References Regression via Trees

Breiman, L., J. Friedman, R. Olshen and C. Stone (1984),
Classification and Regression Trees, CRC Press.
Breiman, L (1996), Bagging Predictors, Machine Learning, 24,
123-140
Breiman, L. (2001) Random Forests. Machine Learning. 45, pp
5-32.
Friedman, J. H. Greedy function approximation: A gradient
boosting machine
http://www-stat.stanford.edu/~jhf/ftp/trebst.pdf
Ann. Statist. Volume 29, Number 5 (2001), 1189-1232.
Friedman, J. H., and Popescu, B. E. (2003). Importance
sampled learning ensembles. Stanford University, Department
of Statistics. Technical Report. http://www-stat.stanford.edu/
~jhf/ftp/isle.pdf
64
Whats Next
Visit our website for the full 4-hour video series
https://www.salford-systems.com/videos/tutorials/
the-evolution-of-regression-modeling
o 2 hours methodology
o 2 hours hands-on running of examples
o Also other tutorials on CART, TreeNet gradient boosting
Download no-cost 60-day evaluation
o Just let the Unlock Department know you participated in the ondemand webinar series
Contains many capabilities not present in open

source renditions
o Largely the source code of the inventor of todays most important

data mining methods: Jerome H. Friedman
o We started working with Friedman in 1990 when very few people were
interested in his work
65
Salford Predictive Modeler SPM

Download a current version from our website
http://www.salford-systems.com
Version will run without a license key for 10-days
For more time request a license key from
unlock@salford-systems.com
Request configuration to meet your needs
o Data handling capacity
o Data mining engines made available

Evolution of Regression - Ols To Gps To Mars SF Meetup

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Evolution of Regression - Ols To Gps To Mars SF Meetup

Caricato da

Copyright:

Formati disponibili

Evolution of Regression:

From Classical Least Squares to Regularized

Covering MARS, Generalized PathSeeker, TreeNet

A Brief Overview the 4 Part Webinar

Salford Systems 2013

Full Webinar Outline

Regression Problem quick overview

CART Regression tree quick overview

Salford Systems 2013

American Statistical Association reports 18,900 members

Many other professionals involved in the sophisticated analysis

Few of these other researchers will call themselves statisticians

One reason for popularity of regression: effective

Determining which predictors to include in model

Transformation or coding of predictors

High correlation among predictors

Salford Systems 2013

More Regression Challenges

Detecting and modeling important interactions

Wide data has more columns than rows

Salford Systems 2013

Boston Housing Data Set

Combined information from 10 separate governmental

Salford Systems 2013

study relationship between quality of life variables and property values

Ten Data Sources Organized

An excellent example of creative data blending

Least Squares Regression

Discovered by Legendre (1805) and Gauss (1809)

The model is always of the form

Response = A + B1X1 + B2X2 + B3X3 +

Salford Systems 2013

Transformations In Original Paper

RM number of rooms in house: RM2

Considerable experimentation undertaken

Classical Regression Results

Salford Systems 2013

20% random test partition

BATTERY PARTITION: Rerun 80/20 Learn test 100 times

Salford Systems 2013

Least Squares Regression on Raw Boston Data

414 records in the learn

Salford Systems 2013

LEARN MSE = 27.455

Used MARS in forward

Motivation for Regularized Regression

Solution (1970) Hoerl and Kennard, Ridge Regression

Salford Systems 2013

Coefficients much less variable even if biased

Salford Systems 2013

Ridge Regression Features

Salford Systems 2013

Ridge Regression vs OLS

Ridge: Worse on training data but much beYer on test data

Lasso Regularized Regression

Regularized Regression - Concepts

Any regularized regression approach tries to balance model

Salford Systems 2013

Null model zero-coefficients (maximum possible penalty)

Regularized Regression: Penalized Loss Functions

RIDGE does no selection but Lasso and Compact select

Salford Systems 2013

(the elastic net)