Cooley KXEN April 04

Automation through Structured Risk Minimization
Robert Cooley, Ph.D. VP Technical Operations Knowledge Extraction Engines (KXEN), Inc.
Personal Motivation & Background
When the solution is simple, God is answering

Albert Einstein
1991 Navy Nuclear Reactor Design

Emphasis on simple but highly functional design
The key is to automate data mining as much as possible, but not more so
Gregory Piatetsky-Shapiro
1996 - Data mining research

Emphasis on automation of knowledge discovery
Personal Motivation & Background

Solving a problem of interest, do not solve a more general problem as an intermediate step
Vladimir Vapnik
1997 Structured Risk Minimization (SRM)

Emphasis on simplicity to manage generalization
1998 Support Vector Machines (SVM) for Text Classification

Hoping for better generalization, ended up with increased automation
When all you have is a hammer, every problem looks like a nail
Abraham Maslow
2001 Joined KXEN

Automation through SRM
Learning / Generalization
Under Fit Model/High Robustness (Training Error = Test Error)
Over Fit Model/Low Robustness (No Training Error, High Test Error)
Robust Model (Low Training ErrorLow Test Error)
What is Robustness?
Statistical Ability to generalize from a set of training examples
Are the available examples sufficient?
Training Ability to handle a wide range of situations

Any number of variables, cardinality, missing values, etc.
Deployment Ability to resist degradation under changing conditions

Values not seen in training
Engineering Ability to avoid catastrophic failure

Return an answer without crashing under challenging conditions
From W. of Ockham to Vapnik
If two models explain the data equally well then the simpler model is to be preferred
1st discovery: VC (Vapnik-Chervonenkis) dimension measures the complexity of a set of mappings.
If two models explain the data equally well then the model with the smallest VC dimension is to be preferred
1st discovery: VC (Vapnik-Chervonenkis) dimension measures the complexity of a set of mappings. 2nd discovery: The VC dimension can be linked to generalization results (results on new data).
If two models explain the data equally well then the model with the smallest VC dimension has better generalization performance
1st discovery: VC (Vapnik-Chervonenkis) dimension measures the complexity of a set of mappings. 2nd discovery: The VC dimension can be linked to generalization results (results on new data). 3rd discovery: Dont just observe differences between models, control them.
To build the best model try to optimize the performance on the training data set, AND minimize the VC dimension
1st discovery: VC (Vapnik-Chervonenkis) dimension measures the complexity of a set of mappings. 2nd discovery: The VC dimension can be linked to generalization results (results on new data). 3rd discovery: Dont just observe differences between models, discovery control them.
Structured Risk Minimization (SRM)

Quality:
How well does a model describe your existing data? How well does a model describe your existing data? Achieved by minimizing Error. Achieved by minimizing Error.
Reliability:
How well will a model predict future data? How well will a model predict future data? Achieved by minimizing Confidence Interval. Achieved by minimizing Confidence Interval.
Risk
Best Model
Total Risk
Error
Conf. Interval
Model Complexity
SRM is a Principle, NOT an Algorithm
Consistent Coder
Robust Regression Smart Segmenter
Support Vector Machine
Features of SRM
Adding Variables is Low Cost, Potentially High Benefit
Does not cause over-fitting Does not cause over-fitting More variables can only add to the model quality More variables can only add to the model quality Random variables do not harm quality or reliability Random variables do not harm quality or reliability Highly correlated variables do not harm the modeling process Highly correlated variables do not harm the modeling process Efficient scaling with additional variables Efficient scaling with additional variables
Free of Distribution Assumptions

Normal distributions arent necessary Normal distributions arent necessary Skewed distributions do not harm quality Skewed distributions do not harm quality Resistant to outliers Resistant to outliers Independence of inputs isnt necessary Independence of inputs isnt necessary
Indicates Generalization Capacity of any Model

Insufficient training examples to get a robust model Insufficient training examples to get a robust model Choose between several models with similar training errors Choose between several models with similar training errors
SRM enables Automation without sacrificing Quality & Reliability
Create Analytic Create Analytic Dataset Dataset
Select Select Variables Variables
Prepare Data Prepare Data
Train Train Model Model
Validate & Validate & Understand Understand
Analytic Dataset Creation (Lack of) Variable Selection Data Preparation Model Selection Model Testing
Analytic Data Set Creation
Adding Variables is Low Cost, Potentially High Benefit
Kitchen Sink philosophy for variable creation Dont shoe-horn several pieces of information into a single variable (e.g. RFM) Break events out into several different time periods A single analytic data set with hundreds or thousands of columns can serve as input for hundreds of different models
Event Aggregation Example
Price
50 40 30 20 10 50 40 30 20 10 Jan Feb Mar Apr May Jun
C C
RFM Aggregation
B B
RFM
Customer 2 Customer 1
B B A A
A A C C
Customer 1 Customer 2
132 132
Simple Aggregation
A B C Total purchase
Customer 1
Time
1 1
1 1
1 1
105 105
Customer 2
Relative Aggregation
Month 1 Month 2 Month 3 Month 4 Month 5 Month 6 A out C B
Transitions
B A C out B C A B
$45 $25
$35 0
0 0
0 0
$25 $35
0 $45
1 0
1 0
1 0
0 1
0 1
0 1
Data Preparation
Free of Distribution Assumptions
Different encodings of the same information generally lead to the same predictive quality No need to check for co-linearities No need to check for random variables SRM can be used within the data preparation process to automate binning of values
Binning
Nominal Group values

TV, Chair, VCR, Lamp, Sofa [Chair, Sofa], [Lamp] [TV,VCR],
Ordinal Group values, preserving order

1, 2, 3, 4, 5 [1, 2, 3], [4, 5]
Continuous Group values into ranges

1.5, 2.6, 5.2, 12.0, 13.1 [1.5 5.2], [12.0 13.1]
Why Bin Variables?
Performance
Captures non-linear behavior of continuous variables Minimizes impact of outliers Removes noise from large numbers of distinct values
Explainability
Grouped values are easier to display and understand
Speed
Predictive algorithms get faster as the number of distinct values decreases
Binning Strategies
None Leave each distinct value as a separate input Standardized Group values according to preset ranges
Age: [10 19], [20 29], [30 39], Zip Code: [94100 94199], [94200 94299],
Target Based Use advanced techniques to find the right bins for a given question. DEPENDS ON THE QUESTION!
No Binning
Customer Value by Age

10 9 8 7 6 5 4 3 2 1 0
18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
Standardized Binning

8 7 6 5 4 3 2 1 0
18 - 25 26 - 35 36 - 45
Target Based Binning

9 8 7 6 5 4 3 2 1 0
18 - 28 29 - 39 40 - 45
Predictive Benefit of Target Based Binning
Sine Wave Training Dataset

2 1.5 1 0.5 0 -200 -0.5 -1
-1
Y
KXEN Approximation - 20 Bins

2 1.5 1 0.5 0 -200 -0.5 0 200 400 600 800 1000 1200
200
400
600
800
1000
1200
-1.5
-1.5
-2 X
-2
Model Testing
Indicates Generalization Capacity of any Model
It is not always possible to get a robust model from a limited set of data Determine if a given number of training examples is sufficient Determine amount of model degradation over time
Traditional Modeling Methodology

CRISP-DM
Business Understanding Data Understanding Data Preparation Modeling Evaluation Deployment
SEMMA
Sample Explore Modify Model Assess
SRM Modeling Methodology
1. 2. 3. 4. 5. 6.
Business Understanding Create Analytic Dataset Modeling Data Understanding Deployment Maintenance
Conclusions
SRM is a general principle that can be applied to all phases of data mining The main benefit of applying SRM is increased automation, not increased predictive quality In order to take full advantage of the benefits, the overall modeling methodology must be modified
Thank You
Questions?
Contact Information: rob.cooley@kxen.com 651-338-3880
Additional KXEN Information: www.kxen.com sales-us@kxen.com 5/25 NYC Discovery Workshop, 9am to 1pm

Cooley KXEN April 04

Caricato da

Informazioni sul documento

Descrizione originale:

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Cooley KXEN April 04

Caricato da

Copyright:

Formati disponibili

Automation through Structured Risk Minimization

Personal Motivation & Background

When the solution is simple, God is answering

1991 Navy Nuclear Reactor Design

1996 - Data mining research

Personal Motivation & Background

1997 Structured Risk Minimization (SRM)

1998 Support Vector Machines (SVM) for Text Classification

2001 Joined KXEN

Under Fit Model/High Robustness (Training Error = Test Error)

Robust Model (Low Training ErrorLow Test Error)

Training Ability to handle a wide range of situations

Deployment Ability to resist degradation under changing conditions

Engineering Ability to avoid catastrophic failure

From W. of Ockham to Vapnik

1st discovery: VC (Vapnik-Chervonenkis) dimension measures the complexity of a set of mappings.

From W. of Ockham to Vapnik

From W. of Ockham to Vapnik

From W. of Ockham to Vapnik

Structured Risk Minimization (SRM)

SRM is a Principle, NOT an Algorithm

Robust Regression Smart Segmenter

Support Vector Machine

Free of Distribution Assumptions

Indicates Generalization Capacity of any Model

SRM enables Automation without sacrificing Quality & Reliability

Create Analytic Create Analytic Dataset Dataset

Select Select Variables Variables

Prepare Data Prepare Data

Train Train Model Model

Validate & Validate & Understand Understand

Analytic Data Set Creation

Adding Variables is Low Cost, Potentially High Benefit

Event Aggregation Example

Free of Distribution Assumptions

Nominal Group values

Ordinal Group values, preserving order

Continuous Group values into ranges

Why Bin Variables?

Customer Value by Age

Customer Value by Age

Target Based Binning

Customer Value by Age

Predictive Benefit of Target Based Binning

Sine Wave Training Dataset

KXEN Approximation - 20 Bins

Indicates Generalization Capacity of any Model

Traditional Modeling Methodology

SRM Modeling Methodology

Potrebbero piacerti anche