Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Alberto Guillén
03. June 2008
Capgemini is a leading company with long
experience in technology services Alberto Guillén
We are one of the biggest actors in Business Consultant
Intelligence in Norway
A major demand from our clients is delivering Risk Management & Complia
solutions in Microsoft Excel, we have
continuously updated our efforts to adapt MSc. Mathematics
clients’ needs
MSc. Statistics
Excel is the Industry standard for end-user calculations, and also as front
interface
Analytical solutions can be created on different
complexity layers beyond basic Excel
Statistical
Programming
Languages
Analysis
Tool pack
Data Third-
Mining Excel party
Add-in Add-ins
Solver
VBA
Analytical solutions can be created on different
complexity layers beyond basic Excel
Statistical
Programming
Languages
Analysis
Tool pack
Data Third-
Mining Excel party
Add-in Add-ins
Solver
VBA
Analytical solutions can be created on different
complexity layers beyond basic Excel
Statistical
Programming
Languages
Analysis
Tool pack
Data Third-
Mining Excel party
Add-in Add-ins
Solver
VBA
Analytical solutions can be created on different
complexity layers beyond basic Excel
Statistical
Programming
Languages
Analysis
Tool pack
Data Third-
Mining Excel party
Add-in Add-ins
Solver
VBA
Analytical solutions can be created on different
complexity layers beyond basic Excel
Statistical
Programming
Languages
Analysis
Tool pack
Data Third-
Mining Excel party
Add-in Add-ins
Solver
VBA
Analytical solutions can be created on different
complexity layers beyond basic Excel
Statistical
Programming
Languages
Analysis
Tool pack
Data Third-
Mining Excel party
Add-in Add-ins
Solver
VBA
The average user masters the standard Excel tools
BASIC EXCEL
1
Solver leverages computational abilities
SOLVER
1
Solver leverages computational abilities
SOLVER
1
Third party add-ins provide easily new functionalities
ADD-INS
Cheap
Simple
Easy to use
No development efforts
1
There are several third party add-ins offering
solutions on quantitative analysis and Monte Carlo
ADD-INS
simulation
Hundreds of free or cheap add-ins offer various solutions on fields like Risk
Management
1
The Table Analysis Tools add-in brings data mining
capabilities DATA MINING
1
The Data Mining add-in easies data mining to
business analysts DATA MINING
Software
• SQL Server 2005/2008 (Visual Studio BI)
• Excel/Visio add-ins
• DMX
• ADOMD.Net / AMO
Microsoft brings Data Mining to business users for the first time
1
Microsoft takes a different approach to Data Mining
DATA MINING
“We have a huge database marketing team who do classic customer analysis. These guys were all SAS
users, but when they joined Microsoft, they started using our tools. […], they actually use the Excel data
mining add-ins to do it. It's not that there's nothing they don't miss, it's that they are able to achieve the
same business results using our tools.“
"For a function such as 'Detect Categories,' what the add-in is doing is building a clustering model in the
background […], but we don't expect the Excel user to understand that. We just call it 'Build Categories,'“
"We're seeing a lot of interest in the Excel-side data mining,for one thing, but we're also seeing interest in the
embed-ability, too. The people who are actually pushing this are from the developer side.
Microsoft will not compete with traditional DM vendors, Microsoft targets other
users
1
Data Mining assists in various business processes
DATA MINING
1
Data Mining is performed in SQL Server 2005 / 2008
DATA MINING
SQL Server Business Development Studio and DMX code is the natural
environment
2
Data Mining is also accessible through Excel 2007
DATA MINING
th Excel and SQL Server Analysis Services support the full DM Cycle:
Data Data
understan preparatio Modeling Validation Deployment
ding n
Excel sends DM queries and data directly to SQL Server Analysis Services
2
Data Mining is an iterative process
DATA MINING
? Deploym
ent
Preparing Data
Exploring Data
Building Models
Although theand
Deploying process is illustrated as circular, creating a data mining model is
Updating
a dynamic and iterative process
2
There are 9 available Data Mining algorithms on Excel
DATA MINING
Decision/Regression Trees
Clustering
Naïve Bayes
Association rules
Sequence clustering
Time series
Neural Networks
Logistic regression
Linear regression
Plug-in algorithms
• Third-party or self programmed implementing a set of COM interfaces
2
Decision and Regression trees find natural splits
DATA MINING
Example:
• Identify potential buyers
Decision trees give decision rules that are suitable to business understanding
2
Clustering finds homogeneous groups
DATA MINING
income
2
Clustering finds homogeneous groups
DATA MINING
Middle
”older” age
Example: Find segments of similar clients age
age Many cars
2 cars and Young
no children people
children No
children
income
2
Naïve Bayes provides probabilities of group
membership DATA MINING
2
Association rules unveils hidden logic
DATA MINING
2
Sequence clustering finds event patterns in time
DATA MINING
2
Time series forecasts processes in time
DATA MINING
The past patterns that it discovers can be used to predict values for future
time steps.
3
Time series forecasts processes in time
DATA MINING
The past patterns that it discovers can be used to predict values for future
time steps.
3
Neural networks discovers predictive patterns by
learning DATA MINING
3
Logistic regression predicts binary responses
DATA MINING
3
Linear regression is of course also available
DATA MINING
3
Chosen examples vs. real life problems
DATA MINING
Unfortunately, it is not that easy; data Mining is a creative and unclear process. Sometimes there is no
answer with data mining.
3
Data Mining: using the Data Mining add-in to forecast Credit
Default DATA MINING
Ranked
classes
Number of
payments
A
score
Income
Probability B
of default
C
Age
Civil
D
status
After training the algorithm, probabilities of default can be predicted for new
applicants
3
Data Mining: using the Data Mining add-in to forecast Debt
Recovery DATA MINING
% recovered
3
Data Mining: using the Data Mining add-in to forecast Debt
Recovery DATA MINING
% recovered
Training…
period
t
Problem: need the algorithm k-nearest neighbours
• Can be implemented as a plug-in algorithm
3
Data Mining: using the Data Mining add-in to forecast Debt
Recovery DATA MINING
% recovered
Training…
period
t
Problem: need the algorithm k-nearest neighbours
• Can be implemented as a plug-in algorithm
3
Data Mining: using the Data Mining add-in to forecast Debt
Recovery DATA MINING
% recovered
Training…
period
t
Problem: need the algorithm k-nearest neighbours
• Can be implemented as a plug-in algorithm
4
Data Mining: using the Data Mining add-in to forecast Debt
Recovery DATA MINING
% recovered
Training…
period
t
Problem: need the algorithm k-nearest neighbours
• Can be implemented as a plug-in algorithm
4
Data Mining: using the Data Mining add-in to forecast Debt
Recovery DATA MINING
% recovered
Training…
period
t
Problem: need the algorithm k-nearest neighbours
• Can be implemented as a plug-in algorithm
4
Data Mining: using the Data Mining add-in to forecast Debt
Recovery DATA MINING
% recovered
Predicting…
Age = Age + …
t
Problem: need the algorithm k-nearest neighbours
• Can be implemented as a plug-in algorithm
4
Data Mining: using the Data Mining add-in to forecast Debt
Recovery DATA MINING
% recovered
x Training…
period
t
Problem: need the algorithm k-nearest neighbours
• Can be implemented as a plug-in algorithm
4
Data Mining: using the Data Mining add-in to forecast Debt
Recovery DATA MINING
% recovered
x Predicting…
period
t
Problem: need the algorithm k-nearest neighbours
• Can be implemented as a plug-in algorithm
4
VBA for Excel is the main tool for automated solutions
VBA
Easy and quick interaction with the solution through ActiveX Buttons and
Userforms
4
VBA: Building a statistical tool for analyzing and forecasting
Debt Collection VBA
Problems: a lot of work to implement statistical algorithms, Solver can get slow
VBA is the tool to use to provide end-users with an interactive work station
4
There are no limits with statistical programming
languages R
Statistical
Programming
Languages
(COM
Server)
Analysis
Tool pack
Data Third-
Mining Excel party
Add-in Add-ins
Solver
VBA
4
There are no limits with statistical programming
languages R
Statistical
Programming
Languages
(COM
Server)
Analysis
DDE Tool pack
Data Third-
1991 OLE
Mining Excel party
Add-in Add-ins
.Net
WCF VBA
4
R is becoming the standard in the scientific
community R
R Excel add-in
• Background mode
• Small Ribbon toolbar
• Fast
• Code embedded in:
− Worksheet functions
− VBA
− Cells
5
Histograms are a dangerous tool to approximate
empirical pdf’s R
5
Histograms are a dangerous tool to approximate
empirical pdf’s R
5
Histograms are a dangerous tool to approximate
empirical pdf’s R
5
With the R add-in, advanced semiparametric methods
are available R
5
With the R add-in, advanced semiparametric methods
are available R
5
With the R add-in, advanced semiparametric methods
are available R
5
With the R add-in, advanced semiparametric methods
are available R
5
With the R add-in, advanced semiparametric methods
are available R
5
Statistical programming languages: problem case
R
5
Statistical programming languages: problem case
R
3000
20
2500
15
2000
r
olje
10
1500
5
1000
0
0 50 100 150
500
Index
0 20 40 60 80 1
Complex multivariate Monte Carlo models are developed fast
Index
in R
6
Industrial solutions are another alternative
Industry vendor
Statistical
Programming Industrialized
Languages Vendors
(COM
Server)
Analysis
Tool pack
Data Third-
Mining party
Excel
Add-in Add-ins
Solver
VBA
6
Industrial solutions should be chosen only if the area
requires it Industry vendor
The previously presented alternatives for Excel can do the their job at end-user
level
6
Further references
Capgemini:
• www.no.capgemini.com (alberto.guillen@capgemini.com)
R:
• http://www.r-project.org
• http://sunsite.univie.ac.at/rcom/