Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
1.
2.
3.
4.
5.
6.
Customer retention
Target key prospect
Profile market segments
Detect fraud
Analyze customer response, and much more
Implementation of ERP, CRM & SCM systems have resulted in vast stores of
operational data.
2.
Emergence of global competition has put the pressure on companies to be datadriven i.e., make informed decisions based on facts and not hunches.
3.
The speed of change in the marketplace demands that the pearls of actionable
information have to be found faster in the ocean of data, for companies to be one
step ahead of competition.
4.
The hardware needed to store and process a ton of data was prohibitively
expensive until recently You would have had to have NASA at your disposal.
Today, the technology makes it feasible to apply complex models to ferret out
patterns previously left to rot in data jails.
Farmers Insurance
2.
Rule-Based System
If This, Then That
Rules are determined from expert knowledge and programmed in the
software
An HR Application
Screening a large number of resumes for relatively low-level positions with
well-defined and precise skill requirements
- e.g., Call Center Agents
Expert System can weed out applicants who do not meet the requirements
Companiescirculatedtoplinereports,includingtablesand
chartsfromtheretailstoreauditdata.Ananalyst
preparedthecovermemohighlightingimportantnewsin
thedata.
Now...
Notfeasibletohaveanarmyofanalyststosiftthroughthe
mountainofscannerdata.Instead,"CoverStory"
automaticallywritesthismemo!
a model-imbedded expert system extracts the news
includes a built-in thesaurus to eliminate repetitious
wording
Dr. Lakshmi Mohan
Case Example:
Ocean Spray
Cranberries
A $1 billion grower-owned agricultural cooperative
Lean IS staff
Only one marketing professional for analyzing the
tracking data
Scanner data for juices is imposing
-- 400 M numbers covering up to 100 data
measures, 10,000 products, 125 weeks and 50
geographic markets
-- Grows by 10 million new numbers every four
weeks
Dr. Lakshmi Mohan
10
Impact of
CoverStory
Enables a department of one to alert all Ocean
Spray marketing and sales managers to key
problems and opportunities and provide problemsolving information
Being done across 4 business units handling scores
of company products in dozens of markets
representing hundreds of millions of dollars of sales
System is totally integrated into business operations
because it delivers information of competitive value
in running the business
11
12
13
Knowledge Discovery in
Databases
- Steps in KDD process
Data Warehouse
Selection
Target Data
Cleaning
Transformed Data
DATA MINING
Patterns
Evaluation & Interpretation
Knowledge
Source: Communications of the ACM, 1996
Dr. Lakshmi Mohan
14
15
Data Mining
Models
1. Association
-
3. Classification
-
Opera ticket buyers are usually young urban professionals with high
income while country music concert ticket purchasers are typically
blue collar workers
4. Clustering
-
5. Predictive Models
-
16
Example:
If a customer buys corn chips, then 65% of the time, also buys
cola
Unless there is a promotion, in which case buys cola 85% of the
time.
17
Sequential
Patterns
Example:
If surgical procedure X is performed, then 45% of the time
infection Y occurs within 5 days
But after 5 days, the likelihood of infection Y drops to 4%
Dr. Lakshmi Mohan
18
Classification Models
- Most Common Data Mining
Model
Examples:
- What are the characteristics of customers who are likely to switch to a
rival telecom service provider?
- Which kinds of promotions have been effective in keeping which types
of customers so that you can target the right promotion to the right
customer?
19
Clustering Models
You do not know what the clusters will be when you start, or on what attributes the data
will be clustered.
Hence, a user who is knowledgeable in the business needs to interpret the clusters.
Example:
-
Xerox has developed predictive models using clusters for analyzing usage profile history,
maintenance data, and representations of knowledge from field engineers to predict
photocopy component failure.
An email is sent to the repair staff to schedule maintenance PRIOR to the breakdown
Root Cause Analysis enables a prescription for what to do about a problem
20
Predictive Models
Resulting model is used to predict the value for new data that does not include the
predictive variable.
If the customer is rural and her monthly usage is high, then the customer will
probably renew.
If the customer is urban and new feature exploration is high, then the customer
will probably not renew.
We can tell the profile of someone who is about to have a baby by what purchases
they make
We can then compare that profile with those of others who are moving into baby
space to predict needs. For instance, such a customer may be a good target for a
life insurance sales pitch.
Dr. Lakshmi Mohan
21
22
2.
3.
What is the best possible question to ask at each branch point of the tree?
e.g., The question are you over 35? may not distinguish between churners and
those who are not if the spilt of people over 35 is 40% for churners & 60% for
others. The goal is to get a 90%-10% (10%- 90%) spilt in the segment of people
over 35 years.
The algorithms look at all possible distinguishing questions and the sequence of
asking them that could break up the training data set into segments that are
nearly homogeneous with respect to the variable to be predicted. They stop growing
the tree when the improvement is not substantial to warrant asking the question.
Dr. Lakshmi Mohan
23
CART begins by trying all the questions for grouping the population and
picks the best one that splits the data into two or more organized
segments that decrease the disorder of the original population as much
as possible.
The algorithm not only discovers the optimally generated tree but also has
the validation of the model on new test data (holdout sample) built in.
The most complex tree rarely fares the best on the holdout sample because
it has been over-fitted to the training data set. The tree is pruned back
based on the performance of the various pruned versions on the test data.
Dr. Lakshmi Mohan
24
25
26
15%
6%
95%
0.01%
Left Side of Rule (before THEN) Antecedent (Can Have Multiple Conditions)
Right Side of Rule (after THEN) Consequent (Only ONE Condition)
Dr. Lakshmi Mohan
27
Accuracy High
Coverage High
Coverage Low
28
2.
29
30
31
32
Acid test of the model is to apply the fitted model to new data not used to
calculate the parameters (a and b) of the model the hold-out or
validation data set
Refine the model, if necessary, to make better predictions:
Add multiple predictors (multiple regression models)
Transform predictors by squaring, taking logarithms etc (non-linear models)
Combine predictors by multiplying or taking rations
(e.g., ratio of annual household income to family size)
33
34
Structure of a Neural
Network
35
A Simple Example
No Default
vs Actual value of 0
0.47(0.7) + 0.65(0.1) = 0.39
Link weights (0.7 & 0.1 in the above example) are adjusted to correct for the
deviation between the output of the processing (0.39 in this case) and the
actual value (0 in this case)
Large errors are given greater attention in the correction than small errors
Adjust
Weights
No
Desired
Output
Achieved?
Yes
Stop
Dr. Lakshmi Mohan
37
38
2.
3.
4.
5.
6.
7.
8.
How long does it take to get useful answers from the data?
9.
39
40
41
42
43
I call it the central nervous system for what we are doing with
knowledge management.
Dr. Lakshmi Mohan
44
Installing SAS Text Miner is a simple process- just needed to load 6 CDs on my
workstation
- Depends on the skill and knowledge of user to properly interrogate text repositories
We are getting an increasing understanding of what things are possible with text
mining. But there is a huge skills problem in this area, which is why it hasnt gotten
much traction so far- Gartner
Dr. Lakshmi Mohan
45
46
Proactive or Reactive?
The conventional wisdom has been to just take transactional data and move it to the data
warehouse and then to the BI System. But these systems arent responsive
Monitoring business activity after the fact is too late to head off a problem such as a missed
deadline or the loss of a major customer.
BAM systems pluck the data in real time from the applications where it originates order entry, accounts receivable, call centers, etc. Output in variety of forms
dashboards, e-mails, pager alerts,
Dr. Lakshmi Mohan
47
48
49
50
To monitor some 50,000 cases per year where the firm has signed contracts
with its clients guaranteeing performance against operational metrics
relating to dozens of milestones in the contracts.
You can actually over engineer something like this. If you get too many
stakeholders involved, everyone wants their own particular metric. We have
been able to keep it focused and simple.
51
Vendors
Sends an e-mail to each vendor that was issued an electronic payment during the night.
Directs the vendor to a Website on the extranet where it can get a remittance report
Residents
Sends an e-mail to each residents for whom a water-bill was produced with all the
pertinent billing info
Directs the resident to a Website where he may pay his bill online
City Employees
Once-a-day e-mails to certain employees letting them know of all online payments made
to the city during the past 24 hours > whenever a candidate files a contribution report,
NoticeCast sends an e-mail to city employees responsible for tracking campaign law
compliance
Dr. Lakshmi Mohan
52
Example:
A BAM system could generate an alert that the estimated date of
a package delivery had slipped.
A CRM system and a BPM system might each subscribe to such
package due-date change alerts, extending the usefulness of
the alerts.
Dr. Lakshmi Mohan
53
54
55