Sei sulla pagina 1di 12

12/14/2014

© Copyright IBM Corporation 2013. All rights reserved.

THE INFORMATION CONTAINED IN THIS PRESENTATION IS FOR INFORMATIONAL PURPOSES ONLY. IBM SHALL
NOT BE RESPONSIBLE FOR ANY DAMAGES ARISING OUT OF THE USE OF, OR OTHERWISE RELATED TO, THIS
PRESENTATION OR ANY OTHER DOCUMENTATION.
BAFPRED: Fundamentals of Predictive Analytics
IBM, the IBM logo, ibm.com, Cognos, SPSS and iLog are trademarks or registered trademarks of International
Introduction Business Machines Corporation in the United States, other countries, or both. If these and other IBM
trademarked terms are U.S. registered or common law trademarks owned by IBM at the time this information
was published. Trademarks may also be registered or common law trademarks in other countries. A current
list of IBM trademarks is available on the Web at “Copyright and trademark information” at
http://www.ibm.com/legal/copytrade.html. The IBM logo must not be moved, added to or altered in any way.

Other company, product, or service names may be trademarks or service marks of others.

IBM Global Center for Smarter Analytics © 2013 IBM Corporation IBM Global Center for Smarter Analytics © 2013 IBM Corporation

Course Description
This course is designed to introduced to students the fundamentals
of predictive analytics. Predictive analytics allows voluminous data
to be used for prediction, classification and association making it a
very useful tool for projections, forecasts and correlations.
BAFPRES: Fundamentals of Predictive Analytics

Course Overview

4 IBM Global Center for Smarter Analytics © 2013 IBM Corporation 5 IBM Global Center for Smarter Analytics © 2013 IBM Corporation

Course Objectives Readings


The course will enable the student to:
 Identify opportunities for Predictive Analytics
 Extract and select predictors
 Build Predictive models using different predictive modelling algorithms
 Identify appropriate predictive modelling algorithms
 Validate and evaluate the performance of the predictive model
 Utilize analytics tools to select predictors, build and validate predictive
models
 Discuss key concepts, theories and algorithms in optimization
 Develop an awareness of the ethical norms as required under policies and Witten, I. and Frank E. (2005). Data Mining Practical Machine Learning Tools and
Techniques. Elsevier.
applicable laws governing confidentiality and non-disclosure of
data/information/documents and proper conduct in the learning process
and application of business analytics

6 IBM Global Center for Smarter Analytics © 2013 IBM Corporation 7 IBM Global Center for Smarter Analytics © 2013 IBM Corporation

1
12/14/2014

BAFPRED : Fundamentals of Predictive Analytics

Module 1 : Introduction to Predictive


Analytics

8 IBM Global Center for Smarter Analytics © 2013 IBM Corporation 9 IBM Global Center for Smarter Analytics © 2013 IBM Corporation

Areas for predictive analytics IBM SPSS Modeler


 High performance data mining and text analytics workbench
 Customer analytics  Human capital management
• Who are our best customers? • Who are our best employees?  Used for the proactive
• Can we get more like that? • How do we keep our best • Identification of revenue opportunities
• What/why do they buy? employees from leaving? • Reduction of costs
• Increase in productivity
• Why do they leave?  Which prospects should we recruit? • Forecasting
 Fraud detection and prevention  Allows analytics to be repeated and integrated within business systems
 Crime analysis
• Money laundering
• Network intrusion
 Science
• Tax audits & collection
• Genetics
• Drug discovery
 Industrial process optimization
• Medical research
 Predictive maintenance to predict
• Food authentication
equipment failure
 Warranty claims
 Product quality assurance .... and many more

IBM Global Center for Smarter Analytics © 2013 IBM Corporation IBM Global Center for Smarter Analytics © 2013 IBM Corporation

Sample Problem[6] Sample Problem[6]

ID SEX STATUS CHILDREN STATUS ID SEX STATUS CHILDREN STATUS


0 F M 2 Vol What are the characteristics of 0 F M 2 Vol What are the characteristics of
3 F S 2 InVol those subscribers whose status is 3 F S 2 InVol those subscribers whose status is
4 F M 2 Vol Vol (voluntarily terminated 4 F M 2 Vol
Vol (voluntarily terminated
8 M S 0 Current 8 M S 0 Current
10 M M 2 Vol
subscription)? 10 M M 2 Vol
subscription)?
11 F S 0 InVol 11 F S 0 InVol
13 F M 2 Vol 13 F M 2 Vol They are either:
17 M S 0 Current 17 M S 0 Current MARRIED FEMALE
20 F M 2 Vol 20 F M 2 Vol or
23 F M 1 Vol 23 F M 1 Vol MALE with at least 1 CHILD
42 M S 2 Vol 42 M S 2 Vol
4820 F M 0 Vol 4820 F M 0 Vol

12 IBM Global Center for Smarter Analytics © 2013 IBM Corporation 13 IBM Global Center for Smarter Analytics © 2013 IBM Corporation

2
12/14/2014

Sample Problem[6] Sample Problem[6]

ID SEX STATUS CHILDREN STATUS ID SEX STATUS CHILDREN STATUS


Does this mean that a new
0 F M 2 Vol What are the characteristics of 0 F M 2 Vol
3 F S 2 InVol 3 F S 2 InVol subscriber who fits the
those subscribers whose status is characteristic will eventually
4 F M 2 Vol 4 F M 2 Vol
Vol (voluntarily terminated voluntarily terminate their
8 M S 0 Current 8 M S 0 Current
10 M M 2 Vol
subscription)? 10 M M 2 Vol subscription?
11 F S 0 InVol 11 F S 0 InVol
13 F M 2 Vol They are either: 13 F M 2 Vol
17 M S 0 Current MARRIED FEMALE 17 M S 0 Current
20 F M 2 Vol or 20 F M 2 Vol
23 F M 1 Vol MALE with at least 1 CHILD 23 F M 1 Vol
42 M S 2 Vol 42 M S 2 Vol
4820 F M 0 Vol 4820 F M 0 Vol
Does this mean that a new
subscriber who fits the
characteristic will eventually
voluntarily terminate their
subscription?

14 IBM Global Center for Smarter Analytics © 2013 IBM Corporation 15 IBM Global Center for Smarter Analytics © 2013 IBM Corporation

Sample Problem[6] Predictive Analytics Sample (cont)[6]

LongDista
Internatio PAY_MTH LocalBillT nceBillTyp Est_Inco
ID SEX STATUS CHILDREN STATUS
Does this mean that a new ID LONGDIST nal LOCAL DROPPED D ype e AGE SEX STATUS CHILDREN me Car_Owner STATUS
0 F M 2 Vol
3 F S 2 InVol subscriber who fits the 0 5.2464 7.5151 86.3278 0CH FreeLocal Standard 57F M 2 27535.3Y Vol

4 F M 2 Vol characteristic will eventually Intnl_disc


3 0 0 3.94229 0CC Budget ount 50F S 2 64632.3N InVol
8 M S 0 Current voluntarily terminate their Intnl_disc
10 M M 2 Vol subscription? 4 5.55564 0 9.36347 1CC Budget ount 68F M 2 81000.9N Vol
8 14.0193 5.68043 29.8065 0CC Budget Standard 34M S 0 87467.1Y Current
11 F S 0 InVol
Intnl_disc
13 F M 2 Vol 10 13.664 2.95642 32.6381 0CC FreeLocal ount 60M M 2 83220.6N Vol
17 M S 0 Current
MOST PROBABLY NO. 11 0 0 1.41294 0CC FreeLocal Standard 84F S 0 50290.7N InVol
20 F M 2 Vol
Intnl_disc
23 F M 1 Vol 13 0.281029 0 8.53692 0CH Budget ount 28F M 2 20850.4N Vol
42 M S 2 Vol There is NOT ENOUGH DATA
4820 F M 0 Vol To make this conclusion. 17 1.577 0 19.9808 0CC FreeLocal Standard 52M S 0 84112.6N Current

20 0.452629 0 73.0122 0Auto FreeLocal Standard 88F M 2 73865.9Y Vol

23 20.2946 0 76.0518 0CC FreeLocal Standard 76F M 1 12309.6N Vol


Intnl_disc
42 8.86499 4.43676 43.6439 0CH Budget ount 55M S 2 85753.8N Vol
4820 0 0 0.660288 0CH Budget Standard 49F M 0 68828.4Y Vol

16 IBM Global Center for Smarter Analytics © 2013 IBM Corporation 17 IBM Global Center for Smarter Analytics © 2013 IBM Corporation

Telco Subscriber List Telco Subscriber List


LongDist LongDist
LONGDI Internatio DROPPE PAY_MT LocalBill anceBillT CHILDR Est_Inco LONGDI Internatio DROPPE PAY_MT LocalBill anceBillT CHILDR Est_Inco
ID ST nal LOCAL D HD Type ype AGE SEX STATUS EN me Car_Owner STATUS ID ST nal LOCAL D HD Type ype AGE SEX STATUS EN me Car_Owner STATUS
FreeLoca FreeLoca
0 5.2464 7.5151 86.3278 0 CH l Standard 57 F M 2 27535.3 Y Vol 0 5.2464 7.5151 86.3278 0 CH l Standard 57 F M 2 27535.3 Y Vol

3 0 0 3.94229
MORE DATA may provide
0 CC
Intnl_disc
Budget ount
50 F S 2 64632.3 N InVol 3 0 0 3.94229 0 CC
Intnl_disc
Budget ount 50 F S 2 64632.3 N InVol
a more accurate
Intnl_disc Intnl_disc
4 5.55564 0 9.36347 1 CC Budget ount
68 F M 2 81000.9 N Vol 4 5.55564 0 9.36347 1 CC Budget ount 68 F M 2 81000.9 N Vol
8 14.0193 5.68043 29.8065 0 CC ANALYSIS. 34 M
Budget StandardS 0 87467.1 Y Current 8 14.0193 5.68043 29.8065 0 CC Budget Standard 34 M S 0 87467.1 Y Current
FreeLoca Intnl_disc FreeLoca Intnl_disc
10 13.664 2.95642 32.6381 0 CC l ount 60 M M 2 83220.6 N Vol 10 13.664 2.95642 32.6381 0 CC l ount 60 M M 2 83220.6 N Vol
But is it practical to do this
FreeLoca FreeLoca
11 0 0 1.41294 0 CC l Standard
84 F S 0 50290.7 N InVol 11 0 0 1.41294 0 CC l Standard 84 F S 0 50290.7 N InVol
MANUALLY?Intnl_disc Intnl_disc
13 0.281029 0 8.53692 0 CH Budget ount 28 F M 2 20850.4 N Vol 13 0.281029 0 8.53692 0 CH Budget ount 28 F M 2 20850.4 N Vol
FreeLoca FreeLoca
17 1.577 0 19.9808 0 CC l Standard 52 M S 0 84112.6 N Current 17 1.577 0 19.9808 0 CC l Standard 52 M S 0 84112.6 N Current
FreeLoca FreeLoca
20 0.452629 0 73.0122 0 Auto l Standard 88 F M 2 73865.9 Y Vol 20 0.452629 0 73.0122 0 Auto l Standard 88 F M 2 73865.9 Y Vol
FreeLoca FreeLoca
23 20.2946 0 76.0518 0 CC l Standard 76 F M 1 12309.6 N Vol 23 20.2946 0 76.0518 0 CC l Standard 76 F M 1 12309.6 N Vol
Intnl_disc Intnl_disc
42 8.86499 4.43676 43.6439 0 CH Budget ount 55 M S 2 85753.8 N Vol 42 8.86499 4.43676 43.6439 0 CH Budget ount 55 M S 2 85753.8 N Vol
4820 0 0 0.660288 0 CH Budget Standard 49 F M 0 68828.4 Y Vol 4820 0 0 0.660288 0 CH Budget Standard 49 F M 0 68828.4 Y Vol

IBM Global Center for Smarter Analytics © 2013 IBM Corporation IBM Global Center for Smarter Analytics © 2013 IBM Corporation

3
12/14/2014

CRISP-DM[8]

IBM Global Center for Smarter Analytics © 2013 IBM Corporation 21 IBM Global Center for Smarter Analytics © 2013 IBM Corporation

Business Understanding[8]
 Most important phase of data mining. Includes determining business objectives, situation
assessment, data mining goals and producing a project plan.

• Identify business objectives and success criteria

• Perform a situational assessment (resources, constraints, assumptions, risks, costs, and


benefits)

• Determine the goals of the data-mining project and success criteria

• Produce a project plan

IBM Global Center for Smarter Analytics © 2013 IBM Corporation 23 IBM Global Center for Smarter Analytics © 2013 IBM Corporation

Data Understanding[8] Data Preparation[8]


 This phase addresses the need to understand what your data resources are and the  After cataloging your data resources, you will need to prepare your data for mining.
characteristics of those resources. Activities include: Preparations include selecting, cleaning, constructing, integrating, and formatting data.

• Extracting data from a data warehouse or data mart


• Collecting initial data
• Linking tables together within a database or in PASW Modeler
• Describing data
• Combining data files from different systems
• Exploring data
• Reconciling inconsistent field values
• Verifying data quality
• Identifying missing, incorrect, or extreme data values

• Data selection

• Restructuring data into a form the analysis requires

• Transforming relevant fields (taking differences, ratios, etc.)

24 IBM Global Center for Smarter Analytics © 2013 IBM Corporation 25 IBM Global Center for Smarter Analytics © 2013 IBM Corporation

4
12/14/2014

Modeling[8] Evaluation[8]
This is the phase where analysis methods are used to extract information from the data. Involves evaluating the data mining results. The key aim is to determine if there is some
Involves selecting modeling techniques, generating test designs, and building then critical business issues that has not been sufficiently considered.
assessing models.

26 IBM Global Center for Smarter Analytics © 2013 IBM Corporation 27 IBM Global Center for Smarter Analytics © 2013 IBM Corporation

Deployment[8] Components of Predictive Analytics[9]


This phase can be as simple as generating a report or as complex as implementing a Training Phase
repeatable data-mining process. It is important to make predictions with the model against
new data .
• Predictor Extraction

• Predictor Selection

• Modeling

• Testing

28 IBM Global Center for Smarter Analytics © 2013 IBM Corporation 29 IBM Global Center for Smarter Analytics © 2013 IBM Corporation

Components under CRISP-DM[8] Components of Predictive Analytics[9]


Deployment Phase

• Predictor Extraction

• Model

• Prediction

30 IBM Global Center for Smarter Analytics © 2013 IBM Corporation 31 IBM Global Center for Smarter Analytics © 2013 IBM Corporation

5
12/14/2014

Data Mining and Machine Learning [1] Data Mining and Machine Learning [1]
Data in the world, in our lives seems to go on increasing Is about solving problems by finding patterns in data already present

Lying hidden in all this data is information, potentially useful information, that Useful patterns allow prediction of new data
is rarely made explicit or taken advantage of.

The greater the volume of data

The harder it is to understand for humans

32 IBM Global Center for Smarter Analytics © 2013 IBM Corporation 33 IBM Global Center for Smarter Analytics © 2013 IBM Corporation

Data Mining and Machine Learning [1] Data Mining and Machine Learning [1]
What are the characteristics of customers who will stay with the Telephone Company What are the characteristics of customers who will stay with the Telephone Company
as subscribers? as subscribers?

Surveys?
Focus Group Discussions?
What do you need to answer this question? Or do you use actual data?

IBM Global Center for Smarter Analytics © 2013 IBM Corporation IBM Global Center for Smarter Analytics © 2013 IBM Corporation

Data Mining and Machine Learning [1]


There will be millions of data to learn from as there are millions of subscribers!
Technique Usage Algorithms
Classification • Used to predict group membership • Auto Classifiers,
(or prediction) (e.g., will this employee leave?) or a Decision Trees,
number (e.g., how many widgets Logistic, SVM, Time
will I sell?) Series, etc.
Segmentation • Used to classify data points into • Auto Clustering, K-
groups that are internally means, etc.
homogenous and externally
heterogeneous. • Anomoly detection
• Identify cases that are unusual
Association • Used to find events that occur • APRIORI, Carma,
together or in a sequence (e.g., Sequence
market basket)

Use data mining!!! Use Predictive Analytics!!!

IBM Global Center for Smarter Analytics © 2013 IBM Corporation IBM Global Center for Smarter Analytics © 2013 IBM Corporation

6
12/14/2014

Association model

Goal:
 Identify what products are being sold together
Approach:
 Use a data extract from a transactional system
 Define which fields to use
 Visualize relationship between products
 Generate association model
 Review results
Why?
 Identify next likely purchase
 Create bundles to increase $ value

IBM Global Center for Smarter Analytics © 2013 IBM Corporation IBM Global Center for Smarter Analytics © 2013 IBM Corporation

The importance of text Classification model


Goal:
 Identify who is likely to cancel their contract
Because people communicate with Approach:
words, not numbers, it has become  Use a data extract from a CRM
critical to be able to mine text for its  Use open ended comments from call center
meaning and to sort, analyse, and  Extract concepts from the text
understand it in the same way that data  Define which fields to use
has been tamed. In fact, the two basic  Choose the modeling technique
types of information complement each
 Automatically generate a model to identify who has cancelled
other, with data supplying the “what”
 Review results
and text supplying the “why”.
Why?
Source IDC: “Text Analytics: Software’s Missing Piece?”  Identify customers at risk before they churn
 Unstructured data can provide insight into customers actions and
improve model accuracy

IBM Global Center for Smarter Analytics © 2013 IBM Corporation IBM Global Center for Smarter Analytics © 2013 IBM Corporation

Deployment
Data mining and text analytics

Data mining Text analytics


Goal:
 Use advanced analytical  Extract, analyze and create  Deploy a predictive model
techniques on data structure for unstructured data
Approach:
 Discover key relationships  Integrate analysis results into  Use the stream generated in the earlier session
between variables operational systems  Pass new data through the stream and „score‟ the data
 Model effect of variables on  Integrate analysis results into  Identify those likely to cancel
outcomes Business Intelligence applications  Export an .xls file with 50 most likely to cancel
 Determine influence on outcomes  Integrate analysis results with Why?
structured data and use as input  Extend the reach of analytics in an organization
 Predict outcomes  Allows analytics at the point of impact rather than being reactive
for Data Mining
 Apply models to new data
 Improves model accuracy

IBM Global Center for Smarter Analytics © 2013 IBM Corporation IBM Global Center for Smarter Analytics © 2013 IBM Corporation

7
12/14/2014

Machine Learning [1]


What is Learning?
•To get knowledge of by study, experience, or being taught

•To become aware by information or from observation

•To commit to memory


BAFPRED : Fundamentals of Predictive Analytics
•To be informed of, ascertain

Machine Learning •To receive instruction

•Pertains to algorithms

44 IBM Global Center for Smarter Analytics © 2013 IBM Corporation 45 IBM Global Center for Smarter Analytics © 2013 IBM Corporation

Data Mining [1]


Is a practical topic and involves learning in a practical and not in a theoretical sense

Application of the algorithms in Machine Learning

Process of knowledge discovery

BAFPRED: Fundamentals of Predictive Analytics

Data Mining

46 IBM Global Center for Smarter Analytics © 2013 IBM Corporation 47 IBM Global Center for Smarter Analytics © 2013 IBM Corporation

Describing Structural Patterns [1] Describing Structural Patterns [1]

Predictors,
Attributes,
Features,
Inputs
Label
Target,
Class,
Outputs

Learning
Instance,
Records

Witten, I. and Frank E. (2005). Data Mining Practical Machine Learning Tools and Techniques. Elsevier. Witten, I. and Frank E. (2005). Data Mining Practical Machine Learning Tools and Techniques. Elsevier.

48 IBM Global Center for Smarter Analytics © 2013 IBM Corporation 49 IBM Global Center for Smarter Analytics © 2013 IBM Corporation

8
12/14/2014

Data Mining and Machine Learning [1]


Highly dependent on the learning instances or records

Categories, Target, Class or Labels must be evenly distributed in the data

BAFPRED : Fundamentals of Predictive Analytics

Sample Problem

50 IBM Global Center for Smarter Analytics © 2013 IBM Corporation 51 IBM Global Center for Smarter Analytics © 2013 IBM Corporation

Tennis Problem [1] Tennis Problem [1]

Predictors Predictors

Target,
Class or
Label
14 14
Learning Instances
Instances

Rules Rules
What are the rules for deciding whether to play tennis or not?

Witten, I. and Frank E. (2005). Data Mining Practical Machine Learning Tools and Techniques. Elsevier. Witten, I. and Frank E. (2005). Data Mining Practical Machine Learning Tools and Techniques. Elsevier.

52 IBM Global Center for Smarter Analytics © 2013 IBM Corporation 53 IBM Global Center for Smarter Analytics © 2013 IBM Corporation

Contact Lens Problem [1] Contact Lens Problem [1]


Contact Lens Rules Contact Lens Decision Tree

Witten, I. and Frank E. (2005). Data Mining Practical Machine Learning Tools and Techniques. Elsevier. Witten, I. and Frank E. (2005). Data Mining Practical Machine Learning Tools and Techniques. Elsevier.

54 IBM Global Center for Smarter Analytics © 2013 IBM Corporation 55 IBM Global Center for Smarter Analytics © 2013 IBM Corporation

9
12/14/2014

What is an attribute? [1] What is a class or target? [1]


Individual, independent instances that provides the input to machine on a fixed Label assigned to a set of features that classifies the learning instance
predefined set of features

Witten, I. and Frank E. (2005). Data Mining Practical Machine Learning Tools and Techniques. Witten, I. and Frank E. (2005). Data Mining Practical Machine Learning Tools and Techniques.
Elsevier. Elsevier.

56 IBM Global Center for Smarter Analytics © 2013 IBM Corporation 57 IBM Global Center for Smarter Analytics © 2013 IBM Corporation

Tennis Problem [1] Supervised Learning


All learning instances are properly labelled with a class, category or target that can be
used to classify the set of features

Unsupervised Learning
Learning instances are not labelled
Machine learning must be used to cluster similar instances
An expert will have to characterize the clusters later

Rules
Identify the different components

Witten, I. and Frank E. (2005). Data Mining Practical Machine Learning Tools and Techniques. Elsevier.

58 IBM Global Center for Smarter Analytics © 2013 IBM Corporation 59 IBM Global Center for Smarter Analytics © 2013 IBM Corporation

Machine Learning VS Statistics [1]

Machine Learning More concerned with formulating the process of


generalization as a search through possible
hypothesis

BAFPRED : Fundamentals of Predictive Analytics More concerned with testing hypothesis


Statistics

Machine Learning VS Statistics

60 IBM Global Center for Smarter Analytics © 2013 IBM Corporation 61 IBM Global Center for Smarter Analytics © 2013 IBM Corporation

10
12/14/2014

Data Mining and Ethics [1]


Must act responsibly by making themselves aware of the ethical issues that surround
their particular applications

BAFPRED : Fundamentals of Predictive Analytics

Data Mining and Ethics

62 IBM Global Center for Smarter Analytics © 2013 IBM Corporation 63 IBM Global Center for Smarter Analytics © 2013 IBM Corporation

BAFPRED : Fundamentals of Predictive Analytics BAFPRED : Fundamentals of Predictive Analytics

Concepts, Instances, and Attributes Getting to know your data

64 IBM Global Center for Smarter Analytics © 2013 IBM Corporation 65 IBM Global Center for Smarter Analytics © 2013 IBM Corporation

Getting to know your data [1] Predictive Analytics Model


Graphical visualizations of data make it easy to identify outliers

Outliers

+ =
Data Machine Learning Algorithm

A model is an approximation of the real world.

A model can be a set of rules or an equation that represents


real world.

66 IBM Global Center for Smarter Analytics © 2013 IBM Corporation 67 IBM Global Center for Smarter Analytics © 2013 IBM Corporation

11
12/14/2014

Check-up
How are data mining, machine learning and predictive analytics related?

How can predictive analytics be used in business?

What is the most important thing in predictive analytics?


BAFPRED : Fundamentals of Predictive Analytics

For the Next Session

68 IBM Global Center for Smarter Analytics © 2013 IBM Corporation 69 IBM Global Center for Smarter Analytics © 2013 IBM Corporation

References
[1] :Witten, I. and Frank E. (2005). Data Mining Practical Machine Learning Tools and
Techniques. Elsevier.

70 IBM Global Center for Smarter Analytics © 2013 IBM Corporation

12

Potrebbero piacerti anche