Sei sulla pagina 1di 32

Analytics & Data Mining

Overview and General Concepts

Motivation: Necessity is the Mother of


Invention
Data explosion problem
Automated data collection tools and mature database
technology lead to tremendous amounts of data stored in
databases, data warehouses and other data repositories
We are drowning in data, but starving for knowledge!
Solution: Data warehousing and Knowledge Discovery
Data warehousing and on-line analytical processing
Extraction of interesting knowledge (rules, regularities,
patterns, constraints) from data in large databases

Predictive versus Retrospective Models


Traditional Analysis tools
are Retrospective

Data Mining tools are


Predictive

Print out last months


expenses by cost centre

Predict and Explain next


months demand

List the biggest spenders


from the last mailing
campaign

Define concentrated micromarket to reduce future


mailing costs and improve
success rates
Explain why some
customers defect to our
competitors
Find some new patterns of
customer behaviour we are
not aware of yet

Using this model, tell me


how well it describes last
years cancellations of our
contracts

What is Data Mining?


Search for relationships and global patterns that exist in
large databases but are hidden in the vast amounts of data.
Analyst combines knowledge of data and machine learning
technologies to discover nuggets of knowledge hidden in
the data.
Easier and more effective when the organization has
accumulated as much data as possible, such as with a data
warehouse
A data warehouse is not a prerequisite to data mining

Data Mining: Confluence of Multiple


Disciplines
Database
Systems

Machine
Learning

Algorithms

Statistics

Data Mining

Visualization

Other
Disciplines

Potential Applications
Database analysis and decision support
Market analysis
target marketing, customer relationship management, market
basket analysis, cross selling, market segmentation

Risk analysis
Forecasting, customer retention, improved underwriting, quality
control, competitive analysis

Fraud detection and management

Other Applications
Text mining (news group, email, documents) and Web analysis.

Intelligent query answering

Market Analysis
A lot of data is available for analysis
Credit card transactions, loyalty cards, discount coupons,
customer complaint calls, plus (public) lifestyle studies

Customer profiling
data mining can tell you what types of customers buy what
products (clustering or classification)

Identifying customer requirements


identifying the best products for different customers
use prediction to find what factors will attract new
customers

Market Analysis
Target marketing
Find clusters of model customers who share the same
characteristics: interest, income level, spending habits, etc.

Determine customer purchasing patterns over time


Conversion of single to a joint bank account: marriage(?),
etc.

Cross-market analysis
Associations/correlations between product sales

Prediction based on the association information

Financial Applications
Finance planning and loan application evaluation
evaluation of loan applications
cross-sectional and time series analysis (trend analysis
etc.)

Credit Risk Modeling


Estimating risk in credit portfolios
Determining credit limits
Risk based pricing of products

Fraud Detection
Applications
widely used in health care, retail, credit card services,
telecommunications (phone card fraud), etc.

Approach
use historical data to build models of fraudulent behavior
and use data mining to help identify similar instances

Examples
auto insurance: detect a group of people who stage
accidents to collect on insurance
money laundering: detect suspicious money transactions
medical insurance: detect professional patients and ring of
doctors and ring of references

Other Applications
Sports
IBM Advanced Scout analyzed NBA game statistics
(shots blocked, assists, and fouls) to gain competitive
advantage for New York Knicks and Miami Heat

Astronomy
The Palomar Observatory discovered 22 quasars with the
help of data mining

Internet Web Surf-Aid


IBM Surf-Aid applies data mining algorithms to Web
access logs for market-related pages to discover customer
preference and for analyzing effectiveness of Web
marketing, improving Web site organization, etc.

Knowledge Discovery in Databases

Knowledge Discovery in Databases (KDD) is the


nontrivial process of identifying valid, novel,
potentially useful, and ultimately understandable
patterns in data.

Knowledge Discovery and Data Mining


Pattern Evaluation

Data Mining
Task-relevant Data
Data Warehouse
Data Cleaning
Data Integration
Databases

Selection and
Reduction

KDD and Decisions


Increasing potential
to support
business decisions

Use
Knowledge
Decision Making

Evaluation & Presentation


Visualization Techniques
Data Mining
Data Mining Algorithms
Data Exploration
Statistical Analysis, Querying and Reporting
Data Warehouses / Data Marts
OLAP, MDA
Data Sources
Paper, Files, Information Providers, Database Systems, OLTP

Manager

Business
Analyst

Data
Analyst

DBA

Data Mining
Data Mining is the process of discovering meaningful
new correlations, patterns and trends by sifting
through large amounts of data stored in repositories,
by using pattern recognition technologies as well as
statistical and mathematical techniques.

Data Mining Techniques

Memory Based Reasoning (MBR)


Scatter Graph of Movies by Age and Income Group

4
3
2
1
Independence Day
Courage Under Fire
Birdcage
25
Nutty Professor

30

35

Age

40

45

50

MBR: Nearest Neighbour


Nearest 4
Neighbor

All the data points closest


to this point saw
Independence Day

2
Nearest 1

Neighbor

Nearest
Neighbor

Independence Day
Courage Under Fire

Birdcage
Nutty Professor

25

30

35
Age

40

45

50

Link Analysis

Link Analysis: Identifying Fax Numbers

Market Basket Analysis

Association Rules
Usually applied to market baskets but other
applications are possible
Useful Rules contain novel and actionable
information: e.g. On Thursdays grocery customers are
likely to buy diapers and beer together
Trivial Rules contain already known information: e.g.
People who buy maintenance agreements are the ones
who have also bought large appliances
Some novel rules may not be useful: e.g. New
hardware stores most commonly sell toilet rings

What Is Association Mining?


Association rule mining:
Finding frequent patterns, associations, correlations, or
causal structures among sets of items or objects in
transaction databases, relational databases, and other
information repositories.

Applications:
Basket data analysis, cross-marketing, catalog design, lossleader analysis, clustering, classification, etc.

Examples.
Rule form: Body ead [support, confidence].
buys(x, diapers) buys(x, beers) [0.5%, 60%]
major(x, CS) ^ takes(x, DB) grade(x, A) [1%,
75%]

Uniforms for the Army


The army has many sets of uniforms: cemonial, general
parade, daily training / PT, sports, camouflage (land, jungle,
mountains, snow), specialized (field guns, tanks, missiles) etc.
Each uniform set has several elements: helmet, belt, trousers,
socks, shoes / boots, shirts etc.
Each element has several sizes: belt (S / M / L), trousers
(various waist-size and leg-length combinations) etc.
Some elements have different sizes / cuts for men and women.
This requires the army stores to carry an inventory of all these
items. Also, army units have to be prepared to move at short
notice. Whenever an unit moves a lot of stores / inventory
needs to be moved, requiring both effort and cost.
How many of each size does the army need to procure and
stock so that army personnel are smartly kitted out and yet we
can carry minimum inventory?

Cluster Detection

Cluster Detection
It is the task of discovering structure in data by
segmenting the heterogeneous population
Finding an optimal partition involves the
evaluation of a finite number of possible
partitions that can be formed for a given
number of objects n and a given number of
required clusters m
Computationally complex: even for small
problem sizes (e.g. n = 125, m = 25), the
number of possible partitions evaluates to:
2,436,684,974,110,751

Heuristic Solutions to Clustering


Hierarchical Algorithms : partition the data into a
nested sequence of partitions. There are two
approaches:
Start with n clusters (where n is the number of
objects), and iteratively merge pairs of clusters Agglomerative algorithms
Start by considering all the objects to be in one cluster
and iteratively split one cluster into two at each stepDivisive algorithms

Classification
Income group 1
age > 27

Income group 2

Income group 3
or 4
age >41

Saw Independence Day

Income
group 3
age < 41

Income
group 4
age < 41

Saw Birdcage

Income group 1
age < 27

Saw Courage Under Fire

Saw Nutty Professor

Saw Birdcage

Saw Courage
Under Fire

ClassificationA Three-Step Process


Model construction: describing a set of predetermined classes
Each example is assumed to belong to a predefined class, as
determined by the class label attribute
The set of examples used for model construction: training set
The model is represented as classification rules, decision trees, or
mathematical formulae

Model Testing: Estimate accuracy of the model


The known label of test sample is compared with the classified
result from the model
Accuracy rate is the percentage of test set samples that are
correctly classified by the model
Test set is independent of training set, otherwise over-fitting will
occur

Model is used for classifying future or unknown objects

Classification Process: Model Construction

Training
Data

NAME
Mike
Mary
Bill
Jim
Dave
Anne

RANK
YEARS Ind. Chrg.
Assistant Prof
3
no
Assistant Prof
7
yes
Professor
2
yes
Associate Prof
7
yes
Assistant Prof
6
no
Associate Prof
3
no

Classification
Algorithms

Classifier
(Model)

IF rank = professor
OR years > 6
THEN Ind. Chrg. = yes

Classification Process: Use the Model


(Testing & Prediction)
Classifier
Testing
Data

Unseen Data
(Jeff, Professor, 4)

NAME
Tom
Merlisa
George
Joseph

RANK
YEARS Ind. Chrg.
Assistant Prof
2
no
Associate Prof
7
no
Professor
5
yes
Assistant Prof
7
yes

Ind. Chrg.?

Real Estate Appraisal by Neural Networks