Sei sulla pagina 1di 97

Data Warehouse

Fundamentals

Chapter 9

Data Mining Basics


Instructor: Paul Chen
Topics
1. How Data Mining Evolved?
2. Decision Processing Overview and Tasks
3. Data Mining, Whats it?
4. Data Mining vs. Data Warehousing
5. How Data Mining Works? And Its Applications
6. Data Mining Operations and Associated Techniques
7. The Data Mining Process
8. Data Mining Tools
9. Data Mining Techniques- A Summary
Topic 1:How Data Mining Evolved?

Many businesses have invested heavily in information


technology to help them manage their businesses more
effectively and gain a competitive edge. Increasingly large
amounts of critical business data are being stored
electronically and this volume is expected to continue to
grow. The Data Mining technology is helping companies
leverage their existing data more effectively and obtain
insightful information giving them a competitive edge.
How Data Mining Evolved?

1960s 1990s Late 1990s to


1970s-80s
Data OLAP and Now
RDBMS Data Mining
Collection DW

Time Line
Topic 2: Decision Processing
Overview
Decision processing systems, and their underlying
analytical applications, provide business users with the
information they need to track and analyze business
trends, and to explore new business opportunities. As
businesses become increasingly competitive and
complex, effective decision processing systems are
essential for success.
The Next Generation of Business
Intelligence
A decision processing system analyzes business
information captured from operational systems (Back-
and-front office, and e-business applications).
Distribution of business information to business users
is via corporate intranets and extranets.
The flow of data can be thought of as an information
supply chain whose objective is to convert operational
data into useful business information.
The Decision Processing Information
Business
Supply Chain Metrics
Operational
Systems
External Analytic
E-Business Data Applications
Applications

Collaborative
DW &
Back-Office Office Systems
Transaction Business
Applications Intelligence
Information Tools
Staging
Area
Business
Front-Office Decisions
Applications
Decision ProcessingFour Tasks***

Extracting and transforming information


This involves capturing data from operational systems,
transforming it into business information, and loading
Into a data warehouse information store.

Current extract templates on the market are primarily at


Capturing data from ERP (Enterprise Resource Planning)
Transaction processing systems for example: SAP Business
Information Warehouse and Peoplesoft BPM data warehouse)

*** Mentioned in chapter 2


Decision ProcessingFour Tasks
(Contd)

Managing information

This task encompasses the maintenance of business


information in information stores, and how these
information stores are processed by business intelligence
tools and analytic applications.
The cornerstone of decision processing is data
warehousing, and warehouse information stores should
be organized and modeled into relational and
multidimensional database products.
Decision ProcessingFour Tasks
(Contd)
Analyzing and modeling information
The traditional approach to decision
processing is to build a data warehouse
and supply business users with a set of
business intelligence tools (query,
reporting, OLAP and data mining, for
example) to process information in data
warehouse information stores.
A better approach is employ turn-key and
web-based analytic application packages
that are designed to provide
comprehensive analyses for the business
area being researched. Key business
metrics (ex. Revenue dollars per sales rep
per day) are useful.
Decision ProcessingFour Tasks
(Contd)
Distributing information

Business intelligence tools and analytic applications distribute


information and the results of analysis operations to business
users via standard graphical and Web interfaces.
To help users uncover and organize this range of business
information, an enterprise information portal (EIP) is required.
An EIP provides a single point of entry to any piece of
business information, no matter where it resides.
The main components of an EIP are information assistant
(Web browser interface) , an information directory and a
subscription facility.
Decision Making Under Risk

Decisions are made under three sets of conditions:


Certainty
The decision makers know everything in advance

of making the decision


Uncertainty
The decision makers know nothing about the

probabilities or the consequences of decisions


Risk
Decision-Making Style
Decision-making styles of users are categorized as
either
Analytic or
Heuristic
Analytic and Heuristic Decision
Making
Analytical Decision Maker Heuristic Decision Maker

Learns by analyzing Learns by acting


Uses step-by-step procedure Uses trial and error
Values quantitative Values experiences
information and models Relies on common sense
Builds mathematical models Seeks completely satisfying
and algorithms solution
Seeks optimal solution
Topic 3: Data Mining, Whats it?

Data Mining has been defined as a decision support


process in which a search is made for patterns of
information in data. To detect patterns in data, Data
Mining uses sophisticated statistical analysis and modeling
technologies to uncover useful relationships hidden in
databases. It predicts future trends and finds behavior
allowing businesses to make predictive, knowledge-driven
decisions.
Data Mining, Whats it?
The process of extracting valid, previously unknown,
comprehensible, and actionable information from large
databases and using it to make crucial business
decisions, (Simoudis,1996).

Involves analysis of data and use of software techniques


for finding hidden and unexpected patterns and
relationships in sets of data.
Data Mining, Whats it?
Reveals information that is hidden and unexpected, as
little value in finding patterns and relationships that
are already intuitive.

Patterns and relationships are identified by examining


the underlying rules and features in the data.

Tends to work from the data up and most accurate


results normally require large volumes of data to
deliver reliable conclusions.
Data Mining, Whats it?
Starts by developing an optimal representation of
structure of sample data, during which time knowledge
is acquired and extended to larger sets of data.

Data mining can provide huge paybacks for companies


who have made a significant investment in data
warehousing.

Relatively new technology, however already used in a


number of industries.
Topic 4: Data Mining vs. Data
Warehousing
Data Mining does not require that a Data Warehouse be
built. Often, data can be downloaded from the operational
files to flat files that contain the data ready for the data
mining analysis.

Data Mining can be implemented rapidly on existing


software and hardware platforms. Data Mining tools can
analyze massive databases to deliver answers to questions
such as, Which customers are most likely to respond to
my next promotional mailing, and why?
Data Mining vs. Data
Warehousing
Major challenge to exploit data mining is identifying suitable data
to mine.

Data mining requires single, separate, clean, integrated, and self-


consistent source of data.

A data warehouse is well equipped for providing data for mining.

Data quality and consistency is a pre-requisite for mining to


ensure the accuracy of the predictive models. Data warehouses are
populated with clean, consistent data.
Data Mining vs. Data
Warehousing
Advantageous to mine data from multiple sources to discover as
many interrelationships as possible. Data warehouses contain data
from a number of sources.

Selecting relevant subsets of records and fields for data mining


requires query capabilities of the data warehouse.

Results of a data mining study are useful if there is some way to


further investigate the uncovered patterns. Data warehouses
provide capability to go back to the data source.
Topic 5: How Data Mining
Works?
How exactly is Data Mining able to tell you important
things that you didnt know or what is going to happen
next? The technique in Data Mining is called Predictive
Modeling which is knowledge discovery process via
relationships and patterns in broad sense.

Modeling is the act of building a model in one situation


where you know the answer and then applying it to another
situation that you dont.
Examples of Applications of Data
Mining via relationships and patterns
Retail / Marketing
Identifying buying patterns of customers
Finding associations among customer demographic
characteristics
Predicting response to mailing campaigns
Market basket analysis
Examples of Applications of Data
Mining via relationships and patterns
Banking
Detecting patterns of fraudulent credit card use
Identifying loyal customers
Predicting customers likely to change their credit
card affiliation
Determining credit card spending by customer
groups
Examples of Applications of Data
Mining via relationships and patterns
Insurance
Claims analysis
Predicting which customers will buy new policies.

Medicine
Characterizing patient behaviour to predict surgery
visits
Identifying successful medical therapies for
different illnesses.
Examples of Applications of Data
Mining via relationships and patterns
Customer profiling: characteristics of good customers are
identified with the goals of predicting who will become
one and helping marketers target new prospects.

Targeting specific marketing promotions to existing and


potential customers offers similar benefits.

Market-basket analysis: With Data Mining, companies can


determine which products to stock in which stores, and
even how to place them within a store.
Examples of Applications of Data
Mining via relationships and patterns
Customer Relationships Management-Determines
characteristics of customers who are likely to leave for a
competitor, a company can take action to retain that
customer because doing so is usually for less expensive
than acquiring a new customer.

Fraud detection- With Data Mining, companies can


identify potentially fraudulent transactions before they
happen.
Topic 6: Data Mining Operations
and Associated Techniques

In previous foils, predictive modeling in essence includes


other operations shown in the above table.
Descriptive: The dealer sold 200 cars last month.

Operational (OLTP)

Explanatory: For every increase in 1 % in the


interest,
auto sales decrease by 5 %.
Traditional DW
OLAP

Predictive: predictions about future buyer behavior.

Data Mining
Level of Modeling vs. Level of Analytical Processing

Descriptive Explanatory Predictive

SIMPLE QUERIES WHAT IF


& REPORTS PROCESSING DETERMINE IF
ANY PATTERNS
ANALYZE WHAT EXIST BY REVIEWING
HAS PREVIOUSLY DATA RELATIONSHIPS
OCCURRED TO
BRING ABOUT THE
CURRENT STATE
OF THE DATA
Normalized Denormaliz + Statistical Analysis/
Tables ed Artificial Intelligence
Tables
Roll-up; Drill Down Classification & Value Predictio
Predictive Modelling

Similar to the human learning experience


uses observations to form a model of the important
characteristics of some phenomenon.

Uses generalizations of real world and ability to fit


new data into a general framework.

Can analyze a database to determine essential


characteristics (model) about the data set.
Predictive Modelling

Model is developed using a supervised learning


approach, which has two phases: training and testing.

Training builds a model using a large sample of


historical data called a training set.
Testing involves trying out the model on new,
previously unseen data to determine its accuracy
and physical performance characteristics.
Predictive Modelling

Applications of predictive modelling include customer


retention management, credit approval, cross selling,
and direct marketing.

Two techniques associated with predictive modelling:


A. classification
B. value prediction, distinguished by nature of the
variable being predicted.
Statistical Analysis of Actual Sales (dollars
and quantities) relative To these Signage
Variables-a predictive modeling example.
Content
Frequency
Depth
Focus
Depth
Scale
Length
Location

Statistical Analysis : Correlation, Regression, Experiment Design,


Optimization. Now it goes into real time analysis.
Signage
Signage
PREDICTIVE MODELING

There are two techniques associated with predictive


modeling: classification and value prediction, which are
distinguished by the nature of the variable being
predicted.
Predictive Modelling - Classification

Used to establish a specific predetermined class for


each record in a database from a finite set of possible,
class values.

Two specializations of classification: tree induction and


neural induction.
Example of Classification using
Tree Induction
Example of Classification using
Tree Induction
Customer renting
property
> No
2 years
Yes

Rent Customer age>45


property
No Yes

Rent Buy property


property
Example of Classification using
Neural Induction
Example of Classification using
Neural Induction
Each processing unit (circle) in one layer is connected
to each processing unit in the next layer by a weighted
value, expressing the strength of the relationship. The
network attempts to mirror the way the human brain
works in recognizing patterns by arithmetically
combining all the variables with a given data point.

In this way, it is possible to develop nonlinear


predictive models that learn by studying
combinations of variables and how different
combinations of variables affect different data sets.
Predictive Modelling - Value
Prediction
Used to estimate a continuous numeric value that is
associated with a database record.

Uses the traditional statistical techniques of linear


regression and non-linear regression.

Relatively easy-to-use and understand.


Predictive Modelling - Value
Prediction
Linear regression attempts to fit a straight line through
a plot of the data, such that the line is the best
representation of the average of all observations at that
point in the plot.

Problem is that the technique only works well with


linear data and is sensitive to the presence of outliers
(i.e.., data values, which do not conform to the expected
norm).
Predictive Modelling - Value
Prediction
Although non-linear regression avoids the main
problems of linear regression, still not flexible enough
to handle all possible shapes of the data plot.

Statistical measurements are fine for building linear


models that describe predictable data points, however,
most data is not linear in nature.
Predictive Modelling - Value
Prediction
Data mining requires statistical methods that can
accommodate non-linearity, outliers, and non-numeric
data.

Applications of value prediction include credit card


fraud detection or target mailing list identification.
Database Segmentation

Aim is to partition a database into an unknown number


of segments, or clusters, of similar records.

Uses unsupervised learning to discover homogeneous


sub-populations in a database to improve the accuracy
of the profiles.
Database Segmentation

Less precise than other operations thus less sensitive to


redundant and irrelevant features.

Sensitivity can be reduced by ignoring a subset of the


attributes that describe each instance or by assigning a
weighting factor to each variable.

Applications of database segmentation include


customer profiling, direct marketing, and cross selling.
Example of Database Segmentation
using a Scatter plot
Database Segmentation
Associated with demographic or neural clustering
techniques, distinguished by:
Allowable data inputs
Methods used to calculate the distance between
records
Presentation of the resulting segments for analysis.
Example of Database Segmentation
using a Visualization
Link Analysis

Aims to establish links (associations) between records,


or sets of records, in a database.

There are three specializations


Associations discovery
Sequential pattern discovery
Similar time sequence discovery

Applications include product affinity analysis, direct


marketing, and stock price movement.
Link Analysis - Associations
Discovery
Finds items that imply the presence of other items in
the same event.

Affinities between items are represented by association


rules.
e.g. When customer rents property for more than 2
years and is more than 25 years old, in 40% of cases,
customer will buy a property. Association happens
in 35% of all customers who rent properties.
Link Analysis - Sequential Pattern
Discovery
Finds patterns between events such that the presence of
one set of items is followed by another set of items in a
database of events over a period of time.

e.g. Used to understand long term customer buying


behaviour.
Link Analysis - Similar Time
Sequence Discovery
Finds links between two sets of data that are time-
dependent, and is based on the degree of similarity
between the patterns that both time series demonstrate.
e.g. Within three months of buying property, new
home owners will purchase goods such as cookers,
freezers, and washing machines.
Deviation Detection

Relatively new operation in terms of commercially


available data mining tools.

Often a source of true discovery because it identifies


outliers, which express deviation from some previously
known expectation and norm.
Deviation Detection

Can be performed using statistics and visualization


techniques or as a by-product of data mining.

Applications include fraud detection in the use of credit


cards and insurance claims, quality control, and defects
tracing.
A Summary: Data-Driven
Techniques*
Data Visualization

Decision Trees

Clustering

Factor Analysis

Neural Network

Association Rules

Rule Induction

* Based on Sakhr Younesss book Professional Data Warehousing with SQL Server 7.0 and
OLAP Services
Data Visualization
A pie chart showing the sales of a product by region is
Sometimes much more effective than presenting the s
Data in a text or tabular form.

9%
Northeast South 11 %
39% North

21 %
West
20 %
East
Decision Tree
Cluster Analysis
First segment (high income>8,000)
Have
Children
Second Segment (8000>middle income >3000)
Married

Third Segment (low income < 3000) Last car is


A used one

Own car
Factor Analysis
Unlike cluster analysis, factor analysis builds a model from data.
The technique finds underlying factors, also called latent
variables and provides models for these factors based on
variables in the data. For ex., a software company is considering a
survey to find out the nine most perceived attributes of one of
their products. They might categorize these products to categories
such as service for technical support, availability for training and
a help system.

Factor analysis is used for grouping together products based on a


similarity of buying patterns so that vendors may bundle several
products as one to sell them together at a lower price than their
added individual prices..
Neural Networks
Association Rules

Association models are models that examine the extent to which


values of one field depend on, or are produced by, values of
another field. These models are often referred to as Market Basket
Analysis when they are applied to retail industries to study the
buying patterns of these customers, especially in grocery and
retail stores that issue their own credit cards. Charging against
these cards gives the store the chance to associate the purchases of
customers with their identities, which allows them to study
associations among other things.
Rules Induction

This is a powerful technique that involves a large number of rules


using a set of if..then statements in the pursuit of all possible
patterns in the dataset. For ex., if the customer is a male then, if he
is between 30 and 40 years of ages, and his income is less than
$50,000 and more than $20,000, he is likely to be driving a car that
was bought as new.
A Summary: Theory-Driven
Techniques
Correlations

T-Tests

Analysis of Variables

Linear Regression

Logistic Regression

Discriminate Analysis

Forecasting Methods
Topic 7: The Data Mining Process

Define the problem.


Select the data.
Prepare the data.
Mine the data.
Deploy the model.
Take business action.
Are you ready for Data Mining?
Define the problem
A successful data mining initiative always starts with
a well-defined project. To insure that the project produces
incremental value, include an assessment of the status quo
solution and a review of technology, organization, and
business processes.
Select the data

This step involves defining your data source . (not every


data source and record is required.) The data is usually
extracted from the source system to a separate server.
Prepare the data

This step represents up to 80 percent of the total project


effort. For data mining, the data must reside in one flat
table (each record has many columns). In addition to being
the most time consuming, the step is also the most critical.
The resulting models are only as good as the data used to
create them.
Mine the data

Typically the easiest and shortest phase, this step involves


applying statistical and AI tools to create mathematical
models. Data mining typically occurs on a server separate
from the data warehousing and other corporate systems.
Deploy the Model

Model deployment is the process of implementing the


mathematical models into operational systems to improve
business results.
Take Business Action

Use the deployed model to achieve improved results to the


business problem identified at the beginning of the
process.
Step to Implement Data Mining

Discovery (patterns, relations


Prior Knowledge
Associations, etc.)

Information Model

Validation

Deployment
ARE YOU READY FOR DATA
MINING?

Just because you have a data warehouse doesnt mean


youre necessarily ready for data mining. Much of the
work our company does in the data mining arena has
more to do with data mining readiness assessment than
with actually performing data mining.
Metrics you can use to gauge your data
mining readiness
Do you have a staff of experienced knowledge workers?
Do you have the data?
Do you have marketing processes in place that can use this
data?
Do you have a business champion who can embrace the
process and results?
Do you have the technology infrastructure to support
advanced analysis?
Topic 8: Data Mining Tools

Data mining tools are typically classified by the type of


algorithm they use to identify hidden patterns. There are
many different algorithms in use, but the four most
popular are association, sequence, clustering (or
segmentation), and predictive modeling.
Data Mining Tools

There are a growing number of commercial data


mining tools on the marketplace.

Important characteristics of data mining tools include:


Data preparation facilities
Selection of data mining operations
Product scalability and performance
Facilities for visualization of results.
Data Mining vs. OLAP

They are two separate breeds of analysis with


entirely different objectives, not to mention
tools, skill sets, and implementation methods.
Data Mining
With canned reports, ad hoc querying, and
OLAP, the end user defines a hypothesis and
determines which data to examine. With data
mining, the tool identifies the hypothesis, and it
actually tells the user where in the data to start
the exploration process.
Data Mining
Rather than using SQL to filter out values and methodically
reduce the data into a concise answer set, data mining uses
algorithms that exhaustively review the relationships among
data elements to determine if any patterns exist. The whole
purpose of data mining is to yield new business information
that a business person can act on.
OLAP vs. Data Mining Tools
OLAP Tools Data Mining Tools
Are ad hoc, shrink wrapped Methods for analyzing
tools that provide an interface multiple data types
to data -- Regression Trees
-- Neural networks
Are used when you have -- Genetic algorithms
specific known questions
Are used when you dont
Looks and feels like a know what the questions are
spreadsheet that allow
rotation, slicing and graphic
Usually textual in nature
Can be deployed to large
number of users Usually deployed to a small
number of analysts
Data Mining Tools

ASSOCIATION

Association, also frequently referred to as "affinity


analysis," reviews numerous sets of items and looks for
common groupings. An example of association is market
basket analysis, which involves reviewing the products
that consumers purchase in a single trip to the grocery
store.
ASSOCIATION

Finds items that imply the presence of other items


in the same event.

Affinities between items are represented by


association rules.
e.g. When a customer rents property for more than 2
years and is more than 25 years old, in 40% of cases,
the customer will buy a property. This association
happens in 35% of all customers who rent properties.
Data Mining Tools

SEQUENCE

Sequential analysis helps data miners identify a set of


order-specific items or events. Association identifies the
existence of patterns or groups of items; sequential
analysis identifies the order of those patterns or groups of
items.
SEQUENCE

Finds patterns between events such that the presence of


one set of items is followed by another set of items in a
database of events over a period of time.
e.g. Used to understand long term customer buying
behavior.
Link Analysis - Similar Time Sequence
Discovery
Finds links between two sets of data that are time-
dependent, and is based on the degree of similarity
between the patterns that both time series demonstrate.

e.g. Within three months of buying property, new home


owners will purchase goods such as cookers, freezers, and
washing machines.
Data Mining Tools

CLUSTERING

Cluster analysis lets the data miner assemble data into


unforeseen groups containing similar characteristics. Also
known as "segmentation," this type of data
mining is probably the most widely used.
CLUSTERING

Aim is to partition a database into an unknown number of


segments, or clusters, of similar records.

Uses unsupervised learning to discover homogeneous sub-


populations in a database to improve the accuracy of the
profiles.
Data Mining Tools

PREDICTIVE MODELING

As the name implies, predictive modeling involves


developing a model from historical data for predicting a
future event. The power of predictive modeling engines is
that they can use a broad range of data attributes to identify
future behavior. Both cluster analysis and predictive
modeling tools identify distinct groups of items with
common attributes; the difference is that predictive modeling
focuses on the likelihood of a particular outcome for a
particular group.
Topic 9: Data Mining Techniques- A
Summary
Artificial neural networks: Non-linear predictive models that
learn through training and resembles biological neural networks
in structure.
Decision Trees: Tree-shaped structures that represent sets of
decisions. These decisions generate rules for the classification of a
database.

Generic Algorithms: Optimization techniques that use processes


such as generic combination, mutation, and natural selection in a
design based on the concepts of revolution.

Rule induction: The extraction of useful if-then rules from data


based on statistical significance.
Data Mining Techniques- A
Summary
Predictive modeling Classification
Value prediction
Database Segmentation Demographic clustering
Neural clustering
Link analysis
Association discovery
Sequential pattern discovery
Similar time sequence
discovery
Deviation detection Statistics
Visualization
Two Types of Data Mining Modeling-
Verification and Discovery
The verification model utilizes a process that looks in a
database to detect trends and patterns in data that will help
answer some specific questions about the business.

In this mode, the user generates a hypothesis about the


data, issues a query against the data and examines the
results of the query looking for verification of the
hypothesis or the user decides that the hypothesis is not
valid.
Verification Model

In this model, very little information is created in this


extraction process: either the hypothesis is verified or it is
not.

Common tools used in this mode are: queries,


multidimensional analysis and visualization. What all have
in common are that the user is essentially guiding the
exploration of the data being inspected.
Discovery Model

A more popular model is the Discovery Model that utilizes


a process that looks in a database to discover and/or
predict future patterns. The discovery model is divided
into two modes: Descriptive and Predictive.
Discovery Model- Descriptive Mode

The Descriptive mode finds hidden patterns without a


predetermined idea or hypothesis about what the patterns
may be. In other words, the Data Mining software or
program takes the initiative in finding what the interesting
patterns are, without the user thinking of the relevant
questions first. In this mode information is created about
the data with very little or guidance from the user. The
exploration of the data is done in such a way as to yield as
large a number of useful facts about the data in the shortest
amount of time.
Discovery Model- Predictive Mode

In the Predictive mode patterns discovered from the database are used
to predict the future patterns or trends. Predictive modeling allows the
user to submit records with some unknown field values, and the
system will guess the unknown values based on previous patterns
discovered from the database.

In comparing the two models, one can state that Verification can be
very inefficient, timely and costly. Whereas, Discovery modeling
can be very efficient, cost effective, less dependent on user input and
increases modeling accuracy.

Potrebbero piacerti anche