Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
& DataMining
1
Overview
❚ Part 1: Data Warehouses
❚ Part 2: OLAP
❚ Part 3: Data Mining
❚ Part 4: Query Processing and
Optimization
2
Part 1: Data Warehouses
3
Data, Data everywhere yet ...
❚ I can’t find the data I need
❙ data is scattered over the network
❙ many versions, subtle differences
[Barry Devlin]
5
Why Data Warehousing?
Which
Whichare
areour
our
lowest/highest
lowest/highestmargin
margin
customers
customers??
Who
Whoare
aremymycustomers
customers
What and
andwhat
whatproducts
Whatisisthe
themost
most products
effective are
arethey
theybuying?
effectivedistribution
distribution buying?
channel?
channel?
What
Whatproduct
productprom- Which
prom- Whichcustomers
customers
-otions
-otionshave
havethe
thebiggest are
biggest aremost
mostlikely
likelyto
togo
go
impact
impactononrevenue? to
revenue? tothe
thecompetition
competition??
What
Whatimpact
impactwill
will
new
newproducts/services
products/services
have
haveon
onrevenue
revenue
and
andmargins?
margins? 6
Decision Support
❚ Used to manage and control business
❚ Data is historical or point-in-time
❚ Optimized for inquiry rather than update
❚ Use of the system is loosely defined and
can be ad-hoc
❚ Used by managers and end-users to
understand the business and make
judgements
7
Evolution of Decision Support
9
Data Warehousing --
It is a process
❚ Technique for assembling and
managing data from various
sources for the purpose of
answering business questions.
Thus making decisions that
were not previous possible
❚ A decision support database
maintained separately from
the organization’s operational
database
10
Traditional RDBMS used for
OLTP
❚ Database Systems have been used
traditionally for OLTP
❙ clerical data processing tasks
❙ detailed, up to date data
❙ structured repetitive tasks
❙ read/update a few records
❙ isolation, recovery and integrity are
critical
❚ Will call these operational systems
11
OLTP vs Data Warehouse
❚ OLTP ❚ Warehouse (DSS)
❙ Application Oriented ❙ Subject Oriented
❙ Used to analyze business
❙ Used to run business
❙ Manager/Analyst
❙ Clerical User
❙ Summarized and refined
❙ Detailed data ❙ Snapshot data
❙ Current up to date ❙ Integrated Data
❙ Isolated Data ❙ Ad-hoc access using large
queries
❙ Repetitive access by
small transactions ❙ Mostly read access (batch
update)
❙ Read/Update access
12
Data Warehouse Architecture
Data Warehouse
Legacy Engine Analyze
Data Query
Purchased
Data
Metadata Repository
13
From the Data Warehouse to
Data Marts
Information
Individually Less
Structured
Departmentally History
Structured Normalized
Detailed
Organizationally More
Structured Data Warehouse
Data
14
Users have different views of
Data
Tourists: Browse
information
OLAP
harvested
by farmers
Farmers: Harvest informatio
from known access paths
16
Old Retail Paradigm
❚ Wal*Mart ❚ Suppliers
❙ Inventory Management ❙ Accept Orders
❙ Merchandise Accounts ❙ Promote Products
Payable ❙ Provide special
❙ Purchasing Incentives
❙ Supplier Promotions: ❙ Monitor and Track
National, Region, Store The Incentives
Level ❙ Bill and Collect
Receivables
❙ Estimate Retailer
Demands
17
New (Just-In-Time) Retail
Paradigm
❚ No more deals
❚ Shelf-Pass Through (POS Application)
❙ One Unit Price
❘ Suppliers paid once a week on ACTUAL items sold
❙ Wal*Mart Manager
❘ Daily Inventory Restock
❘ Suppliers (sometimes SameDay) ship to Wal*Mart
❚ Warehouse-Pass Through
❙ Stock some Large Items
❘ Delivery may come from supplier
❙ Distribution Center
❘ Supplier’s merchandise unloaded directly onto
Wal*Mart Trucks
18
Information as a Strategic
Weapon
❚ Daily Summary of all Sales Information
❚ Regional Analysis of all Stores in a logical
area
❚ Specific Product Sales
❚ Specific Supplies Sales
❚ Trend Analysis, etc.
❚ Wal*Mart uses information when
negotiating with
❙ Suppliers
❙ Advertisers etc.
19
Schema Design
❚ Database organization
❙ must look like business
❙ must be recognizable by business user
❙ approachable by business user
❙ Must be simple
❚ Schema Types
❙ Star Schema
❙ Fact Constellation Schema
❙ Snowflake schema
20
Star Schema
22
Fact Table
❚ Central table
❙ Typical example: individual sales
records
❙ mostly raw numeric items
❙ narrow rows, a few columns at most
❙ large number of rows (millions to a
billion)
❙ Access via dimensions
23
Snowflake schema
❚ Fact Constellation
❙ Multiple fact tables that share many
dimension tables
❙ Booking and Checkout may share many
dimension tables in the hotel industry
Promotion
Hotels
Booking
Checkout
Travel Agents Room Type
Customer 25
Data Granularity in Warehouse
26
Granularity in Warehouse
❚ Solution is to have dual level of
granularity
❙ Store summary data on disks
❘ 95% of DSS processing done against this
data
❙ Store detail on tapes
❘ 5% of DSS processing against this data
27
Levels of Granularity
29
Data Transformation
30
Data Transformation Example
Data Warehouse
encoding
appl A - m,f
appl B - 1,0
appl C - x,y
appl D - male, female
appl A - pipeline - cm
appl B - pipeline - in
unit
appl A - balance
field
appl B - bal
appl C - currbal
appl D - balcurr
31
Data Integrity Problems
32
Data Transformation Terms
❚ Extracting ❚ Enrichment
❚ Conditioning ❚ Scoring
❚ Scrubbing ❚ Loading
❚ Merging ❚ Validating
❚ Householding ❚ Delta Updating
33
Data Transformation Terms
❚ Householding
❙ Identifying all members of a household
(living at the same address)
❙ Ensures only one mail is sent to a
household
❙ Can result in substantial savings: 1
million catalogues at Rs. 50 each costs
Rs. 50 million . A 2% savings would save
Rs. 1 million
34
Refresh
35
When to Refresh?
❚ periodically (e.g., every night, every
week) or after significant events
❚ on every update: not warranted unless
warehouse data require current data (up
to the minute stock quotes)
❚ refresh policy set by administrator based
on user needs and traffic
❚ possibly different policies for different
sources
36
Refresh techniques
❚ Incremental techniques
❙ detect changes on base tables:
replication servers (e.g., Sybase, Oracle,
IBM Data Propagator)
❘ snapshots (Oracle)
❘ transaction shipping (Sybase)
❙ compute changes to derived and
summary tables
❙ maintain transactional correctness for
incremental load
37
How To Detect Changes
38
Querying Data Warehouses
❚ SQL Extensions
❚ Multidimensional modeling of data
❙ OLAP
❙ More on OLAP later …
39
SQL Extensions
40
Reporting Tools
❚ Andyne Computing -- GQL
❚ Brio -- BrioQuery
❚ Business Objects -- Business Objects
❚ Cognos -- Impromptu
❚ Information Builders Inc. -- Focus for
Windows
❚ Oracle -- Discoverer2000
❚ Platinum Technology -- SQL*Assist,
ProReports
❚ PowerSoft -- InfoMaker
❚ SAS Institute -- SAS/Assist
❚ Software AG -- Esperant
❚ Sterling Software -- VISION:Data
41
Decision support tools
Mining
Direct Reporting OLAP tools
Query tools
Essbase Intelligent Miner
Crystal reports
Merge Relational
Clean Data warehouse DBMS+
Summarize e.g. Redbrick
Detailed GIS
transactional data
data Operational data Census
Bombay branch Delhi branch Calcutta branch data
Oracle IMS SAS
42
Deploying Data Warehouses
❚ What business information
keeps you in business today?
What business information
can put you out of business
tomorrow?
❚ What business information
should be a mouse click
away?
❚ What business conditions are
the driving the need for
business information?
43
Cultural Considerations
❚ Not just a technology project
❚ New way of using information
to support daily activities and
decision making
❚ Care must be taken to
prepare organization for
change
❚ Must have organizational
backing and support
44
User Training
❚ Users must have a higher level of IT
proficiency than for operational
systems
❚ Training to help users analyze data
in the warehouse effectively
45
Warehouse Products
❚ Computer Associates -- CA-Ingres
❚ Hewlett-Packard -- Allbase/SQL
❚ Informix -- Informix, Informix XPS
❚ Microsoft -- SQL Server
❚ Oracle -- Oracle7, Oracle Parallel Server
❚ Red Brick -- Red Brick Warehouse
❚ SAS Institute -- SAS
❚ Software AG -- ADABAS
❚ Sybase -- SQL Server, IQ, MPP
46
Part 2: OLAP
47
Nature of OLAP Analysis
❚ Aggregation -- (total sales, percent-to-total)
❚ Comparison -- Budget vs. Expenses
❚ Ranking -- Top 10, quartile analysis
❚ Access to detailed and aggregate data
❚ Complex criteria specification
❚ Visualization
❚ Need interactive response to aggregate
queries
48
Multi-dimensional Data
❚ Measure - sales (actual, plan, variance)
W
eg
S
R
50
Operations
❚ Rollup: summarize data
❙ e.g., given sales data, summarize sales
for last year by product category and
region
❚ Drill down: get more details
❙ e.g., given summarized sales as above,
find breakup of sales by city within each
region, or within the Andhra region
51
More Cube Operations
❚ Slice and dice: select and project
❙ e.g.: Sales of soft-drinks in Andhra
over the last quarter
❚ Pivot: change the view of data
❙ Q1 Q2 Total L S
Total
L
S
22 33 14 55
07
Red
Blue
44 41 59
52
Total Total
15 52
More OLAP Operations
❚ Hypothesis driven search: E.g.
factors affecting defaulters
❙ view defaulting rate on age aggregated
over other dimensions
❙ for particular age segment detail along
profession
❚ Need interactive response to
aggregate queries
❙ => precompute various aggregates
53
MOLAP vs ROLAP
❚ MOLAP: Multidimensional array
OLAP
❚ ROLAP: Relational OLAP
Type Size Colour Amount
Shirt S Blue 10
Shirt L Blue 25
Shirt ALL Blue 35
Shirt S Red 3
Shirt L Red 7
Shirt ALL Red 10
Shirt ALL ALL 45
… … … …
ALL ALL ALL 1290
54
SQL Extensions
❚ Cube operator
❙ group by on all subsets of a set of
attributes (month,city)
❙ redundant scan and sorting of data can
be avoided
❚ Various other non-standard SQL
extensions by vendors
55
OLAP: 3 Tier DSS
Data Warehouse OLAP Engine Decision Support Client
57
Brief History
58
OLAP and Executive Information
Systems
❚ Andyne Computing -- Pablo ❚ Oracle -- Express
❚ Arbor Software -- Essbase
❚ Pilot -- LightShip
❚ Cognos -- PowerPlay
❚ Comshare -- Commander OLAP ❚ Planning Sciences --
❚ Holistic Systems -- Holos Gentium
❚ Information Advantage -- ❚ Platinum Technology
AXSYS, WebOLAP
-- ProdeaBeacon,
❚ Informix -- Metacube Forest & Trees
❚ Microstrategies --DSS/Agent
❚ SAS Institute --
SAS/EIS, OLAP++
❚ Speedware -- Media
59
Microsoft OLAP strategy
60
Part 3: Data Mining
61
Why Data Mining
❚ Credit ratings/targeted marketing:
❙ Given a database of 100,000 names, which persons are
the least likely to default on their credit cards?
❙ Identify likely responders to sales promotions
❚ Fraud detection
❙ Which types of transactions are likely to be fraudulent,
given the demographics and transactional history of a
particular customer?
❚ Customer relationship management:
❙ Which of my customers are likely to be the most loyal,
and which are most likely to leave for a competitor? :
63
Some basic operations
❚ Predictive:
❙ Regression
❙ Classification
❚ Descriptive:
❙ Clustering / similarity matching
❙ Association rules and variants
❙ Deviation detection
64
Classification
❚ Given old data about customers and
payments, predict new applicant’s loan
eligibility.
66
Decision trees
Salary < 1 M
67
Pros and Cons of decision
trees
• Pros • Cons
+ Reasonable training – Cannot handle complicated
time relationship between features
+ Fast application – simple decision boundaries
+ Easy to interpret – problems with lots of
+ Easy to implement missing data
+ Can handle large
number of features
More information:
http://www.stat.wisc.edu/~limt/treeprogs.html
68
Neural network
❚ Set of nodes connected by
directed weighted edges
A more typical NN
Basic NN unit
x1 n x1
x2
w1
o = σ ( ∑ wi xi )
x2
w2 i =1
x3 w3 1 x3 Output nodes
σ ( y) =
1 + e− y Hidden nodes
69
Pros and Cons of Neural
Network
• Pros • Cons
+ Can learn more complicated – Slow training time
class boundaries – Hard to interpret
+ Fast application – Hard to implement:
+ Can handle large number of trial and error for
features choosing number of
nodes
70
Bayesian learning
❚ Assume a probability model on generation of
data.
❚ Apply bayes theorem to find most p ( dlikely
| c j ) p ( cclass
j)
as:
predicted class : c = max p ( c j | d ) = max
cj cj p(d )
71
Clustering
❚ Unsupervised learning when old data with
class labels not available e.g. when
introducing a new product.
❚ Group/cluster existing customers based on
time series of payment history such that
similar customers in same cluster.
❚ Key requirement: Need a good measure of
similarity between instances.
❚ Identify micro-markets and develop
policies for each
72
Association rules
T
❚ Given set T of groups of items Milk, cereal
❚ Example: set of item sets purchased Tea, milk
❚ Goal: find all rules on itemsets of the form
a-->b such that Tea, rice, bread
❙ support of a and b > user threshold s
❙ conditional probability (confidence) of b given
a > user threshold c
❚ Example: Milk --> bread
❚ Purchase of product A --> service B
cereal
73
Variants
74
Prevalent ≠ Interesting
❚ Analysts already
1995 Milk and
know about
cereal sell
prevalent rules
together!
❚ Interesting rules are
those that deviate
from prior
expectation 1998 Milk and
Zzzz...
❚ Mining’s payoff is in cereal sell
finding surprising together!
phenomena
75
What makes a rule surprising?
❚ Does not match ❚ Cannot be trivially
derived from simpler
prior
rules
expectation ❙ Milk 10%, cereal 10%
❙ Correlation between ❙ Milk and cereal 10% …
milk and cereal surprising
remains roughly ❙ Eggs 10%
constant over time ❙ Milk, cereal and eggs
0.1% … surprising!
❙ Expected 1%
76
Application Areas
Industry Application
Finance Credit Card Analysis
Insurance Claims, Fraud Analysis
Telecommunication Call record analysis
Transport Logistics management
Consumer goods promotion analysis
Data Service providersValue added data
Utilities Power usage analysis
77
Data Mining in Use
❚ The US Government uses Data Mining to
track fraud
❚ A Supermarket becomes an information
broker
❚ Basketball teams use it to track game
strategy
❚ Cross Selling
❚ Target Marketing
❚ Holding on to Good Customers
❚ Weeding out Bad Customers
78
Why Now?
❚ Data is being produced
❚ Data is being warehoused
❚ The computing power is available
❚ The computing power is affordable
❚ The competitive pressures are
strong
❚ Commercial products are available
79
Data Mining works with
Warehouse Data
❚ Data Warehousing provides the
Enterprise with a memory
80
Mining market
❚ Around 20 to 30 mining tool vendors
❚ Major players:
❙ Clementine,
❙ IBM’s Intelligent Miner,
❙ SGI’s MineSet,
❙ SAS’s Enterprise Miner.
❚ All pretty much the same set of tools
❚ Many embedded products: fraud detection,
electronic commerce applications
81
OLAP Mining integration
❚ OLAP (On Line Analytical Processing)
❙ Fast interactive exploration of multidim.
aggregates.
❙ Heavy reliance on manual operations for
analysis:
❙ Tedious and error-prone on large
multidimensional data
❚ Ideal platform for vertical integration of mining
but needs to be interactive instead of batch.
82
State of art in mining OLAP
integration
❚ Decision trees [Information discovery, Cognos]
❙ find factors influencing high profits
❚ Clustering [Pilot software]
❙ segment customers to define hierarchy on that dimension
❚ Time series analysis: [Seagate’s Holos]
❙ Query for various shapes along time: eg. spikes, outliers etc
83
Vertical integration: Mining
on the web
84