Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Source :
1.
Ankur Teredesai, Assistant Professor, Dept. of Computer Science, RIT.
2.
3.
IF5031/Intro DWH-DM/Okt/2015
OBJECTIVES
Understand the concept and role of Data
Warehouse(5W)
Understand the concept and role of Data Mining
Data Mining techniques
Understand the difference between DWH and DM
Concept of OLTP,OLAP
IF5031/Intro DWH-DM/Okt/2015
Topics
OLTP vs OLAP
IF5031/Intro DWH-DM/Okt/2015
IF5031/Intro DWH-DM/Okt/2015
Decision-Support Systems:
DecisionOverview
Data analysis tasks are simplified by specialized tools and SQL
extensions
o
Example tasks
For each product category and each region, what were the total sales in the last quarter
and how do they compare with the same quarter last year
As above, for each product category and each customer category
Important for large businesses that generate data from multiple divisions, possibly at multiple
sites
Data may also be purchased externally
IF5031/Intro DWH-DM/Okt/2015
Data Warehousing
Data sources often store only current data, not
historical data
Corporate decision making requires a unified view
of all organizational data, including historical data
A data warehouse is a repository (archive) of
information gathered from multiple sources, stored
under a unified schema, at a single site
o Greatly simplifies querying, permits study of historical trends
o Shifts decision support query load away from transaction processing
systems
IF5031/Intro DWH-DM/Okt/2015
Data Warehousing
IF5031/Intro DWH-DM/Okt/2015
Data Warehouse
Warehouse
Subject--Oriented
Subject
Organized around major subjects, such as
customer, product, sales.
Focusing on the modeling and analysis of data for
decision makers, not on daily operations or
transaction processing.
Provide a simple and concise view around
particular subject issues by excluding data that are
not useful in the decision support process.
IF5031/Intro DWH-DM/Okt/2015
Data Warehouse
Warehouse
Integrated
Constructed by integrating multiple,
heterogeneous data sources
o relational databases, flat files, on-line
transaction records
Data cleaning and data integration techniques
are applied.
o Ensure consistency in naming conventions,
encoding structures, attribute measures, etc.
among different data sources
E.g., Hotel price: currency, tax, breakfast covered, etc.
Data Warehouse
WarehouseTime
Variant
The time horizon for the data warehouse is
significantly longer than that of operational systems.
o Operational database: current value data.
o Data warehouse data: provide information from a
historical perspective (e.g., past 5-10 years)
Every key structure in the data warehouse
o Contains an element of time, explicitly or implicitly
o But the key of operational data may or may not
contain time element.
IF5031/Intro DWH-DM/Okt/2015
10
Data Warehouse
WarehouseNon
Non-Volatile
A physically separate store of data transformed
from the operational environment.
Operational update of data does not occur in the
data warehouse environment.
o Does not require transaction processing,
recovery, and concurrency control mechanisms
o Requires only two operations in data accessing:
initial loading of data and access of data.
IF5031/Intro DWH-DM/Okt/2015
11
12
13
OLAP
users
clerk, IT professional
knowledge worker
function
decision support
DB design
application-oriented
subject-oriented
data
current, up-to-date
detailed, flat relational
isolated
repetitive
historical,
summarized, multidimensional
integrated, consolidated
ad-hoc
lots of scans
unit of work
read/write
index/hash on prim. key
short, simple transaction
# records accessed
tens
millions
#users
thousands
hundreds
DB size
100MB-GB
100GB-TB
metric
transaction throughput
usage
access
complex query
IF5031/Intro DWH-DM/Okt/2015
14
IF5031/Intro DWH-DM/Okt/2015
15
16
17
Cube: A Lattice of
Cuboids
all
time
time,item
0-D(apex) cuboid
item
location
time,location
item,location
time,supplier
time,item,location
supplier
1-D cuboids
location,supplier
2-D cuboids
item,supplier
time,location,supplier
3-D cuboids
time,item,supplier
item,location,supplier
4-D(base) cuboid
time, item,
IF5031/Intro DWH-DM/Okt/2015
location, supplier
18
19
item
time_key
day
day_of_the_week
month
quarter
year
branch
location_key
branch_key
branch_name
branch_type
units_sold
dollars_sold
avg_sales
item_key
item_name
brand
type
supplier_type
location
location_key
street
city
province_or_street
country
Measures
IF5031/Intro DWH-DM/Okt/2015
20
item
Sales Fact Table
time_key
item_key
branch_key
branch
location_key
branch_key
branch_name
branch_type
units_sold
item_key
item_name
brand
type
supplier_key
supplier
supplier_key
supplier_type
location
location_key
street
city_key
city
dollars_sold
city_key
city
province_or_street
country
avg_sales
Measures
IF5031/Intro DWH-DM/Okt/2015
21
item
Sales Fact Table
time_key
item_key
item_name
brand
type
supplier_type
item_key
location_key
branch_key
branch_name
branch_type
units_sold
dollars_sold
avg_sales
time_key
item_key
shipper_key
from_location
branch_key
branch
location
to_location
location_key
street
city
province_or_street
country
dollars_cost
units_shipped
shipper
Measures
IF5031/Intro DWH-DM/Okt/2015
22
shipper_key
shipper_name
location_key
shipper_type
Design Issues
When and how to gather data
o Source driven architecture: data sources transmit new information to
warehouse, either continuously or periodically (e.g., at night)
o Destination driven architecture: warehouse periodically requests
new information from data sources
o Keeping warehouse exactly synchronized with data sources (e.g.,
using two-phase commit) is too expensive
Usually OK to have slightly out-of-date data at warehouse
Data/updates are periodically downloaded form online
transaction processing (OLTP) systems.
IF5031/Intro DWH-DM/Okt/2015
23
IF5031/Intro DWH-DM/Okt/2015
24
Warehouse Schemas
Dimension values are usually encoded using small
integers and mapped to full values via dimension
tables
Resultant schema is called a star schema
o More complicated schema structures
Snowflake schema: multiple levels of dimension tables
Constellation: multiple fact tables
IF5031/Intro DWH-DM/Okt/2015
25
IF5031/Intro DWH-DM/Okt/2015
26
Introduction to
DATA MINING
IF5031/Intro DWH-DM/Okt/2015
27
Pattern Evaluation
Data Mining
Task-relevant Data
Data
Warehouse
Selection
Data Cleaning
Data Integration
Databases
IF5031/Intro DWH-DM/Okt2015
28
29
30
Alternative names
Knowledge discovery (mining) in databases (KDD),
knowledge extraction, data/pattern analysis, data
archeology, data dredging, information harvesting,
business intelligence, etc.
31
Data Mining
Data mining is the process of semi-automatically
analyzing large databases to find useful patterns
Prediction based on past history
o Predict if a credit card applicant poses a good credit risk, based on some
attributes (income, job type, age, ..) and past history
o Predict if a pattern of phone calling card usage is likely to be fraudulent
IF5031/Intro DWH-DM/Okt/2015
32
IF5031/Intro DWH-DM/Okt/2015
33
Some Patterns
Association rules
o 98% of people who purchase diapers also buy
beer
Classification
o People with age less than 25 and salary > 40k
drive sports cars
Outlier Detection
o Residential customers for telecom company with
businesses at home
IF5031/Intro DWH-DM/Okt/2015
34
Making
Decisions
Data Presentation
Visualization Techniques
Data Mining
Information Discovery
End User
Business
Analyst
Data
Analyst
Data
Statistical Analysis,
Querying and Reporting
Exploration
Data Warehouses / Data Marts
OLAP, MDA
DBA
Data Sources
Paper, Files, Information Providers, Database Systems, OLTP
IF5031/Intro DWH-DM/Okt/2015
35
Pattern evaluation
Data mining engine
Knowledge-base
Database or data
warehouse server
Databases
IF5031/Intro DWH-DM/Okt/2015
Filterin
g
Data
Warehouse
36
Relational database
Data warehouse
Transactional database
Advanced database and information repository
o Object-relational database
o Spatial and temporal data
o Time-series data
o Stream data
o Multimedia database
o Heterogeneous and legacy database
o Text databases & WWW
IF5031/Intro DWH-DM/Okt/2015
37
38
39
MORE ON DM PATTERN
(Untuk dipelajari oleh mahasiswa)
IF5031/Intro DWH-DM/Okt/2015
40
Classification Rules
Classification rules help assign new objects to
classes.
o E.g., given a new automobile insurance applicant, should he or she
be classified as low risk, medium risk or high risk?
41
Decision Tree
IF5031/Intro DWH-DM/Okt/2015
42
Construction of Decision
Trees
Training set: a data sample in which the classification is
already known.
Greedy top down generation of decision trees.
o Each internal node of the tree partitions the data into groups based on a
partitioning attribute, and a partitioning condition for the node
o Leaf node:
all (or most) of the items at the node belong to the same class, or
all attributes have been considered, and no further partitioning is possible.
IF5031/Intro DWH-DM/Okt/2015
43
Best Splits
Pick best attributes and conditions on which to partition
The purity of a set S of training instances can be measured
quantitatively in several ways.
o Notation: number of classes = k, number of instances = |S|,
fraction of instances in class i = pi.
IF5031/Intro DWH-DM/Okt/2015
44
|Si|
i= 1 |S|
purity (Si)
IF5031/Intro DWH-DM/Okt/2015
45
log2
|Si|
|S|
Information-gain ratio =
Information-gain (S, {S1, S2, , Sr})
Information-content (S, {S1, S2, .., Sr})
The best split is the one that gives the maximum
information gain ratio
IF5031/Intro DWH-DM/Okt/2015
46
IF5031/Intro DWH-DM/Okt/2015
47
Decision-Tree
DecisionConstruction Algorithm
Procedure GrowTree (S )
Partition (S );
Procedure Partition (S)
if ( purity (S ) > p or |S| < s ) then
return;
for each attribute A
evaluate splits on attribute A;
Use best split found (across all attributes) to partition
S into S1, S2, ., Sr,
for i = 1, 2, .., r
Partition (Si );
IF5031/Intro DWH-DM/Okt/2015
48
49
IF5031/Intro DWH-DM/Okt/2015
50
Regression
Regression deals with the prediction of a value, rather than
a class.
o Given values for a set of variables, X1, X2, , Xn, we wish to predict the value of a
variable Y.
IF5031/Intro DWH-DM/Okt/2015
51
Association Rules
Retail shops are often interested in associations
between different items that people buy.
o Someone who buys bread is quite likely also to buy milk
o A person who bought the book Database System Concepts is quite likely also to
buy the book Operating System Concepts.
Association rules:
bread milk
DB-Concepts, OS-Concepts Networks
o Left hand side: antecedent, right hand side: consequent
o An association rule must have an associated population; the population consists
of a set of instances
E.g., each transaction (sale) at a shop is an instance, and the set of all
transactions is the population
IF5031/Intro DWH-DM/Okt/2015
52
IF5031/Intro DWH-DM/Okt/2015
53
IF5031/Intro DWH-DM/Okt/2015
54
Finding Support
Determine support of itemsets via a single pass on
set of transactions
o Large itemsets: sets with a high count at the end of the pass
55
56
Clustering
Clustering: Intuitively, finding clusters of points in the
given data such that similar points lie in the same
cluster
Can be formalized using distance metrics in several
ways
o Group points into k sets (for a given k) such that the average distance of
points from the centroid of their assigned group is minimized
Centroid: point defined by taking average of coordinates in each
dimension.
o Another metric: minimize average distance between every pair of points
in a cluster
IF5031/Intro DWH-DM/Okt/2015
57
Hierarchical Clustering
Example from biological classification
o (the word classification here does not mean a prediction mechanism)
chordata
mammalia
leopards humans
reptilia
snakes crocodiles
IF5031/Intro DWH-DM/Okt/2015
58
Clustering Algorithms
Clustering algorithms have been designed to
handle very large datasets
E.g., the Birch algorithm
o Main idea: use an in-memory R-tree to store points that are being
clustered
o Insert points one at a time into the R-tree, merging a new point with an
existing cluster if is less than some distance away
o If there are more leaf nodes than fit in memory, merge existing clusters
that are close to each other
o At the end of first pass we get a large number of clusters at the leaves of
the R-tree
Merge clusters to reduce the number of clusters
IF5031/Intro DWH-DM/Okt/2015
59
Collaborative Filtering
Goal: predict what movies/books/ a person
may be interested in, on the basis of
o Past preferences of the person
o Other people with similar past preferences
o The preferences of such people for a new movie/book/
60
IF5031/Intro DWH-DM/Okt/2015
61
End of Chapter
IF5031/Intro DWH-DM/Okt/2015
62
Figure 20.01
IF5031/Intro DWH-DM/Okt/2015
63
Figure 20.02
IF5031/Intro DWH-DM/Okt/2015
64
Figure 20.03
IF5031/Intro DWH-DM/Okt/2015
65
Figure 20.05
IF5031/Intro DWH-DM/Okt/2015
66