Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Application
KDD process
Pattern Evaluation
Data Mining
Task-relevant Data
Data Warehouse
Selection
Data Cleaning
Data Integration
Databases
End User
Business
Analyst
Data
Analyst
Data Exploration
Statistical Analysis, Querying and Reporting
Data Warehouses / Data Marts
OLAP, Multi-dimensional Analysis
DBA
Data Sources
Paper, Files, Information Providers, Database Systems, OLTP
Architecture of a
typical data mining
system
Pattern evaluation
Knowledge
base
Database or
data warehouse server
Data cleaning
Data integration
Database
Filtering
Data
warehouse
Customer
Cust_ID
Name
Address
Age
Income
Credit_info
Name
Brand
Category
Type
Price
Supplier
Salary
Commission
Item
item_ID
Employee
Emp_ID
Name
Department
Group
Emp_id
Date
Purchases
Trans_ID
Cust_id
Time
Pay_method
amount
Items_sold
Trans_ID
Item_ID
Qty
A relational database
fragment
Queries
List of items sold in previous quarter
Total sales last month, grouped by salesperson
Number of sales transactions in December
Salesperson with highest amount of sales
Data warehouse
Integrates data from various sources
Data organized on a historical perspective
Presents different levels of summarized data
Multi-dimensional structure
dimension: attribute
cell: aggregate measures
Client
Clean
Transform
Integrate
Load
Data
warehouse
Query and
analysis tools
Client
Data source in Vancouver
Multi-dimensional data
Chicago
New York
Toronto
Vancouver
Q1
Q2
Q3
Q4
T1 T2 T3 T4 T5 T6
Item-types
Drill down on
data for Q1
Chicago
New York
Toronto
Vancouver
Jan
Roll-up
on Address
USA
Canada
Q1
Q2
Feb
Q3
March
Q4
T1 T2 T3 T4 T5 T6
Item-types
T1 T2 T3 T4 T5 T6
Item-types
Data Warehouse
A decision support database that is maintained
separately from the organizations operational
database
Collection of data this is
-
subject-oriented
integrated
time-variant
nonvolatile
OLAP
users
clerk, IT professional
knowledge worker
function
decision support
DB design
application-oriented
subject-oriented
data
current, up-to-date
detailed, flat relational
isolated
repetitive
historical,
summarized, multidimensional
integrated, consolidated
ad-hoc
read/write
index on primary key
short, simple transaction
usage
access
unit of work
complex query
# records accessedtens
millions
#users
thousands
hundreds
DB size
100MB-GB
100GB-TB
metric
transaction throughput
Multi-dimensional cube
Sales by Item, Time, Location, Supplier
Time
Item
SALES
Supplier
Location
item
time,item time,location
0-D(apex) cuboid
location supplier
item,location
time,supplier
time,item,location
1-D cuboids
location,supplier
item,supplier
time,location,supplier
time,item,supplier
2-D cuboids
3-D cuboids
item,location,supplier
4-D(base) cuboid
yearly data
(keep all data)
weekly data
(up to 7 years)
quarterly data
(up to 20 years)
monthly data
(up to 15 years)
special event
effects (up to 30 years)
- Snowflake schema:
- Fact constellations:
item
time_key
day
day_of_the_week
month
quarter
year
branch
branch_key
branch_name
branch_type
location_key
units_sol
d
avg_sales
dollars_sold
Measures
item_key
item_name
brand
type
supplier_type
location
location_key
street
city
province_or_state
country
item
Sales Fact Table
time_key
item_key
branch_key
branch
location_key
branch_key
branch_name
branch_type
units_sol
d
avg_sales
dollars_sold
Measures
item_key
item_name
brand
type
supplier_type
location
location_key
street
city
city
province_or_state
city_key
country
city
province_or_street
country
branch
branch_key
branch_name
branch_type
item
Sales Fact Table
time_key
item_key
item_name
brand
type
supplier_type
time_key
item_key
shipper_key
item_key
from_location
branch_key
to_location
location_key
units_sold
dollars_sold
avg_sales
location
location_key
street
city
province_or_street
country
dollars_cost
units_shipped
shipper
shipper_key
shipper_name
location_key
shipper_type
Data mart
- Department-wide scope
- Departmental subset of data warehouse
- Star, snowflake schema
Computing Measures
Measure: numerical value at each point in the data cube
Measure types
Distributive: E.g., count(), sum(), min(), max().
Result derived by applying the function to n aggregate values is
the same as that derived by applying the function on all the data
without partitioning.
Concept Hierarchy
Example: Location dimension
all
all
Europe
region
country
city
Germany
Frankfurt
North_America
Spain
Canada
Vancouver
Mexico
Toronto
Concept hierarchies
Full or partial ordering
Industry
Region
Category
Country Quarter
Product
City
Office
Set-grouping hierarchy
Year
Month
Week
Day
($0..$1000]
e.g. price
($0..$1000]
($0..$1000]
($0..$1000]
($0..$1000] ($0..$1000]
($0..$1000]
($0..$1000]
($0..$1000]
($0..$1000]
OLAP examples
Sales volume as a function of product,
month, and region
RegionYear
Category
Product
Country Quarter
City
Office
Month
Month
Day
1Qtr
2Qtr
Date
3Qtr
4Qtr
sum
Chic
ago
New York
Vancouver
Toronto
Q1
Q2
Q3
Q4
T1 T2 T3 T4 T5 T6
Item-types
Drill down on
data for Q1
Chicago
New York
Toronto
Vancouver
Jan
Roll-up
on Address
USA
Canada
Q1
Q2
Feb
Q3
March
Q4
T1 T2 T3 T4 T5 T6
Item-types
T1 T2 T3 T4 T5 T6
Item-types
Dice for
(location in {Chicago, Toronto}
and time in {Q1}
And Item in {T3, T8}
Chicago
New York
Toronto
Vancouver
Chicago
Toronto
Q1
Q1
Q2
Q2
Q3
T3 T8
Slice
For Time in {Q1}
Q4
T1 T2 T3 T4 T5 T6
Item-types
Chicago
New York
Toronto
Vancouver
T1 T2 T3 T4 T5 T6
Shipping Method
Customer
CONTRACTS
AIR-EXPRESS
TRUCK
Time
ORDER
PRODUCT LINE
Prod
PRODUCT ITEM
uct
PRODUCT GROUP
SALES PERSON
CITY
COUNTRY
DISTRICT
REGION
Location
DIVISION
Promotion
Organization
Multi-Tiered DW Architecture
other
Metadata
sources
Operational
DBs
Extract
Transform
Load
Refresh
Monitor
&
Integrator
Data
Warehouse
OLAP Server
Serve
Analysis
Query
Reports
Data mining
Data Marts
Data Sources
Data Storage
Data Mart
- a subset of corporate-wide data that is of value to a
specific groups of users. Its scope is confined to
specific, selected groups, such as marketing data mart
Independent vs. dependent (directly from warehouse) data mart
Virtual warehouse
- A set of views over operational databases
- Only some of the possible summary views may be
materialized
Data Marts
Data warehouse designed to meet the needs
of a specific group of users
Should (but may not) be designed with
corporate standards and accessibility in mind
- incorporate standards for hardware,
software, networking, DBMS, naming
conventions, etc.
- vendors attempt to bypass IT and sell
directly to end-users?
Data
Mart
Data
Mart
Model refinement
Multi-Tier Data
Warehouse
Enterprise
Data
Warehouse
Model refinement
Metadata Repository
Meta data is the data defining warehouse objects. It has the
following kinds
- Description of the structure of the warehouse
schema, view, dimensions, hierarchies, derived data defn, data mart
locations and contents
- Operational meta-data
data lineage (history of migrated data and transformation path), currency
of data (active, archived, or purged), monitoring information (warehouse
usage statistics, error reports, audit trails)
warehouse
- Data related to system performance
- Business data
business terms and definitions, ownership of data, charging policies
Advanced examples
Advanced examples
Advanced examples
Advances examples
Supplier
Sales
#units,
$value
Supplier
Sales
%sales
Product
Product
Sales
#units,
$value
Product
Ordering
Group sales by contiguous 10-day intervals.
Sales
#units,
$value
Supplier
Product