Sei sulla pagina 1di 51

Data warehousing

Han, J. and M. Kamber. Data Mining: Concepts and Techniques. 2001.


Morgan Kaufmann.

Application

KDD process

Pattern Evaluation

Data Mining
Task-relevant Data
Data Warehouse

Selection

Data Cleaning
Data Integration

Databases

Data mining is the process of discovering


interesting knowledge from large amounts
of data stored in databases, data
warehouses and/or other information
repositories.

Data mining and


Business Intelligence
Making
Decisions
Data Presentation
Visualization Techniques
Data Mining
Information Discovery

End User

Business
Analyst
Data
Analyst

Data Exploration
Statistical Analysis, Querying and Reporting
Data Warehouses / Data Marts
OLAP, Multi-dimensional Analysis
DBA
Data Sources
Paper, Files, Information Providers, Database Systems, OLTP

Architecture of a
typical data mining
system

Graphical user interface

Pattern evaluation

Knowledge
base

Data mining engine

Database or
data warehouse server
Data cleaning
Data integration
Database

Filtering
Data
warehouse

Customer
Cust_ID

Name

Address

Age

Income

Credit_info

Name

Brand

Category

Type

Price

Supplier

Salary

Commission

Item
item_ID

Employee
Emp_ID

Name

Department

Group

Emp_id

Date

Purchases
Trans_ID

Cust_id

Time

Pay_method

amount

Items_sold
Trans_ID

Item_ID

Qty

A relational database
fragment

Queries
List of items sold in previous quarter
Total sales last month, grouped by salesperson
Number of sales transactions in December
Salesperson with highest amount of sales

Data warehouse
Integrates data from various sources
Data organized on a historical perspective
Presents different levels of summarized data
Multi-dimensional structure
dimension: attribute
cell: aggregate measures

Data source in Chicago

Client

Data source in New York

Clean
Transform
Integrate
Load

Data
warehouse

Query and
analysis tools

Data source in Toronto

Client
Data source in Vancouver

Typical data warehouse architecture

Multi-dimensional data

Chicago
New York
Toronto
Vancouver
Q1
Q2
Q3
Q4
T1 T2 T3 T4 T5 T6
Item-types

Drill down on
data for Q1
Chicago
New York
Toronto
Vancouver
Jan

Roll-up
on Address
USA
Canada
Q1
Q2

Feb
Q3
March

Q4
T1 T2 T3 T4 T5 T6
Item-types

T1 T2 T3 T4 T5 T6
Item-types

Data Warehouse
A decision support database that is maintained
separately from the organizations operational
database
Collection of data this is
-

subject-oriented
integrated
time-variant
nonvolatile

Data Warehouse Subject-Oriented


Organized around major subjects, such as customer,
product, sales.
Focusing on the modeling and analysis of data for
decision makers, not on daily operations or transaction
processing.
Provide a simple and concise view around particular
subject issues by excluding data that are not useful in the
decision support process.

Data Warehouse Integrated


Constructed by integrating multiple, heterogeneous
data sources
- relational databases, flat files, on-line transaction records

Data cleaning and data integration


- Ensure consistency in naming conventions, encoding
structures, attribute measures, etc. among different data
sources
e.g., Hotel price: currency, tax, breakfast covered, etc.
- Data is converted when moved to the warehouse.

Data Warehouse Time Variant


The time horizon for data warehouse is significantly
longer than that of operational systems.
- Operational database: current value data.
- Data warehouse: provides information from a historical
perspective (e.g., past 5-10 years)

Every key structure in the data warehouse contains an


element of time, explicitly or implicitly

Data Warehouse Non-Volatile


A physically separate store of data transformed from the
operational environment.
Operational update of data does not occur in the data
warehouse environment.
- Does not require transaction processing, recovery, and
concurrency control mechanisms
- Requires only two operations in data accessing:
initial loading of data and access of data.

Data Warehouse vs. Heterogeneous DBMS


Traditional heterogeneous DB integration
- Build wrappers/integrators on top of heterogeneous databases
- Query driven approach
When a query is posed to a client site, a meta-dictionary is
used to translate the query into queries appropriate for the
individual heterogeneous sites involved, and results are
integrated into a global answer set
Complex information filtering, compete for resources with local
processing

Data warehouse: update-driven, high performance


- Information from heterogeneous sources is integrated in advance
and stored in warehouses for direct query and analysis

Data Warehouse vs. Operational DBMS


OLTP (on-line transaction processing)
- Major task of traditional relational DBMS
- Day-to-day operations: purchasing, inventory, banking, manufacturing,
payroll, registration, accounting, etc.

OLAP (on-line analytical processing)


- Major task of data warehouse system
- Data analysis and decision making

Distinct features (OLTP vs. OLAP):


- Users and system orientation: transaction vs. decision support
- Data contents: current, detailed vs. historical, consolidated
- Database design: ER + application vs. star schema + subject
- View: current, local vs. evolutionary, integrated
- Access patterns: update vs. read-only but complex queries

OLTP vs. OLAP


OLTP

OLAP

users

clerk, IT professional

knowledge worker

function

day to day operations

decision support

DB design

application-oriented

subject-oriented

data

current, up-to-date
detailed, flat relational
isolated
repetitive

historical,
summarized, multidimensional
integrated, consolidated
ad-hoc

read/write
index on primary key
short, simple transaction

multiple large scans

usage
access
unit of work

complex query

# records accessedtens

millions

#users

thousands

hundreds

DB size

100MB-GB

100GB-TB

metric

transaction throughput

query throughput, response

Why Separate Data Warehouse?


Maintain high performance for both systems
- DBMS tuned for OLTP: access methods, indexing, concurrency
control, recovery
- Warehousetuned for OLAP: complex OLAP queries,
multidimensional view, consolidation.

Different data and function


- missing data: decision support requires historical data which
operational DBs do not typically maintain
- data consolidation: DS requires consolidation (aggregation,
summarization) of data from heterogeneous sources
- data quality: different sources typically use inconsistent data
representations, codes and formats which have to be reconciled

From Tables and Spreadsheets to


Data Cubes
A data warehouse is based on a multidimensional data
model which views data in the form of a data cube
A data cube, such as sales, allows data to be modeled
and viewed in multiple dimensions
- Dimension tables, such as item (item_name, brand, type), or
time(day, week, month, quarter, year)
- Fact table contains measures (such as dollars_sold) and keys to
each of the related dimension tables

In data warehousing literature


n-D base cube is called a base cuboid.
The lattice of cuboids forms a data cube.

Multi-dimensional cube
Sales by Item, Time, Location, Supplier
Time

Item

SALES

Supplier

Location

Cube: A Lattice of Cuboids


all
time

item

time,item time,location

0-D(apex) cuboid
location supplier

item,location

time,supplier
time,item,location

1-D cuboids
location,supplier

item,supplier

time,location,supplier

time,item,supplier

2-D cuboids

3-D cuboids

item,location,supplier

time, item, location, supplier

4-D(base) cuboid

yearly data
(keep all data)

old monthly data


(archived)

retail monthly data


(up to 15 years)

old weekly data


(archived)

weekly data
(up to 7 years)

old detailed data


(archived)

quarterly data
(up to 20 years)

old quarterly data


(archived)

monthly data
(up to 15 years)

old monthly data


(archived)

special event
effects (up to 30 years)

current detailed data


(up to 3 years)

Conceptual Modeling of Data Warehouses


Modeling data warehouses: dimensions & measures
- Star schema: A fact table in the middle connected to a set of
dimension tables

- Snowflake schema:

A refinement of star schema where some

dimensional hierarchy is normalized into a set of smaller


dimension tables, forming a shape similar to snowflake

- Fact constellations:

Multiple fact tables share dimension

tables, viewed as a collection of stars, therefore called galaxy


schema or fact constellation

Star Schema Example


time

item

time_key
day
day_of_the_week
month
quarter
year

Sales Fact Table


time_key
item_key
branch_key

branch
branch_key
branch_name
branch_type

location_key
units_sol
d
avg_sales
dollars_sold

Measures

item_key
item_name
brand
type
supplier_type

location
location_key
street
city
province_or_state
country

Snowflake Schema example


time
time_key
day
day_of_the_week
month
quarter
year

item
Sales Fact Table
time_key
item_key
branch_key

branch

location_key

branch_key
branch_name
branch_type

units_sol
d
avg_sales
dollars_sold

Measures

item_key
item_name
brand
type
supplier_type

location
location_key
street
city
city
province_or_state
city_key
country
city
province_or_street
country

Fact Constellation example


time
time_key
day
day_of_the_week
month
quarter
year

branch
branch_key
branch_name
branch_type

Shipping Fact Table

item
Sales Fact Table
time_key

item_key
item_name
brand
type
supplier_type

time_key
item_key
shipper_key

item_key

from_location

branch_key

to_location

location_key
units_sold
dollars_sold
avg_sales

location
location_key
street
city
province_or_street
country

Multiple fact tables, sharing dimensions


Collection of stars - fact constellation or
galaxy schema

dollars_cost
units_shipped
shipper
shipper_key
shipper_name
location_key
shipper_type

Data warehouse vs. data marts


Data warehouse
- Enterprise-wide scope
- Subjects that span the organization
- Fact constellation used to model multiple, interrelated
subjects

Data mart
- Department-wide scope
- Departmental subset of data warehouse
- Star, snowflake schema

Star schema is more efficient and thereby popular

Computing Measures
Measure: numerical value at each point in the data cube

e.g. <time=Q1, location=Chicago, item=xyz>: avg-

Need to be able to efficiently compute measures

Measure types
Distributive: E.g., count(), sum(), min(), max().
Result derived by applying the function to n aggregate values is
the same as that derived by applying the function on all the data
without partitioning.

Algebraic: E.g., avg(), min_N(), standard_deviation().


Can be computed by an algebraic function with M arguments
(where M is a bounded integer), each of which is obtained by
applying a distributive aggregate function.

Holistic: E.g., median(), mode(), rank().


There is no constant bound on the storage size needed to
describe a sub-aggregate. No constant function with M
arguments (constant M) that characterizes the computation.
Can be difficult to compute efficiently - approximate computation

Concept Hierarchy
Example: Location dimension
all

all
Europe

region
country
city

Germany

Frankfurt

North_America

Spain

Canada

Vancouver

Mexico

Toronto

Concept hierarchies
Full or partial ordering

Industry

Region

Category

Country Quarter

Product

City
Office

Set-grouping hierarchy

Year

Month

Week

Day

($0..$1000]

e.g. price
($0..$1000]

($0..$1000]
($0..$1000]

($0..$1000] ($0..$1000]

($0..$1000]
($0..$1000]

($0..$1000]

($0..$1000]

Multiple hierarchies for an attribute


price: {inexpensive, moderately_priced, expensive}

OLAP examples
Sales volume as a function of product,
month, and region

Dimensions: Product, Location,


Hierarchical summarization pat
Industry

RegionYear

Category
Product

Country Quarter
City

Office

Month

Month
Day

A Sample Data Cube


TV
PC
VCR
sum

1Qtr

2Qtr

Date
3Qtr

4Qtr

sum

Total annual sales


of TV in U.S.A.
U.S.A
Canada
Mexico
sum

Chic
ago
New York
Vancouver
Toronto
Q1

Drill down, Roll up

Q2
Q3
Q4
T1 T2 T3 T4 T5 T6
Item-types

Drill down on
data for Q1
Chicago
New York
Toronto
Vancouver
Jan

Roll-up
on Address

USA
Canada
Q1
Q2

Feb
Q3
March

Q4
T1 T2 T3 T4 T5 T6
Item-types

T1 T2 T3 T4 T5 T6
Item-types

Dice for
(location in {Chicago, Toronto}
and time in {Q1}
And Item in {T3, T8}

Chicago
New York
Toronto
Vancouver

Chicago
Toronto

Q1

Q1

Q2

Q2

Q3

T3 T8

Slice
For Time in {Q1}

Q4
T1 T2 T3 T4 T5 T6

Item-types

Chicago

Slicing and Dicing

New York
Toronto

Slice: Selection on one dimension

Vancouver
T1 T2 T3 T4 T5 T6

Dice; Selection on two or more dimensions

Browsing a Data Cube


Visualization
OLAP
Interactive
manipulation

Typical OLAP Operations


Roll up (drill-up): summarize data
- by climbing up hierarchy or by dimension reduction
Drill down (roll down): reverse of roll-up
- from higher level summary to lower level summary or detailed
data, or introducing new dimensions
Slice and dice:
- project and select
Pivot (rotate):
- reorient the cube, visualization, 3D to series of 2D planes.
Other operations
- drill across: involving (across) more than one fact table
- drill through: through the bottom level of the cube to its backend relational tables (using SQL)

A Star-Net Query Model


Customer Orders

Shipping Method

Customer
CONTRACTS

AIR-EXPRESS
TRUCK
Time

ORDER
PRODUCT LINE

ANNUALY QTRLY DAILY

Prod

PRODUCT ITEM

uct
PRODUCT GROUP
SALES PERSON

CITY
COUNTRY

DISTRICT
REGION
Location

DIVISION
Promotion

Organization

Data Warehouse Design: Four Views


Top-down view
selection of the relevant information necessary for the data
warehouse based on current and future needs

Data source view


exposes the information being captured, stored, and
managed by operational systems (E/R models, CASE, etc)

Data warehouse view


fact tables and dimension tables, pre-calculated totals,
counts, etc. Source information, date, time for historical
context

Business query view


perspectives of data in the warehouse from the view of enduser

Data Warehouse Design Process


Top-down, bottom-up approaches or a combination
- Top-down: Starts with overall design and planning (mature)
- Bottom-up: Starts with experiments and prototypes (rapid)

From software engineering point of view


- Waterfall: structured and systematic analysis at each step
- Spiral: rapid generation of increasingly functional systems, short
turn around time, quick turn around

Typical data warehouse design process


- Choose a business process to model, e.g., orders, invoices, etc.
- Choose the grain (atomic level of data) of the business process
- Choose the dimensions that will apply to each fact table record
- Choose the measure that will populate each fact table record

Multi-Tiered DW Architecture

other

Metadata

sources
Operational

DBs

Extract
Transform
Load
Refresh

Monitor
&
Integrator

Data
Warehouse

OLAP Server

Serve

Analysis
Query
Reports
Data mining

Data Marts

Data Sources

Data Storage

OLAP Engine Front-End Tools

Three Data Warehouse Models


Enterprise warehouse
- collects all of the information about subjects spanning
the entire organization

Data Mart
- a subset of corporate-wide data that is of value to a
specific groups of users. Its scope is confined to
specific, selected groups, such as marketing data mart
Independent vs. dependent (directly from warehouse) data mart

Virtual warehouse
- A set of views over operational databases
- Only some of the possible summary views may be
materialized

Data Marts
Data warehouse designed to meet the needs
of a specific group of users
Should (but may not) be designed with
corporate standards and accessibility in mind
- incorporate standards for hardware,
software, networking, DBMS, naming
conventions, etc.
- vendors attempt to bypass IT and sell
directly to end-users?

Data Warehouse Development: A


Recommended Approach
Distributed
Data Marts

Data
Mart

Data
Mart

Model refinement

Multi-Tier Data
Warehouse

Enterprise
Data
Warehouse
Model refinement

Define a high-level corporate data mod

OLAP Server Types


Relational OLAP (ROLAP)
- Use relational or extended-relational DBMS to store and manage
warehouse data and OLAP middleware to support missing pieces
- Include optimization of DBMS backend, implementation of
aggregation navigation logic, and additional tools and services
- greater scalability

Multidimensional OLAP (MOLAP)


- Array-based multidimensional storage engine (sparse matrix
techniques)
- fast indexing to pre-computed summarized data

Hybrid OLAP (HOLAP)


- User flexibility, e.g., low level: relational, high-level: array

Specialized SQL servers


- specialized support for SQL queries over star/snowflake schemas

Metadata Repository
Meta data is the data defining warehouse objects. It has the
following kinds
- Description of the structure of the warehouse
schema, view, dimensions, hierarchies, derived data defn, data mart
locations and contents

- Operational meta-data
data lineage (history of migrated data and transformation path), currency
of data (active, archived, or purged), monitoring information (warehouse
usage statistics, error reports, audit trails)

- The algorithms used for summarization


- The mapping from operational environment to the data

warehouse
- Data related to system performance

warehouse schema, view and derived data definitions

- Business data
business terms and definitions, ownership of data, charging policies

Data Warehouse Back-End Tools, Utilities


Data extraction:
- get data from multiple, heterogeneous, and external
sources
Data cleaning:
- detect errors in the data and rectify them when
possible
Data transformation:
- convert data from legacy or host format to warehouse
format
Load
- sort, summarize, consolidate, compute views, check
integrity, and build indices and partitions
Refresh
- propagate the updates from the data sources to the
warehouse

Advanced examples

Exploration of Data Cubes


Hypothesis-driven: exploration by user, huge search space
Discovery-driven
- pre-computed measures indicate exceptions, guide user in
data analysis, at all levels of aggregation
- Exception: significantly different from the value anticipated, based
on a statistical model
- Visual cues such as background color are used to reflect the
degree of exception of each cell
- Computation of exception indicator can be included in cube
construction
SelfExp: degree of surprise in cell, relative to values at same levels of aggregation
InExp: degree of surprise somewhere beneath the cell, if we drill down
PathExp: degree of surprise for each drill down path from cell

Advanced examples

Example: Discovery-driven exploration

Advanced examples

Complex Aggregation at Multiple


Granularities
Ex. Total sales in 2000 by Item, Region, Month, with
subtotals
Ex. Grouping by all subsets of {item, region, month}, find
the maximum price in 2000 for each group, and the total
sales generated by all maximum-price-sales
Ex. Among the max-price-sales, find the min and max
shelf life. Find the fraction of the total sales due to cases
that have min shelf life.

Advances examples
Supplier

Sales
#units,
$value

Supplier

Sales
%sales
Product

Product

Sales
#units,
$value

Product

Sales volume as a % of total units sold


of Product

Ordering
Group sales by contiguous 10-day intervals.

Sales
#units,
$value

Supplier

Product

10 day Moving-avg of Sales, by Product


Order Products by Sales-$ and group into
deciles of decreasing performance.

Potrebbero piacerti anche