Sei sulla pagina 1di 82

Data Warehousing &

Data Mining

Balaram Singh
Computer Point Educational
Ltd.

09/25/10 1
The Architecture of Data

 What’s has been


learned from data

summaries Logical model


Business
by who, rules
physical layout of
what, when,
data
where,... Metadata

Database schema who,


what,
Summary data when,
where,
Operational data

09/25/10 2
Data Warehouse
Architecture
Information Sources Data Warehouse OLAP Servers Clients
Server

MOLAP
Semistructured Analysis
Sources
Data
Warehouse

extract Query/Reporting
transform
load serve
refresh
etc. ROLAP
Operational
DB’s Data Mining

Data Marts

09/25/10 3
Data Warehouse vs. Data
Marts

What comes first

09/25/10 4
From the Data Warehouse
to Data Marts
Information

Individually Less
Structured

Departmentally History
Structured Normalized
Detailed

Organizationally More
Structured Data Warehouse

Data
09/25/10 5
Data Warehouse and Data
Marts
OLAP
Data Mart
Lightly summarized
Departmentally structured

Organizationally structured
Atomic
Detailed Data Warehouse Data

09/25/10 6
Data Marts

09/25/10 7
Data Integrity Problems
Same person, different spellings
 Agarwal, Agrawal, Aggarwal etc...

Multiple ways to denote company name

 Persistent Systems, PSPL, Persistent Pvt.

LTD.
Use of different names

 mumbai, bombay

Different account numbers generated by

different applications for the same customer


Required fields left blank

Invalid product codes collected at point of sale

 manual entry leads to mistakes

 “in case of a problem use 9999999”

09/25/10 8
The ETL Process
 Capture
 Scrub or data cleansing
 Transform
 Load and Index

ETL = Extract, transform, and load

09/25/10 9
Steps in data reconciliation

Capture = extract…obtaining a snapshot


of a chosen subset of the source data for
loading into the data warehouse

Static extract = capturing a Incremental extract =


snapshot of the source data at capturing changes that have
a point in time occurred since the last static
extract
09/25/10 10
Steps in data reconciliation (continued)

Scrub = cleanse…uses pattern


recognition and AI techniques to
upgrade data quality

Fixing errors: misspellings, Also: decoding, reformatting, time


erroneous dates, incorrect field usage, stamping, conversion, key
mismatched addresses, missing data, generation, merging, error
duplicate data, inconsistencies detection/logging, locating missing
data
09/25/10 11
Steps in data reconciliation (continued)

Transform = convert data from format


of operational system to format of data
warehouse

Record-level: Field-level:
Selection – data partitioning single-field – from one field to one field
Joining – data combining multi-field – from many fields to one, or
Aggregation – data summarization one field to many

09/25/10 12
Steps in data reconciliation (continued)

Load/Index= place transformed data


into the warehouse and create indexes

Refresh mode: bulk rewriting of Update mode: only changes in


target data at periodic intervals source data are written to data
warehouse

09/25/10 13
Physical Structure of Data
Warehouse
There are three basic architectures for

constructing a data warehouse:


• Centralized
• Federated
The data warehouse is distributed for: load

balancing, scalability and higher availability


09/25/10 14
Physical Structure of Data
Warehouse
Client Client Client

Central
Data
Warehouse

Source Source

Centralized architecture

09/25/10 15
Physical Structure of Data
Warehouse
 End
Users

 Marketing
Local Financial
Data Distribution

Marts


Logical
 Data
Warehouse

Source Source

Federated architecture

09/25/10 16
Design Methodology for DW
 Nine-step Methodology – proposed by
Kimball

Step Activity
1 Choosing the process
2 Choosing the grain
3 Identifying and conforming the dimensions
4 Choosing the facts
5 Storing the precalculations in the fact table
6 Rounding out the dimension tables
7 Choosing the duration of the database
8 Tracking slowly changing dimensions
9 Deciding the query priorities and the query modes

09/25/10 17
Indexing
a
Data Warehouse
Index Structures
 Index structures applied in warehouses
◦ inverted lists
◦ bit map indexes
◦ join indexes
◦ text indexes

09/25/10 19
Inverted Lists
18
19

r4 rId name age


r18 r4 joe 20
20 r18 fred 20
20 r34
23 r19 sally 21
21 r35
22 r34 nancy 20
r35 tom 20
r5
r36 pat 25
r19
23 r5 dave 21
r37
25 r41 jeff 26
r40
26

...
inverted data
age
lists records
index
09/25/10 20
Inverted Lists
 Query:
◦ Get people with age = 20 and name = “fred”
 List for age = 20: r4, r18, r34, r35
 List for name = “fred”: r18, r52
 Answer is intersection: r18

09/25/10 21
Bitmap Indexes
 Bitmap index: An indexing
technique that has attracted
attention in multi-dimensional
database implementation
 table
Customer City Car
c1 Detroit Ford
c2 Chicago Honda
c3 Detroit Honda
c4 Poznan Ford
c5 Paris BMW
c6 Paris Nissan

09/25/10 22
Bitmap Indexes
 The index consists of bitmaps:
 Index on City:

ec1 Chicago Detroit Paris Poznan
 1 0 1 0 0
2 1 0 0 0
3 0 1 0 0
4 0 0 0 1
5 0 0 1 0
6 0 0 1 0

bitmaps

09/25/10 23
Bitmap Indexes
 Index on Car:

ec1 BMW Ford Honda Nissan
1 0 1 0 0
2 0 0 1 0
3 0 0 1 0
4 0 1 0 0
5 1 0 0 0
6 0 0 0 1

bitmaps

09/25/10 24
Bitmap Indexes
• Index on a particular column
• Index consists of a number of bit
vectors - bitmaps
• Each value in the indexed column has a
bit vector (bitmaps)
• The length of the bit vector is the
number of records in the base table
• The i-th bit is set if the i-th row of the
base table has the value for the
indexed column

09/25/10 25
Bitmap Index
18 1
19 1
0
1 id name age
1 1 joe 20
20 2 fred 20
20 0
23 0
3 sally 21
21 0 0
22 0 4 nancy 20
1
0 5 tom 20
0
6 pat 25
0
23 7 dave 21
0
25 8 jeff 26
1
26
0

...
1
1
age bit data
index maps records

09/25/10 26
Using Bitmap indexes
 Query:
◦ Get people with age = 20 and name = “fred”
 List for age = 20: 1101100000
 List for name = “fred”: 0100000001
 Answer is intersection: 010000000000
• Good if domain cardinality small
• Bit vectors can be compressed

09/25/10 27
Using Bitmap indexes
• They allow the use of efficient bit
operations to answer some
queries
• “how many customers from Detroit have
car ‘Ford’”
◦ perform a bit-wise AND of two bitmaps:
answer – c1
• “how many customers have a car
‘Honda’”
◦ count 1’s in the bitmap - answer - 2
• Compression - bit vectors are
usually sparse for large databases
– the need for decompression
09/25/10 28
Join
 “Combine” SALE, PRODUCT relations
 In SQL: SELECT * FROM SALE, PRODUCT

sale prodId storeId date amt product id name price


p1 c1 1 12 p1 bolt 10
p2 c1 1 11 p2 nut 5
p1 c3 1 50
p2 c2 1 8
p1 c1 2 44
p1 c2 2 4

joinTb prodId name price storeId date amt


p1 bolt 10 c1 1 12
p2 nut 5 c1 1 11
p1 bolt 10 c3 1 50
p2 nut 5 c2 1 8
p1 bolt 10 c1 2 44
p1 bolt 10 c2 2 4

09/25/10 29
Join Indexes

join index
product id name price jIndex
p1 bolt 10 r1,r3,r5,r6
p2 nut 5 r2,r4

sale rId prodId storeId date amt


r1 p1 c1 1 12
r2 p2 c1 1 11
r3 p1 c3 1 50
r4 p2 c2 1 8
r5 p1 c1 2 44
r6 p1 c2 2 4

09/25/10 30
Join Indexes
• Traditional indexes map the value to a list
of record ids. Join indexes map the tuples
in the join result of two relations to the
source tables.
• In data warehouse cases, join indexes
relate the values of the dimensions of a
star schema to rows in the fact table.
• For a warehouse with a Sales fact table and
dimension city, a join index on city maintains
for each distinct city a list of RIDs of the
tuples recording the sales in the city
• Join indexes can span multiple dimensions

09/25/10 31
Terms
• Fact Table
• Dimension Table
• Measure

09/25/10 32
Terms
 Relation, which relates the dimensions
to the measure of interest, is called
the fact table (e.g. sale)
 Information about dimensions can be
represented as a collection of
relations – called the dimension tables
(product, customer, store)
 Each dimension can have a set of
associated attributes

09/25/10 33
Conceptual Modeling of
Data Warehouses
 Three basic conceptual schemas:

 Star schema
 Snowflake schema
 Fact constellations

09/25/10 34
Star schema

Star schema: A single object (fact


table) in the middle connected to a


number of dimension tables

09/25/10 35
Star schema

sale
orderId
date customer
product
custId custId
prodId
prodId name
name
storeId address
price
qty city
amt

store
storeId
city

09/25/10 36
Star schema

product prodId name price store storeId city


p1 bolt 10 c1 nyc
p2 nut 5 c2 sfo
c3 la

sale oderId date custId prodId storeId qty amt


o100 1/7/97 53 p1 c1 1 12
o102 2/7/97 53 p2 c1 2 11
o105 3/8/97 111 p1 c3 5 50

customer custId name address city


53 joe 10 main sfo
81 fred 12 main sfo
111 sally 80 willow la

09/25/10 37
Example of Star Schema

Date Product
ProductNo
Date
Sales Fact Table ProdName
Month
ProdDesc
Year
Date Category
QOH
Product
Store
Store
StoreID Customer
City Customer
State CustId
Country CustName
unit_sales CustCity
Region
CustCountry
dollar_sales

schilling_sales
Measurements

09/25/10 38
Dimension Hierarchies
 For each dimension, the set of
associated attributes can be
structured as a hierarchy
 sType
store
city region

customer city state country

09/25/10 39
Dimension Hierarchies
 Client hierarchy

region
cities city state region
c1 CA East
c2 NY East
state c3 SF West

city

09/25/10 40
Dimension Hierarchies

sType tId size location


t1 small downtown
store storeId cityId tId mgr t2 large suburbs
s5 sfo t1 joe
s7 sfo t2 fred city cityId pop regId
s9 la t1 nancy sfo 1M north
la 5M south

region regId name


north cold region
south warm region

09/25/10 41
Snowflake Schema

Snowflake schema: A refinement of


star schema where the dimensional


hierarchy is represented explicitly by
normalizing the dimension tables

09/25/10 42
Product
Example of Snowflake Schema ProductNo
ProdName
Month ProdDesc
Year Category
Month
QOH
Year Year Date
Date
Sales Fact Table
Month
Date
Product
Store

Store Customer
StoreID unit_sales
City City Cust
dollar_sales
City CustId
schilling_sales
State State CustName
CustCity
State CustCountry
Country
Country
Country Measurements
Region

09/25/10 43
Fact constellations

Fact constellations: Multiple fact tables


share dimension tables


09/25/10 44
Components of a star schema
Fact tables contain
factual or quantitative
data

Dimension tables are


1:N relationship
denormalized to
between dimension
maximize
tables and fact tables
performance

Dimension tables contain


descriptions about the
subjects of the business

Excellent for ad-hoc queries,


but bad for online transaction processing
09/25/10 45
Star schema example

Fact table provides statistics for sales


broken down by product, period and store
dimensions

09/25/10 46
Star schema with sample data

09/25/10 47
Cube computations

B
A
C

ALL
{ABC}{AB}{AC}{BC}
{A}{B}{C}{ }

09/25/10 48
Dimension Hierarchies
Computation
all

city product date

city, product city, date product, date

state
city, product, date
state, date
state, product

roll-up along client state, product, date


hierarchy

09/25/10 49
Multidimensional Data
Model

Sales of products may be represented


in one dimension (as a fact relation) or
in two dimensions, e.g. : clients and

products

Multidimensional Data
Model

09/25/10 50
Multidimensional Data
Model


Fact relation Two-dimensional cube

sale Product Client Amt


p1 c1 12
p2 c1 11
c1 c2 c3
p1 c3 50 p1 12 50
p2 c2 8 p2 11 8

09/25/10 51
Multidimensional Data
Model
Fact relation 3-dimensional cube

sale Product Client Date Amt


p1 c1 1 12
p2 c1 1 11 c1 c2 c3
p1 c3 1 50 day 2
p1 44 4
p2 c2 1 8
p1 c1 2 44 p2 c1 c2 c3
day 1
p1 c2 2 4 p1 12 50
p2 11 8

09/25/10 52
Multidimensional Data Model
and Aggregates
 Add up amounts for day 1
 In SQL: SELECT sum(Amt) FROM SALE
WHERE Date = 1

sale Product Client Date Amt


p1
p2
c1
c1
1
1
12
11
result
p1 c3 1 50 81
p2 c2 1 8
p1 c1 2 44
p1 c2 2 4

09/25/10 53
Multidimensional Data Model
and Aggregates
 Add up amounts by day
 In SQL: SELECT Date, sum(Amt)
 FROM SALE GROUP BY Date

sale Product Client Date Amt


p1 c1 1 12
p2 c1 1 11 result Date sum
p1 c3 1 50
p2 c2 1 8 1 81
p1 c1 2 44 2 48
p1 c2 2 4

09/25/10 54
Multidimensional Data Model
and Aggregates
 Add up amounts by client,
product
 In SQL: SELECT client, product,
sum(amt) FROM SALE
 GROUP BY client, product

09/25/10 55
Multidimensional Data Model
and Aggregates

sale Product Client Date Amt sale Product Client Sum


p1 c1 1 12 p1 c1 56
p2 c1 1 11
p1 c2 4
p1 c3 1 50
p2 c2 1 8 p1 c3 50
p1 c1 2 44 p2 c1 11
p1 c2 2 4 p2 c2 8

09/25/10 56
Multidimensional Data Model
and Aggregates
 In multidimensional data model
together with measure values
usually we store summarizing
information (aggregates)

c1 c2 c3 Sum
p1 56 4 50 110
p2 11 8 19
Sum 67 12 50 129

09/25/10 57
Aggregates
 Operators: sum, count, max, min,
median, avg
 “Having” clause
 Using dimension hierarchy
◦ average by region (within store)
◦ maximum by month (within date)

09/25/10 58
Cube Aggregation

Example: computing sums


c1 c2 c3
day 2 ...
p1 44 4
p2 c1 c2 c3
day 1 p1 12 50
p2 11 8

c1 c2 c3
c1 c2 c3 sum 67 12 50
p1 56 4 50
p2 11 8 129
sum
p1 110
p2 19

09/25/10 59
Cube Operators

c1 c2 c3
day 2 ...
p1 44 4
p2 c1 c2 c3
day 1 p1 12 50
p2 11 8 sale(c1,*,*)

c1 c2 c3
c1 c2 c3 sum 67 12 50
p1 56 4 50
p2 11 8 129
sum
sale(c2,p2,
*) p1 110 sale(*,*,*)
p2 19

09/25/10 60
Cube

* c1 c2 c3 *
p1 56 4 50 110
p2 11 8 19
day 2 c1* 67
c2 c312 * 50 129
p1 44 4 48
p2
c1 c2 c3 *
day 1
p1 *
12 44 4
50 62 48 sale(*,p2,*)
p2 11 8 19
* 23 8 50 81

09/25/10 61
Aggregation Using
Hierarchies

c1 c2 c3
day 2
p1 44 4
customer
p2 c1 c2 c3
day 1
p1 12 50 region
p2 11 8

country

re g io n A re g io n B
p1 12 50
p2 11 8
(customer c1 in Region A;
customers c2, c3 in Region B)

09/25/10 62
Aggregation Using
Hierarchies
client

city
New c1 10 3 21
Orleans c2 12 5 9
region
c3 11 7 7
Video Camera CDDate of
Poznań 12 2211 15 8
c4NO 30sale
PN 23 18 22
video CD
Camera

aggregation with
respect to city

09/25/10 63
A Sample Data Cube
Date
1Q 2Q 3Q 4Q sum
u ct camera
video C
od USA
Pr CD o
sum u
Canada
n
Mexico t
r
sum
y

All,
All, 09/25/10 64
Cube Operation
 SELECT date, product, customer,
SUM (amount)
 FROM SALES
 CUBE BY date, product, customer
 Need compute the following Group-
Bys
◦ (date, product, customer),
◦ (date,product),(date, customer),
(product, customer),
◦ (date), (product), (customer)

09/25/10 65
Cuboid Lattice
• Data cube can be viewed as a lattice
of cuboids
• The bottom-most cuboid is the base
cube.
• The top most cuboid contains only one
cell.

(A,B,C,D)

(A,B,C) (A,B,D) (A,C,D) (B,C,D)

(A,B) (A,C) (A,D) (B,C) (B,D) (C,D)

(A) (B) (C) (D)

( all )
09/25/10 66
Cuboid Lattice
129
all
c1 c2 c3
p1 67 12 50

city product date

city, product city, date product, date


c1 c2 c3
p1 56 4 50
p2 11 8

use greedy
day 2
c1 c2 c3 city, product, date algorithm to
p1
p2 c1
44
c2
4
c3
decide what
day 1
p1
p2
12
11 8
50
to materialize

09/25/10 67
Operations
 Rollup: summarize data
◦ e.g., given sales data, summarize sales for last
year by product category and region
 Drill down: get more details
◦ e.g., given summarized sales as above, find
breakup of sales by city within each region, or
within the Andhra region

09/25/10 68
More Cube Operations

 Slice and dice: select and project


◦ e.g.: Sales of soft-drinks in Andhra over the last
quarter
 Pivot: change the view of data
◦ Q1 Q2 Total L S Total
L Red
S Blue
Total22 33 55 Total 14 07 21
15 44 59 41 52 93
37 77 114 55 59 114

09/25/10 69
Example
roll-up to region
hy
Dimensions:
r ap NY
g Time, Product, Geography
eo SF roll-up to brand
G Attributes:
LA
Product (upc, price, …)
Juice 10 Geography …
Product

Milk 34 …
Coke 56 Hierarchies:
Cream 32 Product  Brand  …
Soap 12 Day  Week  Quarter
Bread 56 roll-up to week City  Region  Country
M T W Th F S S
Time
56 units of bread sold in LA on M

09/25/10 70
Slicing a data cube

09/25/10 71
Summary report

Example of drill-down

Drill-down with
color added

09/25/10 72
Limitations of SQL
 “A Freshman
in Business
needs a Ph.D. in
SQL”

--
 Ralph Kimball

09/25/10 73
Typical OLAP Queries
 Write a multi-table join to compare sales for each
product line YTD this year vs. last year.
 Repeat the above process to find the top 5
product contributors to margin.
 Repeat the above process to find the sales of a
product line to new vs. existing customers.
 Repeat the above process to find the customers
that have had negative sales growth.

09/25/10 74
What Is OLAP?
 Online Analytical Processing - coined by
EF Codd in 1994 paper contracted by
Arbor Software*
 Generally synonymous with earlier terms such as
Decisions Support, Business Intelligence,
Executive Information System
 OLAP = Multidimensional Database
 MOLAP: Multidimensional OLAP (Arbor Essbase,
Oracle Express)
 ROLAP: Relational OLAP (Informix MetaCube,
Microstrategy DSS Agent)

* Reference: http://www.arborsoft.com/essbase/wht_ppr/coddTOC.html

09/25/10 75
OLAP Is FASMI
Fast
Analysis

Shared

Multidimensional

Information

09/25/10 76
Multi-dimensional Data
 “Hey…I sold $100M worth of goods”

Dimensions: Product, Region, Time


on

Hierarchical summarization paths


gi

W
S
Re

N Product Region Time


Industry Country Year
Product

Juice
Cola
Milk
Category Region Quarter
Cream
Toothpaste
Soap Product City Month Week
1 2 34 5 6 7
Month Office Day
09/25/10 77
Data Cube Lattice
Cube lattice
 ABC
AB AC BC
A B C
none
Can materialize some groupbys, compute others on

demand
Question: which groupbys to materialze?

Question: what indices to create

Question: how to organize data (chunks, etc)

09/25/10 78
A Visual Operation: Pivot
(Rotate)
NY
LA

th
SF

n
Mo
Juice 10
Cola 47

Region
Milk 30
Cream 12 Product

3/1 3/2 3/3 3/4


Date 09/25/10 79
“Slicing and Dicing”
The Telecomm Slice
Product

Household

Telecomm n s
i o
eg
Video R Europe
Far East
Audio India

Retail Direct Special Sales Channel


09/25/10 80
Exercise (1)
 Suppose the AAA Automobile Co. builds
a data warehouse to analyze sales of
its cars.
 The measure - price of a car
 We would like to answer the following
typical queries:
◦ find total sales by day, week, month and
year
◦ find total sales by week, month, ... for each
dealer
◦ find total sales by week, month, ... for each
car model
◦ find total sales by month for all dealers in a
given city, region and state. 09/25/10 81
Exercise (2)
 Dimensions:
◦ time (day, week, month, quarter, year)
◦ dealer (name, city, state, region, phone)
◦ cars (serialno, model, color, category , …)

 Design the conceptual data warehouse
schema

09/25/10 82

Potrebbero piacerti anche