Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Data Mining
Balaram Singh
Computer Point Educational
Ltd.
09/25/10 1
The Architecture of Data
09/25/10 2
Data Warehouse
Architecture
Information Sources Data Warehouse OLAP Servers Clients
Server
MOLAP
Semistructured Analysis
Sources
Data
Warehouse
extract Query/Reporting
transform
load serve
refresh
etc. ROLAP
Operational
DB’s Data Mining
Data Marts
09/25/10 3
Data Warehouse vs. Data
Marts
09/25/10 4
From the Data Warehouse
to Data Marts
Information
Individually Less
Structured
Departmentally History
Structured Normalized
Detailed
Organizationally More
Structured Data Warehouse
Data
09/25/10 5
Data Warehouse and Data
Marts
OLAP
Data Mart
Lightly summarized
Departmentally structured
Organizationally structured
Atomic
Detailed Data Warehouse Data
09/25/10 6
Data Marts
09/25/10 7
Data Integrity Problems
Same person, different spellings
Agarwal, Agrawal, Aggarwal etc...
LTD.
Use of different names
mumbai, bombay
09/25/10 8
The ETL Process
Capture
Scrub or data cleansing
Transform
Load and Index
09/25/10 9
Steps in data reconciliation
Record-level: Field-level:
Selection – data partitioning single-field – from one field to one field
Joining – data combining multi-field – from many fields to one, or
Aggregation – data summarization one field to many
09/25/10 12
Steps in data reconciliation (continued)
09/25/10 13
Physical Structure of Data
Warehouse
There are three basic architectures for
09/25/10 14
Physical Structure of Data
Warehouse
Client Client Client
Central
Data
Warehouse
Source Source
Centralized architecture
09/25/10 15
Physical Structure of Data
Warehouse
End
Users
Marketing
Local Financial
Data Distribution
Marts
Logical
Data
Warehouse
Source Source
Federated architecture
09/25/10 16
Design Methodology for DW
Nine-step Methodology – proposed by
Kimball
Step Activity
1 Choosing the process
2 Choosing the grain
3 Identifying and conforming the dimensions
4 Choosing the facts
5 Storing the precalculations in the fact table
6 Rounding out the dimension tables
7 Choosing the duration of the database
8 Tracking slowly changing dimensions
9 Deciding the query priorities and the query modes
09/25/10 17
Indexing
a
Data Warehouse
Index Structures
Index structures applied in warehouses
◦ inverted lists
◦ bit map indexes
◦ join indexes
◦ text indexes
09/25/10 19
Inverted Lists
18
19
...
inverted data
age
lists records
index
09/25/10 20
Inverted Lists
Query:
◦ Get people with age = 20 and name = “fred”
List for age = 20: r4, r18, r34, r35
List for name = “fred”: r18, r52
Answer is intersection: r18
09/25/10 21
Bitmap Indexes
Bitmap index: An indexing
technique that has attracted
attention in multi-dimensional
database implementation
table
Customer City Car
c1 Detroit Ford
c2 Chicago Honda
c3 Detroit Honda
c4 Poznan Ford
c5 Paris BMW
c6 Paris Nissan
09/25/10 22
Bitmap Indexes
The index consists of bitmaps:
Index on City:
ec1 Chicago Detroit Paris Poznan
1 0 1 0 0
2 1 0 0 0
3 0 1 0 0
4 0 0 0 1
5 0 0 1 0
6 0 0 1 0
bitmaps
09/25/10 23
Bitmap Indexes
Index on Car:
ec1 BMW Ford Honda Nissan
1 0 1 0 0
2 0 0 1 0
3 0 0 1 0
4 0 1 0 0
5 1 0 0 0
6 0 0 0 1
bitmaps
09/25/10 24
Bitmap Indexes
• Index on a particular column
• Index consists of a number of bit
vectors - bitmaps
• Each value in the indexed column has a
bit vector (bitmaps)
• The length of the bit vector is the
number of records in the base table
• The i-th bit is set if the i-th row of the
base table has the value for the
indexed column
09/25/10 25
Bitmap Index
18 1
19 1
0
1 id name age
1 1 joe 20
20 2 fred 20
20 0
23 0
3 sally 21
21 0 0
22 0 4 nancy 20
1
0 5 tom 20
0
6 pat 25
0
23 7 dave 21
0
25 8 jeff 26
1
26
0
...
1
1
age bit data
index maps records
09/25/10 26
Using Bitmap indexes
Query:
◦ Get people with age = 20 and name = “fred”
List for age = 20: 1101100000
List for name = “fred”: 0100000001
Answer is intersection: 010000000000
• Good if domain cardinality small
• Bit vectors can be compressed
09/25/10 27
Using Bitmap indexes
• They allow the use of efficient bit
operations to answer some
queries
• “how many customers from Detroit have
car ‘Ford’”
◦ perform a bit-wise AND of two bitmaps:
answer – c1
• “how many customers have a car
‘Honda’”
◦ count 1’s in the bitmap - answer - 2
• Compression - bit vectors are
usually sparse for large databases
– the need for decompression
09/25/10 28
Join
“Combine” SALE, PRODUCT relations
In SQL: SELECT * FROM SALE, PRODUCT
09/25/10 29
Join Indexes
join index
product id name price jIndex
p1 bolt 10 r1,r3,r5,r6
p2 nut 5 r2,r4
09/25/10 30
Join Indexes
• Traditional indexes map the value to a list
of record ids. Join indexes map the tuples
in the join result of two relations to the
source tables.
• In data warehouse cases, join indexes
relate the values of the dimensions of a
star schema to rows in the fact table.
• For a warehouse with a Sales fact table and
dimension city, a join index on city maintains
for each distinct city a list of RIDs of the
tuples recording the sales in the city
• Join indexes can span multiple dimensions
09/25/10 31
Terms
• Fact Table
• Dimension Table
• Measure
09/25/10 32
Terms
Relation, which relates the dimensions
to the measure of interest, is called
the fact table (e.g. sale)
Information about dimensions can be
represented as a collection of
relations – called the dimension tables
(product, customer, store)
Each dimension can have a set of
associated attributes
09/25/10 33
Conceptual Modeling of
Data Warehouses
Three basic conceptual schemas:
Star schema
Snowflake schema
Fact constellations
09/25/10 34
Star schema
09/25/10 35
Star schema
sale
orderId
date customer
product
custId custId
prodId
prodId name
name
storeId address
price
qty city
amt
store
storeId
city
09/25/10 36
Star schema
09/25/10 37
Example of Star Schema
Date Product
ProductNo
Date
Sales Fact Table ProdName
Month
ProdDesc
Year
Date Category
QOH
Product
Store
Store
StoreID Customer
City Customer
State CustId
Country CustName
unit_sales CustCity
Region
CustCountry
dollar_sales
schilling_sales
Measurements
09/25/10 38
Dimension Hierarchies
For each dimension, the set of
associated attributes can be
structured as a hierarchy
sType
store
city region
09/25/10 39
Dimension Hierarchies
Client hierarchy
region
cities city state region
c1 CA East
c2 NY East
state c3 SF West
city
09/25/10 40
Dimension Hierarchies
09/25/10 41
Snowflake Schema
09/25/10 42
Product
Example of Snowflake Schema ProductNo
ProdName
Month ProdDesc
Year Category
Month
QOH
Year Year Date
Date
Sales Fact Table
Month
Date
Product
Store
Store Customer
StoreID unit_sales
City City Cust
dollar_sales
City CustId
schilling_sales
State State CustName
CustCity
State CustCountry
Country
Country
Country Measurements
Region
09/25/10 43
Fact constellations
09/25/10 44
Components of a star schema
Fact tables contain
factual or quantitative
data
09/25/10 46
Star schema with sample data
09/25/10 47
Cube computations
B
A
C
ALL
{ABC}{AB}{AC}{BC}
{A}{B}{C}{ }
09/25/10 48
Dimension Hierarchies
Computation
all
state
city, product, date
state, date
state, product
09/25/10 49
Multidimensional Data
Model
products
Multidimensional Data
Model
09/25/10 50
Multidimensional Data
Model
Fact relation Two-dimensional cube
09/25/10 51
Multidimensional Data
Model
Fact relation 3-dimensional cube
09/25/10 52
Multidimensional Data Model
and Aggregates
Add up amounts for day 1
In SQL: SELECT sum(Amt) FROM SALE
WHERE Date = 1
09/25/10 53
Multidimensional Data Model
and Aggregates
Add up amounts by day
In SQL: SELECT Date, sum(Amt)
FROM SALE GROUP BY Date
09/25/10 54
Multidimensional Data Model
and Aggregates
Add up amounts by client,
product
In SQL: SELECT client, product,
sum(amt) FROM SALE
GROUP BY client, product
09/25/10 55
Multidimensional Data Model
and Aggregates
09/25/10 56
Multidimensional Data Model
and Aggregates
In multidimensional data model
together with measure values
usually we store summarizing
information (aggregates)
c1 c2 c3 Sum
p1 56 4 50 110
p2 11 8 19
Sum 67 12 50 129
09/25/10 57
Aggregates
Operators: sum, count, max, min,
median, avg
“Having” clause
Using dimension hierarchy
◦ average by region (within store)
◦ maximum by month (within date)
09/25/10 58
Cube Aggregation
c1 c2 c3
c1 c2 c3 sum 67 12 50
p1 56 4 50
p2 11 8 129
sum
p1 110
p2 19
09/25/10 59
Cube Operators
c1 c2 c3
day 2 ...
p1 44 4
p2 c1 c2 c3
day 1 p1 12 50
p2 11 8 sale(c1,*,*)
c1 c2 c3
c1 c2 c3 sum 67 12 50
p1 56 4 50
p2 11 8 129
sum
sale(c2,p2,
*) p1 110 sale(*,*,*)
p2 19
09/25/10 60
Cube
* c1 c2 c3 *
p1 56 4 50 110
p2 11 8 19
day 2 c1* 67
c2 c312 * 50 129
p1 44 4 48
p2
c1 c2 c3 *
day 1
p1 *
12 44 4
50 62 48 sale(*,p2,*)
p2 11 8 19
* 23 8 50 81
09/25/10 61
Aggregation Using
Hierarchies
c1 c2 c3
day 2
p1 44 4
customer
p2 c1 c2 c3
day 1
p1 12 50 region
p2 11 8
country
re g io n A re g io n B
p1 12 50
p2 11 8
(customer c1 in Region A;
customers c2, c3 in Region B)
09/25/10 62
Aggregation Using
Hierarchies
client
city
New c1 10 3 21
Orleans c2 12 5 9
region
c3 11 7 7
Video Camera CDDate of
Poznań 12 2211 15 8
c4NO 30sale
PN 23 18 22
video CD
Camera
aggregation with
respect to city
09/25/10 63
A Sample Data Cube
Date
1Q 2Q 3Q 4Q sum
u ct camera
video C
od USA
Pr CD o
sum u
Canada
n
Mexico t
r
sum
y
All,
All, 09/25/10 64
Cube Operation
SELECT date, product, customer,
SUM (amount)
FROM SALES
CUBE BY date, product, customer
Need compute the following Group-
Bys
◦ (date, product, customer),
◦ (date,product),(date, customer),
(product, customer),
◦ (date), (product), (customer)
09/25/10 65
Cuboid Lattice
• Data cube can be viewed as a lattice
of cuboids
• The bottom-most cuboid is the base
cube.
• The top most cuboid contains only one
cell.
(A,B,C,D)
( all )
09/25/10 66
Cuboid Lattice
129
all
c1 c2 c3
p1 67 12 50
use greedy
day 2
c1 c2 c3 city, product, date algorithm to
p1
p2 c1
44
c2
4
c3
decide what
day 1
p1
p2
12
11 8
50
to materialize
09/25/10 67
Operations
Rollup: summarize data
◦ e.g., given sales data, summarize sales for last
year by product category and region
Drill down: get more details
◦ e.g., given summarized sales as above, find
breakup of sales by city within each region, or
within the Andhra region
◦
09/25/10 68
More Cube Operations
09/25/10 69
Example
roll-up to region
hy
Dimensions:
r ap NY
g Time, Product, Geography
eo SF roll-up to brand
G Attributes:
LA
Product (upc, price, …)
Juice 10 Geography …
Product
Milk 34 …
Coke 56 Hierarchies:
Cream 32 Product Brand …
Soap 12 Day Week Quarter
Bread 56 roll-up to week City Region Country
M T W Th F S S
Time
56 units of bread sold in LA on M
09/25/10 70
Slicing a data cube
09/25/10 71
Summary report
Example of drill-down
Drill-down with
color added
09/25/10 72
Limitations of SQL
“A Freshman
in Business
needs a Ph.D. in
SQL”
--
Ralph Kimball
09/25/10 73
Typical OLAP Queries
Write a multi-table join to compare sales for each
product line YTD this year vs. last year.
Repeat the above process to find the top 5
product contributors to margin.
Repeat the above process to find the sales of a
product line to new vs. existing customers.
Repeat the above process to find the customers
that have had negative sales growth.
09/25/10 74
What Is OLAP?
Online Analytical Processing - coined by
EF Codd in 1994 paper contracted by
Arbor Software*
Generally synonymous with earlier terms such as
Decisions Support, Business Intelligence,
Executive Information System
OLAP = Multidimensional Database
MOLAP: Multidimensional OLAP (Arbor Essbase,
Oracle Express)
ROLAP: Relational OLAP (Informix MetaCube,
Microstrategy DSS Agent)
* Reference: http://www.arborsoft.com/essbase/wht_ppr/coddTOC.html
09/25/10 75
OLAP Is FASMI
Fast
Analysis
Shared
Multidimensional
Information
09/25/10 76
Multi-dimensional Data
“Hey…I sold $100M worth of goods”
W
S
Re
Juice
Cola
Milk
Category Region Quarter
Cream
Toothpaste
Soap Product City Month Week
1 2 34 5 6 7
Month Office Day
09/25/10 77
Data Cube Lattice
Cube lattice
ABC
AB AC BC
A B C
none
Can materialize some groupbys, compute others on
demand
Question: which groupbys to materialze?
09/25/10 78
A Visual Operation: Pivot
(Rotate)
NY
LA
th
SF
n
Mo
Juice 10
Cola 47
Region
Milk 30
Cream 12 Product
Household
Telecomm n s
i o
eg
Video R Europe
Far East
Audio India
09/25/10 82