Announcements and class materials for Oct 9

Announcements (Tue. Oct.
9)
Midterm
Data Warehousing and

Data Mining
CompSci 316
Introduction to Database Systems
office hours this Wed.: 2:30-4pm

Homework #1 graded
Homework #2 sample solution sent
Project milestone #1 due next Thursday
resides in many distributed, heterogeneous

OLTP (On-Line Transaction Processing) sources
to support OLAP (On-Line Analytical

Processing) over an integrated view of the data
Possible approaches to integration
Eager: integrate in advance and store the integrated data
at a central repository called the data warehouse
Lazy: integrate on demand; process queries over
distributed sourcesmediated or federated systems
Implications on database design and optimization?

OLAP databases do not care much about redundancy
Denormalize tables
Many, many indexes
Precomputed query results
Maintaining a data warehouse
Eager (warehousing)
In advance: before queries
Copy data from sources
Lazy
On demand: at query time
Leave data at sources
Answer could be stale

) Need to maintain
consistency
) Query processing is local to
the warehouse
Answer is more up-to-date

) No need to maintain
consistency
) Sources participate in
query processing
Slower
Interferes with local
processing
OLAP
Mostly updates
Mostly reads
Short, simple transactions
Long, complex queries
Clerical users
Analysts, decision makers
Goal: transaction throughput Goal: fast queries
Need
Faster
Can operate when sources
are unavailable
OLTP versus OLAP

OLTP
Sales, inventory, customer,

NC branch, NY branch, CA branch,
Open-book, open-notes
No communication devices
Solution to sample midterm was emailed this weekend
Will cover all materials through today
But more focus will be on parts that you already exercised
Data
Eager versus lazy integration
on Thursday, in class
My
Data integration
The ETL process

Extraction: extract relevant data and/or changes from sources
Transformation: transform data to match the warehouse schema
Loading: integrate data/changes into the warehouse
Approaches
Recomputation
Easy to implement; just take periodic dumps of the sources, say, every night
What if there is no night, e.g., a global organization?
What if recomputation takes more than a day?
Incremental maintenance
Compute and apply only incremental changes

Fast if changes are small
Not easy to do for complicated transformations
Need to detect incremental changes at the sources
Star schema of a data warehouse

Dimension table
Store
Dimension table
Product
SID
name
cost
s1
Durham
p1
beer
10
s2
Chapel Hill
p2
diaper
16
s3
RTP
Customer
OID
Date
CID
PID
SID
qty
price
100
08/23/2012
c3
p1
s1
12
102
09/12/2012
c3
p2
s1
17
105
09/24/2012
c5
p1
s3
13
CID
name
address
city
c3
Amy
100 Main St.
Durham
c4
Ben
102 Main St.
Durham
c5
Coy
800 Eighth St.
Durham
(c3, p2, s1) = 2

Fact table
Big
Constantly growing
Stores measures (often
aggregated in queries)
c3
ALL
c4
c5
(ALL, p2, ALL)

=2
p2 (ALL, p1, s1) = 4
ALL
s3
(ALL, ALL, ALL) = 11
c5
Customer
10
s3
(c5, p1, s3) = 5
Store
(c3, p1, s1) = 1
(c5, p1, s1) = 3
s2
s1
Further project points onto Product axis

c3
c4
c5
Customer
CUBE operator
12
Sale (CID, PID, SID, qty)

Proposed SQL extension:
SELECT SUM(qty) FROM Sale
GROUP BY CUBE CID, PID, SID;
Output contains:
(c5, p1, s3) = 5
Store
Normal groups produced by GROUP BY
(c3, p1, s1) = 1
(c1, p1, s1, sum), (c1, p2, s3, sum), etc.
(c5, p1, s1) = 3
Groups with one or more ALLs
s2
s1
c4
Total quantity of sales for each product

SELECT PID, SUM(qty) FROM Sale GROUP BY PID;
ALL
11
Total quantity of sales

SELECT SUM(qty) FROM Sale;
(ALL, p1, s3) = 5
(ALL, p2, s1) = 2
(c3, p2, s1) = 2
(ALL, p1, ALL)

p1
=9
(ALL, p1, ALL)

p1
=9
Customer
Completing the cubeorigin

Product
c3
(ALL, p2, ALL)

=2
p2 (ALL, p1, s1) = 4
(c5, p1, s1) = 3
(c5, p1, s1) = 3
s1
(ALL, p1, s3) = 5

(ALL, p2, s1) = 2
(c3, p2, s1) = 2
Project all points onto Product-Store plane
s1
(c3, p1, s1) = 1
Completing the cubeaxis
(c5, p1, s3) = 5
s2
p1
p1
Product
Store
(c3, p1, s1) = 1
Store
s2
ALL
(ALL, p1, s3) = 5

(ALL, p2, s1) = 2
(c3, p2, s1) = 2
s3
s3
p2
Dimension table
Small
Updated infrequently
Total quantity of sales for each product in each store

SELECT PID, SID, SUM(qty) FROM Sale
GROUP BY PID, SID;
p2 (ALL, p1, s1) = 4
Simplified schema: Sale (CID, PID, SID, qty)
Product
(c5, p1, s3) = 5
Completing the cubeplane

Product
Data cube
city
PID
Sale
(ALL, p1, s1, sum), (c2, ALL, ALL, sum), (ALL, ALL, ALL, sum), etc.
Further project points onto the origin

c3
c4
c5
Customer
Can you write a CUBE query using only GROUP BYs?
Gray et al., Data Cube: A Relational Aggregation Operator

Generalizing Group-By, Cross-Tab, and Sub-Total. ICDE 1996
13
Aggregation lattice
GROUP BY
CID
GROUP BY
PID
GROUP BY
SID
GROUP BY and CUBE aggregates is

expensive
OLAP queries perform these operations over and
over again
GROUP BY
CID, PID
GROUP BY
CID, SID
GROUP BY
PID, SID
) Idea:
Computing
GROUP BY
Roll up
Drill down
GROUP BY
CID, PID, SID
precompute and store the aggregates as

materialized views
A parent can be
computed from any child
15
Selecting views to materialize
Maintained automatically as base data changes

No. 1 user-requested feature in PostgreSQL!
16
Other OLAP extensions

extended grouping capabilities (e.g., CUBE),
window operations have also been added to SQL
A window specifies an ordered list of rows related
to the current row
A window function computes a value from this list
and the current row
Besides
Factors in deciding what to materialize
14
Materialized views
What is its storage cost?

What is its update cost?
Which queries can benefit from it?
How much can a query benefit from it?
Example
GROUP BY is small, but not useful to most queries
GROUP BY CID, PID, SID is useful to any query, but too large
to be beneficial
Standard aggregates: COUNT, SUM, AVG, MIN, MAX

New functions: RANK, PERCENT_RANK, LAG, LEAD,
Harinarayan et al., Implementing Data Cubes Efficiently. SIGMOD

1996
17
RANK window function example

sid
|
pid
|
cid
| qty
------------+------------+------------+----Durham
| beer
| Alice
| 10
Durham
| beer
| Bob
| 2
Durham
| chips
| Bob
| 3
Durham
| diaper
| Alice
| 5
Raleigh
| beer
| Alice
| 2
Raleigh
| diaper
| Bob
| 100
GROUP BY
SELECT SID, PID, SUM(qty),

RANK() OVER w
FROM Sale GROUP BY SID, PID
WINDOW w AS
(PARTITION BY SID
ORDER BY SUM(qty) DESC);
sid
|
pid
|
cid
| qty
------------+------------+------------+----Durham
| beer
| Alice
| 10
|
| Bob
| 2
Durham
| chips
| Bob
| 3
Durham
| diaper
| Alice
| 5
Raleigh
| beer
| Alice
| 2
Raleigh
| diaper
| Bob
| 100
E.g., for the following row,

Durham
| beer
|
| Alice
| Bob
Apply WINDOW after processing

FROM, WHERE, GROUP BY, HAVING
PARTITION defines the related
set and ORDER BY orders it
the related list is:
| 10
| 2
Durham
| beer
|
| Alice
| Bob
|
|
10
2
Durham
| diaper
| Alice
Durham
| chips
| Bob
18
RANK window function example (contd)

RANK() OVER w
WINDOW w AS
(PARTITION BY SID
ORDER BY SUM(qty) DESC);
the related list is:
sid
|
pid
|
cid
| qty
------------+------------+------------+----Durham
| beer
| Alice
| 10
|
| Bob
|
2
Durham
| chips
| Bob
|
3
Durham
| diaper
| Alice
|
5
Raleigh
| beer
| Alice
|
2
Raleigh
| diaper
| Bob
| 100
E.g., for the following row,

Durham
| beer
|
| Alice
| Bob
|
|
10
2
Then, for each row and its related

list, evaluate SELECT and return:
Durham
| beer
|
| Alice
| Bob
| 10
| 2
Durham
| diaper
| Alice
Durham
| chips
| Bob
sid
|
pid
| sum | rank
------------+------------+-----+-----Durham
| beer
| 12 |
1
Durham
| diaper
| 5 |
2
Durham
| chips
| 3 |
3
Raleigh
| diaper
| 100 |
1
Raleigh
| beer
| 2 |
2
19
Multiple windows
sid
|
pid
|
cid
| qty
------------+------------+------------+----Durham
| beer
| Alice
| 10
|
| Bob
| 2
Durham
| chips
| Bob
| 3
Durham
| diaper
| Alice
| 5
Raleigh
| beer
| Alice
| 2
Raleigh
| diaper
| Bob
| 100
No PARTITION means all rows

are related to the current one
So rank1 is the global rank:

RANK() OVER w,
RANK() OVER w1 AS rank1
WINDOW w AS
(PARTITION BY SID
ORDER BY SUM(qty) DESC),
w1 AS
(ORDER BY SUM(qty) DESC)
ORDER BY SID, rank;
sid
|
pid
| sum | rank | rank1
------------+------------+-----+------+------Durham
| beer
| 12 |
1 |
2
Durham
| diaper
| 5 |
2 |
3
Durham
| chips
| 3 |
3 |
4
Raleigh
| diaper
| 100 |
1 |
1
Raleigh
| beer
| 2 |
2 |
5
21
Mining frequent itemsets

Given:
a large database of
transactions, each containing
a set of items
Example: market baskets
Find
all frequent itemsets
A set of items is frequent if

% of all
no less than
transactions contain X
Examples: {diaper, beer},
{scanner, color printer}
TID
T003 milk, beer

T004 diaper, milk, egg
T005 diaper, beer
T006 milk, beer
We
will focus on frequent itemset mining
22
First try
nave algorithm
The number of itemsets is huge!
2 , where
T008 diaper, milk, beer, candy

T009 diaper, milk, beer
Think:
23
All
subsets of a frequent itemset must also be

frequent
) If
Usually complex statistical queries that are difficult to

answer often specialized algorithms outside DBMS
Problem:
T007 diaper, beer
Because any transaction that contains

contains subsets of
knowledge
DBMS meets AI and statistics
Clustering, prediction (classification and regression),
association analysis, outlier analysis, evolution
analysis, etc.
Keep a running count for each possible itemset

For each transaction , and for each itemset , if
contains then increment the count for
Return itemsets with large enough counts
T002 milk, egg
The Apriori property
20
Data
items
T001 diaper, milk, candy
Data mining
must also
we have already verified that is infrequent,

there is no need to count s supersets because they
must be infrequent too
is the number of items
How do we prune the search space?
The Apriori algorithm
24
Multiple passes over the transactions

Pass finds all frequent -itemsets (i.e., itemsets of
size )
Use the set of frequent -itemsets found in pass to
construct candidate + 1 -itemsets to be counted
in pass + 1
A + 1 -itemset is a candidate only if all its subsets of
size are frequent
25
Example: pass 1
Example: pass 2
26
Generate Scan and

candidates count
Check
min. support
TID
items
TID
items
T001
A, B, E
T001
A, B, E
T002
B, D
T002
B, D
itemset
count
itemset
count
itemset
T003
B, C
T003
B, C
{A}
{A,B}
{A,B}
T004
A, B, D
T004
A, B, D
{B}
{A,C}
{A,C}
4
2
count
T005
A, C
itemset
count
T005
A, C
{C}
{A,D}
{A,E}
T006
B, C
{A}
T006
B, C
{D}
{A,E}
{B,C}
T007
A, C
{B}
T007
A, C
{E}
{B,C}
{B,D}
T008
A, B, C, E
{C}
T008
A, B, C, E
{B,D}
{B,E}
T009
A, B, C
{D}
T009
A, B, C
{B,E}
T010
{E}
T010
{C,D}
{C,E}
{D,E}
Transactions
Transactions
Frequent 1-itemsets
% = 20%
(Itemset {F} is infrequent)
% = 20%
27
Example: pass 3
TID
items
T001
A, B, E
T002
B, D
T003
B, C
T004
A, B, D
itemset
count
itemset
T005
A, C
{A,B}
{A,B,C} 2
T006
B, C
{A,C}
{A,B,E}
T007
A, C
{A,E}
T008
A, B, C, E
{B,C}
T009
A, B, C
{B,D}
{B,E}
T010
Generate Scan and

candidates
count
Transactions
% = 20%
Check
min. support
count
2
TID
items
T001
A, B, E
T002
B, D
Generate
candidates
B, C
A, B, D
itemset
{A,B,C} 2
T005
A, C
{A,B,C} 2
{A,B,E}
T006
B, C
{A,B,E}
T007
A, C
T008
A, B, C, E
T009
A, B, C
T010
count
2
Frequent
3-itemsets
Candidate
3-itemsets
28
Example: pass 4
T004
itemset
Frequent
2-itemsets
Candidate
2-itemsets
T003
count
2
Frequent
3-itemsets
itemset
count
Candidate
4-itemsets
No more itemsets to count!
Transactions
Frequent
2-itemsets
% = 20%
29
Example: final answer
Summary
Data
itemset
count
itemset
count
itemset
{A}
{A,B}
{A,B,C} 2
{B}
{A,C}
{A,B,E}
{C}
{A,E}
{D}
{B,C}
{E}
{B,D}
{B,E}
Frequent
1-itemsets
Frequent
1-itemsets
Frequent
2-itemsets
count
2
Frequent
3-itemsets
30
warehousing
Eagerly integrate data from operational sources and store

a redundant copy to support OLAP
OLAP vs. OLTP: different workload different degree
of redundancy
SQL extensions: grouping and windowing
Data
mining
Only covered frequent itemset counting

Skipped many other techniques (clustering, classification,
regression, etc.)
One key difference from statistics and machine learning:
massive datasets and I/O-efficient algorithms

Announcements and class materials for Oct 9

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Announcements and class materials for Oct 9

Caricato da

Copyright:

Formati disponibili

Announcements (Tue. Oct.

Data Warehousing and

office hours this Wed.: 2:30-4pm

resides in many distributed, heterogeneous

to support OLAP (On-Line Analytical

Implications on database design and optimization?

Maintaining a data warehouse

Answer could be stale

Answer is more up-to-date

OLTP versus OLAP

Sales, inventory, customer,

Eager versus lazy integration

The ETL process

Compute and apply only incremental changes

Star schema of a data warehouse

100 Main St.

102 Main St.

800 Eighth St.

(c3, p2, s1) = 2

(ALL, p2, ALL)

(ALL, ALL, ALL) = 11

(c5, p1, s3) = 5

(c5, p1, s1) = 3

Further project points onto Product axis

Sale (CID, PID, SID, qty)

(c5, p1, s3) = 5

Normal groups produced by GROUP BY

(c3, p1, s1) = 1

(c1, p1, s1, sum), (c1, p2, s3, sum), etc.

(c5, p1, s1) = 3

Groups with one or more ALLs

Total quantity of sales for each product

Total quantity of sales

(ALL, p1, ALL)

(ALL, p1, ALL)

Completing the cubeorigin

(ALL, p2, ALL)

(c5, p1, s1) = 3

(c5, p1, s1) = 3

(ALL, p1, s3) = 5

Project all points onto Product-Store plane

(c3, p1, s1) = 1

Completing the cubeaxis

(c5, p1, s3) = 5

(ALL, p1, s3) = 5

Total quantity of sales for each product in each store

p2 (ALL, p1, s1) = 4

Simplified schema: Sale (CID, PID, SID, qty)

(c5, p1, s3) = 5

Completing the cubeplane

Further project points onto the origin

Can you write a CUBE query using only GROUP BYs?

Gray et al., Data Cube: A Relational Aggregation Operator

GROUP BY and CUBE aggregates is

precompute and store the aggregates as

Selecting views to materialize

Maintained automatically as base data changes

Other OLAP extensions

Factors in deciding what to materialize

What is its storage cost?

Standard aggregates: COUNT, SUM, AVG, MIN, MAX

Harinarayan et al., Implementing Data Cubes Efficiently. SIGMOD

RANK window function example