Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
9)
Midterm
CompSci 316
Introduction to Database Systems
Eager (warehousing)
In advance: before queries
Copy data from sources
Lazy
On demand: at query time
Leave data at sources
Slower
Interferes with local
processing
OLAP
Mostly updates
Mostly reads
Short, simple transactions
Long, complex queries
Clerical users
Analysts, decision makers
Goal: transaction throughput Goal: fast queries
Need
Faster
Can operate when sources
are unavailable
Open-book, open-notes
No communication devices
Solution to sample midterm was emailed this weekend
Will cover all materials through today
But more focus will be on parts that you already exercised
Data
on Thursday, in class
My
Data integration
Approaches
Recomputation
Easy to implement; just take periodic dumps of the sources, say, every night
What if there is no night, e.g., a global organization?
What if recomputation takes more than a day?
Incremental maintenance
Dimension table
Product
SID
name
cost
s1
Durham
p1
beer
10
s2
Chapel Hill
p2
diaper
16
s3
RTP
Customer
OID
Date
CID
PID
SID
qty
price
100
08/23/2012
c3
p1
s1
12
102
09/12/2012
c3
p2
s1
17
105
09/24/2012
c5
p1
s3
13
CID
name
address
city
c3
Amy
Durham
c4
Ben
Durham
c5
Coy
Durham
Big
Constantly growing
Stores measures (often
aggregated in queries)
c3
ALL
c4
c5
ALL
s3
c5
Customer
10
s3
Store
(c3, p1, s1) = 1
s2
s1
c4
c5
Customer
CUBE operator
12
Store
s2
s1
c4
ALL
11
Customer
c3
s1
s1
s2
p1
p1
Product
Store
(c3, p1, s1) = 1
Store
s2
ALL
s3
s3
p2
Dimension table
Small
Updated infrequently
Product
Data cube
city
PID
Sale
(ALL, p1, s1, sum), (c2, ALL, ALL, sum), (ALL, ALL, ALL, sum), etc.
c4
c5
Customer
13
Aggregation lattice
GROUP BY
CID
GROUP BY
PID
GROUP BY
SID
GROUP BY
CID, PID
GROUP BY
CID, SID
GROUP BY
PID, SID
) Idea:
Computing
GROUP BY
Roll up
Drill down
GROUP BY
CID, PID, SID
A parent can be
computed from any child
15
16
14
Materialized views
Example
GROUP BY is small, but not useful to most queries
GROUP BY CID, PID, SID is useful to any query, but too large
to be beneficial
17
GROUP BY
sid
|
pid
|
cid
| qty
------------+------------+------------+----Durham
| beer
| Alice
| 10
|
| Bob
| 2
Durham
| chips
| Bob
| 3
Durham
| diaper
| Alice
| 5
Raleigh
| beer
| Alice
| 2
Raleigh
| diaper
| Bob
| 100
| beer
|
| Alice
| Bob
| 10
| 2
Durham
| beer
|
| Alice
| Bob
|
|
10
2
Durham
| diaper
| Alice
Durham
| chips
| Bob
18
sid
|
pid
|
cid
| qty
------------+------------+------------+----Durham
| beer
| Alice
| 10
|
| Bob
|
2
Durham
| chips
| Bob
|
3
Durham
| diaper
| Alice
|
5
Raleigh
| beer
| Alice
|
2
Raleigh
| diaper
| Bob
| 100
| beer
|
| Alice
| Bob
|
|
10
2
Durham
| beer
|
| Alice
| Bob
| 10
| 2
Durham
| diaper
| Alice
Durham
| chips
| Bob
sid
|
pid
| sum | rank
------------+------------+-----+-----Durham
| beer
| 12 |
1
Durham
| diaper
| 5 |
2
Durham
| chips
| 3 |
3
Raleigh
| diaper
| 100 |
1
Raleigh
| beer
| 2 |
2
19
Multiple windows
sid
|
pid
|
cid
| qty
------------+------------+------------+----Durham
| beer
| Alice
| 10
|
| Bob
| 2
Durham
| chips
| Bob
| 3
Durham
| diaper
| Alice
| 5
Raleigh
| beer
| Alice
| 2
Raleigh
| diaper
| Bob
| 100
21
a large database of
transactions, each containing
a set of items
Example: market baskets
Find
TID
We
22
First try
nave algorithm
2 , where
Think:
23
All
) If
Problem:
knowledge
DBMS meets AI and statistics
Clustering, prediction (classification and regression),
association analysis, outlier analysis, evolution
analysis, etc.
20
Data
items
Data mining
must also
24
25
Example: pass 1
Example: pass 2
26
Check
min. support
TID
items
TID
items
T001
A, B, E
T001
A, B, E
T002
B, D
T002
B, D
itemset
count
itemset
count
itemset
T003
B, C
T003
B, C
{A}
{A,B}
{A,B}
T004
A, B, D
T004
A, B, D
{B}
{A,C}
{A,C}
4
2
count
T005
A, C
itemset
count
T005
A, C
{C}
{A,D}
{A,E}
T006
B, C
{A}
T006
B, C
{D}
{A,E}
{B,C}
T007
A, C
{B}
T007
A, C
{E}
{B,C}
{B,D}
T008
A, B, C, E
{C}
T008
A, B, C, E
{B,D}
{B,E}
T009
A, B, C
{D}
T009
A, B, C
{B,E}
T010
{E}
T010
{C,D}
{C,E}
{D,E}
Transactions
Transactions
Frequent 1-itemsets
% = 20%
% = 20%
27
Example: pass 3
TID
items
T001
A, B, E
T002
B, D
T003
B, C
T004
A, B, D
itemset
count
itemset
T005
A, C
{A,B}
{A,B,C} 2
T006
B, C
{A,C}
{A,B,E}
T007
A, C
{A,E}
T008
A, B, C, E
{B,C}
T009
A, B, C
{B,D}
{B,E}
T010
Transactions
% = 20%
Check
min. support
count
2
TID
items
T001
A, B, E
T002
B, D
Generate
candidates
B, C
A, B, D
itemset
{A,B,C} 2
T005
A, C
{A,B,C} 2
{A,B,E}
T006
B, C
{A,B,E}
T007
A, C
T008
A, B, C, E
T009
A, B, C
T010
count
2
Frequent
3-itemsets
Candidate
3-itemsets
28
Example: pass 4
T004
itemset
Frequent
2-itemsets
Candidate
2-itemsets
T003
count
2
Frequent
3-itemsets
itemset
count
Candidate
4-itemsets
No more itemsets to count!
Transactions
Frequent
2-itemsets
% = 20%
29
Summary
Data
itemset
count
itemset
count
itemset
{A}
{A,B}
{A,B,C} 2
{B}
{A,C}
{A,B,E}
{C}
{A,E}
{D}
{B,C}
{E}
{B,D}
{B,E}
Frequent
1-itemsets
Frequent
1-itemsets
Frequent
2-itemsets
count
2
Frequent
3-itemsets
30
warehousing
mining