Sei sulla pagina 1di 58

Data Mining Arif Djunaidy FTIF ITS

Bab 5 - 1/58
Bab 5
Mining Association Rules

Arif Djunaidy
e-mail: arif@its-sby.edu
URL: www.its-sby.edu/~arif
Data Mining Arif Djunaidy FTIF ITS
Bab 5 - 2/58
Outline
What is association rules mining?
The Apriori algorithm
Iceberg Queries
Methods to improve Aprioris efficiency
Mining frequent patterns without candidate
generation
Interestingness measurements
Multiple-level associations rules mining
Data Mining Arif Djunaidy FTIF ITS
Bab 5 - 3/58
Association rule mining:
Finding frequent patterns, associations, correlations, or causal structures
among sets of items or objects in transaction databases, relational
databases, and other information repositories.
Applications:
Basket data analysis, cross-marketing, catalog design, clustering,
classification, etc.
Examples.
buys(x, computer) buys(x, software) [2%, 75%]
age(x, mature) ^ takes(x, DM) grade(x, A) [5%, 75%]
What Is Association Rules Mining?
Data Mining Arif Djunaidy FTIF ITS
Bab 5 - 4/58
Association Rules Mining: Basic Principle
Given a set of transactions, find rules that will predict the
occurrence of an item based on the occurrences of other
items in the transaction
Also known as market basket analysis
Market-Basket transactions
TI D I tems
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke

Example of Association Rules
{Diaper} {Beer},
{Milk, Bread} {Eggs,Coke},
{Beer, Bread} {Milk},
Implication means co-occurrence,
not causality!
Data Mining Arif Djunaidy FTIF ITS
Bab 5 - 5/58
Definition: Frequent Itemset
Itemset
A collection of one or more items
Example: {Milk, Bread, Diaper}
k-itemset
An itemset that contains k items
Support count (o)
Frequency of occurrence of an itemset
E.g. o({Milk, Bread,Diaper}) = 2
Support
Fraction of transactions that contain an
itemset
E.g. s({Milk, Bread, Diaper}) = 2/5
Frequent Itemset
An itemset whose support is greater
than or equal to a minsup threshold
TI D I tems
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke

Data Mining Arif Djunaidy FTIF ITS
Bab 5 - 6/58
Definition: Association Rule
Example:
Beer } Diaper , Milk {
4 . 0
5
2
| T |
) Beer Diaper, , Milk (
= = =
o
s
67 . 0
3
2
) Diaper , Milk (
) Beer Diaper, Milk, (
= = =
o
o
c
Association Rule
An implication expression of the form
X Y, where X and Y are itemsets
Example:
{Milk, Diaper} {Beer}

Rule Evaluation Metrics
Support (s)
Fraction of transactions that contain
both X and Y
Confidence (c)
Measures how often items in Y
appear in transactions that
contain X
TI D I tems
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke

Data Mining Arif Djunaidy FTIF ITS
Bab 5 - 7/58
Association Rule Mining Task
Given a set of transactions T, the goal of association
rule mining is to find all rules having
support minsup threshold
confidence minconf threshold

High confidence = strong pattern
High support = occurs often
Less likely to be random occurrence
Larger potential benefit from acting on the rule
Data Mining Arif Djunaidy FTIF ITS
Bab 5 - 8/58
Application 1 (Retail Stores)
Real market baskets
chain stores keep TBs of customer purchase info
Value?
how typical customers navigate stores
positioning tempting items
suggests cross-sell opportunities e.g., hamburger sale
while raising ketchup price

High support needed, or no $$s
Data Mining Arif Djunaidy FTIF ITS
Bab 5 - 9/58
Application 2 (Information Retrieval)
Scenario 1
baskets = documents
items = words in documents
frequent word-groups = linked concepts.
Scenario 2
items = sentences
baskets = documents containing sentences
frequent sentence-groups = possible plagiarism
Data Mining Arif Djunaidy FTIF ITS
Bab 5 - 10/58
Application 3 (Web Search)
Scenario 1
baskets = web pages
items = outgoing links
pages with similar references about same topic
Scenario 2
baskets = web pages
items = incoming links
pages with similar in-links mirrors, or same
topic
Data Mining Arif Djunaidy FTIF ITS
Bab 5 - 11/58
Mining Association Rules
Example of Rules:

{Milk,Diaper} {Beer} (s=0.4, c=0.67)
{Milk,Beer} {Diaper} (s=0.4, c=1.0)
{Diaper,Beer} {Milk} (s=0.4, c=0.67)
{Beer} {Milk,Diaper} (s=0.4, c=0.67)
{Diaper} {Milk,Beer} (s=0.4, c=0.5)
{Milk} {Diaper,Beer} (s=0.4, c=0.5)
TI D I tems
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke

Observations:
All the above rules are binary partitions of the same itemset:
{Milk, Diaper, Beer}
Rules originating from the same itemset have identical support but
can have different confidence
Thus, we may decouple the support and confidence requirements
Data Mining Arif Djunaidy FTIF ITS
Bab 5 - 12/58
Mining Association Rules
Goal find all association rules such that
Support > s
confidence > c
Reduction to Frequent Itemsets Problems
Find all frequent itemsets X
Given X={A
1
, ,A
k
}, generate all rules X-A
j
A
j
Confidence = sup(X)/sup(X-A
j
)
Support = sup(X)
Exclude rules whose confidence is too low
Observe X-A
j
also frequent support known
Finiding all frequent itemsets is the hard part!
Data Mining Arif Djunaidy FTIF ITS
Bab 5 - 13/58
Association Rule Mining: A Road Map
Boolean vs. quantitative associations (Based on the types of
values handled)
buys(x, WINDOWS 2K) ^ buys(x, SQLServer) buys(x,
DBMiner) [0.2%, 50%]
age(x, 30..39) ^ income(x, 42..48K) buys(x, PC) [1%, 75%]
Single dimension vs. multiple dimensional associations (see
ex. Above)
Single level vs. multiple-level analysis

Data Mining Arif Djunaidy FTIF ITS
Bab 5 - 14/58
How are association rules mined form
large databases?
Association rule mining is a two-step process.
1. Find all frequent itemsets:
By definition, each of these itemsets will occur at least as frequent as a pre-
determined minimum support count.
2. Generate strong association rules form the frequent
itemsets:
By definition, these rules must satisfy minimum support and minimum
confidence

Data Mining Arif Djunaidy FTIF ITS
Bab 5 - 15/58
Itemset Lattice: An Example
null
AB AC AD AE BC BD BE CD CE DE
A B C D E
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
ABCD ABCE ABDE ACDE BCDE
ABCDE
Given m items, there
are 2
m
-1 possible
candidate itemsets
Data Mining Arif Djunaidy FTIF ITS
Bab 5 - 16/58
Scale of Problem
WalMart
sells m=100,000 items
tracks n=1,000,000,000 baskets
Web
several billion pages
approximately one new word per page
Exponential number of itemsets
m items 2
m
-1 possible itemsets
Cannot possibly example all itemsets for large m
Even itemsets of size 2 may be too many
m=100,000 5 trillion item pairs
Data Mining Arif Djunaidy FTIF ITS
Bab 5 - 17/58
Frequent Itemsets in SQL
DBMSs are poorly suited to association rule mining
Star schema
Sales Fact
Transaction ID degenerate dimension
Item dimension
Finding frequent 3-itemsets:
SELECT Fact1.ItemID, Fact2.ItemID,
Fact3.ItemID, COUNT(*)
FROM Fact1
JOIN Fact2 ON Fact1.TID = Fact2.TID
AND Fact1.ItemID < Fact2.ItemID
JOIN Fact3 ON Fact1.TID = Fact3.TID
AND Fact1.ItemID < Fact2.ItemID
AND Fact2.ItemID < Fact3.ItemID
GROUP BY Fact1.ItemID, Fact2.ItemID, Fact3.ItemID
HAVING COUNT(*) > 1000
Finding frequent k-itemsets requires joining k copies of fact table
Joins are non-equijoins
Impossibly expensive!
Data Mining Arif Djunaidy FTIF ITS
Bab 5 - 18/58
Association Rules and Data Warehouses
Typical procedure:
Use data warehouse to apply filters
Mine association rules for certain regions, dates
Export all fact rows matching filters to flat file
Sort by transaction ID
Items in same transaction are grouped together
Perform association rule mining on flat file
An alternative:
Database vendors are beginning to add specialized data mining
capabilities
Efficient algorithms for common data mining tasks are built in to the
database system
Decisions trees, association rules, clustering, etc.
Not standardized yet
Data Mining Arif Djunaidy FTIF ITS
Bab 5 - 19/58
Finding Frequent Pairs
Frequent 2-Sets
hard case already
focus for now, later extend to k-sets
Nave Algorithm
Counters all m(m1)/2 item pairs (m = # of distinct items)
Single pass scanning all baskets
Basket of size b increments b(b1)/2 counters
Failure?
if memory < m(m1)/2
m=100,000 5 trillion item pairs
Nave algorithm is impractical for large m

Data Mining Arif Djunaidy FTIF ITS
Bab 5 - 20/58
Pruning Candidate Itemsets
Monotonicity principle:
If an itemset is frequent, then all of its subsets must also
be frequent

Monotonicity principle holds due to the following
property of the support measure:


Converse:
If an itemset is infrequent, then all of its supersets must
also be infrequent
) ( ) ( ) ( : , Y s X s Y X Y X > _
Data Mining Arif Djunaidy FTIF ITS
Bab 5 - 21/58
Found to be
Infrequent
null
AB AC AD AE BC BD BE CD CE DE
A B C D E
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
ABCD ABCE ABDE ACDE BCDE
ABCDE
Illustrating Monotonicity Principle
null
AB AC AD AE BC BD BE CD CE DE
A B C D E
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
ABCD ABCE ABDE ACDE BCDE
ABCDE
Pruned
supersets
Data Mining Arif Djunaidy FTIF ITS
Bab 5 - 22/58
Mining Frequent Itemsets: the Key Step
The Apriori principle:
Any subset of a frequent itemset must be frequent

Find the frequent itemsets: the sets of items that have
minimum support
A subset of a frequent itemset must also be a frequent itemset
i.e., if {AB} is a frequent itemset, both {A} and {B} should be a
frequent itemset
Iteratively find frequent itemsets with cardinality from 1 to k
(k-itemset)
Use the frequent itemsets to generate association rules.
Data Mining Arif Djunaidy FTIF ITS
Bab 5 - 23/58
The Apriori Algorithm
Join Step: C
k
is generated by joining L
k-1
with itself
Prune Step: Any (k-1)-itemset that is not frequent cannot be a
subset of a frequent k-itemset
Pseudo-code:
C
k
: Candidate itemset of size k
L
k
: frequent itemset of size k

L
1
= {frequent items};
for (k = 1; L
k
!=C; k++) do begin
C
k+1
= candidates generated from L
k
;
for each transaction t in database do
increment the count of all candidates in C
k+1
that are contained in t
L
k+1
= candidates in C
k+1
with min_support
end
return
k
L
k
;
Data Mining Arif Djunaidy FTIF ITS
Bab 5 - 24/58
The Apriori Algorithm Example (sup_min=2)
TID Items
100 1 3 4
200 2 3 5
300 1 2 3 5
400 2 5
Database D itemset sup.
{1} 2
{2} 3
{3} 3
{4} 1
{5} 3
itemset sup.
{1} 2
{2} 3
{3} 3
{5} 3
Scan D
C
1
L
1
itemset
{1 2}
{1 3}
{1 5}
{2 3}
{2 5}
{3 5}
itemset sup
{1 2} 1
{1 3} 2
{1 5} 1
{2 3} 2
{2 5} 3
{3 5} 2
itemset sup
{1 3} 2
{2 3} 2
{2 5} 3
{3 5} 2
L
2
C
2
C
2
Scan D
C
3
L
3
itemset
{2 3 5}
Scan D
itemset sup
{2 3 5} 2
Data Mining Arif Djunaidy FTIF ITS
Bab 5 - 25/58
L2=(1,3)
1->3
Sup(1U3)=2
conf(1->3) = sup(1U3)/sup(1)=2/2=100%

3->1
Sup(1U3)=2
conf(3->1) = sup(1U3)/sup(3)=2/3=67%

Generateing Associatin Rules form Frequent Itemsets
Data Mining Arif Djunaidy FTIF ITS
Bab 5 - 26/58
L3 = (2,3,5)
2 U 3 -> 5
sup (2U3U5) = 2, conf (2U3 -> 5) = sup(2U3U5)/sup(2U3) = 2/2 =
100%
2 -> 3 U 5
sup (2U3U5) = 2, conf (2 -> 3 U 5) = sup(2U3U5)/sup(2) = 2/3 =
67%
2 U 5 -> 3
sup (2U3U5) = 2, conf (2U5 -> 3) = sup(2U3U5)/sup(2U5) = 2/3 =
67%
3U5 -> 2
sup (2U3U5) = 2, conf (3U5 -> 2) = sup(2U3U5)/sup(3U5) = 2/2 =
100%
3 -> 2U5
sup (2U3U5) = 2, conf (3 ->2U 5) = sup(2U3U5)/sup(3) = 2/3 =
67%
5 -> 2U3
sup (2U3U5) = 2, conf (5 -> 2U3) = sup(2U3U5)/sup(5) = 2/3 =
67%
Generateing Associatin Rules form Frequent Itemsets
Data Mining Arif Djunaidy FTIF ITS
Bab 5 - 27/58
How to Generate Candidates?
Suppose the items in L
k-1
are listed in an order
Step 1: self-joining L
k-1

insert into C
k
select p.item
1
, p.item
2
, , p.item
k-1
, q.item
k-1

from L
k-1
p, L
k-1
q
where p.item
1
=q.item
1
, , p.item
k-2
=q.item
k-2
, p.item
k-1
< q.item
k-1
Step 2: pruning
forall itemsets c in C
k
do
forall (k-1)-subsets s of c do
if (s is not in L
k-1
) then delete c from C
k

Data Mining Arif Djunaidy FTIF ITS
Bab 5 - 28/58
Example of Generating Candidates
L
3
={abc, abd, acd, ace, bcd}
Self-joining: L
3
*L
3

abcd from abc and abd
acde from acd and ace
Pruning:
acde is removed because ade is not in L
3
C
4
={abcd}
Data Mining Arif Djunaidy FTIF ITS
Bab 5 - 29/58
Iceberg Queries
Icerberg query: Compute aggregates over one or a set of
attributes only for those whose aggregate values is above
certain threshold
Example:
select P.custID, P.itemID, sum(P.qty)
from purchase P
group by P.custID, P.itemID
having sum(P.qty) >= 10
Compute iceberg queries efficiently by Apriori:
First compute lower dimensions
Then compute higher dimensions only when all the lower ones
are above the threshold
Data Mining Arif Djunaidy FTIF ITS
Bab 5 - 30/58
Iceberg Queries (Cont.)
Generate cust_list, a list of customer who bought three or
more items in total, for example,
select P.cust_ID
from Purchases P
group by P.cust_ID
having SUM(P.qty)>=3;

Generate item_list, a list ofitems that were purchased by
any customer in quantuties of three or more, for example,
select P.item_ID
from Purchases P
group by P.item_ID
having SUM(P.qty)>=3;

Data Mining Arif Djunaidy FTIF ITS
Bab 5 - 31/58
Is Apriori Fast Enough?
Performance Bottlenecks
The core of the Apriori algorithm:
Use frequent (k 1)-itemsets to generate candidate frequent k-itemsets
Use database scan and pattern matching to collect counts for the
candidate itemsets
The bottleneck of Apriori: candidate generation
Huge candidate sets:
10
4
frequent 1-itemset will generate 10
7
candidate 2-itemsets
To discover a frequent pattern of size 100, e.g., {a
1
, a
2
, , a
100
}, one
needs to generate 2
100
~ 10
30
candidates.
Multiple scans of database:
Needs (n +1 ) scans, n is the length of the longest pattern
Data Mining Arif Djunaidy FTIF ITS
Bab 5 - 32/58
Methods to Improve Aprioris Efficiency
Transaction reduction:
A transaction that does not contain any frequent k-itemset is
useless in subsequent scans because it can not contain any
fewquent (k+1)-itemsets. Therefore, such a transaction can be
removed from further consideration.
Partitioning:
Any itemset that is potentially frequent in DB must be frequent
in at least one of the partitions of DB
Data Mining Arif Djunaidy FTIF ITS
Bab 5 - 33/58
Partitioning
Transactions
in D
Divide D
into n
partitions
Find the
frequent
itemsets
local to each
partition
(1 scan)
Combine all
local
frequent
itemsets to
form
candidate
itemset
Find global
frequent
itemsets
among
candidates
(1 scan)
Frequent
itemsets
in D
Phase II
Phase I
Data Mining Arif Djunaidy FTIF ITS
Bab 5 - 34/58
Scan once Algorithm (Support count: 3)
Item a Item b Item c Item d Item e
Transaction 1 1 1 0 1 1
Transaction 2 0 1 1 0 1
Transaction 3 1 1 0 1 1
Transaction 4 1 1 1 0 1
Transaction 5 1 1 1 1 1
Transaction 6 0 1 1 1 0
Table Boolean relational database D
Data Mining Arif Djunaidy FTIF ITS
Bab 5 - 35/58
Scan once Algorithm
Figure: A complete itemset tree for the five items a, b, c, d and e exemplified in
database shown in the table
a b c d e
ab ac ad ae
bc bd be
cd ce
de
abc abd abe acd ace ade bcd bce bde cde
abcd abce abde
acde bcde
abcde
Level 0 (C
1
5
)
Level 1 (C
2
5
)
Level 2 (C
3
5
)
Level 3 (C
4
5
)
Level 4 (C
5
5
)
d
d d d c
a
c
c
d
b
b
d c
d
d
Data Mining Arif Djunaidy FTIF ITS
Bab 5 - 36/58
Support
Count
T6 T5 T4 T3 T2 T1 Itemset T1-a T1-b T1-d T1-e T6-b T6-c T6-d
4
5

1
1
1
1
1
1
1

1
1
1
a
b
1
1

1
4
4
1
1
1
1
1
1
1
1
c
d

1
1
1
5
4
1
1
1
1
1
1
1 1
1
e
ab

1/2

1/2
1
1/2
2
3
1
1
1
1

1
ac
ad
1/2
1/2

1/2
1/2
1/2
4
4

1
1
1
1
1
1
1
1 ae
bc
1/2
1/2
1/2
1/2

1/2
4 1 1 1 1 bd 1/2 1/2 1/2 1/2
5
2

1
1
1
1 1 1 1 be
cd
1/2
1/2
1/2 1/2
1/2

1/2
3
3
1
1
1
1
1
1
ce
de

1/2
1/2
1/2
1/2
1/2
2
3
1
1
1
1

1
abc
abd
1/3
1/3
1/3
1/3

1/3
1/3
1/3
1/3
1/3
4
1
1
1
1 1 1 abe
acd
1/3
1/3
1/3
1/3
1/3 1/3
1/3

1/3
2
3
1
1
1
1

1
ace
ade
1/3
1/3

1/3
1/3
1/3
1/3
1/3
2
3
1 1
1

1

1
bcd
bce
1/3
1/3
1/3
1/3
1/3
1/3
1/3
1/3
1/3
3
1
1
1
1 1 bde
cde
1/3
1/3
1/3
1/3

1/3
1/3
1/3
1
2
1
1

1
abcd
abce
1/4
1/4
1/4
1/4
1/4
1/4
1/4
1/4
1/4
1/4
1/4
3
1
1
1
1 1 abde
acde
1/4
1/4
1/4 1/4
1/4
1/4
1/4
1/4
1/4
1/4
1/4
1
1
1
1
bcde
abcde

1/5
1/4
1/5
1/4
1/5
1/4
1/5
1/4
1/5
1/4
1/5
1/4
1/5
Data Mining Arif Djunaidy FTIF ITS
Bab 5 - 37/58
Mining Frequent Patterns Without
Candidate Generation
Compress a large database into a compact, Frequent-
Pattern tree (FP-tree) structure
highly condensed, but complete for frequent pattern mining
avoid costly database scans
Develop an efficient, FP-tree-based frequent pattern
mining method
A divide-and-conquer methodology: decompose mining tasks
into smaller ones
Avoid candidate generation: sub-database test only!
Data Mining Arif Djunaidy FTIF ITS
Bab 5 - 38/58
Construct FP-tree from a Transaction DB
{}
f:4 c:1
b:1
p:1
b:1 c:3
a:3
b:1 m:2
p:2 m:1
Header Table

I tem frequency head
f 4
c 4
a 3
b 3
m 3
p 3
min_support =0.5
TI D I tems bought (ordered) frequent items
100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}
200 {a, b, c, f, l, m, o} {f, c, a, b, m}
300 {b, f, h, j, o} {f, b}
400 {b, c, k, s, p} {c, b, p}
500 {a, f, c, e, l, p, m, n} {f, c, a, m, p}
Steps:
1. Scan DB once, find frequent
1-itemset (single item
pattern)
2. Order frequent items in
frequency descending order
3. Scan DB again, construct
FP-tree
Data Mining Arif Djunaidy FTIF ITS
Bab 5 - 39/58
Benefits of the FP-tree Structure
Completeness:
preserves complete information for frequent pattern mining
Compactness
reduce irrelevant informationinfrequent items are gone
frequency descending ordering: more frequent items are more likely
to be shared
never be larger than the original database (if not count node-links
and counts)
Data Mining Arif Djunaidy FTIF ITS
Bab 5 - 40/58
Mining Frequent Patterns Using FP-tree
General idea (divide-and-conquer)
Recursively grow frequent pattern path using the FP-tree
Method
For each item, construct its conditional pattern-base, and then its
conditional FP-tree
Repeat the process on each newly created conditional FP-tree
Until the resulting FP-tree is empty, or it contains only one path
(single path will generate all the combinations of its sub-paths, each of
which is a frequent pattern)
Data Mining Arif Djunaidy FTIF ITS
Bab 5 - 41/58
Major Steps to Mine FP-tree
1) Construct conditional pattern base for each
node in the FP-tree
2) Construct conditional FP-tree from each
conditional pattern-base
3) Recursively mine conditional FP-trees and
grow frequent patterns obtained so far
Data Mining Arif Djunaidy FTIF ITS
Bab 5 - 42/58
Step 1: From FP-tree to Conditional
Pattern Base
Starting at the frequent header table in the FP-tree
Traverse the FP-tree by following the link of each frequent item
Accumulate all of transformed prefix paths of that item to form a
conditional pattern base
Conditional pattern bases
item cond. pattern base
c f:3
a fc:3
b fca:1, f:1, c:1
m fca:2, fcab:1
p fcam:2, cb:1
{}
f:4 c:1
b:1
p:1
b:1 c:3
a:3
b:1 m:2
p:2 m:1
Header Table

I tem frequency head
f 4
c 4
a 3
b 3
m 3
p 3
Data Mining Arif Djunaidy FTIF ITS
Bab 5 - 43/58
Step 2: Construct Conditional FP-tree
For each pattern-base
Accumulate the count for each item in the base
Construct the FP-tree for the frequent items of the pattern
base
m-conditional pattern
base:
fca:2, fcab:1
{}
f:3
c:3
a:3
m-conditional FP-tree
All frequent patterns
concerning m
m,
fm, cm, am,
fcm, fam, cam,
fcam


{}
f:4 c:1
b:1
p:1
b:1 c:3
a:3
b:1 m:2
p:2 m:1
Header Table
I tem frequency head
f 4
c 4
a 3
b 3
m 3
p 3
Data Mining Arif Djunaidy FTIF ITS
Bab 5 - 44/58
Mining Frequent Patterns by
Creating Conditional Pattern-Bases
Empty Empty f
{(f:3)}|c {(f:3)} c
{(f:3, c:3)}|a {(fc:3)} a
Empty {(fca:1), (f:1), (c:1)} b
{(f:3, c:3, a:3)}|m {(fca:2), (fcab:1)} m
{(c:3)}|p {(fcam:2), (cb:1)} p
Conditional FP-tree Conditional pattern-base
Item
Data Mining Arif Djunaidy FTIF ITS
Bab 5 - 45/58
Single FP-tree Path Generation
Suppose an FP-tree T has a single path P
The complete set of frequent pattern of T can be
generated by enumeration of all the combinations of the
sub-paths of P
{}
f:3
c:3
a:3
m-conditional FP-tree
All frequent patterns
concerning m
m,
fm, cm, am,
fcm, fam, cam,
fcam

Data Mining Arif Djunaidy FTIF ITS
Bab 5 - 46/58
Principles of Frequent Pattern
Growth
Pattern growth property
Let o be a frequent itemset in DB, B be o's conditional pattern
base, and | be an itemset in B. Then o | is a frequent
itemset in DB iff | is frequent in B.
abcdef is a frequent pattern, if and only if
abcde is a frequent pattern, and
f is frequent in the set of transactions containing abcde
Data Mining Arif Djunaidy FTIF ITS
Bab 5 - 47/58
Why Is Frequent Pattern Growth Fast?
Our performance study shows
FP-growth is an order of magnitude faster than Apriori, and is
also faster than tree-projection
Reasoning
No candidate generation, no candidate test
Use compact data structure
Eliminate repeated database scan
Basic operation is counting and FP-tree building
Data Mining Arif Djunaidy FTIF ITS
Bab 5 - 48/58
Interestingness Measurements
Objective measures
Two popular measurements:
support; and
confidence

Subjective measures (Silberschatz & Tuzhilin,
KDD95)
A rule (pattern) is interesting if
it is unexpected (surprising to the user); and/or
actionable (the user can do something with it)
Data Mining Arif Djunaidy FTIF ITS
Bab 5 - 49/58
Criticism to Support and Confidence
Example 1: (Aggarwal & Yu, PODS98)
Among 5000 students
3000 play basketball
3750 eat cereal
2000 both play basket ball and eat cereal
play basketball eat cereal [40%, 66.7%] is misleading because the overall
percentage of students eating cereal is 75% which is higher than 66.7%.
play basketball not eat cereal [20%, 33.3%] is far more accurate, although
with lower support and confidence

basketball not basketball sum(row)
cereal 2000 1750 3750
not cereal 1000 250 1250
sum(col.) 3000 2000 5000
Data Mining Arif Djunaidy FTIF ITS
Bab 5 - 50/58
Criticism to Support and Confidence
(Cont.)
Example 2:
X and Y: positively correlated,
X and Z, negatively related
support and confidence of
X=>Z dominates
We need a measure of dependent
or correlated events



P(B|A)/P(B) is also called the lift
of rule A => B
X 1 1 1 1 0 0 0 0
Y 1 1 0 0 0 0 0 0
Z 0 1 1 1 1 1 1 1
Rule Support Confidence
X=>Y 25% 50%
X=>Z 37.50% 75% ) ( ) (
) (
,
B P A P
B A P
corr
B A

=
Data Mining Arif Djunaidy FTIF ITS
Bab 5 - 51/58
Other Interestingness Measures: Interest
Interest (correlation, lift)
taking both P(A) and P(B) in consideration
P(AUB)=P(B)*P(A), if A and B are independent events
A and B negatively correlated, if the value is less than 1; otherwise A
and B positively correlated
) ( ) (
) (
B P A P
B A P
X 1 1 1 1 0 0 0 0
Y 1 1 0 0 0 0 0 0
Z 0 1 1 1 1 1 1 1
Itemset Support Interest
X,Y 25% 2
X,Z 37.50% 0.9
Y,Z 12.50% 0.57
Data Mining Arif Djunaidy FTIF ITS
Bab 5 - 52/58
Multiple-Level Association Rules
Items often form hierarchy.
Items at the lower level are
expected to have lower
support.
Rules regarding itemsets at
appropriate levels could be
quite useful.
Transaction database can be
encoded based on dimensions
and levels
We can explore shared multi-
level mining
All
Printer Computer
Desktop
Compaq IBM
Laptop B/W Color
TID Items
T1 {111, 121, 211, 221}
T2 {111, 211, 222, 323}
T3 {112, 122, 221, 411}
T4 {111, 121}
T5 {111, 122, 211, 221, 413}

Sony
HP
Data Mining Arif Djunaidy FTIF ITS
Bab 5 - 53/58
Mining Multi-Level Associations
A top_down, progressive deepening approach:
First find high-level strong rules:
computer printer [20%, 60%].
Then find their lower-level weaker rules:
desktop printer [6%, 50%].
Variations at mining multiple-level association rules.
Level-crossed association rules:
desktop Sony color printer
Association rules with multiple, alternative hierarchies:
desktop Color printer
Data Mining Arif Djunaidy FTIF ITS
Bab 5 - 54/58
Uniform Support
Multi-level mining with uniform support
Computer
[support = 10%]
Desktop
[support = 6%]
Laptop
[support = 4%]
Level 1
min_sup = 5%
Level 2
min_sup = 5%
Back
Data Mining Arif Djunaidy FTIF ITS
Bab 5 - 55/58
Reduced Support
Multi-level mining with reduced support
Desktop
[support = 6%]
Laptop
[support = 4%]
Level 1
min_sup = 5%
Level 2
min_sup = 3%
Computer
[support = 10%]
Data Mining Arif Djunaidy FTIF ITS
Bab 5 - 56/58
Multi-Dimensional Association:
Concepts
Single-dimensional rules:
buys(X, milk) buys(X, bread)

Multi-dimensional rules:
Inter-dimension association rules (no repeated predicates)
age(X,19-25) . occupation(X,student) buys(X,coke)
hybrid-dimension association rules (repeated predicates)
age(X,19-25) . buys(X, popcorn) buys(X, coke)
Data Mining Arif Djunaidy FTIF ITS
Bab 5 - 57/58
Summary
Association rule mining
probably the most significant contribution from the
database community in KDD
A large number of papers have been published
Many interesting issues have been explored
An interesting research direction
Association analysis in other types of data: spatial
data, multimedia data, time series data, etc.
Data Mining Arif Djunaidy FTIF ITS
Bab 5 - 58/58
Akhir
Bab 5

Potrebbero piacerti anche