Sei sulla pagina 1di 18

Association Rule Mining

• Given a set of transactions, find rules that will predict


the occurrence of an item based on the occurrences of
other items in the transaction

Market-Basket transactions
Example of Association
Rules

TID Item
{Diaper} → {Beer},
{Milk, Bread} → {Eggs,Coke},
{Beer, Bread} → {Milk},

Implication means co-occurrence,


not causality!

Transaction data set can be represented in binary form.


Issues regarding Assoc. Rule
Mining
• 1. discovering patterns from large
transaction data set can be computationally
expensive.

• 2. Some of the discovered patterns are not


meaningful, b’coz they happen just by
chance.
Definition: Frequent Itemset
• Itemset

TID
– A collection of one or more items
• Example: {Milk, Bread, Diaper}
– k-itemset
• An itemset that contains k items
• Support count (σ )
– Frequency of occurrence of an

1
itemset
•Support
-Fraction of transactions
– E.g. σ ({Milk, Bread,Diaper}) = 2 that contain an itemset
E.g. s({Milk, Bread,
Diaper}) = 2/5

•Frequent Itemset

2
An itemset whose
support is greater than
or equal to a minsup
threshold
Definition: Association Rule
• Association Rule

TID
– An implication expression of the form X
→ Y, where X and Y are disjoint itemsets,
i.e. X Y=Ø 
– Example:
{Milk, Diaper} → {Beer}

1
• Rule Evaluation Metrics Example:
– Support (s) {Milk , Diaper } ⇒ Beer
• Fraction of transactions that contain both
X and Y σ (Milk , Diaper, Beer) 2
s= = = 0.4
– Confidence (c) |T| 5
• Measures how often items in Y σ (Milk, Diaper, Beer) 2
c= = = 0.67

2
appear in transactions that
contain X σ (Milk, Diaper) 3
Why use Support and Confidence?
• A rule having low support may occur by chance.
• It can be uninteresting as far as business
interests are concerned.
• So support is used to eliminate such rules.
• Confidence measures the reliability of the
inference made by the rule.
X Y, higher confidence indicates the higher
possibility for Y to be present where X is
present in a transaction.

5
Association Rule Mining Task
• Given a set of transactions T, the goal of association rule
mining is to find all rules having
– support ≥ minsup threshold
– confidence ≥ minconf threshold
• Brute-force approach:
– List all possible association rules
– Compute the support and confidence for each rule
– Prune rules that fail the minsup and minconf thresholds
⇒ Computationally prohibitive!

6
Mining Association Rules
Example of Rules:

TID Items
{Milk,Diaper} → {Beer} (s=0.4, c=0.67)
{Milk,Beer} → {Diaper} (s=0.4, c=1.0)
{Diaper,Beer} → {Milk} (s=0.4, c=0.67)
{Beer} → {Milk,Diaper} (s=0.4, c=0.67)
{Diaper} → {Milk,Beer} (s=0.4, c=0.5)
{Milk} → {Diaper,Beer} (s=0.4, c=0.5)

1 Bread
Observations:
• All the above rules are binary partitions of the same itemset:
{Milk, Diaper, Beer}
• Rules originating from the same itemset have identical support but
can have different confidence
• Thus, we may decouple the support and confidence requirements
Mining Association Rules
• Two-step approach:
1. Frequent Itemset Generation
– Generate all itemsets whose support ≥ minsup
1. Rule Generation
– Generate high confidence rules from each frequent itemset,
where each rule is a binary partitioning of a frequent itemset

• Frequent itemset generation is still computationally


expensive
Frequent Itemset Generation
null

A B C D E

AB AC AD AE BC BD BE CD CE DE

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

Given d items, there


are 2d possible
ABCDE candidate itemsets
Frequent Itemset Generation
• Brute-force approach:
– Each itemset in the lattice is a candidate frequent itemset
– Count the support of each candidate by scanning the
database

Transactions List of
Candidates
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
N 3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
– Match each
5 transaction against
Bread, Milk, every
Diaper, Cokecandidate
w
Computational Complexity
• Given d unique items:
– Total number of itemsets = 2d
– Total number of possible association rules:

 d   d − k 
R = ∑   × ∑  
d −1 d −k

 k   j 
k =1 j =1

= 3 − 2 +1
d d +1

If d=6, R = 602 rules


Frequent Itemset Generation Strategies
• Reduce the number of candidates (M)
– Apriori principle is effectively used to eliminate some
candidate itemsets without counting their support
count.
• Reduce the number of transactions (N)

• Reduce the number of comparisons (NM)


– Use efficient data structures to store the candidates or
transactions
– No need to match every candidate against every
transaction.
Reducing Number of Candidates
• Apriori principle:
– If an itemset is frequent, then all of its subsets must also
be frequent.
– Suppose {c,d,e} is a frequent itemset
– Then any transaction that contains {c,d,e} must also
contain its subsets like {c,d},{c,e}, {d,e}, {c}, {d}, {e}.
– So, if {c,d,e} is frequent , then all its subsets must also
be frequent.
Conversely, if
an itemset say
{a,b} is
infrequent ,
then all its
supersets will
be infrequent.

This strategy of
trimming the
search space
based on
support
measure is
called
SUPPORT-
BASED
PRUNING
• Apriori principle holds due to the following property of the support
measure:

∀X , Y : ( X ⊆ Y ) ⇒ s( X ) ≥ s(Y )
– Support of an itemset never exceeds the support of its subsets
– This is known as the anti-monotone property of support
Illustrating Apriori Principle
Frequent itemset generation
Item Count Items (Candidate 1-itemsets)
Bread 4
Coke 2
Milk 4 Itemset Count Pairs (Candidate 2-itemsets
Beer 3 {Bread,Milk} 3
Diaper 4 {Bread,Beer} 2 (No need to generate
Eggs 1
{Bread,Diaper} 3 candidates involving Coke
{Milk,Beer} 2 or Eggs)
{Milk,Diaper} 3
{Beer,Diaper} 3
Minimum Support = 3
Triplets (3-itemsets)

If every subset is considered, Itemset Count


6
C1 + 6C2 + 6C3 = 41 {Bread,Milk,Diaper} 3
With support-based pruning,
6 + 6 + 1 = 13
Apriori Algorithm
• Method:
– Let k=1
– Generate frequent itemsets of length 1
– Repeat until no new frequent itemsets are identified
• Generate length (k+1) candidate itemsets from length k
frequent itemsets
• Prune candidate itemsets containing subsets of length k that
are infrequent
• Count the support of each candidate by scanning the DB
• Eliminate candidates that are infrequent, leaving only those
that are frequent