Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
1
Minimum Support Threshold
3
Effect of Support Distribution
• Many real data sets have skewed support
distribution
Support
distribution of
a retail data set
4
Effect of Support Distribution
How to set the appropriate minsup threshold?
If minsup is set too high, we could miss itemsets
involving interesting rare items (e.g., expensive
products)
5
Multiple Minimum Support
How to apply multiple minimum supports?
MS(i): minimum support for item i
e.g.: MS(Milk)=5%, MS(Coke) = 3%,
MS(Broccoli)=0.1%, MS(Salmon)=0.5%
MS({Milk, Broccoli}) = min (MS(Milk), MS(Broccoli))
= 0.1%
6
Multiple Minimum Support
Item MS(I) Sup(I) AB ABC
AC ABD
A
A 0.10% 0.25%
AD ABE
B AE ACD
B 0.20% 0.26%
BC ACE
C
C 0.30% 0.29% BD ADE
D BE BCD
D 0.50% 0.05%
CD BCE
E
E 3% 4.20% CE BDE
DE CDE
7
Multiple Minimum Support
AB ABC
Item MS(I) Sup(I)
AC ABD
A
A 0.10% 0.25% AD ABE
B AE ACD
B 0.20% 0.26%
BC ACE
C
C 0.30% 0.29% BD ADE
D BE BCD
D 0.50% 0.05%
CD BCE
E
E 3% 4.20% CE BDE
DE CDE
8
Multiple Minimum Support (Liu 1999)
Order the items according to their minimum
support (in ascending order)
e.g.: MS(Milk)=5%, MS(Coke) = 3%,
MS(Broccoli)=0.1%, MS(Salmon)=0.5%
Ordering: Broccoli, Salmon, Coke, Milk
9
Multiple Minimum Support (Liu 1999)
Modifications to Apriori:
In traditional Apriori,
• A candidate (k+1)-itemset is generated by merging two frequent itemsets of size k
• The candidate is pruned if it contains any infrequent subsets of size k
10
Mining Rare Association Rules
11
Rare Association Rule Mining: Motivation
Rare events are events that occur infrequently
Perhaps in the frequency range (0.1% to 10%)
12
Detecting Rare Itemsets
Apriori-Inverse
• To discover all rules that satisfy the maximum
...
Fins = ‘1’ 0.17 Yes
Venomous =‘1’ 0.08 Yes
13
Coincidence vs Interesting
10000 transactions
AB
A → B (confidence = 0.95)
14
Probability of Collision
15
Minimum Absolute Support
i =m
minabssup( N , a, b, p ) = min m |
∑
i =0
Pcc(i | N , a, b) ≥ 1.0 − p
16
Rare pattern
Given a user-specified minimum support
threshold minsup ϵ [0,1], X is called a rare
itemset or rare pattern in D if sup(X,D) ≤
minsup.
17
Roadmap for rare pattern mining
18
Mining Negative Rules
19
Negative vs Rare Patterns
Rare patterns: Very low support but interesting
E.g., buying Rolex watches
Mining: Setting individual-based or special group-based
support threshold for valuable items
Negative patterns
Since it is unlikely that one buys Ford Expedition (an SUV car)
and Toyota Prius (a hybrid car) together, Ford Expedition and
Toyota Prius are likely negatively correlated patterns
Negatively correlated patterns that are infrequent tend
to be more interesting than those that are frequent
20
Negative Correlated Patterns
Definition 1 (support-based)
If itemsets X and Y are both frequent but rarely occur together, i.e.,
sup(X U Y) < sup (X) * sup(Y)
Then X and Y are negatively correlated
Problem: A store sold two needle 100 packages A and B, only one transaction
containing both A and B.
When there are in total 200 transactions, we have
s(A U B) = 0.005, s(A) * s(B) = 0.25, s(A U B) < s(A) * s(B)
When there are 105 transactions, we have
s(A U B) = 1/105, s(A) * s(B) = 1/103 * 1/103, s(A U B) > s(A) * s(B)
Where is the problem? —Null transactions, i.e., the support-based definition is
not null-invariant!
21
Negative Correlated Patterns
Definition 2 (negative itemset-based)
X is a negative itemset if (1) X = Ā U B, where B is a set of positive items,
and Ā is a set of negative items, |Ā|≥ 1, and (2) s(X) ≥ μ
Itemsets X is negatively correlated, if
22
Mining Sequential Patterns
23
Sequential Patterns
Transaction databases, time-series databases vs.
sequence databases
Frequent patterns vs. (frequent) sequential patterns
Applications of sequential pattern mining
Customer shopping sequences:
• First buy computer, then CD-ROM, and then digital camera, within 3
months.
Medical treatments, natural disasters (e.g., earthquakes),
science & eng. processes, stocks and markets, etc.
Telephone calling patterns, Weblog click streams
Program execution sequence data sets
DNA sequences and gene structures
24
What Is Sequential Pattern Mining?
Given a set of sequences, find the complete set
of frequent subsequences
A sequence : < (ef) (ab) (df) c b >
A sequence database
SID sequence
10 <a(abc)(ac)d(cf)> • An element may contain a set of items
20 <(ad)c(bc)(ae)> • Items within an element are unordered
30 <(ef)(ab)(df)cb> and we list them alphabetically
40 <eg(af)cbc>
<a(bc)dc> is a subsequence of
<a(abc)(ac)d(cf)>
Given support threshold min_sup = 2, <(ab)c> is a sequential pattern
25
The Apriori Property of Sequential Patterns
Seq. ID Sequence
Given support threshold
10 <(bd)cb(ac)>
min_sup =2
20 <(bf)(ce)b(fg)>
30 <(ah)(bf)abf>
40 <(be)(ce)d>
50 <a(bd)bcb(ade)>
27
Readings
• Data Mining – Ian Witten - Section 6.3
• Introduction to Data Mining Pang-Ning Tan, Michael Steinbach, Vipin Kumar - Chapter 6
• H. Mannila, H. Toivonen, and A. I. Verkamo. Efficient algorithms for discovering association rules.
KDD'94.
• J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation. SIGMOD’ 00.
• M. Klemettinen, H. Mannila, P. Ronkainen, H. Toivonen, and A. I. Verkamo. Finding interesting
rules from large sets of discovered association rules. CIKM'94.
• S. Brin, R. Motwani, and C. Silverstein. Beyond market basket: Generalizing association rules to
correlations. SIGMOD'97.
• E. Omiecinski. Alternative Interest Measures for Mining Associations. TKDE’03.
• Charu C. Aggarwal and Jiawei Han. 2014. Frequent Pattern Mining . Springer Publishing Company,
Incorporated.
28