Sei sulla pagina 1di 27

Advance Association Analysis

1
Minimum Support Threshold

3
Effect of Support Distribution
• Many real data sets have skewed support
distribution

Support
distribution of
a retail data set

4
Effect of Support Distribution
How to set the appropriate minsup threshold?
If minsup is set too high, we could miss itemsets
involving interesting rare items (e.g., expensive
products)

If minsup is set too low, it is computationally


expensive and the number of itemsets is very large

Using a single minimum support threshold may


not be effective

5
Multiple Minimum Support
How to apply multiple minimum supports?
MS(i): minimum support for item i
e.g.: MS(Milk)=5%, MS(Coke) = 3%,
MS(Broccoli)=0.1%, MS(Salmon)=0.5%
MS({Milk, Broccoli}) = min (MS(Milk), MS(Broccoli))
= 0.1%

Challenge: Support is no longer anti-monotone


• Suppose: Support(Milk, Coke) = 1.5% and
Support(Milk, Coke, Broccoli) = 0.5%

• {Milk,Coke} is infrequent but {Milk,Coke,Broccoli} is frequent

6
Multiple Minimum Support
Item MS(I) Sup(I) AB ABC

AC ABD
A
A 0.10% 0.25%
AD ABE

B AE ACD
B 0.20% 0.26%
BC ACE
C
C 0.30% 0.29% BD ADE

D BE BCD
D 0.50% 0.05%
CD BCE
E
E 3% 4.20% CE BDE

DE CDE

7
Multiple Minimum Support
AB ABC
Item MS(I) Sup(I)
AC ABD
A
A 0.10% 0.25% AD ABE

B AE ACD
B 0.20% 0.26%
BC ACE
C
C 0.30% 0.29% BD ADE

D BE BCD
D 0.50% 0.05%
CD BCE
E
E 3% 4.20% CE BDE

DE CDE

8
Multiple Minimum Support (Liu 1999)
Order the items according to their minimum
support (in ascending order)
e.g.: MS(Milk)=5%, MS(Coke) = 3%,
MS(Broccoli)=0.1%, MS(Salmon)=0.5%
Ordering: Broccoli, Salmon, Coke, Milk

Need to modify Apriori such that:


L1 : set of frequent items
F1 : set of items whose support is ≥ MS(1)
where MS(1) is mini( MS(i) )
C2 : candidate itemsets of size 2 is generated from F1
instead of L1

9
Multiple Minimum Support (Liu 1999)
Modifications to Apriori:
In traditional Apriori,
• A candidate (k+1)-itemset is generated by merging two frequent itemsets of size k
• The candidate is pruned if it contains any infrequent subsets of size k

Pruning step has to be modified:


• Prune only if subset contains the first item
• e.g.: Candidate={Broccoli, Coke, Milk} (ordered according to
minimum support)
• {Broccoli, Coke} and {Broccoli, Milk} are frequent but
{Coke, Milk} is infrequent
– Candidate is not pruned because {Coke,Milk} does not contain
the first item, i.e., Broccoli.

10
Mining Rare Association Rules

11
Rare Association Rule Mining: Motivation
Rare events are events that occur infrequently
Perhaps in the frequency range (0.1% to 10%)

If they occur the consequences can be quite


dramatic or negative.
Applications:
Hardware Fault Detection
• Faults that are rare but costly
Medical Diagnosis
• Diseases that are typically rare but deadly

12
Detecting Rare Itemsets

Apriori-Inverse
• To discover all rules that satisfy the maximum

support (below maximum support) and above a


minimum absolute support value. -- UCI
Repository: Zoo Maximum support: 0.20

Itemsets Support Used?

Venomous =‘0’ 0.92 No

Itemsets Analyzed Tail = ‘1’ 0.74 No

...
Fins = ‘1’ 0.17 Yes
Venomous =‘1’ 0.08 Yes

13
Coincidence vs Interesting
10000 transactions

A appears 9500 times AB appears 9000 times

AB

B appears 9500 times

A → B (confidence = 0.95)

Would we consider this an interesting?


What if AB appears 9010 times?
Under the normal assumption AB is expected to appear together at
least 9025 times.

14
Probability of Collision

• The probability that A and B will occur together exactly c


times is under an assumption of independence:
A ¬A
 a  N − a 
B c √ √  c  b − c 
¬B √ √ √ Pcc( c | N , a, b) =   
N
b
√ √ N  
• Given N = 1000, A= B = 500, and AB = 250, we are able
to determine the probability of A and B occurring
exactly 250 times is 0.05.

15
Minimum Absolute Support

• To find the number of collisions for which Pcc is


smaller than some value p (e.g. 0.0001)

 i =m 
minabssup( N , a, b, p ) = min  m |


i =0
Pcc(i | N , a, b) ≥ 1.0 − p 


• Given N = 1000, A = B = 500, and p = 0.0001,


minabssup value is 274.
• Candidate itemsets that appear above the minabssup
requirement are retained.

16
Rare pattern
Given a user-specified minimum support
threshold minsup ϵ [0,1], X is called a rare
itemset or rare pattern in D if sup(X,D) ≤
minsup.

17
Roadmap for rare pattern mining

18
Mining Negative Rules

19
Negative vs Rare Patterns
Rare patterns: Very low support but interesting
E.g., buying Rolex watches
Mining: Setting individual-based or special group-based
support threshold for valuable items
Negative patterns
Since it is unlikely that one buys Ford Expedition (an SUV car)
and Toyota Prius (a hybrid car) together, Ford Expedition and
Toyota Prius are likely negatively correlated patterns
Negatively correlated patterns that are infrequent tend
to be more interesting than those that are frequent

20
Negative Correlated Patterns
Definition 1 (support-based)
If itemsets X and Y are both frequent but rarely occur together, i.e.,
sup(X U Y) < sup (X) * sup(Y)
Then X and Y are negatively correlated
Problem: A store sold two needle 100 packages A and B, only one transaction
containing both A and B.
When there are in total 200 transactions, we have
s(A U B) = 0.005, s(A) * s(B) = 0.25, s(A U B) < s(A) * s(B)
When there are 105 transactions, we have
s(A U B) = 1/105, s(A) * s(B) = 1/103 * 1/103, s(A U B) > s(A) * s(B)
Where is the problem? —Null transactions, i.e., the support-based definition is
not null-invariant!

21
Negative Correlated Patterns
Definition 2 (negative itemset-based)
X is a negative itemset if (1) X = Ā U B, where B is a set of positive items,
and Ā is a set of negative items, |Ā|≥ 1, and (2) s(X) ≥ μ
Itemsets X is negatively correlated, if

This definition suffers a similar null-invariant problem.

Definition 3 (Kulzynski measure-based) If itemsets X and Y are frequent, but


(P(X|Y) + P(Y|X))/2 < є, where є is a negative pattern threshold, then X and
Y are negatively correlated.

22
Mining Sequential Patterns

23
Sequential Patterns
Transaction databases, time-series databases vs.
sequence databases
Frequent patterns vs. (frequent) sequential patterns
Applications of sequential pattern mining
Customer shopping sequences:
• First buy computer, then CD-ROM, and then digital camera, within 3
months.
Medical treatments, natural disasters (e.g., earthquakes),
science & eng. processes, stocks and markets, etc.
Telephone calling patterns, Weblog click streams
Program execution sequence data sets
DNA sequences and gene structures

24
What Is Sequential Pattern Mining?
Given a set of sequences, find the complete set
of frequent subsequences
A sequence : < (ef) (ab) (df) c b >
A sequence database

SID sequence
10 <a(abc)(ac)d(cf)> • An element may contain a set of items
20 <(ad)c(bc)(ae)> • Items within an element are unordered
30 <(ef)(ab)(df)cb> and we list them alphabetically
40 <eg(af)cbc>
<a(bc)dc> is a subsequence of
<a(abc)(ac)d(cf)>
Given support threshold min_sup = 2, <(ab)c> is a sequential pattern

Sequential pattern mining: find the complete set of patterns,


satisfying the minimum support (frequency) threshold

25
The Apriori Property of Sequential Patterns

A basic property: Apriori


If a sequence S is not frequent
Then none of the super-sequences of S is frequent
E.g, <hb> is infrequent  so is <hab> and <(ah)b>

Seq. ID Sequence
Given support threshold
10 <(bd)cb(ac)>
min_sup =2
20 <(bf)(ce)b(fg)>
30 <(ah)(bf)abf>
40 <(be)(ce)d>
50 <a(bd)bcb(ade)>

27
Readings
• Data Mining – Ian Witten - Section 6.3
• Introduction to Data Mining Pang-Ning Tan, Michael Steinbach, Vipin Kumar - Chapter 6

• R. Srikant and R. Agrawal. Mining sequential patterns: Generalizations and performance


improvements. EDBT’96.
• R. Agrawal and R. Srikant. Fast algorithms for mining association rules. VLDB'94.

• H. Mannila, H. Toivonen, and A. I. Verkamo. Efficient algorithms for discovering association rules.
KDD'94.
• J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation. SIGMOD’ 00.
• M. Klemettinen, H. Mannila, P. Ronkainen, H. Toivonen, and A. I. Verkamo. Finding interesting
rules from large sets of discovered association rules. CIKM'94.
• S. Brin, R. Motwani, and C. Silverstein. Beyond market basket: Generalizing association rules to
correlations. SIGMOD'97.
• E. Omiecinski. Alternative Interest Measures for Mining Associations. TKDE’03.
• Charu C. Aggarwal and Jiawei Han. 2014. Frequent Pattern Mining . Springer Publishing Company,
Incorporated.

28

Potrebbero piacerti anche