Advance Association Analysis

Advance Association Analysis
1
Minimum Support Threshold
3
Effect of Support Distribution
• Many real data sets have skewed support
distribution
Support
distribution of
a retail data set
4
Effect of Support Distribution
How to set the appropriate minsup threshold?
If minsup is set too high, we could miss itemsets
involving interesting rare items (e.g., expensive
products)
If minsup is set too low, it is computationally

expensive and the number of itemsets is very large
Using a single minimum support threshold may

not be effective
5
Multiple Minimum Support
How to apply multiple minimum supports?
MS(i): minimum support for item i
e.g.: MS(Milk)=5%, MS(Coke) = 3%,
MS(Broccoli)=0.1%, MS(Salmon)=0.5%
MS({Milk, Broccoli}) = min (MS(Milk), MS(Broccoli))
= 0.1%
Challenge: Support is no longer anti-monotone

• Suppose: Support(Milk, Coke) = 1.5% and
Support(Milk, Coke, Broccoli) = 0.5%
• {Milk,Coke} is infrequent but {Milk,Coke,Broccoli} is frequent
6
Item MS(I) Sup(I) AB ABC
AC ABD
A
A 0.10% 0.25%
AD ABE
B AE ACD
B 0.20% 0.26%
BC ACE
C
C 0.30% 0.29% BD ADE
D BE BCD
D 0.50% 0.05%
CD BCE
E
E 3% 4.20% CE BDE
DE CDE
7
AB ABC
Item MS(I) Sup(I)
AC ABD
A
A 0.10% 0.25% AD ABE
B AE ACD
B 0.20% 0.26%
BC ACE
C
C 0.30% 0.29% BD ADE
D BE BCD
D 0.50% 0.05%
CD BCE
E
E 3% 4.20% CE BDE
DE CDE
8
Multiple Minimum Support (Liu 1999)
Order the items according to their minimum
support (in ascending order)
e.g.: MS(Milk)=5%, MS(Coke) = 3%,
MS(Broccoli)=0.1%, MS(Salmon)=0.5%
Ordering: Broccoli, Salmon, Coke, Milk
Need to modify Apriori such that:

L1 : set of frequent items
F1 : set of items whose support is ≥ MS(1)
where MS(1) is mini( MS(i) )
C2 : candidate itemsets of size 2 is generated from F1
instead of L1
9
Multiple Minimum Support (Liu 1999)
Modifications to Apriori:
In traditional Apriori,
• A candidate (k+1)-itemset is generated by merging two frequent itemsets of size k
• The candidate is pruned if it contains any infrequent subsets of size k
Pruning step has to be modified:

• Prune only if subset contains the first item
• e.g.: Candidate={Broccoli, Coke, Milk} (ordered according to
minimum support)
• {Broccoli, Coke} and {Broccoli, Milk} are frequent but
{Coke, Milk} is infrequent
– Candidate is not pruned because {Coke,Milk} does not contain
the first item, i.e., Broccoli.
10
Mining Rare Association Rules
11
Rare Association Rule Mining: Motivation
Rare events are events that occur infrequently
Perhaps in the frequency range (0.1% to 10%)
If they occur the consequences can be quite

dramatic or negative.
Applications:
Hardware Fault Detection
• Faults that are rare but costly
Medical Diagnosis
• Diseases that are typically rare but deadly
12
Detecting Rare Itemsets
Apriori-Inverse
• To discover all rules that satisfy the maximum
support (below maximum support) and above a

minimum absolute support value. -- UCI
Repository: Zoo Maximum support: 0.20
Itemsets Support Used?
Venomous =‘0’ 0.92 No
Itemsets Analyzed Tail = ‘1’ 0.74 No
...
Fins = ‘1’ 0.17 Yes
Venomous =‘1’ 0.08 Yes
13
Coincidence vs Interesting
10000 transactions
A appears 9500 times AB appears 9000 times
AB
B appears 9500 times
A → B (confidence = 0.95)
Would we consider this an interesting?

What if AB appears 9010 times?
Under the normal assumption AB is expected to appear together at
least 9025 times.
14
Probability of Collision
• The probability that A and B will occur together exactly c

times is under an assumption of independence:
A ¬A
 a  N − a 
B c √ √  c  b − c 
¬B √ √ √ Pcc( c | N , a, b) =   
N
b
√ √ N  
• Given N = 1000, A= B = 500, and AB = 250, we are able
to determine the probability of A and B occurring
exactly 250 times is 0.05.
15
Minimum Absolute Support
• To find the number of collisions for which Pcc is

smaller than some value p (e.g. 0.0001)
 i =m 
minabssup( N , a, b, p ) = min  m |

∑
i =0
Pcc(i | N , a, b) ≥ 1.0 − p 

• Given N = 1000, A = B = 500, and p = 0.0001,

minabssup value is 274.
• Candidate itemsets that appear above the minabssup
requirement are retained.
16
Rare pattern
Given a user-specified minimum support
threshold minsup ϵ [0,1], X is called a rare
itemset or rare pattern in D if sup(X,D) ≤
minsup.
17
Roadmap for rare pattern mining
18
Mining Negative Rules
19
Negative vs Rare Patterns
Rare patterns: Very low support but interesting
E.g., buying Rolex watches
Mining: Setting individual-based or special group-based
support threshold for valuable items
Negative patterns
Since it is unlikely that one buys Ford Expedition (an SUV car)
and Toyota Prius (a hybrid car) together, Ford Expedition and
Toyota Prius are likely negatively correlated patterns
Negatively correlated patterns that are infrequent tend
to be more interesting than those that are frequent
20
Negative Correlated Patterns
Definition 1 (support-based)
If itemsets X and Y are both frequent but rarely occur together, i.e.,
sup(X U Y) < sup (X) * sup(Y)
Then X and Y are negatively correlated
Problem: A store sold two needle 100 packages A and B, only one transaction
containing both A and B.
When there are in total 200 transactions, we have
s(A U B) = 0.005, s(A) * s(B) = 0.25, s(A U B) < s(A) * s(B)
When there are 105 transactions, we have
s(A U B) = 1/105, s(A) * s(B) = 1/103 * 1/103, s(A U B) > s(A) * s(B)
Where is the problem? —Null transactions, i.e., the support-based definition is
not null-invariant!
21
Negative Correlated Patterns
Definition 2 (negative itemset-based)
X is a negative itemset if (1) X = Ā U B, where B is a set of positive items,
and Ā is a set of negative items, |Ā|≥ 1, and (2) s(X) ≥ μ
Itemsets X is negatively correlated, if
This definition suffers a similar null-invariant problem.
Definition 3 (Kulzynski measure-based) If itemsets X and Y are frequent, but

(P(X|Y) + P(Y|X))/2 < є, where є is a negative pattern threshold, then X and
Y are negatively correlated.
22
Mining Sequential Patterns
23
Sequential Patterns
Transaction databases, time-series databases vs.
sequence databases
Frequent patterns vs. (frequent) sequential patterns
Applications of sequential pattern mining
Customer shopping sequences:
• First buy computer, then CD-ROM, and then digital camera, within 3
months.
Medical treatments, natural disasters (e.g., earthquakes),
science & eng. processes, stocks and markets, etc.
Telephone calling patterns, Weblog click streams
Program execution sequence data sets
DNA sequences and gene structures
24
What Is Sequential Pattern Mining?
Given a set of sequences, find the complete set
of frequent subsequences
A sequence : < (ef) (ab) (df) c b >
A sequence database
SID sequence
10 <a(abc)(ac)d(cf)> • An element may contain a set of items
20 <(ad)c(bc)(ae)> • Items within an element are unordered
30 <(ef)(ab)(df)cb> and we list them alphabetically
40 <eg(af)cbc>
<a(bc)dc> is a subsequence of
<a(abc)(ac)d(cf)>
Given support threshold min_sup = 2, <(ab)c> is a sequential pattern
Sequential pattern mining: find the complete set of patterns,

satisfying the minimum support (frequency) threshold
25
The Apriori Property of Sequential Patterns
A basic property: Apriori

If a sequence S is not frequent
Then none of the super-sequences of S is frequent
E.g, <hb> is infrequent  so is <hab> and <(ah)b>
Seq. ID Sequence
Given support threshold
10 <(bd)cb(ac)>
min_sup =2
20 <(bf)(ce)b(fg)>
30 <(ah)(bf)abf>
40 <(be)(ce)d>
50 <a(bd)bcb(ade)>
27
Readings
• Data Mining – Ian Witten - Section 6.3
• Introduction to Data Mining Pang-Ning Tan, Michael Steinbach, Vipin Kumar - Chapter 6
• R. Srikant and R. Agrawal. Mining sequential patterns: Generalizations and performance

improvements. EDBT’96.
• R. Agrawal and R. Srikant. Fast algorithms for mining association rules. VLDB'94.
• H. Mannila, H. Toivonen, and A. I. Verkamo. Efficient algorithms for discovering association rules.
KDD'94.
• J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation. SIGMOD’ 00.
• M. Klemettinen, H. Mannila, P. Ronkainen, H. Toivonen, and A. I. Verkamo. Finding interesting
rules from large sets of discovered association rules. CIKM'94.
• S. Brin, R. Motwani, and C. Silverstein. Beyond market basket: Generalizing association rules to
correlations. SIGMOD'97.
• E. Omiecinski. Alternative Interest Measures for Mining Associations. TKDE’03.
• Charu C. Aggarwal and Jiawei Han. 2014. Frequent Pattern Mining . Springer Publishing Company,
Incorporated.
28

Advance Association Analysis

Caricato da

Informazioni sul documento

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Advance Association Analysis

Caricato da

Copyright:

Formati disponibili

Advance Association Analysis

If minsup is set too low, it is computationally

Using a single minimum support threshold may

Challenge: Support is no longer anti-monotone

• {Milk,Coke} is infrequent but {Milk,Coke,Broccoli} is frequent

Need to modify Apriori such that:

Pruning step has to be modified:

If they occur the consequences can be quite

support (below maximum support) and above a

Itemsets Support Used?

Venomous =‘0’ 0.92 No

Itemsets Analyzed Tail = ‘1’ 0.74 No

A appears 9500 times AB appears 9000 times

B appears 9500 times

Would we consider this an interesting?

• The probability that A and B will occur together exactly c

• To find the number of collisions for which Pcc is

• Given N = 1000, A = B = 500, and p = 0.0001,

This definition suffers a similar null-invariant problem.

Definition 3 (Kulzynski measure-based) If itemsets X and Y are frequent, but

Sequential pattern mining: find the complete set of patterns,

A basic property: Apriori

• R. Srikant and R. Agrawal. Mining sequential patterns: Generalizations and performance

Potrebbero piacerti anche