DMiningKuliah4 (Asso Rule Apriori)

Data Mining
Pattern Evaluation
Data Mining
Task-relevant
Data
Selection and
Transformation
Data
Data Warehouse
Cleaning
Data Integration
Databases
Data Mining - Market Basket
1/15/2018 Analysis 1
Market Basket Analysis
Analysis of customer buying habits by finding associations
and correlations between the different items that
customers place in their "shopping basket"
Milk, eggs, sugar, bread Milk, eggs, cereal, bread Eggs, sugar
Customer1
Customer2 Customer3

• Given:
• A database of customer transactions (e.g., shopping baskets),
where each transaction is a set of items (e.g., products)
• Find:
• Groups of items which are frequently purchased together

 Useful:
"On Thursdays, grocery store consumers often
purchase diapers and beer together.“
 Trivial:
"Customers who purchase maintenance
agreements are very likely to purchase large
appliances.“
 Unexplicable/unexpected:
"When a new hardware store opens, one of
the most sold items is toilet rings."

 Extract information on purchasing behavior
 "IF buy beer and sausage, THEN also buy mustard with high
probability“
 “IF buys computer, THEN also buy anti virus software with
high probability”
 Actionable information: can suggest...

 New store layouts and product assortments
 Which products to put on promotion
 Which items to put on sale at reduced prices
 Inventory Requirement Priority
 MBA approach is applicable whenever a customer

purchases multiple things in proximity
 Credit cards
 Services of telecommunication companies
 Banking services
 Medical treatments

Association Rules: Basics
 Association rule mining:
 Finding frequent patterns, associations,
correlations, or causal structures among sets of
items or objects in transaction databases, relational
databases, and other information repositories.
 Comprehensibility: Simple to understand
 Utilizability: Provide actionable information
 Efficiency: Efficient discovery algorithms exist
 Applications:
 Market basket data analysis, Cross-marketing,
Catalog design, Clustering, Classification, etc.

 Typical representation formats for association rules:
 diapers  beer [0.5%, 60%]

 buys:diapers  buys:beer [0.5%, 60%]
 "IF buys diapers, THEN buys beer in 60% of the cases.

Diapers and beer are bought together in 0.5% of the
rows in the database."
 Other representations (used in Han's book):

 buys(x, "diapers")  buys(x, "beer") [0.5%, 60%]
 major(x, "CS") ^ takes(x, "DB")  grade(x, "A") [1%,
75%]

diapers  beer [0.5%, 60%] "IF buys diapers,
THEN buys beer
1 2 3 4 in 60% of the cases
in 0.5% of the rows"
1. Antecedent, left-hand side (LHS), body

2. Consequent, right-hand side (RHS), head
3. Support, frequency ("in how big part of the data the
things in left- and right-hand sides occur together")
4. Confidence, strength ("if the left-hand side occurs, how
likely the right-hand side occurs")
• Support: denotes the frequency of the rule within

transactions.
support(A  B [ s, c ]) = p(AB) = support ({A,B})
• Confidence: denotes the percentage of transactions

containing A which contain also B.
confidence(A  B [ s, c ]) = p(B|A) = p(AB) / p(A) =
support({A,B}) / support({A})

 Transaction:
Relational format Compact format

<Tid,item> <Tid,itemset>
<1, item1> <1, {item1,item2}>
<1, item2> <2, {item1,item3}>
<2, item3> <3, {item2}>
<2, item1>
<3, item2>
 Item vs. itemsets: single element vs. set of items

 Support of an itemset I: # of transaction containing I
 Minimum support : threshold for support
 Frequent itemset: Itemsets with support  .
 Minimum confidence: threshold for confidence
1/15/2018 Analysis 10
 Given:
(1) database of transactions,
(2) each transaction is a list of items bought (purchased
by a customer in a visit)
Transaction ID Items Bought Frequent Itemset Support

100 A,B,C {A} 3 or 75%
200 A,C {B} and {C} 2 or 50%
400 A,D {D}, {E} and {F} 1 or 25%
500 B,E,F {A,C} 2 or 50%
Other item pairs max 25%
 Find: all rules with minimum support and confidence

 If min. support 50% and min. confidence 50%, then A  C
[50%, 66.6%], C  A [50%, 100%]

1/15/2018 Analysis 11
Apriori Algorithm
Input: Database, D, of transactions; minimum support threshold, min sup.
Output: L, frequent itemsets in D.
Method:
L1 = find frequent_1-itemsets(D);
for (k = 2; Lk-1  ; k++) {
Ck = apriori_gen(Lk-1, min_sup);
for each transaction t  D { // scan D for counts
Ct = subset(Ck, t); // get the subsets of t that are candidates
for each candidate c  Ct
c.count++;
}
Lk = {c  Ck|c.count  min_sup}
}
return L = kLk;
1/15/2018 Analysis 14
Frequent Sets with Apriori
procedure apriori_gen( Lk-1:frequent (k-1)-itemsets;
min_sup: minimum support)
for each itemset l1  Lk-1
for each itemset l2  Lk-1
if (l1[1] = l2[1]) ^ (l1[2] = l2[2]) ^ … ^ (l1[k - 2] = l2[k - 2])
^ (l1[k - 1] < l2[k - 1]) then {
c = l1 l2 // join step: generate candidates
if has_infrequent_subset(c, Lk-1) then
delete c; // prune step: remove unfruitful candidate
else add c to Ck;
}
return Ck;

1/15/2018 Analysis 15
Frequent Sets with Apriori
procedure has_infrequent_subset(c: candidate
k-itemset; Lk-1: frequent (k-1)-itemsets);
// use prior knowledge
for each (k - 1)-subset s of c
if s  Lk-1 then
return TRUE;
return FALSE;

1/15/2018 Analysis 16
Apriori Candidate Generation
 The Apriori principle:

Any subset of a frequent itemset must be
frequent
 L3={abc, abd, acd, ace, bcd}
 Self-joining: L3*L3
 abcd from abc and abd
 acde from acd and ace
 Pruning:
 acde is removed because ade is not in L3
 C4={abcd}

1/15/2018 Analysis 17
Apriori Candidate Generation Join Step
 select 2 large (k-1) itemsets that share first k-2 items
 construct level k candidate by appending last item of

second selected itemset to first selected itemset

1/15/2018 Analysis 18
Apriori Example (1/3) (Min Support = 2 = 50%)
Database D C1 L1
itemset sup.
TID Items itemset sup.
{1} 2
100 134 {1} 2
Scan D {2} 3 Prune
200 235 {2} 3
{3} 3
300 1235 {3} 3
{4} 1
400 25 {5} 3
{5} 3

1/15/2018 Analysis 19
C2 C2 L2
itemset itemset sup
{1 2} {1 2} 1 itemset sup
{1 3} {1 3} 2 {1 3} 2
{1 5} Scan D {1 5} 1 Prune {2 3} 2
{2 3} {2 3} 2 {2 5} 3
{2 5} {2 5} 3 {3 5} 2
{3 5} {3 5} 2

1/15/2018 Analysis 20
C3 L3
itemset Scan D Prune itemset sup

{2 3 5} {2 3 5} 2

1/15/2018 Analysis 21
Association Rules from Itemsets
 Pseudo-code:
for every frequent itemset l
generate all nonempty subsets s of l
for every nonempty subset s of l
output the rule "s  (l-s)" if
support(l)/support(s)  min_conf",
where min_conf is the minimum
confidence threshold
 Example 1: frequent set L2 = {I1, I3}
 I1  I3, support: 50%, confidence = 2/2 = 100%

 I3  I1, support: 50%, confidence = 2/3 = 66.67%

1/15/2018 Analysis 22
Association Rules from Itemsets
Example 2: frequent set L3 = {I2, I3, I5}

 I2Î3  I5, support = 50%, confidence = 2/2 = 100%
 I2Î5  I3, support = 50%, confidence = 2/3 = 66.67%
 I3Î5  I2, support = 50%, confidence = 2/2 = 100%
 I2  I3Î5, support = 50%, confidence = 2/3 = 66.67%

1/15/2018 Analysis 23
Improving Apriori Performance
 Hash-based itemset counting:
 A k-itemset whose corresponding hashing bucket
count is below the threshold cannot be frequent
 Transaction reduction:
 A transaction that does not contain any frequent k-
itemset is useless in subsequent scans
 Partitioning:
 Any itemset that is potentially frequent in DB must
be frequent in at least one of the partitions of DB
 Sampling:
 Mining on a subset of given data, lower support
threshold + a method to determine the
completeness
1/15/2018 Analysis 24
Association Rule Generation
 Rule 1 to remember:
 Generating Frequent Itemsets is slow (especially
itemsets of size 2)
 Generating Association Rules from frequent
itemsets is fast
 Rule 2 to remember:
 For Frequent Itemsets generation, support
threshold is used
 For Association Rules, confidence threshold is
used

1/15/2018 Analysis 25
Selecting the Interesting Rules?
 Usually the result set is very big, one must
select interesting ones based on:
 Objective measures:
Two popular measurements:
 support; and
 confidence
 Subjective measures
(Silberschatz & Tuzhilin, KDD95)
A rule (pattern) is interesting ifit is
 unexpected (surprising to the user);
and/or
 actionable (the user can do something with it)

1/15/2018 Analysis 26
Boolean vs. Quantitative Rules
 Boolean vs. quantitative association rules (based

on the types of values handled)
 Boolean: Rule concerns associations between the
presence or absence of items (e.g. "buys A" or "does not
buy A")
 buys=SQLServer, buys=DMBook  buys=DBMiner
[2%,60%]
 buys(x, "SQLServer") ^ buys(x, "DMBook") buys(x,
"DBMiner") [2%, 60%]
 Quantitative: Rule concerns associations between

quantitative items or attributes
 age=30..39, income=42..48K  buys=PC [1%, 75%]
 age(x, "30..39") ^ income(x, "42..48K") buys(x, "PC")
[1%, 75%]
1/15/2018 Analysis 27
Quantitative Rules
Quantitative attributes: e.g., age, income, height, weight

Categorical attributes: e.g., color of car
CID height weight income

1 168 75,4 30,5
2 175 80,0 20,3
3 174 70,3 25,8
4 170 65,2 27,0
Problem: too many distinct values for quantitative attributes
Solution: transform quantitative attributes in categorical

ones via discretization
1/15/2018 Analysis 28
Single- vs. Multi-dimensional Rules
 Single-dimensional vs. multi-dimensional

associations
 Single-dimensional: Items or attributes in the rule

refer to only one dimension (e.g., to "buys")
Beer, Chips  Bread [0.4%, 52%]
buys(x, "Beer") ^ buys(x, "Chips") buys(x, "Bread")
[0.4%, 52%]
 Multi-dimensional: Items or attributes in the rule

refer to two or more dimensions (e.g., "buys",
"time_of_transaction", "customer_category")
In the following example: nationality, age, income

1/15/2018 Analysis 29
Multi-dimensional Rules
CID nationality age income

1 Italian 50 low
2 French 40 high
3 French 30 high
4 Italian 50 medium
5 Italian 45 high
6 French 35 high
RULES:
nationality = French  income = high [50%, 100%]
income = high  nationality = French [50%, 75%]
age = 50  nationality = Italian [33%, 100%]

1/15/2018 Analysis 30
Multi-Dimensional Association: Concepts
Multi-dimensional rules:
 Inter-dimension association rules

(no repeated predicates)
age(X,”19-25”)  occupation(X,“student”)
 buys(X,“coke”)
 Hybrid-dimension association rules

(repeated predicates)
age(X,”19-25”)  buys(X, “popcorn”)

 buys(X, “coke”)
1/15/2018 Analysis 31
Single- vs. Multi-level Rules
 Single-level vs. multi-level
associations
 Single-level: Associations between items or

attributes from the same level of abstraction
(i.e., from the same level of hierarchy)
Beer, Chips  Bread [0.4%, 52%]
 Multi-level: Associations between items or

attributes from different levels of abstraction
(i.e, from different levels of hierarchy)
Beer:Karjala, Chips:Estrella:Barbeque 
Bread [0.1%, 74%]

1/15/2018 Analysis 32
Multi-level Trees

1/15/2018 Analysis 33
Multiple-Level Association Rules
TID Items
T1 {1110, 1210, 2110, 2210}
T2 {1110, 2110, 2220, 3230}
T3 {1120, 1222, 2210, 4113}
T4 {1110, 1210}
T5 {1110, 1222, 2110, 2210, 4113}
1/15/2018 Analysis 34

DMiningKuliah4 (Asso Rule Apriori)

Caricato da

Informazioni sul documento

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

DMiningKuliah4 (Asso Rule Apriori)

Caricato da

Copyright:

Formati disponibili

Data Mining

Data Mining - Market Basket

Data Mining - Market Basket

Data Mining - Market Basket

 Actionable information: can suggest...

 MBA approach is applicable whenever a customer

Data Mining - Market Basket

Data Mining - Market Basket

 diapers  beer [0.5%, 60%]

 "IF buys diapers, THEN buys beer in 60% of the cases.

 Other representations (used in Han's book):

Data Mining - Market Basket

1. Antecedent, left-hand side (LHS), body

• Support: denotes the frequency of the rule within

• Confidence: denotes the percentage of transactions

Data Mining - Market Basket

Relational format Compact format

 Item vs. itemsets: single element vs. set of items

Transaction ID Items Bought Frequent Itemset Support

 Find: all rules with minimum support and confidence

Data Mining - Market Basket

Data Mining - Market Basket

Data Mining - Market Basket

 The Apriori principle:

Data Mining - Market Basket

 construct level k candidate by appending last item of

Data Mining - Market Basket

Data Mining - Market Basket

Data Mining - Market Basket

itemset Scan D Prune itemset sup

Data Mining - Market Basket

 Example 1: frequent set L2 = {I1, I3}

 I1  I3, support: 50%, confidence = 2/2 = 100%

Data Mining - Market Basket

Example 2: frequent set L3 = {I2, I3, I5}

Data Mining - Market Basket

Data Mining - Market Basket

Data Mining - Market Basket

 Boolean vs. quantitative association rules (based

 Quantitative: Rule concerns associations between

Quantitative attributes: e.g., age, income, height, weight

CID height weight income

Problem: too many distinct values for quantitative attributes

Solution: transform quantitative attributes in categorical

 Single-dimensional vs. multi-dimensional

 Single-dimensional: Items or attributes in the rule

 Multi-dimensional: Items or attributes in the rule

Data Mining - Market Basket

CID nationality age income

Data Mining - Market Basket

 Inter-dimension association rules

 Hybrid-dimension association rules

age(X,”19-25”)  buys(X, “popcorn”)

 Single-level: Associations between items or

 Multi-level: Associations between items or

Data Mining - Market Basket

Data Mining - Market Basket

Potrebbero piacerti anche