Sei sulla pagina 1di 32

Data Mining

Pattern Evaluation

Data Mining
Task-relevant
Data
Selection and
Transformation

Data
Data Warehouse
Cleaning

Data Integration
Databases
Data Mining - Market Basket
1/15/2018 Analysis 1
Market Basket Analysis
Analysis of customer buying habits by finding associations
and correlations between the different items that
customers place in their "shopping basket"

Milk, eggs, sugar, bread Milk, eggs, cereal, bread Eggs, sugar

Customer1
Customer2 Customer3

Data Mining - Market Basket


1/15/2018 Analysis 2
Market Basket Analysis
• Given:
• A database of customer transactions (e.g., shopping baskets),
where each transaction is a set of items (e.g., products)

• Find:
• Groups of items which are frequently purchased together

Data Mining - Market Basket


1/15/2018 Analysis 3
Market Basket Analysis
 Useful:
"On Thursdays, grocery store consumers often
purchase diapers and beer together.“

 Trivial:
"Customers who purchase maintenance
agreements are very likely to purchase large
appliances.“

 Unexplicable/unexpected:
"When a new hardware store opens, one of
the most sold items is toilet rings."

Data Mining - Market Basket


1/15/2018 Analysis 4
Market Basket Analysis
 Extract information on purchasing behavior
 "IF buy beer and sausage, THEN also buy mustard with high
probability“
 “IF buys computer, THEN also buy anti virus software with
high probability”

 Actionable information: can suggest...


 New store layouts and product assortments
 Which products to put on promotion
 Which items to put on sale at reduced prices
 Inventory Requirement Priority

 MBA approach is applicable whenever a customer


purchases multiple things in proximity
 Credit cards
 Services of telecommunication companies
 Banking services
 Medical treatments

Data Mining - Market Basket


1/15/2018 Analysis 5
Association Rules: Basics
 Association rule mining:
 Finding frequent patterns, associations,
correlations, or causal structures among sets of
items or objects in transaction databases, relational
databases, and other information repositories.
 Comprehensibility: Simple to understand
 Utilizability: Provide actionable information
 Efficiency: Efficient discovery algorithms exist
 Applications:
 Market basket data analysis, Cross-marketing,
Catalog design, Clustering, Classification, etc.

Data Mining - Market Basket


1/15/2018 Analysis 6
Association Rules: Basics
 Typical representation formats for association rules:

 diapers  beer [0.5%, 60%]


 buys:diapers  buys:beer [0.5%, 60%]

 "IF buys diapers, THEN buys beer in 60% of the cases.


Diapers and beer are bought together in 0.5% of the
rows in the database."

 Other representations (used in Han's book):


 buys(x, "diapers")  buys(x, "beer") [0.5%, 60%]
 major(x, "CS") ^ takes(x, "DB")  grade(x, "A") [1%,
75%]

Data Mining - Market Basket


1/15/2018 Analysis 7
Association Rules: Basics
diapers  beer [0.5%, 60%] "IF buys diapers,
THEN buys beer
1 2 3 4 in 60% of the cases
in 0.5% of the rows"

1. Antecedent, left-hand side (LHS), body


2. Consequent, right-hand side (RHS), head
3. Support, frequency ("in how big part of the data the
things in left- and right-hand sides occur together")
4. Confidence, strength ("if the left-hand side occurs, how
likely the right-hand side occurs")
Data Mining - Market Basket
1/15/2018 Analysis 8
Association Rules: Basics

• Support: denotes the frequency of the rule within


transactions.
support(A  B [ s, c ]) = p(AB) = support ({A,B})

• Confidence: denotes the percentage of transactions


containing A which contain also B.
confidence(A  B [ s, c ]) = p(B|A) = p(AB) / p(A) =
support({A,B}) / support({A})

Data Mining - Market Basket


1/15/2018 Analysis 9
Association Rules: Basics
 Transaction:

Relational format Compact format


<Tid,item> <Tid,itemset>
<1, item1> <1, {item1,item2}>
<1, item2> <2, {item1,item3}>
<2, item3> <3, {item2}>
<2, item1>
<3, item2>

 Item vs. itemsets: single element vs. set of items


 Support of an itemset I: # of transaction containing I
 Minimum support : threshold for support
 Frequent itemset: Itemsets with support  .
 Minimum confidence: threshold for confidence
Data Mining - Market Basket
1/15/2018 Analysis 10
Association Rules: Basics
 Given:
(1) database of transactions,
(2) each transaction is a list of items bought (purchased
by a customer in a visit)

Transaction ID Items Bought Frequent Itemset Support


100 A,B,C {A} 3 or 75%
200 A,C {B} and {C} 2 or 50%
400 A,D {D}, {E} and {F} 1 or 25%
500 B,E,F {A,C} 2 or 50%
Other item pairs max 25%

 Find: all rules with minimum support and confidence


 If min. support 50% and min. confidence 50%, then A  C
[50%, 66.6%], C  A [50%, 100%]

Data Mining - Market Basket


1/15/2018 Analysis 11
Apriori Algorithm
Input: Database, D, of transactions; minimum support threshold, min sup.
Output: L, frequent itemsets in D.

Method:
L1 = find frequent_1-itemsets(D);
for (k = 2; Lk-1  ; k++) {
Ck = apriori_gen(Lk-1, min_sup);
for each transaction t  D { // scan D for counts
Ct = subset(Ck, t); // get the subsets of t that are candidates
for each candidate c  Ct
c.count++;
}
Lk = {c  Ck|c.count  min_sup}
}
return L = kLk;
Data Mining - Market Basket
1/15/2018 Analysis 14
Frequent Sets with Apriori
procedure apriori_gen( Lk-1:frequent (k-1)-itemsets;
min_sup: minimum support)
for each itemset l1  Lk-1
for each itemset l2  Lk-1
if (l1[1] = l2[1]) ^ (l1[2] = l2[2]) ^ … ^ (l1[k - 2] = l2[k - 2])
^ (l1[k - 1] < l2[k - 1]) then {
c = l1 l2 // join step: generate candidates
if has_infrequent_subset(c, Lk-1) then
delete c; // prune step: remove unfruitful candidate
else add c to Ck;
}
return Ck;

Data Mining - Market Basket


1/15/2018 Analysis 15
Frequent Sets with Apriori
procedure has_infrequent_subset(c: candidate
k-itemset; Lk-1: frequent (k-1)-itemsets);
// use prior knowledge
for each (k - 1)-subset s of c
if s  Lk-1 then
return TRUE;
return FALSE;

Data Mining - Market Basket


1/15/2018 Analysis 16
Apriori Candidate Generation

 The Apriori principle:


Any subset of a frequent itemset must be
frequent
 L3={abc, abd, acd, ace, bcd}
 Self-joining: L3*L3
 abcd from abc and abd
 acde from acd and ace
 Pruning:
 acde is removed because ade is not in L3

 C4={abcd}

Data Mining - Market Basket


1/15/2018 Analysis 17
Apriori Candidate Generation Join Step
 select 2 large (k-1) itemsets that share first k-2 items

 construct level k candidate by appending last item of


second selected itemset to first selected itemset

Data Mining - Market Basket


1/15/2018 Analysis 18
Apriori Example (1/3) (Min Support = 2 = 50%)

Database D C1 L1

itemset sup.
TID Items itemset sup.
{1} 2
100 134 {1} 2
Scan D {2} 3 Prune
200 235 {2} 3
{3} 3
300 1235 {3} 3
{4} 1
400 25 {5} 3
{5} 3

Data Mining - Market Basket


1/15/2018 Analysis 19
Apriori Example (2/3) (Min Support = 2 = 50%)

C2 C2 L2
itemset itemset sup
{1 2} {1 2} 1 itemset sup
{1 3} {1 3} 2 {1 3} 2
{1 5} Scan D {1 5} 1 Prune {2 3} 2
{2 3} {2 3} 2 {2 5} 3
{2 5} {2 5} 3 {3 5} 2
{3 5} {3 5} 2

Data Mining - Market Basket


1/15/2018 Analysis 20
Apriori Example (3/3) (Min Support = 2 = 50%)

C3 L3

itemset Scan D Prune itemset sup


{2 3 5} {2 3 5} 2

Data Mining - Market Basket


1/15/2018 Analysis 21
Association Rules from Itemsets
 Pseudo-code:
for every frequent itemset l
generate all nonempty subsets s of l
for every nonempty subset s of l
output the rule "s  (l-s)" if
support(l)/support(s)  min_conf",
where min_conf is the minimum
confidence threshold

 Example 1: frequent set L2 = {I1, I3}

 I1  I3, support: 50%, confidence = 2/2 = 100%


 I3  I1, support: 50%, confidence = 2/3 = 66.67%

Data Mining - Market Basket


1/15/2018 Analysis 22
Association Rules from Itemsets

Example 2: frequent set L3 = {I2, I3, I5}


 I2^I3  I5, support = 50%, confidence = 2/2 = 100%
 I2^I5  I3, support = 50%, confidence = 2/3 = 66.67%
 I3^I5  I2, support = 50%, confidence = 2/2 = 100%
 I2  I3^I5, support = 50%, confidence = 2/3 = 66.67%
 I3  I2^I5, support = 50%, confidence = 2/3 = 66.67%
 I5  I2^I3, support = 50%, confidence = 2/3 = 66.67%

Data Mining - Market Basket


1/15/2018 Analysis 23
Improving Apriori Performance
 Hash-based itemset counting:
 A k-itemset whose corresponding hashing bucket
count is below the threshold cannot be frequent
 Transaction reduction:
 A transaction that does not contain any frequent k-
itemset is useless in subsequent scans
 Partitioning:
 Any itemset that is potentially frequent in DB must
be frequent in at least one of the partitions of DB
 Sampling:
 Mining on a subset of given data, lower support
threshold + a method to determine the
completeness
Data Mining - Market Basket
1/15/2018 Analysis 24
Association Rule Generation
 Rule 1 to remember:
 Generating Frequent Itemsets is slow (especially
itemsets of size 2)
 Generating Association Rules from frequent
itemsets is fast

 Rule 2 to remember:
 For Frequent Itemsets generation, support
threshold is used
 For Association Rules, confidence threshold is
used

Data Mining - Market Basket


1/15/2018 Analysis 25
Selecting the Interesting Rules?
 Usually the result set is very big, one must
select interesting ones based on:
 Objective measures:
Two popular measurements:
 support; and
 confidence

 Subjective measures
(Silberschatz & Tuzhilin, KDD95)
A rule (pattern) is interesting ifit is
 unexpected (surprising to the user);
and/or
 actionable (the user can do something with it)

Data Mining - Market Basket


1/15/2018 Analysis 26
Boolean vs. Quantitative Rules

 Boolean vs. quantitative association rules (based


on the types of values handled)
 Boolean: Rule concerns associations between the
presence or absence of items (e.g. "buys A" or "does not
buy A")
 buys=SQLServer, buys=DMBook  buys=DBMiner
[2%,60%]
 buys(x, "SQLServer") ^ buys(x, "DMBook") buys(x,
"DBMiner") [2%, 60%]

 Quantitative: Rule concerns associations between


quantitative items or attributes
 age=30..39, income=42..48K  buys=PC [1%, 75%]
 age(x, "30..39") ^ income(x, "42..48K") buys(x, "PC")
[1%, 75%]
Data Mining - Market Basket
1/15/2018 Analysis 27
Quantitative Rules

Quantitative attributes: e.g., age, income, height, weight


Categorical attributes: e.g., color of car

CID height weight income


1 168 75,4 30,5
2 175 80,0 20,3
3 174 70,3 25,8
4 170 65,2 27,0

Problem: too many distinct values for quantitative attributes

Solution: transform quantitative attributes in categorical


ones via discretization
Data Mining - Market Basket
1/15/2018 Analysis 28
Single- vs. Multi-dimensional Rules

 Single-dimensional vs. multi-dimensional


associations

 Single-dimensional: Items or attributes in the rule


refer to only one dimension (e.g., to "buys")
Beer, Chips  Bread [0.4%, 52%]
buys(x, "Beer") ^ buys(x, "Chips") buys(x, "Bread")
[0.4%, 52%]

 Multi-dimensional: Items or attributes in the rule


refer to two or more dimensions (e.g., "buys",
"time_of_transaction", "customer_category")
In the following example: nationality, age, income

Data Mining - Market Basket


1/15/2018 Analysis 29
Multi-dimensional Rules

CID nationality age income


1 Italian 50 low
2 French 40 high
3 French 30 high
4 Italian 50 medium
5 Italian 45 high
6 French 35 high

RULES:
nationality = French  income = high [50%, 100%]
income = high  nationality = French [50%, 75%]
age = 50  nationality = Italian [33%, 100%]

Data Mining - Market Basket


1/15/2018 Analysis 30
Multi-Dimensional Association: Concepts
Multi-dimensional rules:

 Inter-dimension association rules


(no repeated predicates)

age(X,”19-25”)  occupation(X,“student”)
 buys(X,“coke”)

 Hybrid-dimension association rules


(repeated predicates)

age(X,”19-25”)  buys(X, “popcorn”)


 buys(X, “coke”)
Data Mining - Market Basket
1/15/2018 Analysis 31
Single- vs. Multi-level Rules
 Single-level vs. multi-level
associations

 Single-level: Associations between items or


attributes from the same level of abstraction
(i.e., from the same level of hierarchy)
Beer, Chips  Bread [0.4%, 52%]

 Multi-level: Associations between items or


attributes from different levels of abstraction
(i.e, from different levels of hierarchy)
Beer:Karjala, Chips:Estrella:Barbeque 
Bread [0.1%, 74%]

Data Mining - Market Basket


1/15/2018 Analysis 32
Multi-level Trees

Data Mining - Market Basket


1/15/2018 Analysis 33
Multiple-Level Association Rules

TID Items
T1 {1110, 1210, 2110, 2210}
T2 {1110, 2110, 2220, 3230}
T3 {1120, 1222, 2210, 4113}
T4 {1110, 1210}
T5 {1110, 1222, 2110, 2210, 4113}
Data Mining - Market Basket
1/15/2018 Analysis 34

Potrebbero piacerti anche