Sei sulla pagina 1di 10

Data Mining Techniques

Computer ⇒ finacial_management_software [support = 2%, confidence=60%]


association rules are considered interesting if they satisfy
a minimum support threshold
a minimum confidence threshold
Users or domain experts can set such thresholds
TID List of Items
Itemset Support
T100 I1, I2, I5
T200 I2, I4 I1, I2 4
T300 I2, I3 I2, I5 2
T400 I1, I2, I4 I1,I3 4
T500 I1, I3 I1, I2, I3 2
T600 I2, I3 I1, I2, I5 2
T700 I1, I3 I1, I2, I3, I5 1
T800 I1, I2, I3, I5
T900 I1, I2, I3 i-itemset: an itemset containing i items
Data Mining Techniques
Apriori
Do i = 1 to n
generate candidate i-itemsets;
pruning;
count support of i-itemsets;
EndDo
computes the frequent itemsets in several rounds.
Round i computes all frequent i-itemsets
Pruning reduces candidate
candidate generation itemsets that cannot be frequent,
generates candidate i-itemsets based on the knowledge about
support not yet computed infrequent itemsets obtained from
candidate counting previous rounds.

scans the transaction database


count the support of the candidate itemsets.
Data Mining Techniques
C2 C2
C1 L1 {I1,I2} {I1,I2} 4 L2
{I1,I3} {I1,I3} 4
{I1} {I1} 6 {I1,I4} 1 {I1,I2} 4
{I1,I4}
{I2} {I2} 7 {I1,I5} 2 {I1,I3} 4
Generate C1 Scan D for Sup {I1,I5}
{I3} {I3} 6 {I2,I3} 4 {I1,I5} 2
{I2,I3}
{I4} {I4} 2 {I2,I4} 2 {I2,I3} 4
{I2,I4}
{I5} {I5} 2 {I2,I5} 2 {I2,I4} 2
{I2,I5}
D {I3,I4} {I3,I4} 0 {I2,I5} 2
TID List of Items {I3,I5} {I3,I5} 1
{I4,I5} {I4,I5} 0
T100 I1, I2, I5
T200 I2, I4 C3
T300 I2, I3 {I1,I2,I3}
L3 C3
T400 I1, I2, I4 C4 {I1,I2,I5}
T500 I1, I3 {I1,I2,I3} 2 {I1,I2,I3} {I1,I3,I5}
T600 I2, I3 {I1,I2,I3,I5} {I1,I2,I5} 2 {I1,I2,I5} {I2,I3,I4}
T700 I1, I3 {I2,I3,I5}
T800 I1, I2, I3, I5 {I2,I4,I5}
T900 I1, I2, I3
L4= Φ
Min Sup = 2
Data Mining Techniques
Apriori Algorithm

All subsets of a frequent


itemset must also be
frequent
Data Mining Techniques
Generating Association Rules

A ⇒B
Support of (A ⇒B) = P(A∪B)

Support(A ∪ B)
Confidence of (A ⇒B) = P(B/A) =
Support(A)

for each frequent itemset l, generate all non-empty sets of l.

for every non-empty set s of l, output the rule,


Support(A ∪ B)
s ⇒ (l - s) if
Support(A)
≥ min_conf

Ex: Generate association rules for l = {I1,I2,I5} and {I1, I2, I3}?
Data Mining Techniques
Generating Association Rules
Ex: Generate association rules for l = {I1,I2,I5}
TID List of Items
{I1,I2} ⇒ {I5} Confidence =2/4 = 50%
T100 I1, I2, I5
T200 I2, I4 {I1,I5} ⇒ {I2} Confidence =2/2 = 100%
T300 I2, I3
T400 I1, I2, I4 {I2,I5} ⇒ {I1} Confidence =2/2 = 100%
T500 I1, I3
T600 I2, I3
T700 I1, I3 {I1} ⇒ {I2,I5} Confidence =2/6 = 33%
T800 I1, I2, I3, I5
T900 I1, I2, I3 {I2} ⇒ {I1,I5} Confidence =2/7 = 29%

{I5} ⇒ {I1,I2} Confidence =2/2 = 100%


Data Mining Techniques

Suppose we are interested in analyzing transactions with respect


to the purchase of computer games and videos.

Computer games ⇒ Videos ( 40%, 66%)


10, 000 transactions

6000 - computer games Probability of purchasing videos = 75%

7500 - videos Purchase of games has decreased the purchase of videos

4000 included both Therefore, games and videos are –vely associated

Min support 30% Confidence can be deceiving


Min confidence 60% We could easily make unwise business decision
Data Mining Techniques

Lift

Consider the rule A ⇒ B


Lift is a simple correlation measure defined as follows

P( A ∪ B)
lift ( A, B ) =
P ( A) P ( B )
lift(A,B) <1, negatively correlated

lift(A,B) >1, positively correlated

lift(A,B) =1, no correlation


Data Mining Techniques
Why is finding frequent itemsets a nontrivial problem?

The number of transactions is very large


Usually the transaction database will not fit in memory
Multiple passes over transaction database
requires reading the database completely for each pass resulting in a
large number of disks reads
Involves large number of I/O operations
1 GB database is stored on a hard disk with block size 8KB
require roughly 125,000 blocks reads for a single pass
for 10 passes, require roughly 1,250,000 blocks reads
12 ms read time per block time spent for I/O = 3.5 hours

The potential number of frequent itemsets is exponential in the number of items


100
{I1, I2, …….., I100} C1+ 100C2+ ....+ 100C100 =2100-1
Data Mining Techniques

Apriori scans the database several times, depending on the size of the longest
frequent itemset.

Several refinements have been proposed that focus on reducing

the number of database scans

the number of candidate itemsets counted in each scan

Potrebbero piacerti anche