Data Mining Techniques

Data Mining Techniques
Computer ⇒ finacial_management_software [support = 2%, confidence=60%]

association rules are considered interesting if they satisfy
a minimum support threshold
a minimum confidence threshold
Users or domain experts can set such thresholds
TID List of Items
Itemset Support
T100 I1, I2, I5
T200 I2, I4 I1, I2 4
T300 I2, I3 I2, I5 2
T400 I1, I2, I4 I1,I3 4
T500 I1, I3 I1, I2, I3 2
T600 I2, I3 I1, I2, I5 2
T700 I1, I3 I1, I2, I3, I5 1
T800 I1, I2, I3, I5
T900 I1, I2, I3 i-itemset: an itemset containing i items
Apriori
Do i = 1 to n
generate candidate i-itemsets;
pruning;
count support of i-itemsets;
EndDo
computes the frequent itemsets in several rounds.
Round i computes all frequent i-itemsets
Pruning reduces candidate
candidate generation itemsets that cannot be frequent,
generates candidate i-itemsets based on the knowledge about
support not yet computed infrequent itemsets obtained from
candidate counting previous rounds.
scans the transaction database

count the support of the candidate itemsets.
C2 C2
C1 L1 {I1,I2} {I1,I2} 4 L2
{I1,I3} {I1,I3} 4
{I1} {I1} 6 {I1,I4} 1 {I1,I2} 4
{I1,I4}
{I2} {I2} 7 {I1,I5} 2 {I1,I3} 4
Generate C1 Scan D for Sup {I1,I5}
{I3} {I3} 6 {I2,I3} 4 {I1,I5} 2
{I2,I3}
{I4} {I4} 2 {I2,I4} 2 {I2,I3} 4
{I2,I4}
{I5} {I5} 2 {I2,I5} 2 {I2,I4} 2
{I2,I5}
D {I3,I4} {I3,I4} 0 {I2,I5} 2
TID List of Items {I3,I5} {I3,I5} 1
{I4,I5} {I4,I5} 0
T100 I1, I2, I5
T200 I2, I4 C3
T300 I2, I3 {I1,I2,I3}
L3 C3
T400 I1, I2, I4 C4 {I1,I2,I5}
T500 I1, I3 {I1,I2,I3} 2 {I1,I2,I3} {I1,I3,I5}
T600 I2, I3 {I1,I2,I3,I5} {I1,I2,I5} 2 {I1,I2,I5} {I2,I3,I4}
T700 I1, I3 {I2,I3,I5}
T800 I1, I2, I3, I5 {I2,I4,I5}
T900 I1, I2, I3
L4= Φ
Min Sup = 2
Apriori Algorithm
All subsets of a frequent

itemset must also be
frequent
Generating Association Rules
A ⇒B
Support of (A ⇒B) = P(A∪B)
Support(A ∪ B)
Confidence of (A ⇒B) = P(B/A) =
Support(A)
for each frequent itemset l, generate all non-empty sets of l.
for every non-empty set s of l, output the rule,

Support(A ∪ B)
s ⇒ (l - s) if
Support(A)
≥ min_conf
Ex: Generate association rules for l = {I1,I2,I5} and {I1, I2, I3}?
Generating Association Rules
Ex: Generate association rules for l = {I1,I2,I5}
TID List of Items
{I1,I2} ⇒ {I5} Confidence =2/4 = 50%
T100 I1, I2, I5
T200 I2, I4 {I1,I5} ⇒ {I2} Confidence =2/2 = 100%
T300 I2, I3
T400 I1, I2, I4 {I2,I5} ⇒ {I1} Confidence =2/2 = 100%
T500 I1, I3
T600 I2, I3
T700 I1, I3 {I1} ⇒ {I2,I5} Confidence =2/6 = 33%
T800 I1, I2, I3, I5
T900 I1, I2, I3 {I2} ⇒ {I1,I5} Confidence =2/7 = 29%
{I5} ⇒ {I1,I2} Confidence =2/2 = 100%

Suppose we are interested in analyzing transactions with respect

to the purchase of computer games and videos.
Computer games ⇒ Videos ( 40%, 66%)

10, 000 transactions
6000 - computer games Probability of purchasing videos = 75%
7500 - videos Purchase of games has decreased the purchase of videos
4000 included both Therefore, games and videos are –vely associated
Min support 30% Confidence can be deceiving

Min confidence 60% We could easily make unwise business decision
Lift
Consider the rule A ⇒ B

Lift is a simple correlation measure defined as follows
P( A ∪ B)
lift ( A, B ) =
P ( A) P ( B )
lift(A,B) <1, negatively correlated
lift(A,B) >1, positively correlated
lift(A,B) =1, no correlation

Why is finding frequent itemsets a nontrivial problem?
The number of transactions is very large

Usually the transaction database will not fit in memory
Multiple passes over transaction database
requires reading the database completely for each pass resulting in a
large number of disks reads
Involves large number of I/O operations
1 GB database is stored on a hard disk with block size 8KB
require roughly 125,000 blocks reads for a single pass
for 10 passes, require roughly 1,250,000 blocks reads
12 ms read time per block time spent for I/O = 3.5 hours
The potential number of frequent itemsets is exponential in the number of items

100
{I1, I2, …….., I100} C1+ 100C2+ ....+ 100C100 =2100-1
Apriori scans the database several times, depending on the size of the longest
frequent itemset.
Several refinements have been proposed that focus on reducing
the number of database scans
the number of candidate itemsets counted in each scan

Data Mining Techniques

Caricato da

Informazioni sul documento

Descrizione originale:

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Data Mining Techniques

Caricato da

Copyright:

Formati disponibili

Data Mining Techniques

Computer ⇒ finacial_management_software [support = 2%, confidence=60%]

scans the transaction database

All subsets of a frequent

for each frequent itemset l, generate all non-empty sets of l.

for every non-empty set s of l, output the rule,

{I5} ⇒ {I1,I2} Confidence =2/2 = 100%

Suppose we are interested in analyzing transactions with respect

Computer games ⇒ Videos ( 40%, 66%)

6000 - computer games Probability of purchasing videos = 75%

7500 - videos Purchase of games has decreased the purchase of videos

Min support 30% Confidence can be deceiving

Consider the rule A ⇒ B

lift(A,B) >1, positively correlated

lift(A,B) =1, no correlation

The number of transactions is very large

The potential number of frequent itemsets is exponential in the number of items

Several refinements have been proposed that focus on reducing

the number of database scans

the number of candidate itemsets counted in each scan

Potrebbero piacerti anche