Sei sulla pagina 1di 6

DIFFERENT MEASURES FOR ASSOSIATION RULE MINING

Prof. Jitendra Agarwal Ms.Varshali Jaiswal


School of Information Technology Student of Mtech (IT)
Rajiv Gandhi Technological University School of Information Technology
(State Technological University of MP) Rajiv Gandhi Technological University
Jitendra@rgtu.net varshalijaiswal@gmail.com

Abstract: Data mining is method of Test for Independency; Correlation


mining patterns of interest from large Coefficient; Statistical Correlation;
amount of data stored in database, data
warehouses or other repositories and Introduction: In the previous few years a lot
Association rules is a popular and well of work is done in the field of data mining
researched method for discovering especially in finding association between
interesting relation between variables in items in a data base of customer
large databases. And discovering transaction. Association rule mining, one of
association rules is one of the most the most important and well researched
important tasks in data mining. For techniques of data mining[1]. It aims to extract
generating strong association rules is interesting correlations, frequent patterns,
depend on associations or casual structures among sets of
items in the transaction databases or other data
• The association rule extraction by repositories. Nowadays, association rules
any algorithm for example Apriory mining from large databases is an active
algorithm or Fp-growth etc. research field of data mining motivated by
• The evolution of the rules by many application areas such as
different interestingness measure for telecommunication networks, market and risk
example support/confidence, lift/interest, management, inventory control etc.
Correlation Coefficient, Statistical Rules measurement and Selection
Correlation, Leverage, Conviction etc. One challenge for the association rule mining
The association rules mining are is the rules measurement and selection. Since
dependent on both steps equally. The the data mining methods are mostly applied in
classical model of association rules the large datasets, the association mining is
mining is support-confidence, the very likely to generate numerous rules from
interestingness measure of which is the which it is difficult to build a model or
confidence measure. The classical summarize useful information. A simple but
Interestingness measure in Association widely used approach to help mitigate this
Rules have existed some disadvantage. problem is to gradually increase the threshold
This paper present different measurements value of support and confidence until a
(support/confidence, intrest/lift, Chi-square manageable size of rules is generated. It is an
Test for Independency, Correlation effective way to reduce the number of rules,
Coefficient, Statistical Correlation) to however it may cause problems in the results
calculate the strength of association rules. as well. The major concern is that by
Besides Support and confidencet, there are increasing the minimum support and
other interestingness measures, which include confidence value, some important information
generality, reliability, peculiarity, novelty, may be filtered out while the remaining rules
surprisingness, utility, and applicability. This may be obvious or already known. The data
paper investigates the different measures for mining is a process involving interpretation
association rule mining. and evaluation as well as analysis. For
association rule mining, the evaluation is an
Keywords- associationrules; even more important phase of the process. The
support/confidence; intrest/lift; Chi-square mining of association rules is actually a two

1
step approach: first is the association rule that a single measure alone cannot
extraction (e.g. with the Apriori algorithm); determine the interestingness of the rule.
and the second step is the evaluation of the
rules’ interestingness or quality, by the domain This paper is divided in to three sections
expert or using statistical quality measures. the first section gives the formal definition
The interestingness measures play an and some explanation of each measure. The
important role in data mining. second section gives us the calculation of each
measure on our sample data and the last
section contains our recommendation on
using which measure for discovering the
interesting rules.

Support/ Confidence: Support[1] is


defined as the percentage of transactions in the
data that contain all items in both the
antecedent and the consequent of the rule,
S=P (X∩Y) = {X∩Y}/ {D}
Confidence is an estimate of the conditional
probability of Y given X, i.e. P(X∩Y)/P(X).
C= P (X∩Y)/P(X)
Interestingness measures are necessary to The support of a rule is also important since it
help select association rule patterns. Each indicates how frequent the rule is in the
interestingness measure produces different transactions. Rules that have very small
results. The interestingness of discovered support are often uninteresting since they do
association rules is an important and active not describe significantly large populations
area within data mining research. The
primary problem is the selection of A rule that has a very high confidence (i.e.,
interestingness measures for a given close to 1.0) is very important because it
application domain. However, there is no provides an accurate prediction on the
formal agreement on a definition for what association of the items in the rule.
makes rules interesting. Association rule The disadvantage of this, It is not trivial to set
algorithms produce thousands of rules, good values for the minimum support and
many of which are redundant. In order to confidence thresholds.
filter the rules, the user generally supplies a
minimum threshold for support and Fundamental critique in so far that the same
confidence. Support and confidence are support threshold is being used for rules
basic measures of association rule containing a different number of items
interestingness. Additionally, these are the
most common measures of interest. Lift/Interest : A few years after the
However, generating rules that meet introduction of association rules, researchers
minimum thresholds for support and [ 3] started to realize the disadvantages of the
confidence may not be interesting. This is confidence measure by not taking into account
because rules are often produced that are the baseline frequency of the consequent.
already known by a user who is familiar Therefore, Lift, originally called Interest, was
with the application domain. The purpose of first introduced by Motwani, et al., (1997), it
this paper is to review a few of the measures the number of times X and Y occur
interestingness measures for association together compared to the expected number of
rule. In this paper we have identified a times if they were statistically independent. It
set of measures as proposed by the is presented as:
literature and we have tried to conclude
I= P (X∩Y)/P(X) P(Y)

2
Since P(Y) appears in the denominator of the
interest measure, the interest can be seen as The second problem is that the interest
the confidence divided by the baseline measure should not be used to compare the
frequency of Y. The interest measure is interestingness of itemsets of different size.
defined over Indeed, the interest tends to be higher for large
[0, ∞ [and its interpretation is as follows: itemsets than for small itemsets.

If I <1, then X and Y appear less frequently Chi-square Test for Independency: A
together in the data than expected under the natural way to express the dependence
assumption of conditional independence. X between the antecedent and the consequent of
and Y are said to be negatively interdependent. an association Rule XUY is the correlation
measure based on the Chi-square test for
If I = 1, then X and Y appear as frequently independence [3].
together as expected under the assumption of
conditional independence. X and Y are said to
be independent of each other.

If I >1, then X and Y appear more frequently


together in the data than expected under the
assumption of conditional independence. X
and Y are said to be positively interdependent. The chi-square test for independence is
calculated as follows, with Oxy the observed
Advantage: frequency in the contingency table and Exy
The difference between confidence and lift the expected frequency (by multiplying the
lies in their formulation and the corresponding row and column total divided by the grand
limitations. Confidence is sensitive to the total) Therefore, the χ2is a summed
probability of consequent (Y). Higher normalized square deviation of the observed
frequency of Y will ensure a higher values from the expected values. It can then be
confidence value even if there is not true used to calculate the p-value by comparing the
relationship between X and Y. But if we value of statistics to a chi-square distribution
increase the threshold of the confidence value to determine the significance level of the rule.
to avoid this situation, some important pattern For instance, if the p-value is higher than 0.05
with relatively lower frequency may be lost. In (when χ2 value is less than 3.84), we can tell
contrast to confidence, lift is not vulnerable to X and Y are significantly independent, and
the rare items problem. It is focused on the therefore the rule X => Y can be pruned from
ratio between the joint probability of two the results.
itemsets with respect to their expected
probabilities if they are independent. Even Advantages
itemsets with lower frequency together can The advantage of the chi-square measure, on
have high lift values. the other hand, is that it takes into account all
the available information in the data about the
occurrence or non-occurrence of combinations
Disadvantages : The first one is related to the of items, whereas the lift/interest measure only
problem of sampling variability (see section measures the co-occurrence of two itemsets,
Empirical Bayes Estimate). This means that corresponding to the upper left cell in the
for low absolute support values, the value of contingency table.
the interest measure may fluctuate heavily for
small changes in the value of the absolute Disadvantages
support of a rule. This problem is solved by First of all, the Chi-square test rests on the
introducing a Empirical Bayes estimate of the normal approximation to the Binomial
interest measure.

3
distribution. Thiapproximation breaks down If Scorrelation {X UY}<0, it denotes that the
when the expected values (Exy) are small. items in antecedent X and the consequent Y of
an association rule are negative correlation,
The Chi-square test should only beused when and the items have a relationship of restricting
all cells in the contingency table have each other.
expected values greater than 1 and at least
80% of the cells have expected values greater If Scorrelation {X UY}=0, it means that the
than 5. items in antecedent X and the consequent Y of
an association rule are independent, and the
The Chi-squaretest will produce larger values items are not mutually influence.
when the data set grows to infinity. Therefore,
more items will tend to become significantly If Scorrelation {XUY}>0, it represents that
interdependent if the size of the dataset the items in antecedent X and the consequent
increases. The reason is that the Chi-square Y of an association rule have some degree
value depends on the total number of correlation, and correlation is more and more
transactions, whereas the critical cutoff value strong with the Scorrelation increase.
only depends on the degrees of freedom
(which is equal to 1 for binary variables) and Advantages
the desired significance level. Therefore, Scorrelation , which can enhance the
whilst comparison of Chi-squared values correlation degree of items in association rule
within the same data set may be meaningful, it and cut negative correlation rules.
is certainly not advisable to compare Chi-
squared values across different data sets.
Example
Correlation Coefficient: The [7] The sample data (Table 1) for the analysis
correlation coefficient (also known as the Φ- purpose is taken from a store database of
coefficient) measures the degree of linear customer transaction there are six different
interdependency between a pair of random types of items and a total of ten transactions.
variables. It is defined by the covariance In each transaction a 1 represents the
presence of an item while a 0 represents
between the two variables divided by their
the absence of an item from the market
standard deviations: basket.
Table 1: Sample Transactions
Tid Items
A B C D E F
1 1 1 0 1 0 1
Where ρXY = 0 when X and Y are 2 1 0 1 1 0 1
independent and ranges from [-1, +1]. 3 1 0 1 1 0 1
4 0 1 1 1 0 0
5 0 1 0 1 1 0
Statistical Correlation : To[8] get the 6 1 0 0 0 1 1
association rules with real correlation, this 7 1 0 1 0 1 1
measure put forward statistical correlation 8 0 0 1 0 0 0
from the view point of statistics to compensate 9 0 1 1 1 0 0
the deficiency of support-confidence. 10 1 1 0 1 1 0
Statistical correlation is defined as equation , TOTAL 6 5 6 7 4 5
which is
The frequent item set generated by the
sample data
using A-priori algorithm [6] is shown in
the following
Table 2:

4
itemsets support
{A,D} 40% G rap h b etw een D iffrent In trestin g n ess
{A,F} 50% M easures
{B,D} 50% R u le s
10
{C,D} 40%
9 s up p ort
All measures are calculated for each rule
in table 2, 8
c on fid e n c e
which is output of the A-priori algorithm. 7
The results 6 L ift
are shown in table 3 5

Values
C h i-s q u a re
Table 3: Calculation of different measure on 4 te s t
sample datasets 3 C o rre la tio n

Rule Suppo Con Lift Chi- Corrrl Scorrl 2


S ta tis tic a l
s rt f. squa a. a. 1 C o rre la tio n
re
0
Test
A→D 0.40 0.66 0.95 5.86 -0.089 - -1 1 3 5 7 9 11 13
5 0.040 R u le s
8
D→A 0.40 0.57 0.95 5.86 -0.089 -
5 0.040
8
A→F 0.50 0.83 1.66 0.91 +0.81 +0.52
Conclusions
5 79 2 It is generally accepted that there is no single
F→A 0.50 1.00 1.66 0.91 +0.81 +0.52 measure that is perfect and applicable to all
5 79 2
problems. Usually different measures are
B→D 0.50 1.00 1.42 1.71 +0.65 +0.31
3 5 5 complementary and can be applied at different
applications or phases. Tan et al., [2002]
D→B 0.50 0.71 1.42 1.71 +0.65 +0.31 conducted research on how to select the right
3 5 5
measures for association patterns, and
C→D 0.40 0.66 0.95 8.61 -0.089 - concluded that the best measures should be
3 0.040 selected by matching the properties of the
8
existing objective measures against the
D→C 0.40 0.57 0.95 8.61 -0.089 - expectation of domain experts, which leads us
3 0.040
8
to explore the subjective measures of the
association rules.20 The following suggestions
can be formulated based on the analysis of the
different interestingness measures discussed in
the previously with example:

• Confidence is never the preferred method to


compare association rules since it does not
account for the baseline frequency of the
consequent.
• The lift/interest value corrects for this
baseline frequency but when the support
threshold is very low, it may be instable due to
sampling variability. However, when the data
set is very large, even a low percentage
support threshold will yield rather large
absolute support values. In that case, we do
not need to worry too much about sampling
variability. A drawback of the interest measure
is that it cannot be used to compare itemsets or

5
rules of different size since it tends to Proceedings of the 1993 ACM SIGMOD
overestimate the interestingness for large Conference, pp.207–216, 1993.
itemsets.
• When association rules need to be compared [5] Han J, Pei J, Yin Y, Mining frequent
between data sets of different sizes, the Chi- patterns without candidate generation[A],
square test for independence and Correlation Proceeding of 2000 ACM-SIGMOD
analysis are not preferred since they are highly International Conference on Management of
dependent on the dataset size. Both measures Data[C], pp.1–12, 2000.
tend to overestimate the interestingness of
itemsets in large datasets.
[6] R. Agrawal R S. “Fast Algorithms for
References: Mining “Association Rules.” Proc. 20th Int.
Conf. on Very Large DataBases, 1994, pp:
[1]Aggarwal & Yu, 1998 C.C. Aggarwal and 487~499.
P.S. Yu . A New Framework for Item Set
Generation. In: Proceedings of the ACM
PODS Symposium on Principles of Database [7] Jianhua Liu “A New Interestingness
Systems, Seattle, Washington (USA), 18-24, Measure of Association Rules” Second
1998. International Conference on Genetic and
Evolutionary Computing.
[2]Agresti, 1996 A. Agresti. An Introduction
to Categorical Data Analysis. Wiley Series in
Probability and Statistics, 1996. [8] Jian Hu &Xiang Yang-Li “Association
Rules Mining Based on Statistical
[3]Brijs et al., 1999 T. Brijs, G. Swinnen, K. Correlation”
Vanhoof and G. Wets. The use of association
rules for product assortment decisions: a case
study. In: Proceedings of the Fifth [9] A.silberschatz A T. “What Makes pattern
International Conference on Knowledge interesting in kownledge discovery systems.”
Discovery and Data Mining, San Diego IEEE Transactions on Knowledge and Data
(USA), August 15-18, 254-260, 1999. Engineering, 1996, 8(6), pp: 970~974

[4] R. Agrawal, T. Imielinski, and A.


N.Swami, Mining Association Rules between [10]T. Brijs, K. Vanhoof, G. Wets “Defining
Sets of Items in Large Databases, in: Interestingness For Association Rules”
International Journal "Information Theories &
Applications" Vol.10