Sei sulla pagina 1di 6

DIFFERENT MEASURES FOR ASSOSIATION RULE MINING

Prof. Jitendra Agarwal Ms.Varshali Jaiswal


School of Information Technology Student of Mtech (IT)
Rajiv Gandhi Technological University School of Information Technology
(State Technological University of MP) Rajiv Gandhi Technological University
Jitendra@rgtu.net varshalijaiswal@gmail.com

Abstract: Data mining is method of mining


patterns of interest from large amount of Introduction: In the previous few years a lot
data stored in database, data warehouses or of work is done in the field of data mining
other repositories and Association rules is a especially in finding association between
popular and well researched method for items in a data base of customer transaction.
discovering interesting relation between Association rule mining, one of the most
variables in large databases. And important and well researched techniques of data
discovering association rules is one of the mining[1]. It aims to extract interesting
most important tasks in data mining. For correlations, frequent patterns, associations or
generating strong association rules is depend casual structures among sets of items in the
on transaction databases or other data repositories.
Nowadays, association rules mining from large
• The association rule extraction by databases is an active research field of data
any algorithm for example Apriory mining motivated by many application areas
algorithm or Fp-growth etc. such as telecommunication networks, market
• The evolution of the rules by and risk management, inventory control etc.
different interestingness measure for Rules measurement and Selection
example support/confidence, lift/interest, One challenge for the association rule mining is
Correlation Coefficient, Statistical the rules measurement and selection. Since the
Correlation, Leverage, Conviction etc. data mining methods are mostly applied in the
The association rules mining are dependent large datasets, the association mining is very
on both steps equally. The classical model of likely to generate numerous rules from which it
association rules mining is support- is difficult to build a model or summarize useful
confidence, the interestingness measure of information. A simple but widely used approach
which is the confidence measure. The to help mitigate this problem is to gradually
classical Interestingness measure in increase the threshold value of support and
Association Rules have existed some confidence until a manageable size of rules is
disadvantage. generated. It is an effective way to reduce the
This paper present different measurements number of rules, however it may cause problems
(support/confidence, intrest/lift, Chi-square Test in the results as well. The major concern is that
for Independency, Correlation Coefficient, by increasing the minimum support and
Statistical Correlation) to calculate the strength confidence value, some important information
of association rules. Besides Support and may be filtered out while the remaining rules
confidencet, there are other interestingness may be obvious or already known. The data
measures, which include generality, reliability, mining is a process involving interpretation and
peculiarity, novelty, surprisingness, utility, and evaluation as well as analysis. For association
applicability. This paper investigates the rule mining, the evaluation is an even more
different measures for association rule mining. important phase of the process. The mining of
association rules is actually a two step approach:
Keywords- associationrules; first is the association rule extraction (e.g. with
support/confidence; intrest/lift; Chi-square the Apriori algorithm); and the second step is the
Test for Independency; Correlation evaluation of the rules’ interestingness or
Coefficient; Statistical Correlation; quality, by the domain expert or using statistical

1
quality measures. The interestingness measures contains our recommendation on using which
play an important role in data mining. measure for discovering the interesting rules.

Support/ Confidence: Support[1] is defined


as the percentage of transactions in the data that
contain all items in both the antecedent and the
consequent of the rule,
S=P (X∩Y) = {X∩Y}/ {D}
Confidence is an estimate of the conditional
probability of Y given X, i.e. P(X∩Y)/P(X).
C= P (X∩Y)/P(X)
The support of a rule is also important since it
indicates how frequent the rule is in the
transactions. Rules that have very small support
are often uninteresting since they do not
Interestingness measures are necessary to help describe significantly large populations
select association rule patterns. Each
interestingness measure produces different A rule that has a very high confidence (i.e., close
results. The interestingness of discovered
to 1.0) is very important because it provides an
association rules is an important and active area accurate prediction on the association of the
within data mining research. The primary
items in the rule.
problem is the selection of interestingness The disadvantage of this, It is not trivial to set
measures for a given application domain.
good values for the minimum support and
However, there is no formal agreement on a confidence thresholds.
definition for what makes rules interesting.
Association rule algorithms produce thousands Fundamental critique in so far that the same
of rules, many of which are redundant. In order
support threshold is being used for rules
to filter the rules, the user generally supplies a containing a different number of items
minimum threshold for support and confidence.
Support and confidence are basic measures of Lift/Interest : A few years after the
association rule interestingness. Additionally,
introduction of association rules, researchers [ 3]
these are the most common measures of interest.
started to realize the disadvantages of the
However, generating rules that meet minimum
confidence measure by not taking into account
thresholds for support and confidence may not
the baseline frequency of the consequent.
be interesting. This is because rules are often
Therefore, Lift, originally called Interest, was
produced that are already known by a user who
first introduced by Motwani, et al., (1997), it
is familiar with the application domain. The
measures the number of times X and Y occur
purpose of this paper is to review a few of the
together compared to the expected number of
interestingness measures for association rule. In
times if they were statistically independent. It is
this paper we have identified a set of
presented as:
measures as proposed by the literature and
I= P (X∩Y)/P(X) P(Y)
we have tried to conclude that a single
measure alone cannot determine the
Since P(Y) appears in the denominator of the
interestingness of the rule.
interest measure, the interest can be seen as the
confidence divided by the baseline frequency of
This paper is divided in to three sections
Y. The interest measure is defined over
the first section gives the formal definition
[0, ∞ [and its interpretation is as follows:
and some explanation of each measure. The
second section gives us the calculation of each
If I <1, then X and Y appear less frequently
measure on our sample data and the last section
together in the data than expected under the

2
assumption of conditional independence. X and association Rule XUY is the correlation measure
Y are said to be negatively interdependent. based on the Chi-square test for independence
[3].
If I = 1, then X and Y appear as frequently
together as expected under the assumption of
conditional independence. X and Y are said to be
independent of each other.

If I >1, then X and Y appear more frequently


together in the data than expected under the
assumption of conditional independence. X and The chi-square test for independence is
Y are said to be positively interdependent. calculated as follows, with Oxy the observed
frequency in the contingency table and Exy the
Advantage: expected frequency (by multiplying the row and
The difference between confidence and lift lies column total divided by the grand total)
in their formulation and the corresponding Therefore, the χ2is a summed normalized square
limitations. Confidence is sensitive to the deviation of the observed values from the
probability of consequent (Y). Higher frequency expected values. It can then be used to calculate
of Y will ensure a higher confidence value even the p-value by comparing the value of statistics
if there is not true relationship between X and Y. to a chi-square distribution to determine the
But if we increase the threshold of the significance level of the rule. For instance, if the
confidence value to avoid this situation, some p-value is higher than 0.05 (when χ2 value is
important pattern with relatively lower less than 3.84), we can tell X and Y are
frequency may be lost. In contrast to confidence, significantly independent, and therefore the rule
lift is not vulnerable to the rare items problem. It X => Y can be pruned from the results.
is focused on the ratio between the joint
probability of two itemsets with respect to their Advantages
expected probabilities if they are independent. The advantage of the chi-square measure, on the
Even itemsets with lower frequency together can other hand, is that it takes into account all the
have high lift values. available information in the data about the
occurrence or non-occurrence of combinations
of items, whereas the lift/interest measure only
Disadvantages : The first one is related to the measures the co-occurrence of two itemsets,
problem of sampling variability (see section corresponding to the upper left cell in the
Empirical Bayes Estimate). This means that for contingency table.
low absolute support values, the value of the
interest measure may fluctuate heavily for small Disadvantages
changes in the value of the absolute support of a First of all, the Chi-square test rests on the
rule. This problem is solved by introducing a normal approximation to the Binomial
Empirical Bayes estimate of the interest distribution. Thiapproximation breaks down
measure. when the expected values (Exy) are small.

The second problem is that the interest measure The Chi-square test should only beused when all
should not be used to compare the cells in the contingency table have expected
interestingness of itemsets of different size. values greater than 1 and at least 80% of the
Indeed, the interest tends to be higher for large cells have expected values greater than 5.
itemsets than for small itemsets.
The Chi-squaretest will produce larger values
Chi-square Test for Independency: A when the data set grows to infinity. Therefore,
natural way to express the dependence between more items will tend to become significantly
the antecedent and the consequent of an interdependent if the size of the dataset

3
increases. The reason is that the Chi-square correlation, and correlation is more and more
value depends on the total number of strong with the Scorrelation increase.
transactions, whereas the critical cutoff value
only depends on the degrees of freedom (which Advantages
is equal to 1 for binary variables) and the desired Scorrelation , which can enhance the correlation
significance level. Therefore, whilst comparison degree of items in association rule and cut
of Chi-squared values within the same data set negative correlation rules.
may be meaningful, it is certainly not advisable
to compare Chi-squared values across different
data sets. Example
The sample data (Table 1) for the analysis
Correlation Coefficient: The [7] correlation purpose is taken from a store database of
coefficient (also known as the Φ-coefficient) customer transaction there are six different types
measures the degree of linear interdependency of items and a total of ten transactions. In each
between a pair of random variables. It is defined transaction a 1 represents the presence of
by the covariance between the two variables an item while a 0 represents the absence
of an item from the market basket.
divided by their standard deviations:
Table 1: Sample Transactions
Tid Items
A B C D E F
1 1 1 0 1 0 1
2 1 0 1 1 0 1
Where ρXY = 0 when X and Y are independent 3 1 0 1 1 0 1
and ranges from [-1, +1]. 4 0 1 1 1 0 0
5 0 1 0 1 1 0
Statistical Correlation : To[8] get the 6 1 0 0 0 1 1
7 1 0 1 0 1 1
association rules with real correlation, this
8 0 0 1 0 0 0
measure put forward statistical correlation from
9 0 1 1 1 0 0
the view point of statistics to compensate the
10 1 1 0 1 1 0
deficiency of support-confidence. Statistical TOTAL 6 5 6 7 4 5
correlation is defined as equation , which is
The frequent item set generated by the
sample data
using A-priori algorithm [6] is shown in the
following
Table 2:
If Scorrelation {X UY}<0, it denotes that the itemsets support
items in antecedent X and the consequent Y of {A,D} 40%
an association rule are negative correlation, and {A,F} 50%
the items have a relationship of restricting each {B,D} 50%
other. {C,D} 40%
All measures are calculated for each rule
If Scorrelation {X UY}=0, it means that the in table 2,
which is output of the A-priori algorithm.
items in antecedent X and the consequent Y of
The results
an association rule are independent, and the
are shown in table 3
items are not mutually influence.
Table 3: Calculation of different measure on
If Scorrelation {XUY}>0, it represents that the sample datasets
items in antecedent X and the consequent Y of
Rule Suppo Con Lift Chi- Corrrl Scorrl
an association rule have some degree s rt f. squa a. a.
re

4
Test experts, which leads us to explore the subjective
measures of the association rules.20 The
A→D 0.40 0.66 0.95 5.86 -0.089 - following suggestions can be formulated based
5 0.040 on the analysis of the different interestingness
8
measures discussed in the previously with
D→A 0.40 0.57 0.95 5.86 -0.089 -
5 0.040 example:
8
A→F 0.50 0.83 1.66 0.91 +0.81 +0.52
5 79 2
• Confidence is never the preferred method to
F→A 0.50 1.00 1.66 0.91 +0.81 +0.52 compare association rules since it does not
5 79 2 account for the baseline frequency of the
B→D 0.50 1.00 1.42 1.71 +0.65 +0.31 consequent.
3 5 5
• The lift/interest value corrects for this baseline
D→B 0.50 0.71 1.42 1.71 +0.65 +0.31 frequency but when the support threshold is very
3 5 5 low, it may be instable due to sampling
C→D 0.40 0.66 0.95 8.61 -0.089 - variability. However, when the data set is very
3 0.040 large, even a low percentage support threshold
8 will yield rather large absolute support values. In
D→C 0.40 0.57 0.95 8.61 -0.089 - that case, we do not need to worry too much
3 0.040 about sampling variability. A drawback of the
8 interest measure is that it cannot be used to
compare itemsets or rules of different size since
it tends to overestimate the interestingness for
G rap h b etw een Diffrent In trestin g n ess large itemsets.
M easures • When association rules need to be compared
R ule s between data sets of different sizes, the Chi-
10 square test for independence and Correlation
9 s u p p o rt
analysis are not preferred since they are highly
8 dependent on the dataset size. Both measures
c o n fid e n c e
7 tend to overestimate the interestingness of
6 L ift
itemsets in large datasets.
5
Values

C hi-s q u a re
4 te s t References:
3 C orre la tio n
2
S ta tis tic a l [1]Aggarwal & Yu, 1998 C.C. Aggarwal and
1 C orre la tio n
P.S. Yu . A New Framework for Item Set
0
Generation. In: Proceedings of the ACM PODS
-1 1 3 5 7 9 11 13
Symposium on Principles of Database Systems,
R u le s Seattle, Washington (USA), 18-24, 1998.

[2]Agresti, 1996 A. Agresti. An Introduction to


Conclusions Categorical Data Analysis. Wiley Series in
It is generally accepted that there is no single Probability and Statistics, 1996.
measure that is perfect and applicable to all
problems. Usually different measures are [3]Brijs et al., 1999 T. Brijs, G. Swinnen, K.
complementary and can be applied at different Vanhoof and G. Wets. The use of association
applications or phases. Tan et al., [2002] rules for product assortment decisions: a case
conducted research on how to select the right study. In: Proceedings of the Fifth International
measures for association patterns, and concluded Conference on Knowledge Discovery and Data
that the best measures should be selected by Mining, San Diego (USA), August 15-18, 254-
matching the properties of the existing objective 260, 1999.
measures against the expectation of domain

5
International Conference on Genetic and
[4] R. Agrawal, T. Imielinski, and A. N.Swami, Evolutionary Computing.
Mining Association Rules between Sets of Items
in Large Databases, in: Proceedings of the 1993
ACM SIGMOD Conference, pp.207–216, 1993. [8] Jian Hu &Xiang Yang-Li “Association Rules
Mining Based on Statistical Correlation”
[5] Han J, Pei J, Yin Y, Mining frequent patterns
without candidate generation[A], Proceeding of
2000 ACM-SIGMOD International Conference [9] A.silberschatz A T. “What Makes pattern
on Management of Data[C], pp.1–12, 2000. interesting in kownledge discovery systems.”
IEEE Transactions on Knowledge and Data
Engineering, 1996, 8(6), pp: 970~974
[6] R. Agrawal R S. “Fast Algorithms for
Mining “Association Rules.” Proc. 20th Int.
Conf. on Very Large DataBases, 1994, pp: [10]T. Brijs, K. Vanhoof, G. Wets “Defining
487~499. Interestingness For Association Rules”
International Journal "Information Theories &
Applications" Vol.10
[7] Jianhua Liu “A New Interestingness
Measure of Association Rules” Second

Potrebbero piacerti anche