Sei sulla pagina 1di 5

DATA MINING FOR

BUSINESS INTELLIGENCE
MAYTAL SAAR-TSECHANSKY

Association Rule Discovery with WEKA


This example illustrates some of the basic elements of association rule mining using
WEKA. The sample data set used for this example is bank-discrete.arff (see Blackboard).
Note: Association rule discovery requires discretization of continuous variables (i.e., all
variables must be nominal/discrete). If your data include continuous variables you can
discretize the values of any given variables (map values into bins, where each will
correspond to a discrete/nominal value in the transformed data set). This task can be
performed in the preprocessing node. WEKA is a full data mining suite which includes
various preprocessing modules.
Clicking on the "Associate" tab will bring up the interface for the association rule
algorithms. The Apriori algorithm which we will use is the default algorithm selected. To
change the parameters for this run (e.g., support, confidence, etc.) click on the text box
next to the "Choose" button. Note that this box, at any given time, shows the specific
command-line arguments that are to be used for the algorithm (i.e., when a user uses the
command line rather than the GUI). The dialog box for changing the parameters is
depicted below. Here, you can specify the parameters to be used by the algorithm to find
association rules. Clicking on the "More" button shows the synopsis for the different
parameters.

WEKA allows the resulting rules to be sorted according to different metrics such as
confidence and lift. In this example, select lift as the criteria and specify that the
minimum values should be 1.5 (recall that lift pertains to the confidence of the rule
divided by the support of the result/right-hand-side of the rule).
In a simplified form, given a rule L => R, lift is the ratio of the probability that L and R
occur together to the multiple of the two individual probabilities for L and R, i.e.,
lift = Pr(L,R) / Pr(L) * Pr(R).
If this value is 1, then L and R are independent i.e., having the items listed in R (the
right-hand-side of the rule), does not change the probability of observing the items in L
(the left hand-side of the rule). The higher the lift value, the more likely that the existence
of L and R together in a given transaction is not a random occurrence, but because of
some correlation between them.
Here we also change the default value of rules (10) to 100; this indicates that the program
will only report the top 100 rules (in this case sorted in descending order of their lift
values). The upper bound for minimum support is set to 1.0 (100%) and the lower bound
to 0.1 (10%). Apriori in WEKA starts with the upper bound support and incrementally
decreases support (by delta increments which by default is set to 0.05 or 5%). The
algorithm halts when either the specified number of rules are generated, or the lower
bound for min. support is reached.
The significance testing option is only applicable in the case of confidence and is by
default not used (-1.0). When used, the significance would filter rules with a level of
confidence below that which is specified by the user.
The final selection of parameters for our current run is depicted below:

Now click on the start button to run the algorithm. This results in the following set of
rules:

The panel on the left ("Result list") now shows an item indicating the algorithm used to
generate rules, and the time of the run. One can perform multiple runs with different
parameters. Each run will appear as an item in the Result list panel. Clicking on one of

the results in this list will bring up the details of the run, including the discovered rules in
the right panel. In addition, right-clicking on the result set allows us to save the result
buffer into a (text) file. In this case, we save the output in the file bank-associations.txt.

Note that the rules were discovered based on the specified threshold values for support
and lift. For each rule, the frequency counts for the L and R of each rule is given, as well
as the values for confidence, lift, leverage, and conviction.
Note that leverage and lift measure similar things, except that leverage measures the
difference between the probability of co-occurrence of L and R (see above example) as
the independent probabilities of each of L and R, i.e.,
leverage = Pr(L,R) - Pr(L) * Pr(R).
In other words, leverage measures the proportion of observed cases which include both L
and R above the number of examples expected to include the items in L and R if L and R
were independent of each other. Thus, for leverage, values above 0 are desirable, whereas
for lift, we want to see values greater than 1. Finally, conviction is similar to lift, but it
measures the effect of the right-hand-side not being true. It also inverts the ratio. So,
conviction is measured as:
conviction = Pr(L) * Pr(not R) / Pr(L and not R).
Thus, conviction, in contrast to lift is not symmetric (and also has no upper bound).

In most cases, it is sufficient to focus on a combination of support, confidence, and either


lift or leverage to quantitatively measure the "quality" of the rule. However, the real value
of a rule, in terms of usefulness and actionability is subjective and depends heavily of
the particular domain and business objectives.

Optional Hands-on Exercise (please do not turn this exercise)


Your goal for the exercise below is to perform Association Rule discovery on the data set
using the Weka package.
Perform association rule discovery on the bank-discrete.arff data. Experiment with
different parameters so that you get at least 20-30 strong rules (e.g., rules with high lift
and confidence which at the same time have relatively good support). Select the top 5
most "interesting" rules and for each specify the following:
o
o

Explain the pattern and why you believe it is interesting


Think of any recommendation(s) based on the discovered rules that might
help the company to better understand behavior of its customers or its
marketing campaign.

Note: The top 5 most interesting rules may not be the top 5 in the result set of the Apriori
algorithm. They are rules that provide non-trivial and actionable knowledge given the
banks business objectives.

Potrebbero piacerti anche