Sei sulla pagina 1di 13

Hanoi University of Science and Technology

School of Information and Communication Technology


==================*=================







Knowledge Engineering Report
Subject: Determination of products bought together






Supervisor: Ph.D. Quang Nhat NGUYEN
Group 20: - T Quc Vit (20093262)
- Ng Ngc Thnh (20092421)



















Hanoi Dec 2013
Knowledge Engineering Report Group 20


Determination of Products Bought Together 2
Table of Contents
I. Problem Definition ............................................................................................... 3
II. Basic Concepts ........................................................................................................ 3
II.1. Data Mining ............................................................................................................................ 3
II.2. Association Rule Mining .................................................................................................... 4
II.2.1. Basic Terms ................................................................................................................................ 4
II.2.2. Base Theory ............................................................................................................................... 5
II.2.3. Apriori Algorithm .................................................................................................................... 5
II.2.4. Rule Generation ........................................................................................................................ 6
III. Solution ..................................................................................................................... 7
III.1. System Architecture ............................................................................................................ 7
III.2. Knowledge Representation .............................................................................................. 7
III.3. Implementation .................................................................................................................... 8
III.3.1. Packages ...................................................................................................................................... 8
III.3.2. Class diagram ............................................................................................................................ 8
IV. Summary ............................................................................................................11
IV.1. Achievements ..................................................................................................................... 11
IV.2. Future Work ........................................................................................................................ 12
V. Conclusion .............................................................................................................13
VI. References .........................................................................................................13


Knowledge Engineering Report Group 20


Determination of Products Bought Together 3
I. Problem Definition
As a vendor, information about which products are frequently purchased
serially after other products would be very helpful in the business. By
exploiting the transaction history, the vendor can obtained the knowledge
about purchase behavior of the customers. But the benefits is not stay with
the vendor only, it helps the customers to buy the right sets of products that
relevant to their need.

Figure 1 Purposes & Benefits
To exploit the raw transaction data, an influential method is the Association
Rule Mining, in the scope of this report, we will introduce to you the basic
analysis of this method.
II. Basic Concepts

Figure 2 Overview
II.1. Data Mining
Generally, data mining (sometimes called data or knowledge discovery) is
the process of analyzing data from different perspectives and summarizing it
into useful information - information that can be used to increase revenue,
cuts costs, or both. Data mining software is one of a number of analytical tools
for analyzing data. It allows users to analyze data from many different
dimensions or angles, categorize it, and summarize the relationships
identified. Technically, data mining is the process of finding correlations or
patterns among dozens of fields in large relational databases [1]. The overall
Purposes
& Benefits
User Employer
Data mining
Association rule
Apriori Algorithms
Knowledge Engineering Report Group 20


Determination of Products Bought Together 4
goal of the data mining process is to extract information from a data set and
transform it into an understandable structure for further use [2].
II.2. Association Rule Mining
So, to solve the data mining problem there are several method, one
common approach is Association Rule Data Mining. By definition it is the
frequent pattern mining searches for recurring relationships in a given data
set [3]. The output is the discovery of association and correlations among
items in large transactional or relational data sets. Those exploited correlation
relationships among available items can help the business decision making
process such as: catalog design, cross-marketing and especially customer
shopping behavior analysis. In this section we will give a detail explanation
and analysis on Association Rule Mining.
II.2.1. Basic Terms
To represent the Association Rule Mining, its necessary to understands
several terminologies and be familiar with some notations:
- I = {I
1
, I
2
, , I
m
] is the set of all possible items (products to be
bought).
- D is the set of transactions in the database where each
transaction T = {T , T I]. We assign an identifier TID for
each transaction.
- Let P be a set of items, a transaction T contains P P T.
- An association rule is P Q where P I, Q I and P Q = .
In explanation, it is the relationships between two disjoint
itemsets P and Q in I, which imply that if P occurs, Q also occurs in
a transaction T with a certain probability.
- The rule P Q holds in the transaction set D with support s,
where s is the percentage of transactions in D that contain P Q,
this value is the very probability P(P Q).
- The rule P Q has confidence c in the transaction set D, where c
is the percentage of transactions in D containing in P that also
containing Q. That means it is the conditional probability P( Q|P).
This value can also be computed by the conditional probability
property: cun|dence(P Q) = P( Q|P) =
xuppurt(PQ)
xuppurt(P)
. This
equation makes it much easier to compute the confidence value.
- We have to define two threshold minimum support (min_sup) and
minimum confidence (min_conf). By convention, this two value is
from 0%-100% rather than 0-1.0.
- Rules that satisfy both of above thresholds are called strong rules.
o We see that the confidence of rule P Q can be easily
derived from the support counts of P and (P Q). These
two value is easy to derive, therefore decrease the
computing time of the process.
- The set of k item is denoted as k-itemset.
Knowledge Engineering Report Group 20


Determination of Products Bought Together 5
- The k-itemset which has minimum support is denoted by Li. The
Ck is the set that was generated by joing Lk-1 with itself.
Those are all the concepts that follow the algorithm. In the next sections, we
will see how the algorithm are built up.
II.2.2. Base Theory
Generally, Association Rule Mining has two main steps bellows:
Step 1: Find all frequent itemsets: Each of these itemsets will occurs at
least as frequently as a predetermined minimum support count, min_sup.
Step 2: Generate strong association rules from the frequent itemsets:
These rules must satisfy minimum support and minimum confidence.
Seeing the figure below, you can somehow have an idea about how min_sup
and min_conf work in the Association Rule Mining:

Figure 3 Work flow of the process
In the Step 1, we will apply the Apriori algorithm to find the frequent
itemsets, which is the most important component of the application.
Step 2 is the generating Association Rules from the frequent itemsets
obtained from the step 1.
So the skeleton of the process is not very complicated. Up to now we have
covered the main idea of the Association Rule Mining process. In the next two
sections, the detailed analysis of the two steps of the process will be revealed.
II.2.3. Apriori Algorithm
1

Apriori algorithm is a seminal algorithm proposed by R. Agrawal and R.
Srikant in 1994 for mining frequent itemsets for Boolean association rules [4].
For reminding, please ensure that you remember two notations Lk and Ck
from the Basic Terms section before continue to explore the algorithm.
This is the pseudo code of the Apriori algorithm:

1
The name of the algorithm is based on the fact that the algorithm uses prior knowledge of frequent itemset
properties.
min_conf min_sup
1. Frequent
itemsets
2. Association
rules
Knowledge Engineering Report Group 20


Determination of Products Bought Together 6

List 1 Apriori pseudo code
The idea seems good, but actually, it face a big problem in performance. The
problem is the step generating the candidate set can produce too much
candidates that actually not necessary, if we implement it in a trivial approach,
then the amount of the generated candidates is enormous, together with the
huge computation resource consuming.
Now here comes the Apriori property: All nonempty subsets of a frequent
itemset must also be frequent. This property is based on the observation
that with a not-frequent itemset l, if an item A is added to l, then the resulting
itemset | A cannot occur more frequently than l. Hence, | A is not frequent
either. This property belong to a special category of properties called
antimonotonicity in the sense that if a set cannot pass a test, all of its supersets
will fail the same test as well. Its called antimonotonicity because the property
is monotonic in the context of failing a test.
We will use is property in the prune step: any (k-1)-itemset that is not
frequent cannot be a subset of a frequent k-itemset. Therefore, if any (k-1)-
subset of a candidate k-itemset is not in Lk-1, then the candidate cannot be
frequent either and so can be removed from Ck.
II.2.4. Rule Generation
After the step 1 with the Apriori Algorithm, the remaining work is not much
left. From the frequent itemsets, we can generate strong association rules as
following:

List 2 Rule Generation pseudo code
Because the rules are generated from the frequent itemsets, each one
automatically satisfies the min_sup. So now finally we have produced the
association rules we need.

I
1
= {rcqucnt itcms]
for (k = 1; I
k
! = ; k ++) do begin
C
k+1
= conJiJotc gcncrotcJ rom I
k
;
foreach transaction T in database do
Increment the count of all candidates in Ck+1 that are contained in T
I
k+1
= conJiJotcs in C
k+1
wit min _sup
end
return I
k
;
k

foreach |
generate all nonempty subset of l.
foreach x where x |
output the rule
x (| x) |
xuppurt(|)
xuppurt(x)
mtn _cun.

Knowledge Engineering Report Group 20


Determination of Products Bought Together 7
III. Solution
III.1. System Architecture


Figure 4 System architecture
The system will act as below:
- Accept three input parameters:
o A customer transaction history file
o Min support
o Min confident
- Use customer transaction history file and min support, apply the
Apriori algorithm to find all frequent item sets those satisfy the
min support.
- Use the set of all frequent item sets and the min confident as input,
apply the rule generation process to generate all associated rules,
which will be used to determine the products that are bought
together.
III.2. Knowledge Representation
The dataset we use is a dataset of transaction from a Belgian store. It is
downloaded from http://fimi.ua.ac.be/data/retail.dat. Each record will
contain a set of products id, which shows us one transaction of a customer (a
list of things that he (she) bought together).
Knowledge Engineering Report Group 20


Determination of Products Bought Together 8
Customer history file: Each line of this file will be a set of product id, which
defines one transaction that a customer has made before. Between two
product ids are a space character. There may be empty line between each
transaction. We foresee this case and avoid this in the implementation
Min support and min confident can be any double value between 0 and 100.
The associated rules will be stored in a file as the output. Each line is a rule,
which has the form of A => {B}.
III.3. Implementation
III.3.1. Packages

Figure 5 Packages
The program is divided into three main packages:
- GUI package: contains all classes those are responsible for
providing user interfaces
- Algorithm package: contains all classes those are belong to the
implementation of the algorithms, which are used in this program.
- Program package: contains class that stores the main method. The
program will be run from here.
Besides, there will be two packages which store the input and output files of
the system. The resource package stores the user transaction history file. The
output package stores the output files, which contains all association rules
generated by this system
III.3.2. Class diagram
Knowledge Engineering Report Group 20


Determination of Products Bought Together 9

Figure 6 Class diagram
All those classes: ItemSet, FrequentItemSet, Candidate, Rule are just the
simple POJO, which are used in programming to store data. All procedures of
the system, algorithms, are implemented in the Service class.
We will give you some explanation about the methods implemented in this
class by focusing on each step of the system.
- Step 1: Find all frequent item sets
o This step, we will implement all method that applies the
Apriori algorithm to find the frequent item sets.
Method Description
getConfigValue(String path) - Accept the file path as input parameter
- Read through the file and calculate the number
of transaction and number of items in the
dataset.

createFirstCandidate() - Get the first candidate list by statistic through
the file.
- This will return the set C1
getFirstFrequentItemSet() - From the first candidate list, based on the min
support, we will find all candidates whose
supports are greater than min_sup.
- The support value will be calculated by
function caculateSup();
Knowledge Engineering Report Group 20


Determination of Products Bought Together 10
- This will return the first frequent item set L1
getListCandidate
FromPreviousFrequent
ItemSet()
- The Ck+1 candidate sets will be generated from
Lk
- To improve the performance of the algorithm,
we apply the prune step based on the
downward closure property, that is, subsets of
a frequent itemset are also frequent itemsets
- This prune step will be implemented in the
method prune();
apriori() - This function is the implement of the Apriori
algorithm, using the pseudo code that we
introduced in the basic concepts.
- Step 2: Generate strong association rules from the frequent item
sets.
o This step is implemented in the ruleGenerate() method. This
method will traverse to all frequent item sets found in step 1,
generate all possible rules, calculate confident of each rule
and by that, find all strong rules based on the threshold (min
confident).
o In this method, the algorithm to find all possible rules must
be determined to find enough rules from each set of each
frequent item set. We apply the binary string algorithm to
solve that problem.
o The algorithm is as follow:

List 3 Binary string algorithm

Input: a set from frequent item set
Output: a set of possible rules
Process:
n = Calculate the number of possible rules based on the number of
items in the input set.
a = Generate the n-bit binary number (contains all 0)
While(number of bit 1 < n)
a++;
Bit 0 in the binary number will be considered as the left hand while
the bit 1 is the right hand of the rule.

Knowledge Engineering Report Group 20


Determination of Products Bought Together 11
IV. Summary
IV.1. Achievements
After completing implementing the system, we can state that our system
works successfully and correctly according to the theory.
Here are some demo and output of our application:

Figure 7 Main screen of the application

Figure 8 Result panel
Knowledge Engineering Report Group 20


Determination of Products Bought Together 12

List 4 Application output
Notable honors of our system:
- We built the system from scratch
- The system has a very good performance (look up the List 4
Application output)
IV.2. Future Work
Being restricted by the time, we know that our work is obviously not a
complete solution. There are several tasks that we think they should be
in the system if we have a chance to continue to develop it:
- Explanation system
- Integrate the system to a real market system with real
database, then the system will be trained and updated often.
- Improving the efficiency of the system by implementing
several expansion to the system.

FILE STATISTIC
Number of transaction: 88162
Number of item: 16470
------------------------------------------------
FREQUENT ITEM SET
Frequent 1-itemset
[15167 {32}]
[15596 {38}]
[50675 {39}]
[14945 {41}]
[42135 {48}]
Frequent 2-itemset
[29142 {39 48}]
[10345 {38 39}]
[9018 {41 48}]
[11414 {39 41}]
------------------------------------------------
RULES
39 =====> 48 confident: 57.50764676862358
48 =====> 39 confident: 69.16340334638662
38 =====> 39 confident: 66.3311105411644
41 =====> 48 confident: 60.3412512546002
41 =====> 39 confident: 76.37336901973904
------------------------------------------------
Execution time is: 0.969 seconds.
Knowledge Engineering Report Group 20


Determination of Products Bought Together 13
V. Conclusion
So far, we have gone through the Association Rule Mining problem, from the
theory to practical implementation. Association Rule Mining consists of two
main steps: finding the frequent itemsets (sets of items satisfying min_sup
threshold), then generate strong association rules based on the output of the
previous step (rules satisfy the min_conf threshold).
Apriori algorithm is a seminal algorithm for mining frequent itemsets for
Boolean association rules [4].
There is no doubt, the application of Association Rule Mining has been
proven and tighten its position in data mining field. The benefit of applying
this mining process comes to both Employer and Customers, which is a win-
win paradigm and so very persistence success.
VI. References

[1] B. Palace, "Data Mining: What is Data Mining?," June 1996. [Online]. Available:
http://www.anderson.ucla.edu/faculty/jason.frand/teacher/technologies/palace/data
mining.htm. [Accessed 11 Dec 2013].
[2] Wikipedia, "Data mining," Wikipedia, [Online]. Available:
http://en.wikipedia.org/wiki/Data_mining. [Accessed 11 Dec 2013].
[3] Jiawei Han, Micheline Kamber, Micheline Kamber, "Mining Frequent Patterns,
Associations, and Correlations," in Data Mining: Concepts And Techniques 3rd Edition,
Morgan Kaufmann, 2011, p. 207.
[4] Agrawal, R.; Srikant, R., "Fast algorithms for mining association rules," in Proc. 1994 Int.
Conf. Very Large Data Bases, Santiago, Chile, IBM Almaden Research Center, Sep, 1994,
pp. 487-499.
[5] Jiawei Han, Micheline Kamber, Micheline Kamber, Data Mining Concepts and
Techniques 3rd Edition, Morgan Kaufmann, 2011.

Potrebbero piacerti anche