Coen242 - Big Data - Research Paper - Group 3

Analysis of Association Rule Mining Algorithms with Applications
Aasawari Bagewadikar, Priyanka Botny Srinath, Ronald Bayross, Srujana Paluri, Stephen Woolery
Dr. Ahmed Ezzat
Computer Engineering, Santa Clara University
Santa Clara, California, United States
AbstractThe business world requires critical
decision making in various situations and
scenarios. Large amounts of information are
generated from the products and/or services
industry. With information comes data in a raw
format that requires tasks like data mining to
extract meaningful information. Mining
methods like association rule mining help in
uncovering relationships between unrelated
data in a repository. In this paper, we provide
an overview of the basic concepts of association
rule mining and walk-through the list of existing
association rule mining techniques. We analyze
these by taking sample applications to perform
analysis.
Index TermsAssociative Rule Mining,
Apriori, FP Growth, Opinion Mining,
Recommender Systems, Data Stream Mining,
Stock and Text categorization
I. INTRODUCTION
Data mining is an analytical method for gathering
useful information from data. It allows the users to
analyze, categorize, and summarize the
relationships among data using various algorithms.
In this paper, we examine the various association
rule mining algorithms with applications that help
in the analysis.
Many modern business organizations accumulate
large amounts of data from their day-to-day
activities. This data is stored in databases, data
warehouses, and other various information storage
schemes. This data is important and it has to be
processed in order to get the useful information out
of it. Data mining techniques such as clustering,
classification, prediction and association rule
mining are generally used to analyze the data to
extract useful information from large amounts of

data.
Association rule mining is a popular and well
researched method for discovering interesting
relations between variables in large datasets,
commonly known as the Market Basket
problem. It was first introduced by Agrawal,
Imielinski, and Swami. This technique aims to
extract interesting correlations, associations or
causal structures among sets of items in transaction
databases or other data repositories. By analyzing
these associations, rules can be generated to
classify a given item into an itemset. The
effectiveness of each rule can be measured in two
ways; support level and confidence. [22]
II. BACKGROUND
Association rules are if/then statements that help to
uncover relationships between unrelated data in a
database, relational database or other information
repository. Their application is most often seen
with the Market Basket problem, where the items
in a customers basket will suggest or imply what
else may be included in the basket for a given trip
to the market. Once generated, these rules can be
used to provide guidance on what a user group may
do after a certain sequence of events has
occurred. With this guidance, resources can be
better utilized to guide customers to likely items,
or plan for certain events given known behavior.
A common example used to explain associative
rule mining revolves around the relationship
between diapers and beer in grocery store
transactions. In this example, large amounts of
customer purchase data are collected at a grocery
store, a portion of which is shown in the table

below. Each row represents a transaction, which
has a unique transaction ID (TID). Analysis of this
data highlights some relationships, represented in
the form of frequent items or rules. A rule is
defined as an implication of the form X Y where
X, Y I and X Y = .
Generating these association rules is no simple

task, and there are several methods of analyzing a
dataset to get good results. These include opinion
mining of product reviews, collaborative
recommendation and post-mining of existing
association rules in a data stream management
system.
In this paper, we examine several of these methods
It is also important that when these methods or
algorithms are used, they perform their task in an
efficient and robust manner to provide for timely
returns and accurate decision-making and provide
analysis of the benefits and drawbacks. Finally, use
the relation between execution times, confidence
and support to determine which algorithm is best
fit to generate association rules.
III. OVERVIEW OF ALGORITHMS
Table 1: Customer purchase history
Analysis of the itemsets in the transactions would

imply that there is a relationship between
purchasing diapers and beer, which can be
formalized in the rule:
{Diapers} {Beer}
The rule implies a strong relation between the
diapers and beer; that is, people who buy diapers
also buy beer. Using similar approaches,
businesses and retailers can customize marketing
approaches to target consumers based on their
previous history or current planned items to
purchase.
The measures of effectiveness for associative rules
include support level and confidence. Support
level is defined as the percentage of records that
contain X Y, out of the total number of records
in the database. Confidence is the percentage of
records that contain X Y, out of the total number
of records that contain X in the database. By
setting minimum levels for each of these measures,
it is possible to eliminate poorly performing rules
and create a final model that accurately classifies
the data.
The generation of associative rules usually occurs

in two steps; first, there is the analysis of the
database and generation of frequent itemsets, and
second, the generation of rules from these itemsets.
While all algorithmic approaches include the
generation of frequent itemsets, not all will
generate and evaluate the associative rules. To
better analyze current applications, we reviewed
the common algorithms used to generate frequent
itemsets.
A. AIS (Agrawal, Imielinski, and Swami)
ALGORITHM
This was the first algorithm that was used to find
association rules to detect frequent itemsets. In this
algorithm only one item consequent association
rules are generated. For example, rules like X Y
Z can be generated but not rules like X Y
Z. The algorithm compares frequent item sets to
generate rules. A given database is scanned
multiple times to generate commonalities between
frequent itemsets. Based on the minimal support
count those items whose support count less than its
minimum value gets eliminated from the list of
items. Candidate 2-itemsets are generated by
extending frequent 1-itemsets with other items in
the transaction. During the second pass over the

database, the support count of those candidate 2itemsets are accumulated and checked against the
support threshold. Similarly those candidate (k+1)item sets are generated by extending frequent kitem sets with items in the same transaction. This
process iterates until any one of them becomes
empty. It focuses on improving the quality of
databases together with necessary functionality to
process decision support queries.
B. Set
Oriented
ALGORITHM
Mining
9. FIk+1 = candidates in CIk+1 with

min_support
10. End
11. Return FIk;
(SETM)
This algorithm generates itemsets on-the-fly as the

database is scanned, but does not count itemsets
until the end of the scan. The itemsets are generated
in the same way as AIS algorithm. At the end of
the pass, the support count of these itemsets is
determined by aggregating the sequential structure.
It separates candidate generation process from
counting. SETM repeatedly modifies the entire
database to perform candidate generation, support
counting, and remove infrequent sets.
C. APRIORI ALGORITHM
Apriori algorithm is used for mining frequent
itemsets for Boolean association rules. It is mainly
designed to operate on database containing
transactions, including a collection of transactions
in a supermarket or details of frequent uses of
websites. The basis of the algorithm is defined as
the Apriori principle, stated thus: If an Itemset is
frequent, then all of its subsets must also be
frequent.
Steps involved in Apriori:
1. CIk: Candidate itemset having size k
2. FIk: Frequent itemset having size k
3. FI1 = {frequent items};
4. For (k=1; FIk != null; k++) do begin
5. CIk+1 = candidates generated from FIk;
6. For each transaction t in database D do
7. Increment the count value of all candidates
in
8. CIk+1 that are contained in t
Fig 1: Illustration of frequent itemset generation using

Apriori algorithm
The Apriori algorithm determines the frequent

itemsets based on the minimum support of each
itemset. In the above example the candidate-1
frequent items are generated and if they do not
meet the minimum support level (3 in this
example), the items are removed. The candidate-2
iteration will be based on the candidate-1 data. In a
similar fashion, k itemsets are used to explore the
k+1 itemsets, known as candidate generation.
D. APRIORI TID ALGORITHM
Like the Apriori algorithm, the Apriori TID
algorithm uses a generation function to determine
itemsets. The difference between the two
algorithms is that in the Apriori TID algorithm the
database is not referred to for counting support
after the first pass. Candidate generation is
performed same as the Apriori algorithm, only
using TIDs instead of the items/itemsets
themselves.
Pseudo code for Apriori TID algorithm:
1.
2.
3.
F1= {frequent 1-itemsets}

k= 2;
while(Fk-1 is not empty)do{
Ck= Apriori_generate(Fk-1);
a. CBk=Counting_base_generate(Ck,
CBk-1)
b. Support_count(Ck, CBk);
5. Fk= {c Ck| support(c) min_support};}
6. F= sum of all Fk;
4.
6. Fk +1 = Fk +1UN;
7. end
8. end
9. end
10. if Fk +1= then 11. Bottom-Up(Fk +1);
11. end }
E. APRIORI HYBRID ALGORITHM

The Apriori and Apriori TID algorithms both use
the same candidate generation procedure and
therefore count the same item sets. The Apriori
Hybrid algorithm, true to its name, uses Apriori in
the initial passes and switches to Apriori TID when
it expects that the candidate item sets at the end of
the pass will be in memory. In this manner, it
attempts to optimize the generation of frequent
itemsets by using the fastest part of each of its
parents.
F. ECLAT ALGORITHM
Equivalence Class Clustering and bottom up
Lattice Traversal is known as the ECLAT
algorithm. This algorithm is also used to perform
item set mining. It uses a TID set intersection to
compute the support of a candidate item set to
avoid the generation of subsets that do not exist in
the prefix tree.This algorithm is a scalable
algorithm of Apriori.
On comparison of various algorithms like Apriori,
FP Growth Tree and Eclat, Eclat algorithm is the
fastest amongst the three with highest execution
times as execution times decreases with increasing
confidence and support.
G. FP GROWTH ALGORITHM
FP Growth Algorithm is a Frequency Pattern
Growth Algorithm, which constructs the
Frequency Pattern tree based on the given data sets.
This approach has an advantage over the Apriori
approach as it does not require excessive recursion
over the transaction set, and the complexity to
recreate any given itemset is less.
To create an FP tree, first the transaction database
is reviewed to find the frequency of each item.
Then a priority is assigned to each item, based on
how frequent it is, with more common items
having a higher priority. With that complete, the
dataset is rearranged according to priority, and
infrequent items are discarded. Once complete, the
final FP-tree can be constructed.
Pseudo Code for Eclat Algorithm:

Input: Fk = {I1, I2,.., In} // cluster of frequent kitemsets.
Output: Frequent l-itemsets, l >k.
Bottom-Up (Fk) {
1. for all Ii Fk
2. Fk +1=;
3. for all IjFk, i < j
4. N = Ii Ij;
5. if N .sup min_sup then
Fig 2: FP Tree Construction [1]
FP Tree Construction Flow:
Reads the first transaction and creates the nodes for

a and b. Each node has an item and the counter for
it. The root node always starts with null, and from
null node it creates other nodes for the first
transaction, mapping the path for a and b. In a
similar way, each transaction is read out and
mapped. If the paths of the transaction overlap,
then the appropriate counters are incremented by
one. The dotted lines in the figure above represent
the pointers that are maintained with the same
items creating the single linked list. Compression
of the data sets depend on the overlapping of the
paths. If the path overlapping is high then it has
more compression and may use less memory. Once
the FP tree construction is done then the frequent
data item list is extracted.
H. RECURSIVE ELIMINATION
Recursive Elimination processes the transactions
directly, based on FP-Growth algorithm without
the prefix tree. FP-Growth algorithm uses the
prefix tree for representation of datasets, which
save a lot of memory for storing the transaction.
Recursive elimination algorithm deletes all items
from the transaction database that has least
frequent items. Recursive elimination is better
when min support is low. The goal of frequent
pattern mining algorithm is discover all the
patterns having support greater than the userdefined threshold.
IV.
ANALYSIS OF ALGORITHMS
The associative rule mining algorithms do not

always return the results in a reasonable time. An
analysis of various associative rule algorithms can
be evaluated by computing the efficiency.
Increasing
efficiency
means
to
reduce
computational costs. [1] There are four ways to
achieve efficiency
1. By reducing the number of passes over the
database
2. By sampling the database
3. By adding extra constraints on structure of

patterns
4. Through parallelization
Following indicates the various benefits,
drawbacks and analysis for generating association
rules. The analysis of some of the algorithms can
be best explained by taking application samples in
the next section.
A. AIS ALGORITHM:
Benefits: It was the first algorithms to be
introduced for generating association rules.
Drawbacks: The drawback of the AIS algorithm is
that it makes multiple passes over the database. It
generates and counts too many candidate itemsets
that turned out to be small which requires more
space and waste much efforts that turned out to be
useless.
B. SETM:
Benefits: The algorithm repeatedly modifies the
entire database to perform candidate generation,
support counting, and remove infrequent sets and
generates its own transaction IDs
Drawbacks: The problem with the SETM
algorithm is that candidates are replicated for every
transaction in which they occur, which results in
huge sizes of intermediate results. The algorithm
uses candidate IDs that would save space, but then
the join could not be carried out as an SQL
operation. What is even worse is that these huge
relations have to be sorted twice to generate the
next larger frequent sets.
C. APRIORI ALGORITHM
Benefits: Uses large itemset property, easily
parallelizable and easy to implement.
Drawbacks: Takes a lot of memory and requires
multiple scans of the database.
D. FP TREE GROWTH ALGORITHM
Benefits: Based on experimental result FP Growth
algorithm is better than traditional Apriori
algorithm in case of number of rule, CPU time and

minimum support where database size should not
be large. Data support, speed and Accuracy of the
FP growth algorithm is high when compared to the
other contemporary Algorithms. Only two times
the data set is parsed. Once for organizing the list
and the other for constructing the FP Tree.
Drawbacks: Using tree structure creates
complexity. In the best case where the overlap of
path is high in a very large data set. It is
compressed to very minimal with the FP Tree. If
the overlap is very less, then it is very difficult to
fit in the memory. Even though the algorithm is
very fast. It is expensive and time consuming to
build the FP Tree. Once it is build, then frequency
set extraction is very easy.
E. APRIORI TID ALGORITHM
Benefits: Use for smaller problems. Apriori only
performs better than Apriori TID in the initial
passes but more passes are given Apriori TID
certainly has better performance than Apriori.
Apriori TID can be considered an optimized
version of SETM that does not rely on standard
database operations and uses apriori-gen for
faster candidate generation.
Drawbacks: For smaller k value each entry in the
Count Base table might be larger than the
corresponding transaction.
F. APRIORI HYBRID ALGORITHM
Benefits: Used where Apriori and Apriori TID are
used
Drawbacks: Switching from Apriori to Apriori TID
is costly.
G. ECLAT ALGORITHM
Benefits: ECLAT algorithm scans the database
only once. Support is counted in this algorithm.
Confidence is not calculated in this algorithm. Best
used for free item sets. It finds frequent itemsets
with less time.
Drawbacks: Apriori wins in cases where candidate
sets are more.
H. RECURSIVE ELIMINATION
ALGORITHM
Benefits: Better efficiency than Apriori in all cases.
The goal of frequent pattern mining algorithm is
discover all the patterns having support greater
than the user-defined threshold.
Drawbacks: Less efficiency than ECLAT in all
cases.
Characteristic
Data support
AIS
Less
SETM
Less
Apriori
Limited
Apriori-TID
Often large
Apriori hybrid
Very large
ECLAT
FP-Growth
Table 2 Comparison of Association
Rule Mining algorithms [5]
V. APPLICATIONS
Many applications require the identification and
categorization of frequent itemsets, extending
beyond the associations of purchasing diapers and
beer. Here, we examine several applications of
associative rules and rule learning that represent
some common and some more novel contemporary
approaches.
A. Stock Categorization
Algorithm used: FP Growth Algorithm
In The Influence of Volume and Volatility on
Predicting Shanghai Stock Exchange Trends,
Pierrot et. al. develop a method of evaluating
association rule effectiveness on predicting stock
price behavior. For stock price data, measures of
support are not sufficient to characterize the
performance of associative rules on stock price
behavior, as relatively few trades may severely
skew the results. Confidence measures also may
prove inaccurate, as the seemingly random motion
of stock price may result in a rule having very high
confidence in a certain result, but not having a high
reliability in reaching that result. Instead, the
authors propose two new measures, volatility and
volume, to
evaluation.
generate
associative
rules
for
With the new measures calculated and a set of

associative rules generated, the authors then
revised the confidence measure to use a weighted
average and majority vote scheme to identify the
rules that would best predict future trends of a
given stock price.
This revised confidence
measure was used to prune the generated rules and
develop a final model to predict stock price trends.
The authors found that the volatility measure did
not prove to be a good predictor of future stock
price performance, providing accuracy close to
what would be expected with a random prediction.
The volume measure, however, was much more
successful, providing a 5% gain in prediction
accuracy over a similar model not using the
measure.
B. Text Categorization
Algorithm used: FP Growth Algorithm
The authors of Associative Text Categorization
exploiting Negated Words attempt to improve on
associative classification of text in documents by
adding a negated word (a word not found in the
document) to associative rules. This method is not
commonly used, as the usual approach to
implement includes adding negated words to every
generated associative rule, quickly creating an
unmanageable model. Baralis et. al. address this
issue by creating a traditional associative
classification model and running it on a training
set, and then selectively modifying any rule that
misclassified a document by adding a negated
word that was found in at least two misclassified
documents.
This approach has the benefit of much less
computational overhead than generating an
associative rule for every possible negated word,
actual word combination. As a result, the runtime
to generate the model is not significantly greater
than the traditional model. It is also not complex
to implement, as the negated word approach uses
the traditional associative rule generation for text
as its basis. To simplify the approach: based on the

misclassified sets, the authors generate a list of
frequent words. One or more of these words are
added to a rule that has failed for a given document,
allowing it to then accurately classify the document
and improve overall accuracy. Repeating this for
all misbehaving rules generates a final model that
is very accurate on the training data and has good
performance on test data.
Using this method, the authors were able to see a
slight improvement over the traditional method for
associative text classification. The new method
was also comparable to other machine learning
approaches, but could not improve on the accuracy
of Support Vector Machines. The authors realized
an increase in accuracy with their new method over
the traditional approach, without losing the humanreadability of the data that would be necessary with
other machine learning approaches, providing a
basis for further text-based associative rule
refinement.
C. Opinion Mining
Algorithm used: Apriori Algorithm
Most people want to receive more information
about products before they buy them; the feedback
of previous customers can affect their decisions.
With the development of Web 2.0 that emphasizes
the participation of users, more and more websites
like Amazon, IMDB, and Yelp lead people to write
their opinion about products or services they are
interested in. These opinions benefit not only
future customers but also the manufacturers. By
mining these opinions for key phrases, retailers or
manufacturers can better market or fine tune their
products to the target audience. In A Method for
Opinion Mining of Product Reviews using
Association Rules [3], Kim et.al. proposed a
method to generate these key phrases by breaking
the problem down into three major tasks: 1)
linguistic resource development, 2) sentiment
classification, and 3) opinion summarization.
Consider the following opinion for a product:
I like C camera in general, because the picture

quality of C camera is good. It has a solid body and
excellent quality. So I think that C camera is
good.
From this example, we can extract several phrases
such as picture quality is good, solid feel,
excellent quality, camera is good, which
illustrate the customers opinion. With that
information in hand, we then need to categorize the
opinions into categories that describe the aspects
we wish to highlight. However, the reliability of
opinion information that is extracted from a
product review is low, because opinion
information is subjective. Therefore, we need to
summarize opinion information from lots of review
data using opinion mining.
Part-of-Speech (POS) Tagging
The first step is to extract key words or phrases
from the given data. POS tagging [3] is the process
of assigning a part-of-speech like noun, verb,
pronoun, adverb, adjective or other lexical class
marker to each word in a sentence. The input to a
tagging algorithm is a string of words of a natural
language sentence and a finite list of POS tags. The
output is a single best POS tag for each word.
Fig 3: Phrase-tree structure
Semantic Orientation (SO) from Association

With the phrase tree generated, there then needs to
be some classification as to if each phrase is good
or bad. The authors use Semantic Orientation from
Association to create a measure of how positive or
negative each phrase may be. Using the Pointwise
Mutual Information (PMI) between two words,
word1 and word2, the equation formed is:
With this in mind, the semantic orientation of a

given word is calculated from the difference
between strength of its association with a set of
positive words and strength of its association with
a set of negative words. It is calculated as follows:
SO (phrase) = PMI (phrase, excellent) PMI
(phrase, poor)
Table 3: Sample tagset
Using this process, we can take a review and

convert it into the phrase-tree structure as shown
below.
The reference words excellent and poor were

chosen because they are common adjectives used
when giving reviews. With the common phrases
extracted and used to train associative rules on
which are positive and negative, the rules can then
run on the entire dataset and extract useful
information from all reviews.
The ultimate goal of opinion mining is to extract
customer opinions on product reviews and to
provide useful information for others. This
paper[3] proposes a method for opinion mining
which extracts features and opinions using part-ofspeech tagging on each review sentence and
summarizes
product
reviews
discovering
association rules. Through opinion mining people

can easily find out other peoples summarized
opinions without reading all the product reviews.
D. Recommender Systems
Algorithm used: User-based and Item-based
collaborative filtering algorithms
In this era of e-commerce, marketing strategy
recommendation systems play an important role in
business. Although a variety of recommendation
techniques have been developed recently,
collaborative filtering, which is based on
association rule mining, has been known to be the
most successful recommendation technique. It has
been used in a number of different applications
such as recommending web pages, movies,
articles, and products. The goal of the collaborative
filtering is to suggest new items or to predict the
utility of a certain item for a particular user based
on the users previous likings and the opinions of
other similar users. Then, based on user
preferences, classify users and apply a
classification
technique
to
generate
a
recommendation scheme for the whole user base.
The two types of collaborative filtering techniques
are model based and memory based. In a modelbased recommendation system we extract the
information from the dataset and use that as a
"model" to make recommendations without having
to use the complete dataset every time. Memorybased recommendation filtering uses a uses the
entire dataset to generate recommendations. As
such, model-based algorithms are widely accepted
as a way to alleviate the scaling problem presented
by memory-based algorithms in data-intensive
commercial recommendation systems. Though
model-based filtering is widely accepted, in most
cases it tends to provide lower recommendation
accuracy as only a sample of the original dataset is
used. Even with this drawback, however, the
model-based approach is still widely accepted due
to the fact that it also provides improved robustness
against profile injection attacks.
There are two primary approaches used in

collaborative filtering; User-based and Item based.
These approaches are similar, but differ on the
primary subject to develop relationships to.
User-based collaborative filtering was the first of
the automated CF methods, also known as k-NN
collaborative filtering. The objective of this
algorithm is to find other users whose past rating
behavior is similar to that of the current user and
use their ratings on other items to predict what the
current user will like. These users ratings for the
item are then weighted by their level of agreement
with current users ratings to predict his/her
preference. Besides the rating matrix R, a user
user CF system requires a similarity function s:
UU R computing the similarity between two
users and a method for using similarities and
ratings to generate predictions. A critical design
decision in implementing useruser CF is the
choice of similarity function. One can choose a
similarity function from a wide range of functions
such as Pearson correlation, Spearman rank
correlation, Cosine similarity, mean-squared
difference and many others. However, Pearson
correlation has been found to provide the best
results.
User-based collaborative filtering is effective but it
suffers from scalability problems as the user base
grows. Searching for the neighbors of a user is an
O(|U|) operation (or worse, depending on how
similarities are computed). Directly computing
most similarity functions against all other users is
linear in the total number of ratings. To extend
collaborative filtering for large user bases and
facilitate deployment on e-commerce sites, it was
necessary to develop more scalable algorithms.
Thus CF was moved towards item-based
collaborative filtering which one of the most
widely deployed collaborative filtering techniques
in use today. The main difference between userbased and item-based CF is that item-based CF
uses similarities between the rating patterns of
items than users rating. If two items tend to have
the same users like and dislike them, then they are
similar and users are expected to have similar
preferences for similar items. The itemitem
prediction process requires an itemitem similarity

matrix S. This matrix is a standard sparse matrix,
with missing values being 0 (no similarity); it
differs in this respect from R, where missing values
are unknown.
Performance of item-based CF algorithms is of
superior quality than user-based CF algorithms.
Many e-commerce sites including Amazon makes
heavy use of item-based CF algorithm to show item
ratings and suggestions on their sites.
E. Rule Mining in Data Streams
Algorithm used: ECLAT algorithm
Data stream contains time series data. The columns
of the data stream are records and fields.
Preserving all elements of a data stream is
impossible due to the restrictions and the following
conditions must be fulfilled by a data stream
system. [17]
Condition 1: The data stream must be analyzed by
inspecting each data element only once.
Condition 2: Even though new data elements are
constantly produced in a data stream, memory
usage must be restricted within finite limits.
Condition 3: Processing of the newly created data
elements should be performed in the fastest
possible manner.
Condition 4: The latest analysis result of a data
stream must be provided at once when demanded
Algorithm
For calculating the 1 candidate frequent item sets,
this algorithm assumes that the vertical TID list
data base is given and for each item it simply reads
it corresponding TID list from the given data base
and incrementing the items support for each entry.
ECLAT algorithm attempts to improve the fastness
of the support computations. Compared with other
algorithms like Apriori, FP- growth etc, it does not
create the candidate item sets. This algorithm scans
the data base only once and creates the vertical data
base, which identifies each item in the list of
transactions that supports the items. [18]
Steam data is utilized by numerous applications

like monitoring of network traffic, detection of
fraudulent credit card, and analysis of stock market
trend like we discussed above. The real-time
production systems that create huge quantity of
data at exceptional rates have been a constant
confrontation to the scalability of the data mining
methods. Network event logs, telephone call
records, credit card transactional flows, sensors
and surveillance video streams and so on are some
of the examples for such data streams. Processing
and categorization of continuous, high volume data
streams are necessitated by upcoming applications
like online photo and video streaming services,
economic analysis, real-time manufacturing
process control, search engines, spam filters,
security, and medical services.
VI. CONCLUSION
Forms of association rules are used in many
applications today to provide recommendation
engines, behavior predictions, transaction analysis,
and item categorization. Many of the most
common retail websites use a recommendation
algorithm that attempts to increase sales by
suggesting items that were commonly bought
along with a selected item. Some stores analyze a
customers transaction data to provide targeted
coupon offers based on the customers purchase
history. These applications are all powered by
associative rules generated from customer data.
If associative rule generation were broken down
into two phases, it would include frequent itemset
generation and rule generation and evaluation.
Reviewing how the frequent itemsets are generated
allows us to examine the efficiency of the
algorithms used. There does not appear to have
been much development and innovation in this
area; many recently published applications use
versions of already existing algorithms to generate
itemsets. There are some that attempt to increase
the efficiency of an existing algorithm by adding
an additional sorting step or hierarchy [1], but it is
more common to just select the Apriori or FPGrowth algorithm to create itemsets. This may be
due to a lack of innovation in the space, but

contrarily, there may be no need for a faster or
more efficient mining algorithm with current
hardware (i.e. memory) limitations.
Regardless, our review found the Apriori and FPGrowth algorithms to be the most commonly used
for itemset generation in cases where the dataset
was contained in a traditional database format. For
the more exotic applications, such as mining data
streams, we found that newer itemset generation
algorithms such as ECLAT were the primary
choice. The selection of algorithm was often made
with the strengths and weaknesses of the
application in mind; more static, research oriented
applications utilized the Aprioi algorithm, realworld applications looking for good performance
used an FP-Growth style algorithm, and streaming
data applications used the ECLAT algorithm.
The second part of associative rule generation, the
actual creation and evaluation of the rules, is where
most of the innovation in current applications is
ongoing. Here, we saw the introduction of novel
measures to identify useful rules, attempts to create
rules by looking at the items not in an itemset, and
a method of identifying rules based on phrases by
treating the phrase itself as a frequent itemset.
Since we are attempting to uncover previously
unknown relationships in the data, there is no fixed
method or approach to identify the associative
rules that go along with these relationships, leaving
the space open to experimentation.
For most of the applications we reviewed, novel
methods to generate associative rules only did
marginally better than baseline approaches. There
were some unique cases where previously
undiscovered relationships allowed for the creation
of rules with a much higher classification accuracy,
but finding these relationships was difficult and
required some familiarity with the dataset. There
are most likely many more improvements to be
made in the area of rule generation, but each
improvement may only provide a small gain for
most circumstances.
Developing a deeper
understanding of what the data is and how it is
generated would provide a more direct method of
creating associative rules that uncover obscure

relationships than any benefit that the incremental
gains would provide.
The most useful and direct method of improving
the performance and accuracy of a set of
associative rules proved to be introducing unique
measures for rule performance. Similar to support
and confidence, these unique measures allowed for
better identification of good rules and elimination
of the bad. After using a traditional approach to
generate the frequent itemsets and all the
associative rules, the application of the unique
measure (in addition to support and confidence)
results in a smaller set of rules that perform better
on the final dataset than in a traditional approach.
This method was also the most accessible, since it
allowed for traditional, in place approaches to
generate the rules before pruning them.
The large number of methods to generate
associative rules provides a hint at how varied and
flexible they can be. While the primary methods
to generate the frequent itemsets dont widely vary,
the way the rules are generated and measured
allows for fine-tuning to specific applications.
This property makes associative rules broadly
applicable, and leaves the door open to even more
novel (and potentially useful) implementations. It
is likely that we dont have a full understanding of
the some of the largest implementations of
associative rules in use, including the rules behind
Amazons product recommendations or Netflix
movie suggestions, but these shape many
interactions in our daily activities. With the
incredible amount of data being generated but not
deeply understood, we can be assured that versions
of associative rules will continue to be used to
make sense of it all for some time to come.
VII. ACKNOWLEDGEMENT
We would like to thank Professor Ahmed Ezzat for
giving us an opportunity to explore various
algorithms and applications. Our special thanks to
the design lab at Santa Clara University, our
families and friends.
VIII. REFERENCES
[1] M. Agarwal, M. Jailia. An interactive
method for generalized association rule
mining using FP-tree, in Proc of 2nd
Bangalore Annual Compute Conference, 2009
http://dl.acm.org.libproxy.scu.edu/citation.
cfm?id=1517303.1517314&coll=DL&dl=
ACM&CFID=693745984&CFTOKEN=8
8430067
[2] Sandvig, B. Mobasher, R. Burke.
Robustness of collaborative recommendation
based on association rule mining, in Proc of
2007 ACM Conference on Recommender
Systems, 2007, pp 105-112
cfm?id=1297231.1297249&coll=DL&dl=
8430067
[3] W.Y. Kim, J.S. Ryu, K. Kim, U. Kim. A
method for opinion mining of product reviews
using association rules, in Proc of 2nd
International Conference on Interaction
Sciences: Information Technology, Culture
and Human, 2009, pp 270-274
cfm?id=1655925.1655973&coll=DL&dl=
8430067
[4] T.A. Kumbhare, S.V. Chobe. An
Overview of Association Rule Mining,
International Journal of Computer Science
and Information Technologies, Vol. 5, pp 927930, 2014
http://www.ijcsit.com/docs/Volume%205/
vol5issue01/ijcsit20140501201.pdf
[5] R.
Tilili,
Y.Slimani.
Executing
Association Rule Mining Algoritms under a
Grid Computing Environment. in Proc.
Parallel and Distributed Systems: Testing,
Analsys, and Debugging, 2011, pp 53-61
http://doi.acm.org.libproxy.scu.edu/10.114
5/2002962.2002973
[6] Kaosar, Z. Xu, X. Yi. Distributed

Association Rule Mining with Minimum
Communication Overhead, in Proc. of 8th
Australasian Data Mining Conference, 2009,
pp 17-23
http://crpit.com/confpapers/CRPITV101K
aosar.pdf
[7] Y.S.Koh, R. Pears. Rare Association Rule
Mining via Transaction Clustering, in Proc of
7th Australasian Data Mining Conference,
2008, pp 87-94
http://crpit.com/confpapers/CRPITV87Ko
h.pdf
[8] X. Wu, C. Zhang, S. Zhang. Efficient
Mining of Both Positive and Negative
Association
Rules,
Transactions
on
Information Systems, Vol. 22, No. 3, pp 381405, July 2004
5/1010614.1010616
[9] N. Jiang, L. Gruenwald. Research Issues
in Data Stream Association Rule Mining,
ACM Special Interest Group on Data
Management Record, Vol. 35, No. 1, pp 1419, March 2006
http://www.sigmod.org/publications/sigmo
d-record/0603/p14-article-gruenwald.pdf
[10] Thakkar, B. Mozafari, C. Zaniolo.
Continuous Post-Mining of Association
Rules in a Data Stream Management System,
in Post-Mining of Association Rules;
Techniques
for
Effective
Knowledge
Extraction, Y. Zhao et. al., Information
Science Reference, 2009
http://web.cs.ucla.edu/~zaniolo/papers/PM
16Thakkar.pdf
[11] M. Thool, P. Voditel. Association Rule
Generation in Streams, International Journal
of Advanced Research in Computer and
Communication Engineering, Vol. 2, Issue 5,
pp 2277 - 2283, May 2013
http://www.ijarcce.com/upload/2013/may/
59-Manisha%20ThoolASSOCIATION%20RULE%20GENERA
TION%20IN%20STREAMS.pdf
[12] R. Pierrot, L. H. Liu. The Influence of
Volume and Volatility on Predicting Shanghai
Stock
Exchange Trends,
in
Fifth
International Conference on Fuzzy Systems
and Knowledge Discovery, Vol. 1 pp 470-474,
October 2008
http://dx.doi.org.libproxy.scu.edu/10.1109/
FSKD.2008.88
[13] Baralis, P. Garza. Associative text
categorization exploiting negated words, in
Proc of 2006 ACM Symposium on Applied
Computing, 2006, pp 530-535
5/1141277.1141402
[14] Garg, Urvashi. Kaur, Manjit. ECLAT
Algorithm for frequent Itemsets Generation,
International Journal of Computer Systems
(ISSN: 2394-1065), Volume 01 Issue 03,
December, 2014
http://www.ijcsonline.com/IJCS/IJCS_201
4_0103002.pdf
[15] Sharma,Simple. Khurana, Komal. A
Comparative Analysis of Associative Rules
Mining Algorithms International Journal of
Scientific and Research Publications, Volume
3, Issue 5, May 2013 1 ISSN 2250-3153
http://www.ijsrp.org/research-paper0513/ijsrp-p17133.pdf
[16] Kudhati Madhav, Venu. A New Data
Steam Mining Algorithm for InterestingnessRich Association Rules Journal of Computer
Information Systems, Spring 2013
http://iacis.org/jcis/articles/JCIS53-3-2.pdf
[17] S.Vijayarani. P.Sathya. Mining Frequent
Item Sets over Data Streams using clat
Algorithm, International Conference on
Research Trends in Computer Technologies
(ICRTCT - 2013)
http://research.ijcaonline.org/icrtct/number
4/icrtct1048.pdf
[18] Pooja Agrawal. Suresh Kashyap. Vikas
Chandra Pandey. Suraj Prasad Keshri. A
Review Approach on various form of Apriori
with Association Rule Mining in
International Journal on Recent and
Innovation Trends in Computing and
Communication Vol. 1, Issue 5.
http://www.academia.edu/4899172/A_Rev
iew_Approach_on_various_form_of_Apri
ori_with_Association_Rule_Mining
[19] Michael D. Ekstrand. John T. Riedl.
Joseph A. Konstan. Collaborative Filtering
Recommender Systems in Foundations and
Trends in HumanComputer Interaction Vol.
4, 2010.
http://files.grouplens.org/papers/FnT%20C
F%20Recsys%20Survey.pdf
[20] Badrul Sarwar. George Karypis. Joseph
Konstan.
John
Riedl.
Item-Based
Collaborative Filtering Recommendation
Algorithms for GroupLens Research
Group/Army
HPC
Research
Center,
Department of Computer Science and
Engineering University of Minnesota,
Minneapolis.
http://files.grouplens.org/papers/www10_s
arwar.pdf
[21] Kenneth. Lai, and N. Cerpa, Proceedings
of the OPTIMA Conference OPTIMA 2001 Conference of the ICHIO (The Chilean
Operations Research Society), Curic, Chile ,
October 10-12, 2001, Curic, Chile
http://www.academia.edu/648890/Support
_vs_Confidence_in_Association_Rule_Algorithm
s
[22] Das, W. Ng, Y. Woon. Rapid
Association Rule Mining, in Proc of 10th
International Conference on Information and
Knowledge Management, 2001, pp 474-481
http://doi.acm.org/10.1145/502585.502665

Coen242 - Big Data - Research Paper - Group 3

Caricato da

Informazioni sul documento

Descrizione originale:

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Coen242 - Big Data - Research Paper - Group 3

Caricato da

Copyright:

Formati disponibili

Analysis of Association Rule Mining Algorithms with Applications

extract useful information from large amounts of

store, a portion of which is shown in the table

Generating these association rules is no simple

Table 1: Customer purchase history

Analysis of the itemsets in the transactions would

The generation of associative rules usually occurs

the transaction. During the second pass over the

9. FIk+1 = candidates in CIk+1 with

This algorithm generates itemsets on-the-fly as the

Fig 1: Illustration of frequent itemset generation using

The Apriori algorithm determines the frequent

F1= {frequent 1-itemsets}

E. APRIORI HYBRID ALGORITHM

Pseudo Code for Eclat Algorithm:

Fig 2: FP Tree Construction [1]

FP Tree Construction Flow:

Reads the first transaction and creates the nodes for

The associative rule mining algorithms do not

3. By adding extra constraints on structure of

algorithm in case of number of rule, CPU time and

With the new measures calculated and a set of

as its basis. To simplify the approach: based on the

I like C camera in general, because the picture

Fig 3: Phrase-tree structure

Semantic Orientation (SO) from Association

With this in mind, the semantic orientation of a

Table 3: Sample tagset

Using this process, we can take a review and

The reference words excellent and poor were

association rules. Through opinion mining people

There are two primary approaches used in

prediction process requires an itemitem similarity

Steam data is utilized by numerous applications

due to a lack of innovation in the space, but

creating associative rules that uncover obscure

[6] Kaosar, Z. Xu, X. Yi. Distributed

Potrebbero piacerti anche