Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
on
Multithreaded Apriori Algorithm on Different Multicore System.
BACHELORE OF ENGINEERING
SUBMITTED BY
TEJASWI B. GUNJAL [roll no:16]
REVTI K. DIMBAR [roll no:11]
NEHA C. GUPTA [ roll no:33]
Under the guidance of
Prof. Halkarnikar
Department of Computer Engineering
PADMASHREE DR D Y PATIL INSTITUTE OF
ENGINEERING MANGEMENT AND RESEARCH AKURDI
PUNE-411 044.
A Huge amount of data gets collected from society with different sources.
Hardly has it led to a useful knowledge. For finding useful knowledge an algorithm
is required. Apriori is an algorithm for mining data from databases which shows
items that are related to each other. The databases having a size in GB and TB need
a fast processor. For fast processing multi-core processors are used. Parallelism is
used to reduce time and increase performance, Multi-core processor is used for
parallelizing. Serial mining can consume time and reduce performance for mining.
To solve this issue we are proposing a work in which load balancing is done among
processors. In this paper we have implemented Apriori algorithm in serial and
parallel manner and comparison of both on the basis of varying support-count and
time using parallel programming technique Multithread Java
1.1.1
INTRODUCTION :-
TECHNICAL KEYWORDS
1. Data Mining, e-Commerce, apriori algorithm, association rules, support,
confidence, retail sector, Parallel processing, Multicore processing,
Relevance of Work:
In computer science and data mining, Apriori is a classic algorithm for learning
association rules. Apriori is designed to operate on databases containing transactions (for
example, collections of items bought by customers, or details of a website frequentation).
As is common in association rule mining, given a set of itemsets (for instance, sets of
retail transactions, each listing individual items purchased), the algorithm attempts to find
subsets which are common to at least a minimum number C of the itemsets. Apriori uses a
"bottom up" approach, where frequent subsets are extended one item at a time (a step known as
candidate generation), and groups of candidates are tested against the data. The algorithm
terminates when no further successful extensions are found.
Apriori uses breadth-first search search and a tree structure to count candidate item sets
efficiently. It generates candidate item sets of length k from item sets of length k 1. Then it
prunes the candidates which have an infrequent sub pattern. According to the downward closure
lemma, the candidate set contains all frequent k-length item sets. After that, it scans the
transaction database to determine frequent item sets among the candidates. It is nothing but finding
frequent itemsets using candidate generation. It uses Apriori property that all nonempty subsets of a
frequent itemset must also be frequent.
1.1.2 ADVANTAGES :-
APPLICATIONS:
It is used in Data-Marts.
It is used in Share Market
Also used in Development Centers..
2. LITERATURE SURVEY:
Association rule mining tries to find frequent patterns,
associations, correlations, or casual structures sets of items or
objects in transaction database, relational database, etc. that is
to say, to find out the relation or dependency of occurrence of
one of one item based on occurrence of other items. Apriori
algorithm is a basic algorithm for association rule mining.
A supermarket wants to implement a bundling sale. They
need to find the items purchased together frequently. Its a typical
market basket analysis problem. This process analyzes customer
buying habits by finding associations between the different items
that customers place in their shipping baskets. The result can
help retailers develop marketing strategies by getting to know
4. OBJECTIVES OF PROJECT:
Objectives :
To generate the frequent itemset using apriori algorithm,our aim to implement
mining system using serial approach and parallel approach through which we will
focus on the to enhancement of apriori algorithm performance.
Implementation Modules:
1. Authentication Module
2. User Interface Module
3. Serial Approach Module
3.1. Candidates Generation module(single Threaded)
3.2. Frequent Item Calculation Module(Single Threaded)
Project Scope
This project is to implement parallel apriori
algorithm using new generations Multicore Processing
units/Processors.We are going to implement the array
based (bitmap based ) apriori algorithm. With this
functionality, as specified in system architecture master
node will also interact with other distributed systems
and will parallel y execute the mining operations on
those systems.
System Architecture
4. METHODOLOGY:
Step 1:
The Prune Step: To find the count of each candidate in Ck the entire database is scanned.
Candidate k-itemset is represented by Ck. To find whether that itemset can be placed in frequent k-itemset
Lk to count each itemset in Ck is compared with a predefined minimum support count [1].
Step 2:
The join step: Lk is natural joined with itself to get the next candidate k+1- itemset Ck+1.
The major step here is the prune step which requires scanning the entire database for finding the count of
each itemset in every candidate k-itemset. If the database size is large, so to find all the frequent itemsets
in the database, it requires more time [1]
The Apriori Algorithmis an influential algorithm for mining frequent itemsets for boolean
association rules. Following are the key concepts:-
Frequent Itemsets: The sets of item which has minimum support (denoted by Li for ith Itemset).
Apriori Property: Any subset of frequent itemset must be frequent.
Join Operation: To find Lk, a set of candidate k-itemsets is generated by joining Lk-1with itself.
Join Step: Ck is generated by joining Lk-1with itself
Prune Step: Any (k-1)-itemset that is not frequent cannot be a subset of a frequent k-itemset
Ck: Candidate itemset of size k
Lk: frequent itemset of size k
L1= {frequent items};
( for(k= 1; Lk!=0;k++) do begin
Ck+1= candidates generated from Lk;
for eachtransaction tin database do
increment the count of all candidates in Ck+1that are contained in t
Lk+1= candidates in Ck+1with min_support
end
return UkLk
HARDWARE REQUIREMENTS:
Hardware
Speed
-1.1 GHz
RAM
-1GB
Hard Disk
-20 GB
Floppy Drive
-1.44 MB
Key Board
Mouse
Monitor
- SVGA
SOFTWARE REQUIREMENTS:
Operating System
: Any(Windows/Linux)
Technology
IDE
: My Eclipse
Java Version
: J2SDK1.5 or later
5. POSSIBLE OUTCOMES:
Approximate Output must be written in this.
6. REFERENCES:
1. Han and Micheline Kamber, Data Mining concepts and Techniques 2nd edition Morgan
Kaufmann Publishers, San Francisco 2006.
2. N. Ricci, S. Guyer, and J. E. Moss, Elephant Tracks: Portable Production Complete and Precise
GC Traces, in ISMM, 2013.
3. S. Blackburn, R. Garner, C. Hoffmann, et al., The DaCapo Benchmarks: Java Benchmarking
Development and Analysis, in OOPSLA, 2006.
4. 'The 6 biggest challenges retailer Face today", www.onStepRetail .com. retrieved on June 2011
5. Berry, M. J. A. and Linoff, G. Data mining techniques for marketing, sales and customer support,
USA: John Wiley and Sons,1997
6. Andre Bergmann, "Data Mining for Manufacturing: PreventiveMaintenance, Failure Prediction,
and Quality Control" Fayyad, U. M; Piatetsky-Shapiro, G. ; Smyth, P.; and Uthurusamy, R. 1996.
7. Advances in Knowledge Discovery and Data Mining. Menlo Park,Calif.: AAAI Press.Dr. Gary
Parker, vol 7, 2004, Data Mining: Modules in emerging fields, CD-ROM.
8. Jiawei Han and Micheline Kamber (2006), Data Mining Concepts and Techniques, published by
Morgan Kauffman, 2nd ed.
9. Literature Review: Data mining, http://nccur.lib.nccu.edu. twlbitstream/1 40.1 I 9/3523I/S/35603
I OS.pdf, retrieved on June 2012
10. .H. Mahgoub,"Mining association rules from unstructured documents" in Proc. 3rd Int. Conf. on
Knowledge Mining, ICKM, Prague, Czech Republic, Aug. 25- 27, 2006, pp. 1 67-1 72. S. annan,
and R. Bhaskaran "Association rule pruning based on interestingness meas ures with clustering".
International Journal of Computer Science Issues, IJCSI, 6(1 ), 2009, pp. 35-43 .
11. M. Ashrafi, D. Taniar, and K. Smith "A New Approach of Eliminating Redundant Associa tion
Rules". Lecture Notes in Computer Science,Volume 31 S0, 2004, pp. 465 -474.
12. http://wenku.baidu.com/view/972ef7c66137ee06eff91824.html.
13. Data Mining: Concepts and Techniques. J.Han and M.Kamber. 2000.
14. Pattern Recognition and Machine Learning. Christopher M. Bishop. 2006.
15. Data Mining by Dr. Hall (http://www.cse.usf.edu/~hall/dm/
16. http://en.wikipedia.org/wiki/Apriori_algorithm
Conferences:
Sr. No
1
2
7
8
9
Name of Conference
Date
Location
Pune
Pune
Lavasa,
Pune