Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Abstract: Mining high utility itemsets from a bioinformatics, Web click-stream analysis, and
transactional database refers to the discovery of mobile environments[2].
itemsets with high utility like profits. To overcome In frequent itemset mining, frequent sets play an
the problems of producing large number of essential role in many data mining tasks that try to
candidate itemsets [4],[5], we are implementing find interesting patterns from databases, such as
three algorithms like Utility Pattern-Growth[2], association rules[7], correlations, sequences,
Chernoff-based[3] and the Top-K Lossy Counting episodes, classifiers and clusters. A set is called
Algorithm[1]. UP Growth algorithm facilitates frequent if its support is no less than a given absolute
mining process by maintaining essential information minimal support threshold min_sup with
in UP-Tree (Utility Pattern Tree). High-utility 0>min_supabs>|D|. When working with frequencies
itemsets can be generated from UP-Tree efficiently of sets instead of their support, we use the relative
with only two scans of original databases [2]. In minimal frequency threshold min_suprel, with
case of Chernoff Bound Algorithm, user specifies a 0>min_suprel>1. The task of discovering all frequent
bound for better optimization of the dataset and to sets is quite challenging. The search space is
reduce the generated frequent itemsets [3]. In case of exponential in the number of items occurring in the
Top-K Lossy Counting Algorithm, the main database and the targeted databases tend to be
parameter is frequency (quantity) of the item. Based massive, containing millions of transactions. Both
on the frequency count and support of the item we these characteristics make it a worthwhile effort to
generate unpromising items [1]. These algorithms seek the most efficient techniques to solve this
produce small candidate itemsets unlike other task[9].
algorithms that produce large candidate itemsets. Mining frequent patterns (or itemsets) are probably
These improve the mining performance in terms of one of the most important concepts in data mining. A
execution time and space requirement. The objective lot of other data mining tasks and theories stem from
of this work is to mine the frequent itemsets of a this concept. It should be the beginning of any data
database using the above mentioned three mining technical training because, on one hand, it
algorithms. Based on the results, the most significant gives a very well-shaped idea about what data
algorithm is judged. mining is and, on the other, it is not extremely
technical.
1. Introduction. The task of discovering all frequent itemsets is quite
challenging. The search space is exponential in the
Data mining of uncertain data has become an active number of items occurring in the database. The
area of research recently. Data mining is the process support threshold limits the output to a hopefully
of revealing nontrivial, previously unknown and reasonable subspace. Also, such databases could be
potentially useful information from large databases. massive[8], containing millions of transactions,
Discovering useful patterns hidden in a database making support counting a tough problem. In this
plays an essential role in several data mining tasks, paper, we will analyze these two aspects into further
such as frequent pattern mining, weighted frequent detail. Throughout the last decade, a lot of people
pattern mining, and high utility pattern mining. have implemented and compared several algorithms
Among them, frequent pattern mining is a that try to solve the frequent itemsets mining
fundamental research topic that has been applied to problem as efficiently as possible. Unfortunately,
different kinds of databases, such as transactional only a very small selection of researchers put the
databases, streaming databases, and time series source codes of their algorithms publicly available
databases, and various application domains, such as such that fair empirical evaluations and
comparisons[12] of their algorithms become very (3) Identification of high utility itemsets from the set
difficult[6],[2]. Moreover, we experienced that of potential high utility itemsets.
different implementations of the same algorithms The UP-tree is constructed in two scans of the
could still result in significantly different database. The first scan collects and sort the set of
performance results[6]. As a consequence, several high utility items, and the second constructs the UP-
claims that were presented in some articles[10] were Tree.
later contradicted in other articles[11]. The original
motivation for searching frequent sets came from the Definitions
need to analyze so called supermarket transaction
data, that is, to examine customer behavior in terms I. Utility of an item ip in a transaction Td is denoted
of the purchased products. Frequent sets of products as u (ip; Td) and defined as pr (ip) × q (ip; Td).
describe how often items are purchased together. II. Utility of an itemset X in Td is denoted as u(X;
This paper shows the implementation of the above Td) and defined as ∑ip2X^X_Td u (ip; Td).
said three algorithms and the results are observed. III. Utility of an itemset X in D is denoted as u (X)
and defined as ∑X_Td^Td2D u(X; Td).
2. Utility Pattern Growth. IV. An itemset is called a high utility itemset if its
utility is no less than a user-specified minimum
UP Growth algorithm is a weighted association rule utility threshold which is denoted as min_util.
mining technique. In this framework, weights of Otherwise, it is called a low-utility itemset.
items, such as unit profits of items in transaction V. Transaction utility of a transaction Td is denoted
databases, are considered. With this concept, even if as TU (Td) and defined as u (Td; Td).
some items appear infrequently, they might still be VI. Transaction-weighted utility of an itemset X is
found if they have high weights. the sum of the transaction utilities of all the
Mining high utility itemsets from databases refers to transactions containing X, which is denoted as
finding the itemsets with high profits. Here, the TWU(X) and defined as ∑X_Td^Td2D TU(Td).
meaning of itemset utility is interestingness, VII. An itemset X is called a high-transaction
importance, or profitability of an item to users. weighted utility itemset (HTWUI) if TWU(X) is no
Utility of items in a transaction database consists of less than min_util.
two aspects: 1) the importance of distinct items,I. VIII. An item ip is called a promising item if TWU
which is called external utility, and 2) the importance (ip) < min util. Otherwise it is called an unpromising
of items in transactions, which is called internal item. Without loss of generality, an item is also
utility. Utility of an itemset is defined as the product called a promising item if its overestimated utility is
of its external utility and its internal utility. An no less than min_util. Otherwise it is called an
itemset is called a high utility itemset if its utility is unpromising item.
no less than a user-specified minimum utility
threshold; otherwise, it is called a low-utility itemset. UP Growth algorithm
Mining high utility itemsets from databases is an
important task has a wide range of applications such Input: A transaction database DB, profit table p and
as website clickstream analysis, business promotion a minimum utility threshold min_util.
in chain hypermarkets, cross marketing in retail Output: Unpromising items.
stores, online e-commerce management, and mobile Steps:
commerce environment planning, and even finding Step 1: for each item Ip in Transaction Td compute
important patterns in biomedical applications[2],[6]. Step 2: u(Ip, Td)= p(Ip) * q(Ip, Td);
This method suffers from the problems of a large //p(Ip) value is given in profit table and q(Ip,Td)
search space, especially when databases contain lots value is in DB
of long transactions or a low minimum utility Step 3: end for
threshold is set. Step 4: for each transaction Td in DB compute
UP-Growth algorithm identifies high utility itemsets Step 5: TU (Td) = u (Td, Td);
using a compact tree structure, called UP-Tree. UP- //where u(X, Td) = sum of u (Ip, Td), Ip belongs to
Tree maintains the important information of the X and X is subset of Td
transaction database related to utility pattern. High Step 6: end for
utility itemsets are then generated from the UP-Tree Step 7: for each item Ip in DB compute
efficiently with only two scans of the database[2]. Step 8: TWU (Ip) = sum of TU (Td);
The main framework of this proposed method //where Ip is subset of Td, Td belongs to DB
consists of 3 steps as: Step 9: end for
(1) Construction of UP-Tree Step 10: Discarding all unpromising items
(2) Generation of potential high utility itemsets from //TWU (Ip) < min_util are unpromising items.
the UP-Tree by UP-Growth, and Step 11: Unpromising items are removed from
datasets are available at UCI repository of machine [3] A Simple but Effective Stream Maximal Frequent
learning databases. This paper uses transactional Itemset Mining Algorithm Haifeng Li, Ning Zhang, 2011
dataset and a brief description of dataset used in Seventh International Conference on Computational
experiment evaluation Table 1 shows some Intelligence and Security.
[4] R. Agrawal and R. Srikant, “Fast Algorithms for
characteristics of dataset as the test datasets. Mining Association Rules,” Proc. 20th Int’l Conf. Very
Large Data Bases (VLDB), pp. 487-499, 1994.
Table 1. Description of Datasets [5] R. Agrawal and R. Srikant, “Mining Sequential
Transactional Number of Number of Patterns,” Proc. 11th Int’l Conf. Data Eng., pp. 3-14, Mar.
Datasets Items records/Instances 1995.
Dataset-1 8 200 [6] R. Chan, Q. Yang, and Y. Shen, “Mining High Utility
In this paper, we have used NetBeans IDE to Itemsets,” Proc. IEEE Third Int’l Conf. Data Mining, pp.
implement the above stated algorithms and tested the 19-26, Nov. 2003.
[7] J. Pei, J. Han, B. Mortazavi-Asl, H. Pinto, Q. Chen, U.
efficiency i.e, running time and memory taken by
Moal, and M.C. Hsu, “Mining Sequential Patterns by
each algorithm on given datasets. The results are as Pattern-Growth: The Prefixspan Approach,” IEEE Trans.
shown in Table 2. Knowledge and Data Eng., vol.16, no.10, pp. 1424-1440,
Oct. 2004.
Table 2. Performance Results for Algorithms [8] B.-E. Shie, H.-F. Hsiao, V., S. Tseng, and P.S. Yu,
“Mining High Utility Mobile Sequential Patterns in Mobile
Efficiency UP-Growth Top-K- Chernoff
Commerce Environments,” Proc. 16th Int’l Conf. Database
Type Lossy Bound Systems for Advanced Applications (DASFAA ’11), vol.
6587/2011, pp. 224-238, 2011.
Running 0.1870 0.2820 0.3120 [9] Lee, C.-H., C.-R.L., and Chen, M.-S. 2001. Sliding-
Time(ms) window Filtering: An Efficient algorithm for incremental
mining. In Intl. Conf. on Information and Knowledge
Memory 7955.91 8215.85 8235.64 Management.
Used(kb) [10] Xu, J., Lin, X., and Zhou, X. 2004. Space efficient
quantile summary for constrained sliding windows on a
data stream. In The 5th International Conference on Web-
6. Conclusion. Age Information Management.
[11] Yu, J., Chong, Z., Lu, H., and Zhou, A. 2004. False
positive or false negative: Mining frequent itemsets from
Frequent sets play an essential role in many Data high speed transactional data streams. In VLDB.
Mining tasks that try to find interesting patterns from [12] Huang, H., Wu, X., and Relue, R. Association
databases. The original motivation for searching Analysis with One Scan of Databases. Proceedings of the
frequent sets came from the need to analyze so called 2002 IEEE International Conference on Data Mining.
supermarket transaction data. Mining frequent 2002.
itemsets from large datasets degrades the
performance in terms of execution time and space
requirement. To overcome this drawback we have
implemented three algorithms like UP-Growth,
Chernoff based and the Top-K Lossy Counting
Algorithm. The main aim of this paper is to find the
efficient algorithm to find frequent itemsets using
utility of the items. Every algorithm has their own
importance and we use them on the behavior of the
data, but on the basis of this research we found that
Utility Pattern Growth algorithm is simplest
algorithm as compared to other algorithms because it
takes less time and space to get the results.
7. References.