An Efficient Mining Algorithm For Maximal Weighted Frequent Patterns in Transactional Databases

Knowledge-Based Systems 33 (2012) 5364
Contents lists available at SciVerse ScienceDirect
Knowledge-Based Systems
journal homepage: www.elsevier.com/locate/knosys
An efcient mining algorithm for maximal weighted frequent patterns in transactional databases
Unil Yun a, Hyeonil Shin b, Keun Ho Ryu a, EunChul Yoon c,
a
Department of Computer Science, College of Electrical & Computer Engineering, Chungbuk National University, South Korea Car Infotainment Research Department, LG Electronics, South Korea c Department of Electronics Engineering, Konkuk University, South Korea
b
a r t i c l e
i n f o
a b s t r a c t
In the eld of data mining, there have been many studies on mining frequent patterns due to its broad applications in mining association rules, correlations, sequential patterns, constraint-based frequent patterns, graph patterns, emerging patterns, and many other data mining tasks. We present a new algorithm for mining maximal weighted frequent patterns from a transactional database. Our mining paradigm prunes unimportant patterns and reduces the size of the search space. However, maintaining the antimonotone property without loss of information should be considered, and thus our algorithm prunes weighted infrequent patterns and uses a prex-tree with weight-descending order. In comparison, a previous algorithm, MAFIA, exponentially scales to the longest pattern length. Our algorithm outperformed MAFIA in a thorough experimental analysis on real data. In addition, our algorithm is more efcient and scalable. 2012 Elsevier B.V. All rights reserved.
Article history: Received 18 April 2011 Received in revised form 1 February 2012 Accepted 5 February 2012 Available online 21 February 2012 Keywords: Data mining Weighted frequent pattern mining Maximal frequent pattern mining Vertical bitmap Prex tree
1. Introduction Data mining is dened as the process of non-trivial extraction of previously unknown and potentially useful information from data stored in databases [9,16,20,23,33]. Data mining is used to nd patterns (or itemsets) hidden within data, and associations among the patterns. In particular, frequent pattern mining plays an essential role in many data mining tasks such as mining association rules [1], interesting measures [3,26], correlations [22,34], sequential patterns [8,29,41,42], constraint-based frequent patterns [5,40], graph patterns [35], emerging patterns [11,19,27] and approximate patterns [39]. Mining information and knowledge from very large databases is not easy since it takes a long time to process large datasets and the amount of discovered knowledge, and because the number of patterns can be signicant and redundant. Frequent pattern mining is used to discover a complete set of frequent patterns in a transaction database with minimum support. Frequent patterns have a well-known anti-monotone property [1]: if a pattern is infrequent, all of its super patterns must be infrequent. According to this property, a long frequent pattern of length x leads to (2n 2) shorter non-empty frequent patterns. For instance, if pattern {a, b, c, d} is frequent, all subsets of {a, b, c, d} including {a}, {b}, {c}, {a, b}, {a, c}, . . ., and {b, c, d} are also frequent. To avoid
Corresponding author. Tel.: +822 450 3349; fax: +822 3437 5235.
E-mail addresses: yunei@chungbuk.ac.kr (U. Yun), hyeonil.shin@lge.com (H. Shin), khryu@dblab.chungbuk.ac.kr (K.H. Ryu), ecyoon@konkuk.ac.kr (E. Yoon). 0950-7051/$ - see front matter 2012 Elsevier B.V. All rights reserved. doi:10.1016/j.knosys.2012.02.002
mining all frequent patterns, we can mine only those that are maximal frequent patterns [4,43]. A pattern X is a maximal frequent pattern if X is frequent and every proper super pattern of X is infrequent. The problem of mining maximal frequent patterns is how to discover the complete set of maximal frequent patterns. Mining maximal frequent patterns can reduce the size of the search space dramatically. However, while items have different importance in reality, this characteristic is still not considered. For this reason, weighted frequent pattern mining algorithms [2,7,31] have been suggested. Weight based pattern mining is so powerful that it not only reduces the size of the search space but also mines more important patterns. The main focus in weighted frequent pattern mining is on satisfying the anti-monotone property since this property is generally broken when different weights are applied to different items. Even if a pattern is weighted as infrequent, its super patterns can be weighted as frequent since super patterns of a low weight pattern can receive a high weight after other items with higher weight are added. For this reason, previous weighted frequent pattern mining algorithms [3638] used a maximum weight of the transaction database (or each conditional database) instead of each item weight to maintain the anti-monotone property. There are several applications used to apply maximal frequent pattern mining or weighted frequent pattern mining. For example, using maximal frequent patterns with respect to a series of support thresholds, we can approximate and summarize the support information of all frequent patterns [28]. As another example, using
54
U. Yun et al. / Knowledge-Based Systems 33 (2012) 5364
maximal frequent patterns, we can nd positive feature patterns [32] that are mined from the edge images. Negative feature patterns are mined from the complements of edge images and used to prune non-face candidate images. These feature patterns can be used to effectively train face detectors [32]. Similarly, we can nd emerging patterns [11] that are frequent in positive samples and infrequent in negative samples. These patterns can be used to develop effective classiers [19,20]. Additionally, maximal frequent patterns can be used to build an effective heart attack prediction system [27]. These applications reect the characteristics of maximal frequent patterns; they can serve as a border between frequent and infrequent patterns. Meanwhile, weighted frequent pattern mining also has several applications. First, item weights can be used to calculate the cluster allocation prot, which is the sum of the ratio of total occurrence frequency to the cluster of XML documents, as described in [18]. Second, several efcient recommendation systems [10,13] have been devised based on WFP mining. Chen [10] proposed a recommendation system that increases prot from cross-selling without losing recommendation accuracy. Its consideration of product protability for sellers is based on weighted frequent patterns. Third, alarm weights affect the accuracy and validity of the results directly in an alarm correlation in communication networks, as demonstrated in [21]. The proper determination of weight reects both the objective information of the alarm and the subjective judgment of experts. More extensions with weight constraints have been developed, such as mining weighted association rules [31], mining weighted association rules without pre-assigned weights [30], mining weighted sequential patterns [41,42], mining weighted closed patterns [36], mining frequent patterns with dynamic weights [2], mining weighted graphs [24], mining weighted sub-trees or sub-structures [25], and mining weighted frequent XML query patterns [15]. However, in the above applications, weighted frequent pattern mining discovers important patterns and maximal frequent pattern mining extracts fewer patterns by compressing the frequency information. In this paper, we take a systems approach to the problem of mining maximal frequent patterns with weight constraints. We propose an efcient algorithm called MWFIM (Maximal Weighted Frequent Itemset Mining) that mines the entire set of maximal weighted frequent patterns from a transaction database. We conduct an extensive experimental characterization of MWFIM against a state-of-the-art maximal pattern mining method, MAFIA [6]. Using some standard machine-learning benchmark datasets, our results indicate that MWFIM outperforms the MAFIA algorithm. The main contributions of this paper are as follows: (1) we dene the problem of incorporating maximal frequent pattern mining and weighted frequent pattern mining, (2) introduce maximal weighted frequent pattern mining, (3) propose a pruning technique with head weighted support for reducing the size of the search space, and (4) implement our algorithm, MWFIM, and conduct an extensive experimental study comparing MWFIM with MAFIA. The remainder of this paper is organized as follows. In Section 2, we provide some background information on this subject. In Section 3, we suggest maximal weighted frequent pattern mining and describe the MWFIM algorithm. Section 4 provides our extensive experimental results. Finally, some concluding remarks are presented in Section 5.
is the number of transactions containing that pattern in the TDB. The problem of frequent pattern mining is how to nd a complete set of patterns satisfying minimum support in a TDB. To prune infrequent patterns, frequent pattern mining usually uses the anti-monotone property [1]. That is, if pattern X is an infrequent pattern, all super patterns of X must be infrequent. Using this property, infrequent patterns can be pruned earlier. Frequent pattern mining has been studied in the area of data mining due to its broad applications in mining association rules [1], interesting measures or correlations [3,22,26,34], sequential patterns [8,29,35,41,42], constraint-based frequent patterns [5], graph patterns [35], emerging patterns [11,19,27] and other data mining tasks. These approaches have focused on enhancing the efciency of algorithms in which techniques for search strategies, data structures, and data formats have been devised. Moreover, constraint-based pattern mining approaches [17] have been proposed to address the reduction of uninteresting patterns in terms of user focus. These studies have focused on how a concise pattern or other important patterns are found. However, there has been no research on mining maximal frequent patterns with weight constraints. 2.1. Weighted frequent pattern mining The main focus of weighted frequent pattern (WFP) mining concerns the anti-monotone property since this property is usually broken when different weights are applied to items and patterns. That is, although pattern X is weighted as infrequent, the super patterns of X can be weighted as frequent. Mining association rules with weighted items (MINWAL) [7] denes a weighted support, which is calculated by multiplying the support of a pattern with that patterns average weight. In weighted association rule mining (WARM), the problem of breaking the anti-monotone property is solved using a weighted support and devising a weighted downward closure property. However, the weighted support of pattern {A, B} in WARM is the weight ratio of the transactions containing both A and B to the weight of all transactions, and thus WARM does not consider this support measure. Recently, weighted frequent pattern mining algorithms [39,40] based on the pattern-growth approach have been developed. Weight-based pattern mining reduces the size of the search space, but also nds more important patterns. The weight of an item is a non-negative real number assigned to reect the importance of each item in the TDB. Given a set of items, I = {i1, i2, i3, . . . , in}, the weight of a pattern is formally dened as follows:
Plengthp Weight P
i 1
Weightpi : lengthP
The value achieved when multiplying the support of a pattern with the weight of the pattern is the weighted support of that pattern. That is, given pattern P, the weighted support is dened as WSupport (P) = Weight (P) Support (P). A pattern is called a weighted frequent pattern if the weighted support of the pattern is no less than the minimum support threshold. 2.2. Maximal frequent pattern mining Pattern X is the maximal frequent pattern (MFP) if X is frequent and every one of its proper super patterns is infrequent. For instance, if pattern {d, g, h} is a frequent pattern and has only one infrequent super pattern {d, g, h, f}, then {d, g, h} is a maximal frequent pattern. The supports of all frequent patterns are needed to extract the association rules. However, it is often impractical to generate an entire set of frequent patterns or closed patterns when there are very long patterns present in the data [6]. The set of maximal frequent patterns is the smallest possible expression
2. Background Let I = {i1, i2, i3, . . . , in} be a unique set of items. A transaction database, TDB, is a set of transactions in which each transaction, denoted as a tuple htid, Xi, contains a unique transaction identier (tid) and a set of items (X). We call X # I a pattern, and we call X a k-pattern if the cardinality of pattern X is k. The support of a pattern
55
of data that can still be used to extract a set of frequent patterns. Once set FI is generated, the support information can be easily recomputed from the transaction database. Burdick [6] proposed a novel MFP mining algorithm, MAFIA (MAximal Frequent Itemset Algorithms). This algorithm used a vertical bitmap representation, where the count of a pattern is based on the column in the bitmap. In contrast, the FPmax [12] algorithm uses a horizontal format. It is also a depth-rst MFP mining algorithm, but uses an FP-tree structure based on the pattern-growth approach. Weighted frequent pattern mining discovers important patterns, and maximal frequent pattern mining extracts fewer patterns by compressing the frequency information, and thus the maximal frequent pattern with weight-constraint mining can be used to nd fewer but important frequent patterns. However, several difcult problems remain in this incorporation. We describe detailed information regarding these problems and how to handle them in the next section. 3. MWFIM: Maximal Weighted Frequent Itemset Mining 3.1. Maximal weighted frequent pattern First, we propose maximal frequent pattern mining with weight constraints. We dene a new joined paradigm for considering both paradigms. Our joining order is maximal weighted frequent pattern (MWFP) mining. In MWFP mining, weighted frequent patterns are rst found, and maximal frequent patterns are then mined from the weighted frequent patterns. Denition 3.1. Maximal Weighted Frequent Pattern (MWFP) A pattern is dened as a maximal weighted frequent pattern if the pattern has no weighted frequent superset.
Table 1 Example transaction database and item weights. TID (a) Transaction database 100 200 300 400 500 600 Item (b) Weight table a b c d e f g h i Weight 0.7 0.6 0.8 0.65 0.45 0.5 0.4 0.5 0.45 Transaction a, b, c, d, f, g a, b, c, d, f b, d, e, h, i a, d, e, g a, b, c, d, f, g e, f, g, h Support 4 4 3 5 3 4 4 2 1
In this approach, MWFP mining discovers candidate weighted frequent patterns using MaxW (the maximum weight of the database) rst, and maximal frequent patterns are then mined. To maintain the anti-monotone property of MWFP mining, MaxW is used to determine whether a pattern is an approximate weighted frequent pattern before checking the supersets (whether the pattern has any supersets weighted as frequent). Finally, some of the maximal frequent but weighted infrequent patterns are pruned, and new maximal frequent patterns are mined as candidate MWFPs. The new maximal frequent patterns are checked again for lossless MWFP mining. The integration of a weighted frequent pattern with a maximal frequent pattern is obvious. The weight constraints have to be considered before applying the superset checking. For example, suppose the minimum support threshold is 2, and that the transaction database in Table 1a and its weight range (0.40.8) are used as normalized weights in Table 1b. The set of candidate weighted frequent patterns using the Apriori principle [1] is {{a}, {b}, {c}, {d}, {e}, f, {g}, {a, b}, {a, c}, {a, d}, {a, f}, {a, g}, {b, c}, {b, d}, {b, f}, {c, d}, {c, f}, {d, f}, {d, g}, {a, b, c}, {a, b, d}, {a, d, f}, {a, c, d}, {a, c, f}, {a, d, g}, {b, c, d}, {b, c, f}, {c, d, f,}, {a, b, c, d}, {a, b, c, f}, {a, c, d, f}, {a, b, d, f}, {b, c, d, f}, and {a, b, c, d, f}}, and MaxW is 0.8. Then, the patterns {a, b, c, d, f} and {a, d, g} are maximal since they have no proper supersets. However, neither is a real weighted frequent pattern whose maximum weighted support (MaxW support) is less than a minimum support (2), and thus they must be pruned using real weighted frequent pattern checking. Consequently, new maximal frequent patterns {{a, b, c, d}, {a, b, c, f}, {a, c, d, f}, {a, b, d, f}, {b, c, d, f}, {a, d}, {a, g}} are mined again. Among the patterns, only the pattern {a, b, c, d} is a real weighted frequent pattern, and thus is MWFP. In contrast, {{a, b, c, f}, {a, b, d, f}, {a, c, d, f}, {b, c, d, f}, {a, d}, {a, g}} are pruned and generate fewer 1-level subsets. Therefore, new candidate patterns {{a, b, f}, {a, c, f}, {a, d, f}, {b, c, f}, {b, d, f}, {c, d, f}, {g}} are generated
for mining the remaining MWFPs (subsets of the MWFPs are not generated). Pattern {a, c, f} is an MWFP, but other candidate patterns are not. Finally, the resulting set of MWFPs, {{a, b, c, d}, and {a, c, f}}, is ultimately generated. As shown in the above example, to maintain the anti-monotone property in MWFP mining, we use the maximum weight (MaxW) when an item weighted as infrequent must be pruned. For this reason, some of the candidate patterns can be exactly weighted as infrequent. Thus, we must check whether the candidate patterns are exactly weighted as frequent. A problem occurs when a candidate pattern is not exactly weighted as frequent, which is that if we remove the candidate pattern weighted as infrequent, then certain information on other patterns may be lost. Even if the candidate is weighted as infrequent, any weighted frequent patterns that belong to subsets of the candidate can exist, and it may thus be possible to nd an MWFP from these subsets. In conclusion, if a generated candidate pattern is exactly weighted as frequent, then we check whether it has a proper weighted frequent superset in MWFP mining. Only if it has no proper weighted frequent superset, the candidate pattern is an MWFP and can be inserted into the resulting set of MWFP mining. In contrast, if the candidate is not exactly weighted as frequent, we must repeatedly check whether all the 1-level reduced subsets of the candidate are weighted as frequent. This means that if the candidate pattern of a leaf node is weighted as infrequent when the MWFP mining algorithm traverses in depth-rst order for the generation of candidate MWFP patterns, then its parent node must be visited and the node pattern checked. Our MWFP mining search strategies involve this additional handling, and can nd the entire MWFPs of a TDB without a loss of information. Another consideration of mining MWFP is as follows. While the subsets of the MFP are guaranteed to be frequent, some of the subsets of MWFP can be weighted as infrequent. For instance, as shown in Fig. 1, with a minimum support of 2 and a list of three
Fig. 1. A weighted frequent pattern {a, b, c} and its subsets.
56
weighted items (a, b, c), in which {a:0.8, b:0.75, c:0.45}, the pattern {a, b, c} is exactly weighted as frequent since its weighted support is 2. Moreover, it has no proper superset weighted as frequent, and thus it is an MWFP. However, one of its subsets does not have exactly weighed support. One pattern, {a, c}, and its weighted support, 1.875, is less than the minimum support. Therefore, as one of the characteristics of an MFP, a complete frequent pattern set that can be generated from an MFP cannot be inherited to an MWFP. However, MWFP mining still has wide applications such as mining negative and positive patterns for classication. 3.2. Search strategies for MWFP mining In this section, we present a conceptual framework of the item subset lattice, along with our search strategies for MWFP mining. Assume there is a descending ordering in the total weight 6WD of items, I in a database. If an item i before an item j occurs in the ordering, we denote this occurrence as i 6 WDj. This ordering can be used to enumerate the item subset lattice or partial ordering over power set S of items I. We dene the partial order 6 on S1, S2 e S such that S1 6 S2 if S1 # S2. Fig. 2 shows a sample of a complete subset lattice for four items. The top element in the lattice is an empty set (denoted as {} or root), and each lower level k includes all k-patterns. The k-level patterns are sorted in weightdescending order on each level, and all children nodes are associated with the earliest subset in the previous level. The pattern identifying each node will be referred to as the head of the node, while possible extensions of the node are named the tail. For example, consider node P in Fig. 3. The head of P is {a, b}, and its tail is the set {c, d}. The search space used in mining maximal weighted frequent patterns is a prex-tree consisting only of weighted frequent items. Specically, approximately weighted frequent items are formed to maintain the anti-monotone property. Using prex weights, we can nd such items without any loss of information. A prex-tree is a subset lattice, as shown in Fig. 2. The tail contains all items weighted as approximately frequent, the weights of which are no larger than any element item weights of the head. To prune infrequent items, not every exactly weighted support for these items is needed. Instead, we multiply the weight of the items head by its support. Using Lemma 1, this approximate weighted support, W, is always larger than each exactly weighted support. In addition, W is the maximum weighted support of the heads subsets, and thus the anti-monotone property is always maintained. To mine the maximal weighted frequent patterns from the prex-tree, we traverse the tree in depth-rst order. At each node P, each element in the tail of the node is generated and counted as a 1-level extension. If the weighted support of {Ps head} [ {1-extention} is less than the minimum support threshold, any super pattern in the sub-tree rooted at {Ps head} [ {1-extention} will be weighted as infrequent.
Fig. 3. An example of a prex-tree with weight-descending order.
Lemma 1. In a prex-tree with weight-descending order, the weight of pattern P containing only head items is always equal to or larger than any weight of super patterns containing P. Proof. When given items a1, a2, a3, . . . , am, ak is sorted in weightdescending order as in w1 P w2 P w3 P P wn P wk > 0, w1 w2 P w1 ) 2w1 P w1 w2 ) w1 P w2 is always true. Thus, 1 2 the weight of pattern {a1} is always equal to or larger than that of pattern {a1, a2}. In addition, the weight of pattern {a1, a2} is also always equal to or larger than that of pattern {a1, a2, a3} since w1 w2 2 w3 P w1 w ) 3 w1 w2 P 2w1 w2 w3 ) w1 w2 P 2 3 2w3 . h The weight of pattern {a1, a2, a3, . . . , ak} is (w1 + w2 + w3 + + wn)/n and is always equal to or larger than that of pattern{a1, a2, a3, . . . , an, ak},
w1 w2 w3 wn wk w1 w2 w3 wn ;* n1 n w1 w2 w3 wn wk P n1
) n 1w1 w2 w3 wn 6 nw1 w2 w3 wn wk ) w1 w2 w3 wn P nwk andw1 + w2 + w3 + + wn is always equal to or larger than n(wk). Fig. 3 shows items with weight-descending order and their prex-tree. The weight of pattern {a, b} is always equal to or larger than that of it super patterns, {a, b, c} and {a, b, d} by Lemma 1. As shown in Fig. 3, the weight of pattern {a, b} is 0.75, and that of {a b, c} is (0.8 + 0.7 + 0.6)/3 = 0.7. The weight of {a, b, d} is (0.8 + 0.7 + 0.5)/ 3 0.667, and is also equal to or less than that of its parent node pattern, {a, b}. Lemma 2. In a weight-descending prex-tree, if a node pattern is weighted as infrequent, then all of its child nodes are also weighted as infrequent. Thus, in a depth-rst traversal, if a pattern is exactly weighted as infrequent, then its sub-tree traversal is stopped and the remaining child nodes are not considered.
Fig. 2. An example of a subset lattice for four items.
57
Fig. 4. An example weighted frequent pattern and its prex-tree.
Proof. Each child node pattern is always a super pattern of its parent node pattern in a prex-tree. Based on Lemma 1, each child node pattern is always equal to or less than its parent node pattern. In addition, the support of a parent pattern is always equal to or larger than that of its super patterns since, if the levels of patterns increase, the number of supports decreases. Thus, all patterns of child nodes in a prex-tree of weight-descending order must have less or equally weighted supports than the weighted support of their parent node pattern. h As shown in Fig. 4, the support of pattern {a, b} is 5, and thus the weighted support of pattern {a, b} is 0.75 5 = 3.75. If the minimum support threshold (min_sup) is 3, pattern {a, b} is weighted as frequent. However, the weighted support of {a, b, c}, 0.7 4 = 2.8, is not larger than min_sup, and thus {a, b, c} is weighted as infrequent. The weighted support of {a, b, d} is 0.667 3 2, and therefore {a, b, d} is also weighted as infrequent. It is not necessary to check whether pattern {a, b, c, d} is weighted as frequent since it is a child node of weighted infrequent pattern {a, b, c}. Lemma 3. In a weight-descending prex-tree, if the head of a node is weighted as infrequent, then node N must be a leaf node. Proof. The pattern of N is a subset of Ns child node (denoted as C). Even if the pattern of N is weighted as infrequent, some of its supersets can be weighted as frequent. However, if all items are sorted in weight-descending order, the weighted support of C is always less than or equal to the weighted support of Ns head based on Lemma 1. Thus, if the pattern of N is weighted as infrequent, C must also be weighted as infrequent. Consequently, any child nodes of N are always weighted as infrequent. Further, a sub-tree rooted as N cannot have any weighted frequent patterns. For this reason, node N, the pattern of which is weighted as infrequent, is a leaf node. Reaching leaf node P in the depth-rst traversal of the tree, we obtain a candidate for the result set of maximal weighted frequent patterns; however, only if a real weighted frequent pattern can be a
candidate. Meanwhile, a weighted frequent superset of P may have already been discovered. Therefore, we need to check whether a superset of candidate pattern P is already contained in the result set. Only the weighted frequent patterns whose supersets are not weighted as frequent can be added to the result set. The largest possible frequent pattern contained in the sub-tree rooted at P is H [ T (Head union Tail) of P. As shown in Fig. 3, as the head of P is {a, b} and the tail is {c, d}, the H [ T of P is then {a, b, c, d}. If {a, b, c, d} is discovered to be weighted frequent, it is not necessary to traverse any subsets of H [ T. Thus, we can prune out the entire sub-tree rooted at node P. The items of tails are always sorted in descending order of their weights, and the item order is always xed. Thus, there is no need to reorder items for any sub-trees, and we can omit the spent reordering time. h To discover the exact result set of MWFPs, we have to check all MWFPs to ensure that no superset of any pattern has already been discovered before adding the pattern to the MWFPs. A progressive focusing technique [14] was introduced to improve the superset checking performance without excessive accesses to all MWFP set. The basic idea is as follows. If the entire MWFP set is large at any given node, only fragments of the MWFP set are possible supersets of the pattern at the given node. Thus, we use a local MWFP set, which is the subset of the entire MWFP set and is relevant at the node, to effectively check the supersets. In our MWFP mining, the local MWFP set for the root is initialized as the null set. Suppose that we are examining node K and are about to traverse on Kn, where Kn = K [ {i} (i is each item of Ns tail). The local MWFP set for Kn contains all of the patterns in the Local MWFP set for K with the added condition that they also consist of the item used to extend K when forming Kn. After the sub-tree traversal of Kn, the local MWFP set containing the MWFPs of Kn is inserted into the global MWFP set. Consequently, candidate patterns are no longer required to conduct superset checks in the global MWFP set. Instead, the local MWFP set consists of all supersets of the current node. Thus, if the local MWFP set of a candidate node is empty, then the global MWFP set has no supersets. On the contrary, if the local MWFP set is not empty, then a superset will be found in the global MWFP set. Our MWFP mining framework in weightdescending order is as follows. First, a candidate pattern can be checked to determine whether its real weighted support is less than or equal to the minimum support threshold (min_sup). If the pattern is genuinely weighted as frequent, its supersets (extensions) have the possibility to be genuinely weighted frequent patterns. Therefore, the pattern combinations and each approximately weighted frequent item of the pattern tail make up its 1-level extensions. Finally, the real weighted frequent pattern of a leaf node has to be checked to determine whether it has a weighted frequent superset.
Fig. 5. Prex-tree for mining maximal weighted frequent patterns.
58
Fig. 6. An example of a vertical bitmap representation and AND-operation applied to the bits.
3.3. Example for MWFP mining In this Section, we show an example of mining maximal weighted frequent pattern (MWFP). Fig. 5 shows an example of prex-tree for maximal weighted frequent pattern (MWFP) mining of the transaction database (TDB) shown in Table 1. If we suppose that the minimum support threshold (min_sup) is 2, then the weight list is ha:0.7, b:0.6, c:0.8, d:0.65, e:0.45, f:0.5, g:0.4, h:0.5, i:0.45i. We extend each level of the node pattern and traverse the prex-tree in depth-rst order until no more weighted frequent patterns can be generated. In our approach, we consider important maximal frequent patterns with weight constraints so MWFIM can remove weighted infrequent items such as items h and i. As a result, the prex tree of MWFIM does not store weighted infrequent items so memory usage can be reduced. In this example, when the head of a node is {}, the sorted weighted frequent item list in weight-descending order with a maximum weight of the TDB of 0.8 is hc, a, d, b, f, e, gi. This list becomes the tail of the root. The real weighted support of the rst item of the tail, c, is larger than min_sup, and thus the 1-level extension of node {c} can be generated as child nodes of node {c}. Thus, when {c} is a head, the available tail is {a, d, b, f}. This tail can be generated since each item is weighted as approximately frequent using the weight information of the head, {c}. For generating l-level extensions through the union of head {c} and each item of the tail, {c} and {a} of the tail rst make {c, a}, which is a child node of node {c}. The real weighted support of {c, a} is larger than min_sup, and thus therefore {c, a} is a real weighted frequent pattern. Thus, if the tail of head {c, a} is not {}, we can then traverse the child nodes of {c, a}. To nd the tail, which is the approximate weighted support of {c, a} [ {i}, each item with a lower order than the member item of {c, a} must be calculated. Maintaining the anti-monotone property, the tail of {c, a} is {d, b, f}, and thus we can generate a child node of {c, a}, which is {c, a, d}. This pattern, {c, a, d}, is a real weighted frequent pattern, and thus we can nd the tail of the next head, {c, a, d}, which is {b, f}. The 1-level extension of {c, a, d, b} can be generated since {c, a, d, b} is a real weighted frequent pattern. This extension is {c, a, d, b, f} since the tail of {c, a, d, b} is f. However, {c, a, d, b, f} is not genuinely weighted as frequent (0.65 3 = 1.95 < min_sup), and thus no more extensions are possible. Pattern {c, a, d, b, f} is just a leaf node. If a leaf node pattern has real weighted support and passes a superset check to determine whether it has any real weighted frequent supersets, it can be an MWFP. Nevertheless, {c, a, d, b, f} is not, and thus we have to nd MWFPs from the 4-level subsets of {c, a, d, b, f}. The candidates are {c, a, d, b}, {c, a, d, f}, {c, a, b, f}, {c, d, b, f}, and {a, d, b, f}. One of these ve candidates, {c, a, d, b} is the parent node of {c, a, d, b, f}, as shown in Fig. 5, and thus {c, a, d, b} is checked to determine whether it is an MWFP when any of its child nodes are not MWFPs. Other candidates do not need additional checks since they traverse through each subtree later. Returning to the parent node {c, a, d, b}, {c, a, d, b} is genuinely weighted as frequent, and it has weighted frequent supersets.
Thus, {c, a, d, b} is an MWFP, and is inserted into the result set. For the next search, other child nodes of {c, a, d, b}s parent node need to be traversed in depth-rst order. When {c, a, d} is set to a head, only pattern {c, a, d, f} can be generated for its 1-level extension. However, {c, a, d, f} is not genuinely weighted as frequent (0.6625 3 = 1.9875 < min_sup), and thus {c, a, d, f} becomes a leaf node. The depth-rst traversal for the sub-tree of {c, a, d} is ended, and our search returns to the root of sub-tree {c, a, d}. A child node of {c, a, b} is then an MWFP, and therefore {c, a, d} cannot be an MWFP. We must traverse the sibling nodes of {c, a, d} for the next step. The next sibling node of {c, a, d}, {c, a, b}, is genuinely weighted as frequent, and thus its child nodes can be generated. When the head of the node is {c, a, b}, its tail is f, and thus a new node, {c, a, b, f}, is extended as a child node of {c, a, b}. This is a leaf node (it is no longer possible to create extensions) since it is not genuinely weighted as frequent (0.65 3 = 1.95 < min_sup). Since the sub-tree rooted at {c, a, b} has not no MWFPs, {c, a, b} can be a candidate MWFP. Thus, we need to check whether pattern {c, a, b} has any supersets that are weighted as frequent. We already discovered an MWFP, {c, a, d, b} that is a superset of {c, a, b}. Therefore, {c, a, b} cannot pass the superset check and is not an MWFP. Returning to our traversal, we arrive at {c, a, f}, the next sibling node of {c, a, b}, and check whether the 1-level extensions of {c, a, f} are possible. Indeed, {c, a, f} is genuinely weighted as frequent, but it has no tail and thus becomes a leaf node. It is no longer possible to create any extensions. Thus, we have to check whether its supersets are weighted as frequent. No supersets of {c, a, f} are found from the result set, which contains already discovered MWFPs, and thus {c, a, f} is an MWFP. This pattern is inserted into the result set. Next, {c, a, f} has no remaining sibling nodes, and we return to its parent node. The sub-tree rooted as parent node {c, a} already has MWFPs, and therefore {c, a} is a subset of these MWFPs. Thus, we can skip the superset checking step for {c, a}. Next, we traverse the sub-tree rooted as {c, d}, which is the next sibling of {c, a}. Node {c, d} can have its tail, {b, f}, and a combination of {c, d} [ {b}, which is used to make the rst child nodes of {c, d}. The generated pattern, {c, d, b}, is a real weighted frequent pattern, and thus we can generate a new candidate as the 1level extension of {c, d, b} using its tail. The only 1-level extension, {c, d, b, f}, is not a genuinely weighted frequent pattern (0.6375 3 = 1.9125 < min_sup), and therefore we stop the traversal of this sub-tree and return to its parent node, {c, d, b}. Node {c, d, b} is a subset of {c, a, d, b}, which is an already discovered MWFP. Consequently, the remaining sibling node, {c, d, f}, is not genuinely weighted as frequent (0.65 3 = 1.95 < min_sup), and thus we check its remaining sibling nodes. However, it has no remaining sibling nodes, and therefore we return to its parent node, {c, d}. However, {c, d} is not an MWFP since it is a subset of an already discovered MWFP. Next, {c, b} is the right-sibling node of {c, d}, and thus we extend its sub-tree, and search the tree in depth-rst order. Node {c, b} is genuinely weighted as frequent, but {c, b, f} is not. Returning to the root of the sub-tree, only {c, b} can be a candidate. However, {c, b} is also a subset of an already discovered MWFP, and thus it cannot be inserted into the result set. We traverse the next sibling node, {c, f}, but {c, f} is not genuinely weighted as frequent. Thus, {c, f} cannot be an MWFP. Furthermore, it has no right-sibling node, so we return to its parent node, {c}. Without checking the superset, we know that pattern {c} cannot be an MWFP since the sub-tree rooted at {c} has one or more MWFPs. We traverse the remaining search spaces for the right-sibling nodes of {c}. The remainders are {a}, {d}, {b}, f, {e}, and {g}. By traversing the prextree rooted at {}, all of the MWFPs are discovered. 3.4. Vertical bitmap representation Our MWFIM mining algorithm adopts a vertical bitmap representation used in MAFIA [6]. Each bitmap stands for a pattern in
U. Yun et al. / Knowledge-Based Systems 33 (2012) 5364 Table 2 Characteristics of benchmark datasets. Data sets Pumsb Accidents Retail BMS-Webview1 Size (M) 15.9 33.8 3.97 0.97 # of trans 49,046 340,183 88,162 59,602 # of items 2113 572 16,470 497 A.(M.) trans size 74(74) 45(45) 13(50) 2.5(267)
59
3.5. MWFIM : MWFP mining algorithm We will now present the MWFIM algorithm. ALGORITHM [MWFIM]: Maximal Weighted Frequent Itemset Mining Input: (1) A Transaction Database: TDB; (2) Weights of the items within weight range: MinWMaxW; (3) Minimum support threshold: min_sup Output: The complete set of maximal weighted frequent patterns Begin 1. Let MWFP be the set of maximal weighted frequent patterns. Initialize MWFP {}; 2. Scan TDB once to nd the global weighted frequent items as follows: support MaxW P min_sup 3. Sort the items in weight-descending order; 4. Scan the TDB again and build vertical bitmaps to store weighted frequent candidate items of transactions in the TDB. 5. Call MWFIM(root, MWFP, false); Procedure MWFIM (Current node C, MWFP, Boolean isHUT) isAdded = false; allWF = true; HeadWS = C.Head.Weight C.Head.Support; If (HeadWS Pmin_sup) For each item i in all the remaining items //remaining items that have lower orders than the items of C.Head 6: If (C.Head.Weight C.Head [ {i} P min_sup), then insert i into C.Tail; 7: Else allWF = false; 8: for each item i in C.Tail 9: If (i is the rst item in C.Tail) isHUT = true; 10: Else isHUT = false; 11: extended_C = C[ {i}; 12: isAdded_local = MWFIM(extended_C, MWFP, isHUT); 13: If (isAdded_local) isAdded_MWFP = true; 14: If (isHUT and AllWF = true) return isAdded; 15: if (C.Tail == {}) 16: If (not ExistSubset(C.Head, MWFP)) { // C.Head is not subset of a weighted frequent pattern 17: Insert C.Head into MWFP; 18: isAdded = true; } 19: Else isAdded = false; 20: return isAdded; 1: 2: 3: 4: 5:
Table 3 Parameter settings for scalability test. Data sets (a) T10I4Dx datasets T10I4D100K T10I4D200K T10I4D400K T10I4D600K T10I4D800K T10I4D1000K T10I4D2000K T10I4D3000K T10I4D4000K T10I4D5000K (b) TaLbNc datasets T10.L1000.N10000D100K T20.L2000.N20000D100K T30.L3000.N30000D100K T40.L4000.N40000D100K T10.L1000.N10000D1000K T20.L2000.N20000 D1000K T30.L3000.N30000 D1000K T40.L4000.N40000 D1000K |T| | I| |L| # of items 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 10,000 20,000 30,000 40,000 10,000 20,000 30,000 40,000 # of trans (K) 100 200 400 600 800 1000 2000 3000 4000 5000 100 100 100 100 1000 1000 1000 1000 Size (MB) 3.92 8.05 15.71 23.56 31.42 39.27 78.55 117.83 157.11 196.39 5.08 10.99 16.82 22.66 50.82 109.9 168.28 226.63
10 10 10 10 10 10 10 10 10 10 10 20 30 40 10 20 30 40
4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 1000 2000 3000 4000 1000 2000 3000 4000
the database, and the bit in each bitmap represents whether a given transaction has the corresponding pattern. Initially, each bitmap corresponds to a 1-level pattern or a single item. The patterns of counted supports in the transaction database become recursively longer, and the vertical bitmap representation works completely in combination with this pattern extension. For instance, the bitmap for pattern {a, b} can be generated easily by performing an AND-operation on all of the bits in the bitmaps for {a} and {b}. Next, to count the number of transactions that contain {a, b}, we only need to count the number of single bits in the {a, b} bitmap that equals the number of transactions that have {a, b}, as shown in Fig. 6b. To simplify, the bitmap representation is ideal for both candidate pattern generation and support counting. In previous algorithms [6,14], bit map representation is applied for storing all the items of transactions in TDB. Meanwhile, in MWFIM algorithm, bit map structure is used to keep only weighted frequent items of transactions in TDB. Thus, memory usage can be reduced. For example, suppose that a minimum support (min_sup) is 2, given a transaction database in Table 1a and weight table in Table 1b, a frequent item list is:ha:4, b:4, c:3, d:5, e: 3, f:4, g:4, h:2, i:1i and a weight list of items is:ha:0.7, b:0.6, c:0.8, d:0.65, e:0.45, f:0.5, g:0.4, h:0.5, i:0.45i. As vertical bitmaps shown in Fig. 6a, our MWIFM algorithm does not need to store items h and i in vertical bit maps because the items maximum weighted supports (MaxW support) are less than a minimum support (2) and any pattern including items h or i cannot be a weighted frequent pattern.
In the MWFIM algorithm, TDB is scanned once, and weighted frequent items are found and sorted in weight-descending order. To store all of the transaction and corresponding items, vertical bitmaps are generated. Next, the MWFIM algorithm calls the recursive MWFIM procedure (Current node C, MWFP, Boolean isHUT). In the procedure, a ag, isAdded, is set to true only if any child node of current node C has an MWFP. Otherwise, it is set to false. When the ag is true, without checking for supersets, we know that the head of C is not be a maximal weighted frequent pattern. When the procedure call has ended, it returns its isAdded ag. Another ag of the procedure, allWF, is set to true when all of the remaining items that have lower orders than the items of C.Head are weighted as frequent. The other ag, isHUT means that current candidate is H [ T when isHUT is true. If isHUT and allWF
60
Fig. 7. Runtime (Pumsb dataset). Fig. 11. Runtime (retail dataset).
Fig. 8. Number of patterns (Pumsb dataset).
Fig. 12. Number of patterns (retail dataset).
Fig. 9. Runtime (accidents dataset).
Fig. 13. Runtime (BMS-Webview1 dataset).
Fig. 10. Number of patterns (accidents dataset).
are true, we do not have to traverse any subsets of H[ T (C.Head C.Tail) in line 14. The weighted support of C.Head is used to check the genuinely weighted support of the current candidate pattern in line 4. Line 6 prunes weighted infrequent patterns with the maximum weight, which is the weight of C.Head. The MWFIM algorithm adopts the depth-rst traversal with the prex-tree. If C.Tail is not empty, C is extended as extended_C, and the MWFIM procedure (extended_C, MWFP, isHUT) is called recursively in line 12. However, if C.Tail is empty, then C.Head is a candidate pattern, and thus the procedure checks whether C.Head has a weighted frequent superset in line 16. If it does not, C.Head is a maximal weighted frequent pattern and is inserted into the MWFP set.
61
Fig. 14. Number of patterns (BMS-Webview1 dataset).
Fig. 17. Memory usage (Pumsb dataset).
4. Performance evaluation Using real and synthetic datasets, we report our experimental results on the performance of the MWFIM algorithm as compared to the state-of-the-art maximal pattern mining algorithm, MAFIA [6]. MWFIM and MAFIA both use vertical bitmaps for candidate pattern generation and support counting. However, MWFIM is the rst maximal weighted frequent pattern mining algorithm. The main purposes of this experiment are to demonstrate how effectively non-maximal weighted patterns can be pruned, and to show the effectiveness of maximal weighted frequent patterns. Additionally, we run a scalability test and analyze the memory usage and quality of patterns in MWFIM.
4.1. Test environment and datasets In our experiments, we used four real datasets and several synthetic datasets. Table 2 shows the characteristics of these datasets (Pumsb, Accidents, Retail, BMS-Webview1, and T10I4DxK datasets). The Pumsb dataset includes census data for populations and housing. The Accidents dataset contains anonymous trafc accident data. It is quite dense, and therefore a large number of frequent patterns will be mined, even for very high minimum support values. The Retail dataset is sparse and includes a dataset on a retail supermarket store basket. The BMS-Webview1 dataset contains several months of sparse click-stream data from an e-commerce Website. These four real datasets can be obtained from the Frequent Itemset MIning
Fig. 15. Runtime (T10I4Dx dataset).
Fig. 16. Runtime (TaLbNc dataset).
62
4.2. Experimental results on execution time We analyze the evaluation results for the Accidents, Pumsb, Retail, and BMS-Webview1 datasets in Figs. 714. The normalized weights of the items are between 0.3 to 0.6, 0.5 to 0.8, 0.3 to 0.6 and 0.4 to 0.8, respectively. From Figs. 714 show that MWFIM runs faster and generates fewer patterns than the MAFIA algorithm in all cases. Specically, fewer patterns are found as minimum support is increased. Fig. 7 compares the results for the Pumsb dataset and shows that MWFIM outperforms MAFIA in all cases. Likewise, Fig. 9 shows that MWFIM is faster than MAFIA with the Accidents dataset, which is a dense TDB. Figs. 8, 10 and 12 show that MAFIA mines a very large number of patterns with the following settings. For example, in the Pumsb datasets, the numbers of patterns in MAFIA are 108,804 with a minimum support of 54%, and 146,882 with a minimum support of 52%. For the Accidents dataset, which is another dense dataset, MWFIM is faster than MAFIA, as shown in Fig. 9, and generates fewer patterns, as shown in Fig. 10. In particular, Fig. 10 shows that the number of unimportant patterns is considerably reduced by MWFIM. In Figs. 1114, we provide the evaluation results for two sparse datasets, the Retail and BMSWebview1 datasets. With these datasets, our experiment shows that MWFIM shows the best performance in terms of the number of patterns and runtime for sparse datasets. In conclusion, our experiment has that the number of patterns found by MWFIM is several orders of magnitude fewer than the number of patterns discovered by MAFIA. However, the maximal weighted frequent patterns mined by MWFIM are more important and fewer than the maximal frequent patterns of MAFIA since the weight constraints reect what item is more important even if it has low frequency. Therefore, we determined that MWFP mining can reduce much larger search spaces than MFP mining when the tting parameters are set. Meanwhile, the runtime is increased when the minimum support becomes lower. 4.3. Scalability test T10I4DxK datasets are used to test the scalability using the number of transactions, and TaLbNc datasets are used to check the scalability using the number of attributes. In this experiment, MWFIM scales much better than the MAFIA algorithm. First, we ran a scalability test on MWFIM with regard to the number of transactions from 100 K to 5000 K. The minimum support is set as 0.1% about 100500 K transactions and set as 0.3% about 10005000 K transactions. The normalized weights of items are set as between 0.3 and 0.6. Fig. 15 shows that the slope ratio of MWFIM is lower than that of MAFIA, and that MWFIM is also faster. Second, we also compare MWFIM with MAFIA using the number of attributes as 1040 K. In this test, transaction number is increased from 100 K to 1000 K in which a minimum support set as 0.5% for 100 K transactions and 0.8% for 1000 K transactions. The normalized weights of the items are set as between 0.3 and 0.6. Fig. 16 shows that MWFIM is much more scalable than MAFIA in terms of the number of attributes. In comparison with MAFIA algorithm, MWFIM not only runs faster but also has better scalability. 4.4. Memory consumption In this experiment, we checked the memory usage of MWFIM and MAFIA using four real datasets. From Figs. 1722, we can see that MWFIM uses less memory than MAFIA. MWFIM uses normalized weights pushed deeply into the mining process, and therefore weighted infrequent patterns are not considered for the next mining step. These memory usages are generally proportional to the number of their result patterns. In addition, unimportant patterns
Fig. 18. Memory usage (accidents dataset).
Fig. 19. Memory usage (retail dataset).
Fig. 20. Memory usage (BMS-Webview1 dataset).
(FIMI) dataset repository (http://www.mi.cs.helsinki./data/). These datasets do not have weight values of their items, and therefore a random generation function is used to generate their weights. Table 3 summarizes the parameter settings, where |T| is the average size of a transaction, |I| is the average size of the maximal potentially large itemsets, |L| is the maximum number of potential frequent patterns, and N is the number of items. As shown in Table 3a and b, we use synthetic T10I4Dx and TaLbNc datasets. T10I4Dx datasets contain from 1000 K to 5000 K transactions and TaLbNc datasets have the number of items from 10 K to 40 K with 100 K or 1000 K transactions. These synthetic datasets were generated from the IBM dataset generator. Our MWFIM algorithm was written in Visual C++. Experiments were performed on a processor operating at 2.40 GHz with 2048 MB of memory on a Microsoft Windows 7 operating system.
63
Fig. 21. Memory usage (T10I4Dx dataset).
Fig. 22. Memory usage (TaLbNc dataset).
are not generated, and more important patterns are discovered by considering the weighted support. That is, MWFIM use less prex trees and bit map structures because it can only consider weighted frequent patterns and remove weighted infrequent items or patterns. For these reasons, MWFIM uses less memory than MAFIA. In Fig. 17, MWFIM use much less memory space than MAFIA since MWFIM mines much fewer result patterns on the Pumsb datasets, as shown in Fig. 8. In Figs. 21 and 22, we performed additional scalability test for memory usage. As you can see, from the performance test, both MWFIM and MAFIA show linear scalability. However, MWFIM has much better scalability in terms of number of transactions (1005000 K) and items (1040 K).
but using weights according to the importance of the items, MWFIM detects more important patterns by considering the weighted support of the patterns instead of the support itself. The main contribution of this paper is to incorporate weight constraints with maximal frequent pattern mining. As future works, the suggested techniques to mine maximal weighted frequent patterns can be combined with other structures such as FP-tree or tries to improve performance. In addition, maximal weighted frequent pattern mining techniques can be applied to mine maximal sequential patterns with weight constraints and extended for mining maximal weighted frequent patterns with approximate bounds.
Acknowledgements 5. Conclusions In this paper, we proposed maximal weighted frequent pattern mining in which a vertical bitmap representation was used for the transaction database. In this proposed MWFIM algorithm, we presented the importance of discovering important patterns among very large resulting maximal patterns and dened a maximal weighted frequent pattern. The normalized weights are used according to the importance of the items. Based on our framework, the anti-monotone property is efciently applied in maximal weighted frequent pattern mining for pruning the size of the search space. Our performance test shows that the MWFIM algorithm is more efcient and scalable than MAFIA. In addition, the number of patterns found by MWFIM is several orders of magnitude fewer than the number of patterns discovered by the MAFIA algorithm, This research was supported by the National Research Foundation of Korea (NRF) funded by the Ministry of Education, Science and Technology (NRF No. 2012-0003740 and 2012-0000478).
References
[1] R. Agrawal, T. Imielinski, R. Srikant, Mining association rules between sets of items in large databases, SIGMOD, May, (1993). [2] C.F. Ahmed, S.K. Tanbeer, et al., Handling Dynamic Weights in Weighted Frequent Pattern Mining, IEICE Transactions on Information and Systems (November) (2008) 25782588. [3] M. Baena-Garcia, R. Morales-Bueno, Mining interesting measures for string pattern mining, Knowledge-Based Systems 25 (1) (2012) 4550. [4] R.J. Bayardo, Efciently mining long patterns from databases, in: Proceedings of 1998 ACMSIGMOD International Conference Management of Data (SIGMOD98), Seattle, WA, pp. 8593.
64
U. Yun et al. / Knowledge-Based Systems 33 (2012) 5364 [24] M. McGlohon, L. Akoglu, C. Faloutsos, Weighted Graphs and Disconnected Component: Patterns and a generator, KDD, 2008. [25] S. Nowozin, K. Tsuda, Weighted substructure mining for image analysis, in: IEEE Conference on Computer Vision and Pattern Recognition, June 2007. [26] E.R. Omiecinski, Alternative interest measures for mining associations in databases, IEEE Transaction on Knowledge and Data Engineering (2003). [27] S.B. Patil, Y.S. Kumaraswamy, Intelligent and effective heart attack prediction system using data mining and articial neural network, European Journal of Scientic Research 31 (4) (2009) 642656. [28] J. Pei, G. Dong, W. Zou, J. Han, Mining condensed frequent-pattern bases, Knowledge and Information Systems 6 (5) (2004) 570594. [29] J. Pei, J. Han, et al., Mining sequential patterns by pattern-growth: the prexspan approach, IEEE Transactions on Knowledge and Data Engineering (October) (2004). [30] K. Sun, F. Bai, Mining weighted association rules without pre-assigned weights, IEEE Transactions on Knowledge and Data Engineering 20 (4) (2008). [31] F. Tao, Weighted Association Rule Mining Using Weighted Support and Signicant Framework, ACM SIGKDD, August 2003. [32] W. Tsao, A.J.T. Lee, Y. Liu, T. Chang, H. Lin, A data mining approach to face detection, Pattern Recognition 43 (2010) 10391049. [33] Y. Wu, Y. Chen, R. Chang, Mining negative generalized knowledge from relational databases, Knowledge-Based Systems 24 (1) (2011) 134145. [34] H. Xiong, S. Shekhar, P.N. Tan, V. Kumar, Exploiting A Support-based Upper Bound of Pearsons Correlation Coefcient for Efciently Identifying Strongly Correlated Pairs, ACM SIGKDD, August 2004. [35] X. Yan, J. Han, gSpan: Graph-Based Substructure Pattern Mining, IEEE ICDM02, December 2002. [36] U. Yun, Mining lossless closed frequent patterns with weight constraints, Knowledge-Based Systems 20 (2007) 8697. [37] U. Yun, Efcient mining of weighted interesting patterns with a strong weight and/or support afnity, Information Sciences 177 (17) (2007) 34773499. [38] U. Yun, On pushing weight constraints deeply into frequent itemset mining, Intelligent Data Analysis 13 (2) (2009). [39] U. Yun, K. Ryu, Approximate weighted frequent pattern mining with/without noisy environments, Knowledge Based Systems 24 (1) (2011) 7382. [40] U. Yun, An efcient mining of weighted frequent patterns with length decreasing support constraints, Knowledge Based Systems 21 (8) (2008) 741752. [41] U. Yun, K. Ryu, Weighted approximate sequential pattern mining within tolerance factors, Intelligent Data Analysis 15 (4) (2011) 551569. [42] U. Yun, K. Ryu, Discovering important sequential patterns with lengthdecreasing weighted support constraints, International Journal of Information Technology and Decision Making 9 (4) (2010) 575599. [43] X. Zeng, J. Pei, K. Wang, J. Li, PADS: a simple yet effective pattern-aware dynamic search method for fast maximal frequent pattern mining, Knowledge and Information Systems 20 (3) (2009) 375391.
[5] F. Bonchi, C. Lunnhese, Pushing Tougher Constraints in Frequent Pattern Mining, PAKDD, May 2005. [6] D. Burdick, M. Calimlim, J. Flannick, J. Gehrke, T. Yiu, MAFIA: a maximal frequent itemset algorithm, IEEE Transactions on Knowledge and Data Engineering 17 (11) (2005) 14901504. [7] C.H. Cai, A.W. Fu, C.H. Cheng, W.W. Kwong, Mining association rules with weighted items, in: Proceedings of International Database Engineering and Applications Symposium, IDEAS 98, Cardiff, Wales, UK, 1998, pp. 6877. [8] J. Chang, Mining weighted sequential patterns in a sequence database with a time-interval weight, Knowledge-Based Systems 24 (1) (2011) 19. [9] M.S. Chen, J. Han, P.S. Yu, Data mining: an overview from database perspective, IEEE Transactions on Knowledge and Data Engineering 8 (1996) 866883. [10] L. Chen, F. Hsu, M. Chen, Y. Hsu, Developing recommender systems with the consideration of product protability for sellers, Information Sciences 178 (4) (2008) 10321048. [11] G. Dong G.J. Li, Efcient mining of emerging patterns: discovering trends and differences, in: Proceedings of 1999 International Conference on Knowledge Discovery and Data Mining (KDD99), San Diego, CA, 1999, pp. 4352. [12] G. Grahne, J. Zhu, Fast algorithms for frequent itemset mining using FP-trees, IEEE Transactions on Knowledge and Data Engineering 17 (10) (2005) 13471362. [13] J. Ge, Y. Qiu and Z. Chen, Cooperative recommendation system based on ontology construction, in: 7th Intl Conference on Grid and Cooperative Computing, October 2008, pp. 691694. [14] K. Gouda, M.J. Zaki, Efciently mining maximal frequent itemsets, in: Proc. IEEE Intl Conf. Data Mining, 2001, pp. 163170. [15] M.S. Gu, J.H. Hwang et al., Mining the weighted frequent XML query pattern, in: IEEE International Workshop on Semantic Computing and Applications, July 2008. [16] J. Han, M. Kamber, Data Mining Concepts and Techniques, second ed., Morgan Kaufmann, 2005. [17] J. Han, J. Pei, Y. Yin, R. Mao, Mining frequent patterns without candidate generation: a frequent-pattern tree approach, Data Mining and Knowledge Discovery 8 (1) (2004) 5387. [18] J.H. Hwang, K.H. Ryu, A weighted common structure based clustering technique for XML documents, Systems and Software 83 (7) (2010) 1267 1274. [19] J. Li, G. Dong, K. Ramamohanarao, L. Wong, Deeps: a new instance-based lazy discovery and classication system, Machine Learning 54 (2) (2004) 99124. [20] A.H. Lim, C.S. Lee, Processing online analytics with classication and association rule mining, Knowledge-Based Systems 23 (3) (2010) 248255. [21] T. Li, X. Li, H. Xiao, An effective algorithm for mining weighted association rules in telecommunication networks, in: Intl Conference on Computational Intelligence and Security Workshops, 2007, pp. 425428. [22] Y.C. Li, J.S. Yeh, C.C. Chang, Isolated items discarding strategy for discovering high utility itemsets, Data & Knowledge Engineering 64 (2008) 198217. [23] H. Mannila, H. Toivonen, Levelwise search and borders of theories in knowledge discovery, Data Mining and Knowledge Discovery 1 (3) (1997) 241258.

An Efficient Mining Algorithm For Maximal Weighted Frequent Patterns in Transactional Databases

Caricato da

Informazioni sul documento

Descrizione originale:

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

An Efficient Mining Algorithm For Maximal Weighted Frequent Patterns in Transactional Databases

Caricato da

Copyright:

Formati disponibili

Knowledge-Based Systems 33 (2012) 5364

Contents lists available at SciVerse ScienceDirect

U. Yun et al. / Knowledge-Based Systems 33 (2012) 5364

U. Yun et al. / Knowledge-Based Systems 33 (2012) 5364

Fig. 1. A weighted frequent pattern {a, b, c} and its subsets.

U. Yun et al. / Knowledge-Based Systems 33 (2012) 5364

Fig. 3. An example of a prex-tree with weight-descending order.

Fig. 2. An example of a subset lattice for four items.

U. Yun et al. / Knowledge-Based Systems 33 (2012) 5364

Fig. 4. An example weighted frequent pattern and its prex-tree.

Fig. 5. Prex-tree for mining maximal weighted frequent patterns.

U. Yun et al. / Knowledge-Based Systems 33 (2012) 5364

U. Yun et al. / Knowledge-Based Systems 33 (2012) 5364

Fig. 7. Runtime (Pumsb dataset). Fig. 11. Runtime (retail dataset).

Fig. 8. Number of patterns (Pumsb dataset).

Fig. 12. Number of patterns (retail dataset).

Fig. 9. Runtime (accidents dataset).

Fig. 13. Runtime (BMS-Webview1 dataset).

Fig. 10. Number of patterns (accidents dataset).

U. Yun et al. / Knowledge-Based Systems 33 (2012) 5364

Fig. 14. Number of patterns (BMS-Webview1 dataset).

Fig. 17. Memory usage (Pumsb dataset).

Fig. 15. Runtime (T10I4Dx dataset).

Fig. 16. Runtime (TaLbNc dataset).

U. Yun et al. / Knowledge-Based Systems 33 (2012) 5364

Fig. 18. Memory usage (accidents dataset).

Fig. 19. Memory usage (retail dataset).

Fig. 20. Memory usage (BMS-Webview1 dataset).

U. Yun et al. / Knowledge-Based Systems 33 (2012) 5364

Fig. 21. Memory usage (T10I4Dx dataset).

Fig. 22. Memory usage (TaLbNc dataset).

Potrebbero piacerti anche