Sei sulla pagina 1di 8

ISBN: 978-972-8924-93-5 2009 IADIS

MINING THE MOST K-FREQUENT ITEMSETS WITH TS-TREE


Savo Tomovi and Predrag Stanii
Faculty of Mathematics and Science, University of Montenegro Podgorica, Montenegro

ABSTRACT In this paper we present TS-Growth algorithm that takes a pattern-growth approach (Han et al, 2000.) and Rymon's set enumeration framework (Rymon, 1992.) for mining the most k-frequent itemsets. Top-k mining concept has been proposed because it is difficult to predict how many frequent itemsets will be mined with a specified minimum support. The Top-k mining concept is based on an algorithm for mining the number of most k frequent itemsets ordered according to their support values. TS-Growth algorithm uses compact data structure called TS-tree (TS-tree will contain itemsets from the input dataset with its support and because of that we called this tree a Total Support Tree or TS-tree) to store candidate itemsets and extracts the most k-frequent itemsets directly from this structure. The algorithm requires just two database scans. KEYWORDS Top-k mining concept, frequent itemset mining, association analysis, FP-Growth algorithm

1. INTRODUCTION
Association rules have received lots of attention in data mining due to their many applications in marketing, advertising, inventory control, and many other areas. The motivation for discovering association rules has come from the requirement to analyze large amounts of supermarket basket data. A record in such data typically consists of the transaction unique identifier and the items bought in that transaction. Items can be different products which one can buy in supermarkets or on-line shops, or car equipment, or telecommunication companies services etc. A typical supermarket may well have several thousand items on its shelves. Clearly, the number of subsets of the set of items is immense. Even though a purchase by a customer involves a small subset of this set of items, the number of such subsets is very large. For example, even if we assume that no customer has more than five items in his shopping cart, there are

i =1

10000 possible contents of this cart, which i

corresponds to the subsets having no more than five items of a set that has 10,000 items, and this is indeed a large number! The supermarket is interested in identifying associations between item sets; for example, it may be interested to know how many of the customers who bought bread and cheese also bought butter. This knowledge is important because if it turns out that many of the customers who bought bread and cheese also bought butter, the supermarket will place butter physically close to bread and cheese in order to stimulate the sales of butter. Of course, such a piece of knowledge is especially interesting when there is a substantial number of customers who buy all three items and a large fraction of those individuals who buy bread and cheese also buy butter. For example, the association rule bread cheese butter [support=20%, confidence=85%] represents facts: 1. 20% of all transactions under analysis contain bread, cheese and butter; 2. 85% of the customers who purchased bread and cheese also purchased butter.

606

IADIS International Conference WWW/Internet 2009

The result of association analysis is strong association rules, which are rules satisfying a minimal support and minimal confidence threshold. The minimal support and the minimal confidence are input parameters for association analysis. The problem of association rules mining can be decomposed into two sub- problems (Agrawal et al, 1993): 1. Discovering frequent or large itemsets. Frequent itemsets have support greater than the minimal support; 2. Generating rules. The aim of this step is to derive rules with high confidence (strong rules) from large itemsets. For each large itemset l one finds all not empty subsets of l; for each a l a one generates (l ) the rule a l a , if minimal confidence. (a) We do not consider the second sub-problem in this paper, because the overall performances of mining association rules are determined by the first step. Efficient algorithms for solving the second sub-problem are presented in (Han and Kamber, 2001.) and (Tan et al, 2006.). Many algorithms were proposed to solve frequent itemset mining. In many cases, the algorithms generate large number of frequent itemsets, often in thousands or even millions. It is nearly impossible for the end users to comprehend or validate such large number of frequent itemsets, thereby limiting the usefulness of the data mining results. In (Stanisic and Tomovic, 2008a.) and (Stanisic and Tomovic, 2008b.) we proposed the Apriori Multiple algorithm for mining all frequent itemsets. The algorithm uses candidate-generation-and-test approach, which is the basis of Apriori (Agrawal and Srikant, 1994). The main idea behind the Apriori Multiple algorithm is to add parameter multiple_num which determines the length of the algorithms iterations. If k Max < multiple _ num , where k Max is the length of the longest frequent itemset, Apriori Multiple finishes in just two database scans, while the original Apriori (Agrawal and Srikant, 1994) requires k Max + 1 database scans. In this paper we further develop Apriori Multiple method and present efficient TS-Growth algorithm for mining the most k frequent itemsets. TS-Growth algorithm uses pattern-growth approach and eliminates candidate generation phase. Top-k mining (Fu et al, 2000.) (Han et al, 2002.) and (Hirate et al, 2004.) algorithms are important because the user does not have to specify minimum support, which is usually difficult to choose: if minimum support is too small large number of frequent itemsets are generated and many of them are not interesting; if minimum support is too large there is a chance to fail to generate all interesting itemsets because they do not have sufficiently high support. Using Top-k mining algorithms, users can mine the most k frequent itemsets in descending order of support without specifying minimum support threshold. Among several modifications of the original Top-k definition we follow approach from (Hirate et al, 2004.). The remainder of this paper is organized as follows. In section 2 basic concepts from association analysis are defined. In section 3, TS-Growth algorithm is presented. At the end, section 4 contains experiment results.

2. TERMS DEFINITION
This section contains definitions that are necessary for further text. We primarily use notions from (Simovici and Djeraba, 2008.); it describes mathematical tools for data mining. Suppose that I is a finite set; we refer to the elements of I as items. Definition 1. A transaction data set on I is a function T: {1, ..., n}P(I). The set T(k) is the kth transaction of T. The numbers 1,... , n are the transaction identifiers (TIDs). Given a transaction data set T on the set I, we would like to determine those subsets of I that occur often enough as values of T.

607

ISBN: 978-972-8924-93-5 2009 IADIS

Definition 2. Let T: {1, ..., n}P(I) be a transaction data set on a set of items I. The support count of a subset K of the set of items I in T is the number suppcountT (K) given by: suppcountT (K)=|{t | 1 t n and K T (t ) }|. The support of an item set K (in the following text instead of an item set K we will use just an itemset K) is the number: supportT(K)= suppcountT (K)/n. The following rather straightforward statement is fundamental for the study of frequent itemsets. Proof is presented in order to introduce anti-monotone property. Theorem 1. Let T: {1, ..., n}P(I) be a transaction data set on a set of items I. If K and K are two itemsets, then K ' K implies supportT (K) supportT(K). Proof. The previous theorem states that suppT for an itemset has the anti-monotone property. It means that support for an itemset never exceeds the support for its subsets. For proof, it is sufficient to note that every transaction that contains K also contains K. The statement from the theorem follows immediately. This theorem is used in the Apriori algorithm for candidate generation and we will explain this in the next section. Definition 3. The Top-k frequent itemsets are the set of itemsets having support S, where S is the support of the kth itemset in the sorted list of all itemsets by descending support values.

3. TS-GROWTH ALGORITHM
Top-k mining concept is important to enhance usability for real applications for data mining. TS-Growth is top-k mining algorithm. It generates itemsets with descending order of support without specifying the minimum support threshold but with a user defined threshold k. It finishes in two database scans and because of that it is comparable to existing Top-k mining algorithms. Definition 4. The threshold border_sup is support of the kth 1-itemset in descending order. This means that there are at least k 1-itemsets with support higher than border_sup. Users do not have to be concerned with border_sup because it is an internal threshold and its value is defined automatically. Top-k Apriori Multiple algorithm as primitives uses just 1-itemsets whose support are higher than border_sup, reducing number of candidates generated in the following phases. The following Lemma proves correctness of this method. Lemma 1. If the support of 1-itemset X is lower than border_sup, X cannot be used to generate the most k frequent itemsets. Proof. Let X be any 1-itemset whose support is lower than border_sup. Let Y be any itemset. According to Theorem 1, expression (X, Y) (X) < border_sup is satisfied. The previous expression shows that the support of any itemset that includes 1-itemset whose support is lower than border_sup, is also lower than border_sup. It is clear that the number of itemsets whose supports are higher than border_sup is greater or equal to k, based on the Definition 4. Thus, we can consider just 1-itemsets whose support is higher than border_sup. Further modification of Apriori Multiple is to use TS-tree (Total Support Tree) structure to store all candidate itemsets and their supports at one place. In Apriori Multiple and also in original Apriori hash tree was used to store candidate j-itemsets, for each j. TS-tree is based on Rymon tree enumeration framework (Rymon, 1992.). In order to explain the method, we first need to define TS-tree data structure. Definition 5. Let S be a set and let d: S N be an injective function. The number d(x) is the index of x from S. If x S , the view of x is view(d , x) = {s S | d ( s) > d ( x)} . Definition 6. Let T: {1, ..., n} P(I) be a transaction dataset on a set of items I. TS-tree corresponding to the set T is a tree that initially contains only the root node labelled by NULL. The tree has the following properties: 1. initially, TS-tree contains only the root node represented by NULL symbol; 2. each node N in the tree contains label from the set I along with a counter that shows the number of transactions mapped onto the path from the root to the node N; each path represents one candidate: the candidate {i1, ..., in} is mapped onto the path NULL i1...in and support of the node in from that path is support for the itemset {i1, ..., in}; 3. children of node N are nodes from view(d, N); they are stored in lexicographic order.

608

IADIS International Conference WWW/Internet 2009

TS-tree is a compressed representation of candidate itemsets. Actually, candidates are mapped onto paths in the TS-tree. As different candidates can have several items in common, their paths overlap. The more the paths overlap with one another, the more compression we can achieve using the TS-tree. We will present efficient procedure to map each transaction onto TS-tree and increment support just for candidates contained in that transaction. It is sufficient just one database scan to count supports for candidates from the TS-tree. To output the most k frequent itemsets from the TS-tree a Pruning Array with boundary_sup threshold is used. Pruning Array is modification of Reduction Array (Hirate et al, 2004.). Definition 7. Threshold boundary_sup is the support of the kth itemset stored in the Pruning Array. Initially, boundary_sup is set to 0, but its value increases after the generation of k itemsets from TS-tree. The boundary_sup threshold is set automatically. During the traversal itemsets are stored in Pruning Array in descending order of their supports. The algorithm does not visit all nodes in the TS-tree; pruning of the candidate tree is based on boundary_sup threshold: before traversing the node link N in the TS-tree, the algorithm compares the support of N (this support presents support for candidate corresponding to the path from root of the tree to the node N) with boundary_sup. If the expression N.support < boundary_sup is satisfied, the algorithms terminates itemset generation from that path of the tree; it prunes subtree with root N. On the other hand if N.support < boundary_sup is not satisfied, the algorithm continues itemsets generation from that path. The reason why traversal of the TS-tree node is terminated if the support of the node is lower than boundary_sup is in the following. None of itemsets generated after traversing the node N, have support higher than N.support (Theorem 1). Thus, if the expression N.support < boundary_sup is satisfied, no itemsets which prefix is path to N, has support higher than boundary_sup. Top-k Apriori Multiple outputs the most k frequent itemsets by descending order of their support. The algorithm is presented in Table 1.
Table 1. TS-Growth algorithm

Algorithm: TS-Growth Algorithm Input: A transaction dataset T, k (number of frequent itemsets) Output: the most k frequent itemsets (descending order) Method: 1. scan transaction dataset T and count support for all 1-itemsets 2. set border_sup to the support of the kth 1-itemset (descending order) 3. generate TS-list which contains 1itemsets with support higher than border_sup in lexicographic order 4. create the root R of a TS-tree and label it as NULL 5. FOR EACH c IN TS-list create new node N as child of R and set label to c and support to 1 END FOR 6. FOR EACH transaction t IN T DO call traverse_TS-tree(R,t) /* maps candidates contained in transaction t onto paths in TS-tree and increment their supports */ END FOR 7. call mine_TS-tree(R, k) Let us briefly outline the most important steps. The step 1 counts support for all items from I. The internal threshold border_sup is set in the second step. Then, the algorithm creates TS-list whist contains 1-itemsets with support higher than border_sup. Itemsets from TS-list will be primitives for TS-tree construction (see

609

ISBN: 978-972-8924-93-5 2009 IADIS

Lemma 1). Instead of candidate generation phase, TS-Growth stores all candidates in TS-tree during the second (and the last) database scan, so in step 4 the algorithm initializes TS-tree: it creates root of the tree and labels it as NULL (see Definition 6). In the step 5 the first level of TS-tree is generated; it contains 1itemsets from TS-list (see Lemma 1). In the step 6 the algorithm performs another database scan. During this scan the algorithm completes TS-tree and simultaneously counts supports for candidates contained in the input dataset. Candidates will be represented by one path in TS-tree); for each transaction t from T the algorithm calls traverse_TS-tree(R, t) which maps t onto paths in TS-tree with the root node R and increment support for candidates represented by these paths (candidates contained in t). Function traverse_TS-tree is presented in Table 2.
Table 2. Function traverse_TS-tree from TS-Growth algorithm

function traverse_TS-tree(TS-tree R, transaction t) 1. let items in t be [p|P] 2. IF p is not in TS-list GOTO step 4 3. IF R has a child N with label p THEN increment Ns support ELSE create new node N as child of R and set label to p and support to 1 END IF 4. IF P THEN call traverse_TS-tree(N, P) call traverse_TS-tree(R, P) END IF Finally, in the step 7, the algorithm calls mine_TS-tree(R, k) function which extracts the most k frequent itemsets from the TS-tree with the root node R. It is sufficient to traverse the tree just once, level by level. During the traversal frequent itemsets are stored in Pruning Array in descending order of their supports. The algorithm does not visit all nodes in the TS-tree; pruning of the candidate tree is based on boundary_sup threshold as it has been explained. Function mine_TS-tree is presented in Table 3.
Table 3. Function mine_TS-tree from TS-Growth algorithm

function mine_TS-tree(TS-tree R, integer k) 1. set boundary_sup to 0 2. next = R 3. INSERT_QUEUE(Q, next) 4. WHILE NOT EMPTY_ QUEUE(Q) DO next = DELETE_QUEUE(Q) IF next.support >= boundary_sup THEN create new itemset f containing as items node labels from root to the node next set support of f to be next.support insert f in Pruning Array, keep descending order of supports set boundary_sup to support of kth element in Pruning Array FOR EACH n IN child of next DO INSERT_QUEUE(Q, n) END FOR END IF END WHILE 5. print the first k elements in Pruning Array

610

IADIS International Conference WWW/Internet 2009

4. EXPERIMENTAL RESULTS
We implemented TS-Growth algorithm in C in order to do some experiments to evaluate its performance. Experiments are performed on PC with Intel(R) Pentium(R) Dual 1.86GHz. In the following experiments, length of frequent itemset is controlled by additional parameter m. We aim to find top k frequent itemsets, which contains at most m items.
14 12 10 time (sec) 8 6 4 2 0 k = 10 k = 20 m=4 k = 40 k =100 Top-k Apriori Multiple Top-k FP-growth

Figure 1. Fix m = 4, varying k=(10, 20, 40, 100) using D100k.T10.M1k

14 12 10 time (sec) 8 6 4 2 0 k = 10 k = 20 m=6 k = 40 k =100 Top-k Apriori Multiple Top-k FP-growth

Figure 2. Fix m = 6, varying k=(10, 20, 40, 100) using D100k.T10.M1k

We label datasets used in experiments by Dx.Ty.Mz, where D refers to the number of transactions, T is the average number of items in a transaction and M is the number of different items. For dataset Dx.Ty.Mz there are x transactions, each transaction has y items on average and there are z different items. In the following experiments, we measured total execution time, i.e. the period between input and output instead of CPU time measured in the experiments in some literature. We compare the performance of TS-Growth with that of Top-k FP-Growth (Hirate et al, 2004.). We implemented Top-k FP-Growth to the best of our knowledge in C. In the first two experiments, we fixed parameter m. Figures 1 and 2 show that TS-Growth has better performance for all test cases. It is because TS-Growth has a good pruning power and there is no need to construct conditional sub-trees recursively. In the experiments from figures 3 and 4 we fixed k and were varying m. Again, it is easy to notice that TS-Growth outperforms Top-k FP-Growth for all test cases.

611

ISBN: 978-972-8924-93-5 2009 IADIS

Finally, we were varying number of transactions. From figure 5, one can observe that TS-Growth has much better performance for all data set sizes.
1,2 1 time (sec) 0,8 0,6 0,4 0,2 0 m=1 m=2 m=3 m=4 k = 20 m=5 m=6 m=7 Top-k Apriori Multiple Top-k FP-growth

Figure 3. Fix k=20, varying m=(1, 2, 3, 4, 5, 6, 7) using D100k.T10.M1k

14 12 10 time (sec) 8 6 4 2 0 m=1 m=2 m=3 m=4 k = 100 m=5 m=6 m=7 Top-k Apriori Multiple Top-k FP-growth

Figure 4. Fix k=100, varying m=(1, 2, 3, 4, 5, 6, 7) using D100k.T10.M1k

1,2 1 time (sec) 0,8 0,6 0,4 0,2 0 20000 40000 60000 80000 100000
Number of transactions

Top-k Apriori Multiple Top-k FP-growth

Figure 5. Varying number of transactions D (k = 20, m = 5, M = 1k, T = 10)

612

IADIS International Conference WWW/Internet 2009

5. CONCLUSION
We have implemented an algorithm to solve the problem of mining the most k interesting itemsets without giving a minimum support threshold. The algorithm performs especially well with small k, and it outperforms Top-k FP-growth algorithm (Hirate et al, 2004.) which is the most famous algorithm proposed to solve the same problem.

REFERENCES
Agrawal R. et al, 1993. Mining Association Rules between Sets of Items in Large Databases, In Proceedings of the 1993 ACM SIGMOD Conference, Washington DC, USA, pp. 207-216. Agrawal R., Srikant R. 1994. Fast Algorithms for Mining Association Rules, Proceedings of the 20th VLDB Conference, Santiago, Chile, pp. 487-499. Fu A. et al, 2000. Mining N-most Interesting Itemsets, In Proc. of the ISMIS00, Charlotte, NC, USA, pp. 59-67. Han J., Kamber M. 2001. Data Mining: Concepts and Techniques, Morgan Kaufmann Publishers, San Francisco, USA. Han J. et al, 2000. Mining Frequent Patterns without Candidate Generation, In Proc. of the ACM SIGMOD Conference on Management of Data, Dallas, Texas, USA, pp. 1-12, 2000. Han J. et al, 2002. Mining Top-k Frequent Closed Patterns without Minimum Support, In Proc. of IEEE ICDM Conference on Data Mining, Maebashi City, Japan, pp. 211-219. Hirate Y. et al, 2004. TF2P-growth: An Efficient Algorithm for Mining Frequent Patterns without any Thresholds, IEEE ICDM 2004 Workshop on Alternative Techniques for Data Mining and Knowledge Discovery, Brighton, UK. Rymon R., 1992. Search through Systematic Set Enumeration, Proceedings of 3rd International Conference on Principles of Knowledge Representation and Reasoning, pp. 539-550. Simovici D. A., Djeraba C., 2008. Mathematical Tools for Data Mining, Springer-Verlag London Limited, London, UK. Stanii P., Tomovi S., 2008. Apriori Multiple Algorithm for Mining Association Rules, Information Technology and Control, Vol. 37, No. 4, pp. 311-320. Stanii P., Tomovi S., 2008. Mining Association Rules from Transactional Databases and Apriori Multiple Algorithm, In Proc. of the IADIS International Conference WWW/Internet 2008, Freiburg, Germany, pp. 227-234. Tan P. et al, 2006. Introduction to Data Mining, Addison Wesley, Boston, USA.

613

Potrebbero piacerti anche