ClosedPatternMining ESWA AK AL JA AG 2014 PDF

This article appeared in a journal published by Elsevier.
The attached
copy is furnished to the author for internal non-commercial research
and education use, including for instruction at the authors institution
and sharing with colleagues.
Other uses, including reproduction and distribution, or selling or
licensing copies, or posting to personal, institutional or third party
websites are prohibited.
In most cases authors are permitted to post their version of the
article (e.g. in Word or Tex form) to their personal website or
institutional repository. Authors requiring further information
regarding Elseviers archiving and manuscript policies are
encouraged to visit:
http://www.elsevier.com/authorsrights
Author's personal copy
Expert Systems with Applications 41 (2014) 51055114
Contents lists available at ScienceDirect
Expert Systems with Applications

journal homepage: www.elsevier.com/locate/eswa
Novel techniques and an efcient algorithm for closed pattern mining

Andrs Kirly a,, Asta Laiho b, Jnos Abonyi a, Attila Gyenesei b
a
b
University of Pannonia, Department of Process Engineering, P.O. Box 158, Veszprem H-8200, Hungary
The Finnish Microarray and Sequencing Centre, Turku Centre for Biotechnology, University of Turku and bo Akademi University, Tykistkatu 6A, 20520 Turku, Finland
a r t i c l e
i n f o
Keywords:
Biclustering
Closed frequent itemset mining
Clustering visualization
Data mining algorithm
Pattern detection
a b s t r a c t
In this paper we show that frequent closed itemset mining and biclustering, the two most prominent
application elds in pattern discovery, can be reduced to the same problem when dealing with binary
(01) data. FCPMiner, a new powerful pattern mining method, is then introduced to mine such data efciently. The uniqueness of the proposed method is its extendibility to non-binary data. The mining
method is coupled with a novel visualization technique and a pattern aggregation method to detect
the most meaningful, non-overlapping patterns. The proposed methods are rigorously tested on both
synthetic and real data sets.
2014 Elsevier Ltd. All rights reserved.
1. Introduction
Very large datasets have lately become increasingly common in
many application areas, making it impossible to inspect data manually when looking for interesting patterns and knowledge (Tan,
Steinbach, & Kumar, 2006). Data mining as a eld aims at developing computational methods and tools that can be used for automating the knowledge extraction for the aid of decision making
or for understanding general trends within data collections (Kantardzic, 2002). Pattern discovery, an interesting subeld of data
mining, is motivated by the huge amounts of electronic data that
many organizations produce on their functions or collect during
experiments within various research elds. For instance, supermarkets store electronic copies of millions of receipts, banks and
credit card companies maintain extensive transaction histories
and biomedical and bioinformatics research groups collect data
from various biological experiments. The goal of pattern discovery
in general is to analyze these large datasets to identify informative
patterns and motifs (Berry & Linoff, 1997).
Frequent pattern discovery is an actively researched data mining technique with a wide range of applications (Han, Cheng, Xin,
& Yan, 2007). Market basket data analysis in web-shops, web
search analysis, frequent symptom set mining in health-care or
proper product placement in supermarkets are all well-known
association rule mining techniques, where pattern matching is
used to predict the behavior of customers or patients. Using
adjacency matrices, strongly connected components can be
Corresponding author. Tel.: +36 88624770; fax: +36 88623171.
E-mail address: kiralya@fmt.uni-pannon.hu (A. Kirly).
http://dx.doi.org/10.1016/j.eswa.2014.02.029
0957-4174/ 2014 Elsevier Ltd. All rights reserved.
revealed in social networks and by the help of collaborative ltering automated suggestions can be performed for users taking into
account the duality between users and items. In the eld of genetics, the huge amount of data from gene expression analysis can be
handled by efcient pattern mining algorithms to uncover local
motifs and interesting genetic pathways which are not apparent
otherwise.
Frequent itemset mining methods usually generates a huge
amount of patterns or association rules so the result of the data
mining is not directly applicable. One possible solution to get a
compact and informative set of itemsets is closed itemset mining
(Agrawal, Imielinski, & Swami, 1993; Brin, Motwani, Ullman, &
Tsur, 1997). During the last decade many frequent closed itemset
mining methods have been proposed in the literature. The problem
was rstly introduced by Pasquier et al. in Pasquier, Bastide, Taouil,
and Lakhal (1999) together with the rst algorithm called A-Close
for mining closed itemsets. Other closed itemset mining algorithms
include CLOSET (Pei, Han, & Mao, 2000), CHARM (Zaki & Hsiao,
1999), CLOSET+ (Wang, Han, & Pei, 2003), FPClose (Grahne &
Zhu, 2003), AFOPT (Liu, Lu, Lou, & Yu, 2003), DCI_Closed (Lucchese,
Orlando, & Perego, 2006), DBV-Miner (Rodrguez-Gonzlez, Martnez-Trinidad,
Carrasco-Ochoa,
&
Ruiz-Shulcloper,
2013),
DBV-Miner (Vo, Hong, & Le, 2012) and ClaSP (Gomariz, Campos,
Marin, & Goethals, 2013) just to mention the most applied ones.
Very recent publications are presented in Zhou, Cule, and Goethals
(2013), Riondato and Upfal (2013), Cule, Goethals, and Hendrickx
(2013) and methods for approximation of true frequent itemsets
can be found in Riondato and Vandin (2014) and Riondato and Upfal (2013). For comprehensive reviews about the efcient algorithms, see Fimi (2003), Fimi (2004) and Duneja and Sachan (2012).
5106
A. Kirly et al. / Expert Systems with Applications 41 (2014) 51055114
Based on the critical analysis of literature we think beside of

contributions aiming performance improvements, there is a place
for a novel approach that is not only accurate and effective but
gives more insight to the hidden structure of the itemsets.
Therefore we are looking for a novel method that is intuitive,
supports visualization and allows further aggregation of the
found patterns.
Our main idea is that the problem of nding closed frequent
itemsets can be considered as mining biclusters in binary data,
so the favorable properties of biclustering representation will ensure the required interpretability.
The elds of frequent closed itemset mining and biclustering
were developed independently. Biclustering has been introduced
to complement and expand the capabilities of the standard clustering methods by allowing objects to belong to multiple or
none of the resulting clusters purely based on their similarities.
This property makes biclustering a powerful approach especially
when it is applied to data with a large number of objects. During
recent years, many biclustering algorithms have been developed
especially for the analysis of gene expression data. The concept
of biclustering was rst introduced in Hartigan (1972), and
applied to gene expression data by Cheng and Church (2000).
Many other such algorithms have been published since including
the most refereed ones including BiMAX (Prelic et al., 2006), QUBIC (Li, Ma, Tang, Paterson, & Xu, 2009), BiBit (Rodriguez-Baena,
Perez-Pulido, & AguilarRuiz, 2011), Signature Algorithm (Ihmels,
Bergmann, & Barkai, 2004), xMotif (Murali & Kasif, 2003), OPSM
(Ben-Dor, Chor, Karp, & Yakhini, 2003) and many others (e.g.
Abdullah & Hussain (2006), Uitert, Meuleman, & Wessels
(2008) and Cheng, Law, & Siu (2013)). For comprehensive
reviews, see Busygin, Prokopyev, and Pardalos (2008), Kriegel,
Krger, and Zimek (2009), Bozdag, Kumar, and atalyrek
(2010), Eren, Deveci, Kktun, and atalyrek (2013) and
Freitas, Ayadi, Elloumi, Oliveira, and Hao (2013). As for biclustering has been mainly used for gene expression data analysis with
the aim of discovering gene expression patterns or so-called biclusters (Madeira & Oliveira, 2004). In the later case, biclustering
instead of using original real valued data, the input is discretized
in order to reduce the dimensionality and enable reasonable processing times. Typically the data is transformed into a binary
matrix containing only 0 and 1 values prior to applying a mining
algorithm.
One of our main goals is to show that frequent closed itemset mining and biclustering of binary data can be transformed to
the same problem and therefore, all existing methods for mining
such patterns in binary data can in fact be applied to both
elds. This nding might also help researchers to identify new
research directions. Indeed, as a rst step in this direction we
will extend the problem of mining patterns in binary data
by introducing a novel and efcient method that is able to
discover previously hidden patterns with more than two value
categories.
Since data mining often produces a large number of small frequent and partially overlapping patterns, this causes a considerable
challenge for the result interpretation. Therefore, we introduce a
novel visualization that rearranges the original data matrix based
on the discovered closed patterns and a pattern aggregation method allowing the rapid identication of the most meaningful pattern
clusters.
The contributions of this work will be discussed in the following
subsections as novel methods and algorithms:
After introducing and dening the problems of biclustering and
frequent closed itemset mining on binary data, we show that
they can be reduced to the same problem (Sections 2.12.3).
We prove this both theoretically (Section 2.3) and experimentally (Section 4) on various synthetic and real data sets.
We extend the problem of mining frequent closed patterns
(FCP) to more than two value categories. This is especially
important when FCP mining methods are applied to big biological, such as gene expression data that are commonly obtained
nowadays by microarray and next-generation sequencing
instruments (see Section 3.1).
We propose a novel algorithm, called FCPMiner to mine FCPs
efciently. We rigorously test FCPMiner on various real and
synthetic data sets and show that FCPMiner is a powerful
method for mining FCPs (see Section 3.1.1).
We introduce a novel visualization as well as a pattern aggregation method to enable the quick identication of the most relevant FCPs (see Sections 3.2 and 3.3).
We implemented our algorithms using the Java programming
language that can be run on any Operation System (Windows,
Linux, Mac OS). The implemented methods are freely available
on the project website.3
2. Problem formulation
2.1. Biclustering
In this paper we follow the formulation given in Prelic et al.
(2006) to dene the problem of mining biclusters in gene expression data. According to common practice of the eld, bicluster mining is restricted to a binary matrix, i.e. gene expression values are
transformed to 1 (expressed) or 0 (not expressed) using an expression cutoff (Li et al., 2009; Prelic et al., 2006). Let E 2 f0; 1gnm be
an expression matrix, where E represents the set of m experiments
for n genes. A cell eij contains 1 whenever gene i is expressed in
condition j and 0 otherwise. A bicluster G; C corresponds to a subset of genes G # f1; . . . ; ng that jointly responds a subset of samples
C # f1; . . . ; mg. Therefore, the bicluster G; C is a submatrix of E in
which all elements are equal to 1 (biclusters in a small data set are
depicted in Fig. 1 by bold numbers). Using the above denition,
every cell eij having only non-zero values represents a bicluster.
However, such patterns are usually redundant as they are entirely
contained by other patterns. Thus, the denition of inclusion-maximal bicluster (IMB) was introduced to discover all biclusters not
entirely contained by any other cluster (Prelic et al., 2006): the pair
G; C 2 2f1;...;ng 2f1;...;mg is an IMB, if and only if 8i 2 G; j 2 C : eij 1
0
0
and 9
= G0 ; C 0 2 2f1;...;ng 2f1;...;mg where 8i 2 G0 ; j 2 C 0 : ei0 j0 1 and
0
0
0
0
G # G ^ C # C ^ G ; C G; C.
2.2. Frequent closed itemset mining
One of the earliest and most important concepts in data mining is mining frequent itemsets in large transactional datasets
(Lucchese, Orlando, & Perego, 2010). Such a dataset can be considered as a matrix with transactions as rows and items as columns. If an item appears in a transaction it is denoted by 1,
otherwise by 0. The general goal of frequent itemset mining is
to identify all itemsets that contain at least as many transactions
as required, referred to as minimum support threshold, min_sup. By
denition, all subsets of a frequent itemset are frequent. Therefore, it is also important to provide a minimal representation of
all frequent itemsets without losing their support information.
Such itemsets are called frequent closed itemsets and can be dened as follows. Let rx jfti : x # t i ; ti 2 T gj denote the support
count of itemset x. An itemset x is closed if none of its immediate
supersets has exactly the same support count as x. Oppositely, the
itemset x is not closed if at least one of its immediate supersets
has the same support count as x. Obviously, x; y : rx P
5107
Fig. 1. A simple example illustrating how the proposed method works. While the process ow is marked by solid arrows the recursive steps are highlighted by dashed arrows.
A bold cross sign indicates that the investigated pattern is not closed, or it does not satisfy the minimum support conditions for rows or columns. The discovered frequent
closed patterns are surrounded by solid rectangles.
ry; if x # y. Finally, an itemset is a frequent closed itemset (FCI) if
3. Biclustering based closed pattern mining techniques
it is closed and frequent.

3.1. Mining closed patterns
2.3. Connection between biclustering and frequent closed itemset
mining
We show here that biclustering and frequent closed itemset
mining can in fact be reduced to the same problem when working
on a binary data matrix of size n m. The set of transactions T in
frequent itemset mining can be considered as the set of genes G in
biclustering, and the set of itemsets I as the set of samples C. The
min_sup threshold in frequent itemset mining corresponds to the
min_rows constraint in biclustering. There is no corresponding constraint in frequent itemset mining to min_cols in biclustering but
this can be overcome by setting min_cols to 1 in the biclustering
or ltering the nal result of frequent itemset mining, i.e. the itemsets with less than min_cols items are removed to match the
constraints.
Next, the correspondence of the closeness of frequent itemsets and the inclusion-maximality of biclusters needs to be veried. For this, let us assume, that FCI mining with min sup k
and mining of IMBs with min rows k and min cols 1 produce
different results. It is only possible if (a) 9 xc which is a FCI, but
not an IMB, or if (b) 9 Gb ; C b which is IMB but not FCI. We will
prove (a) here, while the proof of (b) is almost the same. Let a
FCI xc is an itemset with jxc j p and rx q P k. Then 9
=
y; x y, where rx 6 ry. Let jyj p0 and ry q0 . Since a
bicluster is simply a subset of rows and subset of columns, xc
corresponds to a bicluster
G; Cqp and similarly, y corresponds
0
0
0
0 q p
to a bicluster G ; C
. By the denition of our assumption,
= the bicluster G0 ; C 0 with C 0 C and G0 # G, which yields that
9
G0 ; C 0 G; C. Therefore, by the denition of IMB, xc is not only
a FCI but also an IMB.
From now on, we will use the general terms closed pattern (CP)
and frequent closed pattern (FCP) within this paper.
In this section we rst propose a new method for mining closed

patterns (i.e. frequent closed itemsets and biclusters) for data
matrices with up to three values: 1, 0, 1. This is an extension of
the special binary case and therefore, applicable to both data types.
The benet of this kind of general approach has been presented in
Gyenesei, Wagner, Barkow-Oesterreicher, Stolte, and Schlapbach
(2007) using gene expression data. The key benet of the generalized method is the gained ability to make a distinction between
up-regulated and down-regulated genes and thus, discover previously hidden closed patterns (Cano, Garca, Lpez, & Blanco,
2009; Gyenesei et al., 2007; Wu, Huang, Horng, & Huang, 2010).
At the end of the section we will show how traditional methods
developed only for binary data could also be applied to (-1, 0, 1)
data with the cost of performing a few additional matrix transformation steps and post ltering to remove duplicated patterns.
3.1.1. An efcient algorithm for closed pattern mining
The proposed method consists of two procedures and one function to discover all frequent closed patterns:
FCPMain is the main procedure (Algorithm 1). First the procedure takes the three input parameters (input data matrix and
minimum support thresholds) before encoding the input matrix
(A) into a smaller data structure (B) by taking only non-zero
matrix values as follows:
B bi ;
where bi fj : Ai; j 0g
Note that the transformation in Eq. 1 corresponds to the classical

representation of transaction databases dened within the problem
of mining frequent itemsets where bi represents the ith transaction.
The procedure then independently calls the recursive miner
5108
procedure FCPMiner for each B row. Note that the parameter vector
missingRows stores the indices of those rows which are not examined in the actual call (they have been checked before). This is
important as closeness will be checked based on these indices by
the IsClosed function.
FCPMiner procedure is the heart of the method by recursively
building up the frequent closed patterns (Algorithm 3). This is
done by taking the consecutive rows one-by-one and recording
only those column indices that show the same changing tendency (same or exactly the opposite). Then the closeness of the
candidate pattern is checked before the method is calling itself
with the updated parameters. Finally, the newly discovered patterns are added to the output set of frequent closed patterns.
IsClosed is a simple function to check whether adding a new row
index to the candidate pattern would result in a closed pattern
(Algorithm 2). This is done by checking whether there is a row
in missingRows that contains the same column indices with the
same changing tendency as in the pattern under examination. If
no such row can be found then the pattern is already a closed one.
Algorithm 1. FCPMain: Main procedure for mining closed

patterns
Require: A: input discrete matrix
minrows: minimum number of rows in a frequent closed
pattern
mincols: minimum number of columns in a frequent
closed pattern
Ensure: Y: List of all closed frequent patterns
1: global A; Y fg; minrows; mincols; B
2: MissingRows fg
3: Transform Anm into data structure B
4: for every row Ri 2 B where i 1 . . . n minrows do
5: if i > 1 then
6:
MissingRows MissingRows [ fig
7: end if
8: if jRi j P mc then
9:
if i 1 or IsClosedMissingRows; Ri ; i then
10:
FCPMinerMissingRows; Ri ; fig
11:
end if
12: end if
13: end for
14: return Y
Algorithm 2. IsClosed function

Require: missingRows: indices of previously examined rows
(omitted)
actualCols: current column indices under examination
actualRow: actual row index under examination
Ensure: booean: is this candidate frequent pattern closed?
1: global A
2: for every index i in missingRows do
3: if Ai;j Ak;j 1 8j 2 actualCols; 8k 2 actualRow or
Ai;j Ak;j 1 8j 2 actualCols; 8k 2 actualRow then
4:
return true
5: end if
6: end for
7: return false
Algorithm 3. FCPMiner procedure

Require: missingRows: indices of previously examined rows
(omitted)
candidateRows: set of row indices in a candidate closed
frequent pattern
actualCols: actual column indices under examination
1: global A; Y; minrows; mincols; B
2: for every rows index i in fB0 s rowindicesg n candidateRows
do
3: actIndices actualCols \ Bi
4: change1 fjg, where
Ai;j AcandidateRows1;j 1; j 2 actIndices
5: change1 fjg, where
Ai;j AcandidateRows1;j 1; j 2 actIndices
6: if jactualColsj jchange1 j or
jactualColsj jchange1 j then
7:
candidateRows candidateRows [ fig
8: else
9:
if jchange1 j P mincols then
10:
if IsClosed(missingRows; change1 ; i) then
11:
FCPMinermissingRows; candidateRows [ fig; change1
12:
end if
13:
end if
14:
if jchange1 j P mincols then
15:
if IsClosed(missingRows; change1 ; i) then
16:
FCPMinermissingRows; candidateRows [ fig; change1
17:
end if
18:
end if
19:
missingRows missingRows [ fig
20: end if
21: end for
22: if jcandidateRowsj > minrows then
23: Y Y [ fcandidateRows; actualColsg
24: end if
3.1.2. A simple example
Fig. 1 shows a simple example to illustrate how the proposed
method works. Here the minimum support thresholds have been
set to 2 for both rows and columns. The method starts by transforming the input matrix into a smaller data structure by taking
only non-zero matrix values. Then the recursive miner procedure,
FCPMiner is called for each row (Steps 1, 9, 12). Then the next row
indexes are added to the candidate pattern until the calculated
changes of the rst row and the added rows for all column values
are identical or opposite, i.e. 1 or 1. For example, in Step 2, the
change between the values of column indexes 1; 2; 3; 4 and 6 is always 1 and therefore, row 2 (r 2 ) is added to the rst row (r 1 ) with
column indexes 1; 2; 3; 4; 6. This pattern is a valid frequent closed
pattern as it is not contained in any other closed pattern. In Step
3, a new recursion is initiated for r1 ; r2 ; r3 because only a subset
of columns 2; 3; 4 gives the same change (-1) between the rst
and the third row. This pattern is also a valid frequent closed pattern. The same applies to patterns at Steps 5 and 10. During the
mining process there are many candidate patterns that are not
added to the result list of valid frequent closed patterns. Patterns
at Steps 6; 7; 8; 11; 12; 13 are also not closed as they are subsets
of other valid frequent closed patterns. For example, the candidate
pattern at Step 7 (with row indexes 1 and 3) is not closed as it is
part of the closed pattern discovered at Step 4. The IsClosed function ensures that all of this kind of candidate patterns are excluded.
5109
Fig. 2. Visual representation of the input transformation for a simple input matrix (a) and an example for aggregation of 2 pure patterns (b).
3.1.3. Time complexity

To evaluate the amount of time necessary to execute the proposed method, we have taken the worst scenario. Let m denote
the number of rows, n the number of columns and mr the minimal
number of rows in a bicluster. To transform the input data into matrix B, we need Om n time. Then the FCPMiner method is invoked m mr times (by checking steps 1; 9; 12 in Fig. 1) at the
root level. Note that the rst call will require most of the time.
Therefore, the required time for this procedure call can be used
as an upper bound for the remaining ones. Since our method is
recursive, the Master theorem (Leiserson, Rivest, Stein, & Cormen,
2001) can be used to compute the complexity of the recursion,
which is On log n. As we have m mr calls at the root level,
the complexity of the method is m mr On log n.
3.1.4. Transformation of f1; 0; 1g data to binary data
In this section we will show that f1; 0; 1g data can be transformed to binary data (with some limitations as discussed below)
and thus, all earlier methods developed for mining patterns within
binary data could be applied to the transformed data.
The transformation process is presented in the upper part of
Fig. 2 through a simple example. The original input matrix A is
transformed into a four times bigger matrix B using the following
steps:
"
B
A1
A1
1
A1
1
a1
A1 fa1
i;j g; ai;j 2 f0; 1g8i; j:
i;j
3.2. Closed pattern based data visualization

Although various closed pattern mining methods have been
introduced recently, none of them provide a visualization technique for the thousands of scattered subsets of the original data.
Individual visualization projects can be found in Santamara,
Thern, and Quintales (2008) and Heinrich, Seifert, Burch, and
Weiskopf (2011). Here we present a novel technique for visualizing
the original data matrix by reordering the rows and columns based
on the discovered closed patterns. The visualization can be of use
for evaluating the effectiveness of the pattern detection and can
help to interpret the pattern structure in a general level. To illustrate the problem we use a tiny synthetic data with size 10 by
10 (Fig. 3). Algorithm 4 describes the procedure for reordering
the matrix data.
The method takes the original data matrix and the closed patterns as input. At rst, we generate sparse matrices A0r and A0c using
the following formula (line 1):
A0r i; j
number of rows in C j where C j contains columns i
A0c i; j number of columns in C j where C j contains row i

4
where A fai;j g; ai;j 2 f1; 0; 1g8i; j, is the initial data matrix, A1

and A1 are derived from A as follows.
1
A1 fa1
a1
i;j g; ai;j 2 f0; 1g8i; j:
i;j
proposed method, FCPMiner, where the overhead caused by processing duplicated patterns and having to perform post ltering
can be avoided.
1 if ai;j 1
0 otherwise
1 if ai;j 1
0
otherwise
Using this representation, closed patterns are discovered twice

in the transformed matrix as patterns containing only 1s are present in A1 matrices and patterns only with 1s are presented in A1
matrices. Moreover, patterns with oppositely changing values also
appear twice in matrix B, in A1 A1 and in A1 A1 , respectively.
Therefore, all types of closed patterns presented in A also exist in
the transformed matrix, but all of them twice and thus a post-processing step is needed to eliminate the duplicated patterns.
Although the presented transformation allows applying previous methods to {1,0,-1} data, it is more convenient to use our
In steps 2 and 3 the distance tables are generated for both rows and
columns of A respectively. Thus the algorithm computes distances
between each pair of rows, DT Cr and columns, DT Cc based on the
sparse matrices A0r and A0c , respectively. We used Tanimoto distance
to calculate similarity measures. If the samples X and Y are bitmaps,
Tanimotos similarity ratio T S is
P
X i ^ Y i
T S X; Y Pi
i X i _ Y i
where ^; _ are bitwise and, or operators, respectively. According to
this denition, the distance for example between row i and j is calculated as follows, using our sparse matrices:
Q 0
fA i; A0r jg
DT Cr i; j DT Cr j; i P r0
fAr i; A0r jg
Using this type of similarity, overlapping of two identied patterns

can be measured, since the product of matrices Ar s contains nonzero elements only at overlapping places. Using Tanimotos similar-
5110
ity measure in this way provides a robust metric, where zero elements are also taken into account during the calucation of
distances.
Finally, dendrograms for rows (DGr ) and columns (DGc ) are calculated based on the distance tables and the rearranged dataset A00
is generated using the sequence of dendrograms leaves.
Algorithm 4. Closed pattern based data visualization
Require: Amn : initial input data
C p1 : set of closed patterns produced by the pattern
mining algorithm
Ensure: B: rearranged input data based on the patterns
1: Generate sparse matrices A0r and A0c
2: Compute distance table for rows using Tanimoto distance:
DT Cr
3: Compute distance table for columns using Tanimoto
distance: DT Cc
4: Creating dendrogram for rows based on DT Cr : DGr
5: Creating dendrogram for columns based on DT Cc : DGr
6: return Rearranged input data A00 based on the
dendrograms
3.3. Method for the aggregation of closed patterns

Frequent closed pattern mining methods typically discover
large numbers of highly similar, signicantly overlapping patterns.
Grouping similar patterns together can often be useful for providing a more comprehensive view of the results, and thus allowing
rapid detection of the most meaningful patterns. To address this,
we present here a novel technique, which according to our knowledge is the rst published closed pattern aggregation method.
Let minCons be a consistency parameter that represents the proportion of non-zero elements in a pattern. The mathematical formulation of the problem is then given as follows: The input
matrix, A can be expressed as A fai;j g; ai;j 2 f0; 1g; 8i; j, or with
the set of rows, X and set of columns Y, as A X; Y. The set of
all FCPs is B fBk g fIk ; J k g where Bk # A; 8k, and Ik # X; J k #
Y; 8k. The set of all aggregated patterns can be expressed as
S
C fC l g; 8l, where C l Br ; Br 2 B for some r.
The following limitations are stated:
each element of B is included in only one element of C, i.e. the
elements of C are disjoint for the elements of B.
8l; jCjCl jj0 6 1 minCons, where the operator j j0 denotes the
l
number of zeros in the matrix, while j j denotes the number
of all elements in the corresponding matrix.
The consistency ratio of an aggregated pattern can be determined

by calculating the non-zero elements in the original A matrix,
which is a computationally expensive process. Therefore, we introduce a simple estimation for the upper bound of the number of
zeros using the following expression:
C p Br1 [ Br2 Ir1 [ Ir2 ; J r1 [ J r2
where Br1 ; Br2 2 B; Ir1 ; Ir2 2 I; J r1 ; J r2 2 J, for some p; r1 ; r 2 and the

upper bound is
maxjC p j0 jIr1 j jIr2 j 2 jIr1 \ Ir2 j jJ r1 j jJ r2 j 2 jJ r1 \ J r2 j

7
The bottom part of Fig. 2 illustrates on a simple example how the
upper bound of zeros is calculated after merging two patterns.
The proposed method for the upper bound estimation for the
zero elements in the aggregated pattern yields in the following
equation:
maxjC p j0 jfg 1 ; g 2 gj jfg 2 ; g 3 gj 2jfg 2 gj jfc1 ; c2 ; c3 ; c4 ; c6 gj

jfc2 ; c3 ; c4 ; c5 gj 2jfc2 ; c3 ; c4 gj
maxjC p j0 2 2 2 1 5 4 2 3 2 3 6
While the actual number of zeros is 2, the upper bound estimate is
6. Although there is a clear difference between the actual and estimated values, for large data sets the tradeoff between the accuracy
and the computational cost can be justied.
To further illustrate the effectiveness of our pattern aggregation
method, we apply a commonly used VAT visualization method
(Huband, Bezdek, & Hathaway, 2005) (Fig. 4).
The similarity between patterns is indicated on a gray color
scale where darker colors signify stronger similarity. Based on
our aggregation method, we can generate a dendrogram reecting
the sequence of merging. Using this dendrogram the distance table
can then be rearranged to produce a visual representation (Fig. 4).
The visualization clearly illustrates the effectiveness of the proposed aggregation technique in detecting large, strongly correlated
frequent closed patterns.
In this section, we presented a full data-analysis methodology
to mine pure closed frequent patterns by our novel algorithm,
to aggregate them constructing larger biclusters allowing zeros
and to visualize the result of the rearranged data matrix. By multi-threading our algorithm performs very efciently on modern
computers and due to the Java platform it runs on all popular operating system. As we validated by several test cases, our method
nds all the existing closed patterns in binary and -1,0,1-type data
and provides a handy output format conforming to the conventions
of other data mining methods. Although the algorithm is highly
scalable vertically, excessive horizontal scaling can reduce the
Fig. 3. Visualization of rearranged data matrix based on the pattern mining results.
5111
4. Experimental results
In this section we compare our proposed closed pattern mining
method with a biclustering based (BiMAX (Prelic et al., 2006)) and
a frequent closed itemset mining based (DCI_Closed (Lucchese
et al., 2006)) methods that are able to discover all frequent closed
patterns in binary data. These algorithms previously served as
highly recognized reference methods for their application elds
(Prelic et al., 2006). Note that all methods developed for frequent
closed itemset mining produce the same patterns as DCI_Closed.
Using several synthetic and real biological data sets, we show that
(1) all three methods discover the same closed patterns in binary
data and thus, experimentally prove our claim that both biclustering and frequent closed itemset mining methods discover the same
patterns; (2) our pattern discovery method outperforms the other
methods and (3) it is the only method that is able to discover previously hidden and biologically potentially relevant closed patterns
by using the extended f1; 0; 1g data.
Fig. 4. Visualization illustrating the efciency of the pattern merging algorithm.
computing efciency radically. As we saw earlier, the presented

aggregation technique is really fast and easily interpretable, it uses
a very basic metric and the visualization technique could become
unhelpful on huge matrices.
4.1. Comparison and computational efciency of the closed pattern

mining methods
To compare the three mining methods and demonstrate their
computational efciency, we applied them to several real and generated synthetic data sets. Each method was run on a normal desktop computer (2.66 GHz Intel Core i5 CPU and 16 GB memory)
Table 1
Computational results using synthetic data sets. r: number of rows; c: number of columns; d: density (proportion of ones) [%]; sc: minimum support count during the search
(min_cols in pattern mining); sr: minimum row count during pattern mining (min_rows); cf: number of identied closed patterns; cff: number of closed patterns after ltering; b:
number of found patterns by the corresponding algorithm; t: running time [s].
Data
sc
sr
BiMAX
DCI_Closed
FCPMiner
cf
cff
S1
S2
S3
S4
S5
S6
S7
S8
S9
S10
S11
S12
S13
S14
S15
50
50
50
100
100
100
300
300
300
700
700
700
1000
1000
1000
50
50
50
100
100
100
300
300
300
700
700
700
1000
1000
1000
10
20
50
10
20
50
10
20
50
10
20
50
10
20
50
2
4
15
3
7
30
8
22
90
15
45
210
20
60
290
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
78
140
238
337
488
694
437
156
1038
1318
375
283
1496
714
1030
1
1
1
2
2
3
5
52
>600
195
>300
>300
>600
>600
>600
119
189
288
436
588
794
737
456
1338
2018
1075
983
2496
1714
2030
78
140
238
337
488
694
437
156
1038
1318
375
283
1496
714
1030
0.016
0.024
0.033
0.041
0.028
0.034
0.041
0.085
0.241
0.365
0.720
2.631
0.916
2.182
8.110
78
140
238
337
488
694
437
156
1038
1318
375
283
1496
714
1030
0.001
0.016
0.438
0.041
0.015
0.488
0.031
0.047
0.318
0.266
0.499
1.857
0.671
1.451
6.238
IBM 1
IBM 2
IBM 3
IBM 4
IBM 5
IBM 6
100
1000
10000
100000
100
100
100
100
100
100
100
1000
9.04
9.32
8.94
8.99
7.78
7.14
4
4
4
4
6
12
4
6
10
12
6
20
6
15
NA
NA
101
216
1
1
NA
NA
0.8
26
452
19974
426508
8572510
350
1649889
6
15
7
16
101
216
0.070
0.142
1.517
38.909
0.015
25.648
6
15
7
16
101
216
0.004
0.061
1.099
24.147
0.001
20.668
Table 2
Comparison to DCI_Closed r: number of rows; c: number of columns; d: density (portion of ones) [%]; sc: minimum support count during the search (min_cols in pattern mining);
sr: minimum row count during pattern mining (min_rows); cf: number of identied closed patterns; cff: number of closed patterns after ltering; b: number of found patterns by
the corresponding algorithm; t: running time [s].
Problem
Compendium
StemCell-27
Leukemia
StemCell-9
Yeast-80
6316
45276
12533
1840
6221
300
27
72
9
80
1.2
5.8
19.3
15.5
6.8
sc
50
200
400
2
80
sr
2
2
2
2
2
DCI_Closed
FCPMiner
cf
cff
2715
7999
3715
186
3348
2594
7972
3643
177
3285
0.157
0.521
0.823
0.032
0.094
2594
7972
3643
177
3285
0.124
0.325
0.787
0.001
0.055
5112
Fig. 5. Three examples of corresponding patterns discovered by FCPMiner and binary FCP mining methods. If the erroneous genes are removed from the patterns discovered
by BiMAX the sets of patterns produced by the two methods become equivalent.
running Windows operating system to compare the running times.

Real data come from various biological studies previously used as
reference data in biclustering research (Gyenesei et al., 2007; Kirly, Abonyi, Laiho, & Gyenesei, 2012; Li et al., 2009). For the comparison of the computational efciency, all biological data sets were
binarized. For both the fold-change data (stem cell data sets) and
the absolute expression data (Leukemia, Compendium, Yeast-80)
fold-change cut-off 2 is used. Synthetic data were generated by
both our own and IBM Quest Synthetic Data generator tool (Pitman, 2011). Results are shown in Table 1 (synthetic data) and Table 2 (real data), respectively. All three methods were able to
discover all closed patterns for all synthetic and real data sets.
The tables also show that FCPMiner outperforms the other two
methods and provides the best running times in the large majority
of the cases, especially when the number of rows and columns are
higher.
4.2. Biological relevance of closed pattern mining on f1; 0; 1g data
Here we illustrate the potential of our closed pattern mining
method when applied to f1; 0; 1g data. The real data set used in
this section comes from the study of the effects of Tet1-knockdown
on gene expression in mouse embryonic stem cell and trophoblast
stem cell conditions. The data have been previously analyzed using
our standard analysis pipeline and the results have been published
in Koh et al. (2011) (GEO reference:GSE26900). The input data for
closed pattern mining was created based on the differentially
expressed genes between different biological sample groups.

Therefore, the expression values were discretized as 1s signifying
up-regulation, 1s down-regulation and 0s no change. For more
information on preparing the input data for the mining as well as
detailed data analysis, see Kirly et al. (2012). Here it is important
to note that methods developed only for binary data do not take
the direction of gene regulation into account and therefore, transform the discretized values to 1s denoting both up- and down-regulation and to 0s denoting no change.
While FCPMiner identied all 115 valid frequent closed patterns, BiMAX, DCI_Closed developed only for binary data, found
128 patterns. When inspecting these patterns more closely, we
nd that 70% of them are invalid, i.e. contain erroneous genes with
uncorrelated regulation proles due to the binarization. Examples
are shown in Fig. 5.
A common way to compare different biclustering methods is to
run functional enrichment analysis for the resulting gene regulation patterns (BiMAX (Prelic et al., 2006), QUBIC (Li et al., 2009), BiBit (Rodriguez-Baena et al., 2011), Signature Algorithm (Ihmels
et al., 2004)). This approach takes an advantage of databases
grouping genes in pathways and functional categories according
to known biological association. An overrepresentation analysis
can then be carried out to detect patterns containing more genes
within specic functional categories than expected by chance
alone and thus giving insight on the underlying biological mechanisms within the studied experimental setup. Therefore, the different pattern mining methods can be compared by looking at the
patterns detected at certain functional enrichment signicance

levels (p-values) for each method (Li et al., 2009; Prelic et al.,
2006; Rodriguez-Baena et al., 2011). Applying the standard procedure the discovered patterns were analyzed with respect to the
enrichment of functional Gene Ontology (GO) categories (Harris
et al., 2004) and KEGG pathways (Kanehisa, Goto, Kawashima,
Okuno, & Hattori, 2004) using overrepresentation analysis applying a hypergeometric test (Rice, 2007) to calculate an enrichment
p-value for each category and pathway. Supplementary tables (Tables S1 and S2) show the numbers of patterns detected at different
signicant levels (p-values) in GO categories and KEGG pathways
among different methods.
After examining the results in detail we have identied several
closed patterns that were discovered only with FCPMiner. For
example, the rst panel on the left side of Fig. 5 shows an FCP reported signicant by FCPMiner within a GO category at p-value level 5E-12 but missed at this signicance level by other methods
due to binarization and the resulting inclusion of erroneous genes.
The remaining panels show patterns for KEGG that were discovered by FCPMiner and missed by other methods at the p-value signicance level 5E-6. Patterns with the calculated GO categories and
KEGG pathways with the corresponding p-value are given in the
supplementary data.
5. Conclusion
Thanks to the signicant economic and scientic importance in
recent years several attempts have been done to design efcient
frequent itemset mining algorithms. Meanwhile biclustering have
become also important eld of development mainly motivated
by gene expression data analysis. We realized there is a need for
interpretable and intuitive approach that is able to nd all of the
frequent closed itemsets in huge transactional databases. We
found that frequent closed itemset mining and biclustering can
be transformed to the same problem of mining frequent closed
patterns. We have proved the analogy of the two techniques both
theoretically and by demonstration on various synthetic and real
data sets where it was shown that the methods provided the same
patterns.
To tackle the long known poor applicability of pattern mining
methods for very large data sets and the restriction to handling
only binary data, we have introduced a novel efcient algorithm
(FCPMiner) to identify frequent closed patterns in both binary
and extended {-1,0,1} data. The nature of our representation and
the algorithm ensures that the we miss no existing inclusion-maximal bicluster contrary to other widely applied biclustering methods, like BiBit or QUBIC. The ability to handle three value categories
can be extremely valuable within many application elds, for
example in gene expression data analysis, as has been shown in
this work. Application of pattern mining methods for very large
data sets also easily results in large numbers of detected patterns,
many of them partially overlapping, causing a considerable challenge for the result interpretation. In order to target this we have
introduced a novel visualization technique that rearranges the original data matrix based on the discovered closed patterns, and a
pattern aggregation method allowing rapid detection of the most
meaningful patterns. The developed methods can be applied to
various application elds ranging from supermarket transaction
data to banking and credit card transaction histories to biomedical
data collections.
Since a signicant part of data appearing in data mining problems are presented in non-binary format, an evident improvement
of our method is to make it capable handling integer or real numbers without any transformation, i.e. extending the domain of our
approach. Data are evolving continuously, especially in web log
5113
analysis, therefore handling temporality and applying our method

in sequence or temporal data mining to reveal common patterns in
different time segments can be an inevitable improvement of our
method. Although our algorithm can make use of the power in
multi-core systems, further speed-up through additional heuristics
can make our tool capable to analyze huge amount of data on-line.
Handling of streaming data is also a challenge for further research.
6. Availability
All tools proposed, developed and applied in this paper are
available on the project website,1 including the Java implementation of the algorithms, a C# variant of IBM Quest, and a Java program
for the conversion between IBM data, binary format and DCI Closed
format, and for ltering.
Acknowledgements
This publication/research has been supported by the European
Union and the Hungarian Republic through the Project TMOP4.2.2.C-11/1/KONV-2012-0004 National Research Center for
Development and Market Introduction of Advanced Information
and Communication Technologies. The research of Janos Abonyi
was realized in the frames of TMOP 4.2.4. A/211-1-2012-0001
National Excellence Program Elaborating and operating an inland
student and researcher personal support system. The project was
subsidized by the European Union and co-nanced by the European Social Fund.
Appendix A. Supplementary data
Supplementary data associated with this article can be found,
in the online version, at http://dx.doi.org/10.1016/j.eswa.2014.
02.029.
References
Abdullah, A., & Hussain, A. (2006). A new biclustering technique based on crossing
minimization. Neurocomputing, 69(16), 18821896.
Agrawal, R., Imielinski, T., & Swami, A. (1993). Mining association rules between
sets of items in large databases. ACM SIGMOD record (Vol. 22, pp. 207216).
ACM.
Ben-Dor, A., Chor, B., Karp, R., & Yakhini, Z. (2003). Discovering local structure in
gene expression data: The order-preserving submatrix problem. Journal of
Computational Biology, 10(3-4), 373384.
Berry, M. J., & Linoff, G. S. (1997). Data mining techniques for marketing. Sales and
Customer Support.
Bozdag, D., Kumar, A. S., & atalyrek, . V. (2010). Comparative analysis of
biclustering algorithms. In Proceedings of the rst ACM international conference
on bioinformatics and computational biology (pp. 265274). ACM.
Brin, S., Motwani, R., Ullman, J. D., & Tsur, S. (1997). Dynamic itemset counting and
implication rules for market basket data. Proceeding of the 1997 ACM-SIGMOD
international conference on management of data (SIGMOD97) (Vol. 26,
pp. 255264). Tucson, AZ: ACM.
Busygin, S., Prokopyev, O., & Pardalos, P. M. (2008). Biclustering in data mining.
Computers & Operations Research, 35(9), 29642987.
Cano, C., Garca, F., Lpez, F., & Blanco, A. (2009). Intelligent system for the analysis
of microarray data using principal components and estimation of distribution
algorithms. Expert Systems with Applications, 36(3, Part 1), 46544663.
Cheng, Y., & Church, G. M. (2000). Biclustering of expression data. In Eighth
international conference on intelligent systems for molecular biology (ISMB 00)
(pp. 93103).
Cheng, K., Law, N., & Siu, W. (2013). Use of biclustering for missing value imputation
in gene expression data. Articial Intelligence Research, 2(2), p96.
Cule, B., Goethals, B., & Hendrickx, T. (2013). Mining interesting itemsets in graph
datasets. In J. Pei, V. Tseng, L. Cao, H. Motoda, & G. Xu (Eds.), Advances in
knowledge discovery and data mining. Lecture notes in computer science (Vol.
7818, pp. 237248). Berlin, Heidelberg: Springer.
Duneja, E., & Sachan, A. (2012). A survey on frequent itemset mining with
association rules. International Journal of Computer Applications, 46(23), 1824.
1
http://pr.mk.uni-pannon.hu:80/Research/FCPMiner/.
5114
Eren, K., Deveci, M., Kktun, O., & atalyrek, . V. (2013). A comparative
analysis of biclustering algorithms for gene expression data. Briengs in
Bioinformatics, 14(3), 279292.
Fimi03: Workshop on frequent itemset mining implementations (2003). In B.
Gthals, & M. J. Zaki (Eds.), IEEE international conference on data mining workshop
on frequent itemset mining implementations, Melbourne, Florida, USA.
Fimi04: Workshop on frequent itemset mining implementations. (2004). In R.
Bayardo, B. Gthals, & M. J. Zaki (Eds.), IEEE international conference on data
mining workshop on frequent itemset mining implementations, Brighton, UK.
Freitas, A. V., Ayadi, W., Elloumi, M., Oliveira, J., & Hao, J.-K. (2013). Biological
knowledge discovery handbook: Preprocessing. Mining and postprocessing of
biological data (Vol. 23). John Wiley & Sons [Chap. Survey on biclustering of
gene expression data].
Gomariz, A., Campos, M., Marin, R., & Goethals, B. (2013). Clasp: An efcient
algorithm for mining frequent closed sequences. In Advances in knowledge
discovery and data mining (pp. 5061). Springer.
Grahne, G., & Zhu, J. (2003). Efciently using prex-trees in mining frequent
itemsets. In FIMI03 workshop on frequent itemset mining implementations (pp.
123132).
Gyenesei, A., Wagner, U., Barkow-Oesterreicher, S., Stolte, E., & Schlapbach, R.
(2007). Mining co-regulated gene proles for the detection of functional
associations in gene expression data. Bioinformatics, 23(15), 19271935.
Han, J., Cheng, H., Xin, D., & Yan, X. (2007). Frequent pattern mining: Current status
and future directions. Data Mining and Knowledge Discovery, 15, 5586. http://
dx.doi.org/10.1007/s10618-006-0059-1<http://dx.doi.org/10.1007/s10618006-0059-1>.
Harris, M., Clark, J., Ireland, A., Lomax, J., Ashburner, M., Foulger, R., et al. (2004). The
gene ontology (go) database and informatics resource. Nucleic Acids Research,
32(Database issue), D258D261.
Hartigan, J. A. (1972). Direct clustering of a data matrix. Journal of the American
Statistical Association (JASA), 67(337), 123129.
Heinrich, J., Seifert, R., Burch, M., & Weiskopf, D. (2011). Bicluster viewer: A
visualization tool for analyzing gene expression data. In Advances in visual
computing (pp. 641652). Springer.
Huband, J., Bezdek, J., & Hathaway, R. (2005). bigvat: Visual assessment of cluster
tendency for large data sets. Pattern Recognition, 38(11), 18751886.
Ihmels, J., Bergmann, S., & Barkai, N. (2004). Dening transcription modules using
large-scale gene expression data. Bioinformatics, 20(13), 19932003.
Kanehisa, M., Goto, S., Kawashima, S., Okuno, Y., & Hattori, M. (2004). The kegg
resource for deciphering the genome. Nucleic Acids Research, 32(suppl 1),
D277D280.
Kantardzic, M. (2002). Data mining: Concepts, models, methods, and algorithms.
Wiley-IEEE Press.
Kirly, A., Abonyi, J., Laiho, A., & Gyenesei, A. (2012). Biclustering of high-throughput
gene expression data with bicluster miner. In 2012 IEEE 12th international
conference on data mining workshops (ICDMW) (pp. 131 138).
Koh, K. P., Yabuuchi, A., Rao, S., Huang, Y., Cunniff, K., Nardone, J., et al. (2011). Tet1
and tet2 regulate 5-hydroxymethylcytosine production and cell lineage
specication in mouse embryonic stem cells. Cell Stem Cell, 8, 200213.
Kriegel, H.-P., Krger, P., & Zimek, A. (2009). Clustering high-dimensional data: A
survey on subspace clustering, pattern-based clustering, and correlation
clustering. ACM Transactions on Knowledge Discovery from Data (TKDD), 3(1), 1.
Leiserson, C. E., Rivest, R. L., Stein, C., & Cormen, T. H. (2001). Introduction to
algorithms. The MIT press.
Li, G., Ma, Q., Tang, H., Paterson, A. H., & Xu, Y. (2009). Qubic: A qualitative
biclustering algorithm for analyses of gene expression data. Nucleic Acids
Research, 37(15), e101.
Liu, G., Lu, H., Lou, W., & Yu, J. X. (2003). On computing, storing and
querying frequent patterns. In Proceedings of the ninth ACM SIGKDD
international conference on knowledge discovery and data mining. KDD 03

(pp. 607612). New York, NY, USA: ACM Press. http://dx.doi.org/10.1145/
956750.956827<http://doi.acm.org/10.1145/956750.956827>.
Lucchese, C., Orlando, S., & Perego, R. (2006). Fast and memory efcient mining of
frequent closed itemsets.
Lucchese, C., Orlando, S., & Perego, R. (2010). Mining top-k patterns from binary
datasets in presence of noise. In Proceedings of the 10th SIAM international
conference on data mining (SDM), Columbus, OH (pp. 165176).
Madeira, S. C., & Oliveira, A. L. (2004). Biclustering algorithms for biological data
analysis: A survey. IEEE Transactions on Computational Biology and
Bioinformatics, 2445.
Murali, T. M., & Kasif, S. (2003). Extracting conserved gene expression motifs from
gene expression data. In Pacic symposium on biocomputing (pp. 7788).
Pasquier, N., Bastide, Y., Taouil, R., & Lakhal, L. (1999). Discovering frequent closed
itemsets for association rules. In Proceedings of the 7th international conference
on database theory. ICDT 99 (pp. 398416). London, UK, UK: SpringerVerlag<http://dl.acm.org/citation.cfm?id=645503.656256>.
Pei, J., Han, J., & Mao, R. (2000). Closet: An efcient algorithm for mining frequent
closed itemsets. In ACM SIGMOD workshop on research issues in data mining and
knowledge discovery, no. 2 (pp. 2130).
Pitman,
A.
(2011).
Market-basket
synthetic
data
generator.<http://
synthdatagen.codeplex.com/>.
Prelic, A., Bleuler, S., Zimmermann, P., Wille, A., Bhlmann, P., Gruissem, W., et al.
(2006). A systematic comparison and evaluation of biclustering methods for
gene expression data. Bioinformatics, 22(9), 11221129.
Rice, J. A. (2007). Mathematical statistics and data analysis (3rd ed.). Duxbury press.
Riondato, M., & Upfal, E. (2013). Efcient discovery of association rules and frequent
itemsets through sampling with tight performance guarantees. In Machine
learning and knowledge discovery in databases (pp. 2541). Springer.
Riondato, M., & Vandin, F. (2014). Finding the true frequent itemsets. In SIAM
international conference on data mining.
Rodriguez-Baena, D. S., Perez-Pulido, A. J., & AguilarRuiz, J. S. (2011). A biclustering
algorithm for extracting bit-patterns from binary datasets. Bioinformatics,
27(19), 27382745.
Rodrguez-Gonzlez, A. Y., Martnez-Trinidad, J. F., Carrasco-Ochoa, J. A., & RuizShulcloper, J. (2013). Mining frequent patterns and association rules using
similarities. Expert Systems with Applications, 40(17), 68236836.
Santamara, R., Thern, R., & Quintales, L. (2008). Bicoverlapper: A tool for bicluster
visualization. Bioinformatics, 24(9), 12121213.
Tan, P.-N., Steinbach, M., & Kumar, V. (2006). Introduction to data mining. Boston:
Pearson, Addison Wesley.
Uitert, M. v., Meuleman, W., & Wessels, L. (2008). Biclustering sparse binary
genomic data. Journal of Computational Biology, 15(10), 13291345.
Vo, B., Hong, T.-P., & Le, B. (2012). Dbv-miner: A dynamic bit-vector approach for
fast mining frequent closed itemsets. Expert Systems with Applications, 39(8),
71967206.
Wang, J., Han, J., & Pei, J. (2003). Closet+: Searching for the best strategies for mining
frequent closed itemsets. In Proceedings of the ninth ACM SIGKDD international
conference on knowledge discovery and data mining. KDD 03 (pp. 236245). New
York, NY, USA: ACM Press. http://dx.doi.org/10.1145/956750.956779<http://
doi.acm.org/10.1145/956750.956779>.
Wu, L.-C., Huang, J.-L., Horng, J.-T., & Huang, H.-D. (2010). An expert system to
identify co-regulated gene groups from time-lagged gene clusters using cell
cycle expression data. Expert Systems with Applications, 37(3), 22022213.
Zaki, M. J., & Hsiao, C.-J. (1999). Charm: An efcient algorithm for closed association
rule mining. In 2nd SIAM international conference on data mining (pp. 457473).
Citeseer.
Zhou, C., Cule, B., & Goethals, B. (2013). Itemset based sequence classication. In
Machine learning and knowledge discovery in databases (pp. 353368). Springer.

ClosedPatternMining ESWA AK AL JA AG 2014 PDF

Caricato da

Informazioni sul documento

Descrizione originale:

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

ClosedPatternMining ESWA AK AL JA AG 2014 PDF

Caricato da

Copyright:

Formati disponibili

This article appeared in a journal published by Elsevier.

Author's personal copy

Expert Systems with Applications 41 (2014) 51055114

Contents lists available at ScienceDirect

Expert Systems with Applications

Novel techniques and an efcient algorithm for closed pattern mining

Author's personal copy

A. Kirly et al. / Expert Systems with Applications 41 (2014) 51055114

Based on the critical analysis of literature we think beside of

Author's personal copy

A. Kirly et al. / Expert Systems with Applications 41 (2014) 51055114

ry; if x # y. Finally, an itemset is a frequent closed itemset (FCI) if

3. Biclustering based closed pattern mining techniques

it is closed and frequent.

In this section we rst propose a new method for mining closed

Note that the transformation in Eq. 1 corresponds to the classical

Author's personal copy

A. Kirly et al. / Expert Systems with Applications 41 (2014) 51055114

Algorithm 1. FCPMain: Main procedure for mining closed

Algorithm 2. IsClosed function

Algorithm 3. FCPMiner procedure

Author's personal copy

A. Kirly et al. / Expert Systems with Applications 41 (2014) 51055114

3.1.3. Time complexity

3.2. Closed pattern based data visualization

number of rows in C j where C j contains columns i

A0c i; j number of columns in C j where C j contains row i

where A fai;j g; ai;j 2 f1; 0; 1g8i; j, is the initial data matrix, A1

Using this representation, closed patterns are discovered twice

Using this type of similarity, overlapping of two identied patterns

Author's personal copy

A. Kirly et al. / Expert Systems with Applications 41 (2014) 51055114

3.3. Method for the aggregation of closed patterns

The consistency ratio of an aggregated pattern can be determined

C p Br1 [ Br2 Ir1 [ Ir2 ; J r1 [ J r2

where Br1 ; Br2 2 B; Ir1 ; Ir2 2 I; J r1 ; J r2 2 J, for some p; r1 ; r 2 and the

maxjC p j0 jIr1 j jIr2 j  2  jIr1 \ Ir2 j  jJ r1 j jJ r2 j  2  jJ r1 \ J r2 j

maxjC p j0 jfg 1 ; g 2 gj jfg 2 ; g 3 gj  2jfg 2 gj  jfc1 ; c2 ; c3 ; c4 ; c6 gj

Author's personal copy

A. Kirly et al. / Expert Systems with Applications 41 (2014) 51055114

computing efciency radically. As we saw earlier, the presented

4.1. Comparison and computational efciency of the closed pattern

Author's personal copy

A. Kirly et al. / Expert Systems with Applications 41 (2014) 51055114

running Windows operating system to compare the running times.

expressed genes between different biological sample groups.

Author's personal copy

A. Kirly et al. / Expert Systems with Applications 41 (2014) 51055114

patterns detected at certain functional enrichment signicance

analysis, therefore handling temporality and applying our method

Author's personal copy

A. Kirly et al. / Expert Systems with Applications 41 (2014) 51055114

international conference on knowledge discovery and data mining. KDD 03

Potrebbero piacerti anche

where A fai;j g; ai;j 2 f1; 0; 1g8i; j, is the initial data matrix, A1

maxjC p j0 jIr1 j jIr2 j 2 jIr1 \ Ir2 j jJ r1 j jJ r2 j 2 jJ r1 \ J r2 j

maxjC p j0 jfg 1 ; g 2 gj jfg 2 ; g 3 gj 2jfg 2 gj jfc1 ; c2 ; c3 ; c4 ; c6 gj