Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
The attached
copy is furnished to the author for internal non-commercial research
and education use, including for instruction at the authors institution
and sharing with colleagues.
Other uses, including reproduction and distribution, or selling or
licensing copies, or posting to personal, institutional or third party
websites are prohibited.
In most cases authors are permitted to post their version of the
article (e.g. in Word or Tex form) to their personal website or
institutional repository. Authors requiring further information
regarding Elseviers archiving and manuscript policies are
encouraged to visit:
http://www.elsevier.com/authorsrights
University of Pannonia, Department of Process Engineering, P.O. Box 158, Veszprem H-8200, Hungary
The Finnish Microarray and Sequencing Centre, Turku Centre for Biotechnology, University of Turku and bo Akademi University, Tykistkatu 6A, 20520 Turku, Finland
a r t i c l e
i n f o
Keywords:
Biclustering
Closed frequent itemset mining
Clustering visualization
Data mining algorithm
Pattern detection
a b s t r a c t
In this paper we show that frequent closed itemset mining and biclustering, the two most prominent
application elds in pattern discovery, can be reduced to the same problem when dealing with binary
(01) data. FCPMiner, a new powerful pattern mining method, is then introduced to mine such data efciently. The uniqueness of the proposed method is its extendibility to non-binary data. The mining
method is coupled with a novel visualization technique and a pattern aggregation method to detect
the most meaningful, non-overlapping patterns. The proposed methods are rigorously tested on both
synthetic and real data sets.
2014 Elsevier Ltd. All rights reserved.
1. Introduction
Very large datasets have lately become increasingly common in
many application areas, making it impossible to inspect data manually when looking for interesting patterns and knowledge (Tan,
Steinbach, & Kumar, 2006). Data mining as a eld aims at developing computational methods and tools that can be used for automating the knowledge extraction for the aid of decision making
or for understanding general trends within data collections (Kantardzic, 2002). Pattern discovery, an interesting subeld of data
mining, is motivated by the huge amounts of electronic data that
many organizations produce on their functions or collect during
experiments within various research elds. For instance, supermarkets store electronic copies of millions of receipts, banks and
credit card companies maintain extensive transaction histories
and biomedical and bioinformatics research groups collect data
from various biological experiments. The goal of pattern discovery
in general is to analyze these large datasets to identify informative
patterns and motifs (Berry & Linoff, 1997).
Frequent pattern discovery is an actively researched data mining technique with a wide range of applications (Han, Cheng, Xin,
& Yan, 2007). Market basket data analysis in web-shops, web
search analysis, frequent symptom set mining in health-care or
proper product placement in supermarkets are all well-known
association rule mining techniques, where pattern matching is
used to predict the behavior of customers or patients. Using
adjacency matrices, strongly connected components can be
Corresponding author. Tel.: +36 88624770; fax: +36 88623171.
E-mail address: kiralya@fmt.uni-pannon.hu (A. Kirly).
http://dx.doi.org/10.1016/j.eswa.2014.02.029
0957-4174/ 2014 Elsevier Ltd. All rights reserved.
revealed in social networks and by the help of collaborative ltering automated suggestions can be performed for users taking into
account the duality between users and items. In the eld of genetics, the huge amount of data from gene expression analysis can be
handled by efcient pattern mining algorithms to uncover local
motifs and interesting genetic pathways which are not apparent
otherwise.
Frequent itemset mining methods usually generates a huge
amount of patterns or association rules so the result of the data
mining is not directly applicable. One possible solution to get a
compact and informative set of itemsets is closed itemset mining
(Agrawal, Imielinski, & Swami, 1993; Brin, Motwani, Ullman, &
Tsur, 1997). During the last decade many frequent closed itemset
mining methods have been proposed in the literature. The problem
was rstly introduced by Pasquier et al. in Pasquier, Bastide, Taouil,
and Lakhal (1999) together with the rst algorithm called A-Close
for mining closed itemsets. Other closed itemset mining algorithms
include CLOSET (Pei, Han, & Mao, 2000), CHARM (Zaki & Hsiao,
1999), CLOSET+ (Wang, Han, & Pei, 2003), FPClose (Grahne &
Zhu, 2003), AFOPT (Liu, Lu, Lou, & Yu, 2003), DCI_Closed (Lucchese,
Orlando, & Perego, 2006), DBV-Miner (Rodrguez-Gonzlez, Martnez-Trinidad,
Carrasco-Ochoa,
&
Ruiz-Shulcloper,
2013),
DBV-Miner (Vo, Hong, & Le, 2012) and ClaSP (Gomariz, Campos,
Marin, & Goethals, 2013) just to mention the most applied ones.
Very recent publications are presented in Zhou, Cule, and Goethals
(2013), Riondato and Upfal (2013), Cule, Goethals, and Hendrickx
(2013) and methods for approximation of true frequent itemsets
can be found in Riondato and Vandin (2014) and Riondato and Upfal (2013). For comprehensive reviews about the efcient algorithms, see Fimi (2003), Fimi (2004) and Duneja and Sachan (2012).
5106
We prove this both theoretically (Section 2.3) and experimentally (Section 4) on various synthetic and real data sets.
We extend the problem of mining frequent closed patterns
(FCP) to more than two value categories. This is especially
important when FCP mining methods are applied to big biological, such as gene expression data that are commonly obtained
nowadays by microarray and next-generation sequencing
instruments (see Section 3.1).
We propose a novel algorithm, called FCPMiner to mine FCPs
efciently. We rigorously test FCPMiner on various real and
synthetic data sets and show that FCPMiner is a powerful
method for mining FCPs (see Section 3.1.1).
We introduce a novel visualization as well as a pattern aggregation method to enable the quick identication of the most relevant FCPs (see Sections 3.2 and 3.3).
We implemented our algorithms using the Java programming
language that can be run on any Operation System (Windows,
Linux, Mac OS). The implemented methods are freely available
on the project website.3
2. Problem formulation
2.1. Biclustering
In this paper we follow the formulation given in Prelic et al.
(2006) to dene the problem of mining biclusters in gene expression data. According to common practice of the eld, bicluster mining is restricted to a binary matrix, i.e. gene expression values are
transformed to 1 (expressed) or 0 (not expressed) using an expression cutoff (Li et al., 2009; Prelic et al., 2006). Let E 2 f0; 1gnm be
an expression matrix, where E represents the set of m experiments
for n genes. A cell eij contains 1 whenever gene i is expressed in
condition j and 0 otherwise. A bicluster G; C corresponds to a subset of genes G # f1; . . . ; ng that jointly responds a subset of samples
C # f1; . . . ; mg. Therefore, the bicluster G; C is a submatrix of E in
which all elements are equal to 1 (biclusters in a small data set are
depicted in Fig. 1 by bold numbers). Using the above denition,
every cell eij having only non-zero values represents a bicluster.
However, such patterns are usually redundant as they are entirely
contained by other patterns. Thus, the denition of inclusion-maximal bicluster (IMB) was introduced to discover all biclusters not
entirely contained by any other cluster (Prelic et al., 2006): the pair
G; C 2 2f1;...;ng 2f1;...;mg is an IMB, if and only if 8i 2 G; j 2 C : eij 1
0
0
and 9
= G0 ; C 0 2 2f1;...;ng 2f1;...;mg where 8i 2 G0 ; j 2 C 0 : ei0 j0 1 and
0
0
0
0
G # G ^ C # C ^ G ; C G; C.
2.2. Frequent closed itemset mining
One of the earliest and most important concepts in data mining is mining frequent itemsets in large transactional datasets
(Lucchese, Orlando, & Perego, 2010). Such a dataset can be considered as a matrix with transactions as rows and items as columns. If an item appears in a transaction it is denoted by 1,
otherwise by 0. The general goal of frequent itemset mining is
to identify all itemsets that contain at least as many transactions
as required, referred to as minimum support threshold, min_sup. By
denition, all subsets of a frequent itemset are frequent. Therefore, it is also important to provide a minimal representation of
all frequent itemsets without losing their support information.
Such itemsets are called frequent closed itemsets and can be dened as follows. Let rx jfti : x # t i ; ti 2 T gj denote the support
count of itemset x. An itemset x is closed if none of its immediate
supersets has exactly the same support count as x. Oppositely, the
itemset x is not closed if at least one of its immediate supersets
has the same support count as x. Obviously, x; y : rx P
5107
Fig. 1. A simple example illustrating how the proposed method works. While the process ow is marked by solid arrows the recursive steps are highlighted by dashed arrows.
A bold cross sign indicates that the investigated pattern is not closed, or it does not satisfy the minimum support conditions for rows or columns. The discovered frequent
closed patterns are surrounded by solid rectangles.
B bi ;
where bi fj : Ai; j 0g
5108
procedure FCPMiner for each B row. Note that the parameter vector
missingRows stores the indices of those rows which are not examined in the actual call (they have been checked before). This is
important as closeness will be checked based on these indices by
the IsClosed function.
FCPMiner procedure is the heart of the method by recursively
building up the frequent closed patterns (Algorithm 3). This is
done by taking the consecutive rows one-by-one and recording
only those column indices that show the same changing tendency (same or exactly the opposite). Then the closeness of the
candidate pattern is checked before the method is calling itself
with the updated parameters. Finally, the newly discovered patterns are added to the output set of frequent closed patterns.
IsClosed is a simple function to check whether adding a new row
index to the candidate pattern would result in a closed pattern
(Algorithm 2). This is done by checking whether there is a row
in missingRows that contains the same column indices with the
same changing tendency as in the pattern under examination. If
no such row can be found then the pattern is already a closed one.
5109
Fig. 2. Visual representation of the input transformation for a simple input matrix (a) and an example for aggregation of 2 pure patterns (b).
"
B
A1
A1
1
A1
1
a1
A1 fa1
i;j g; ai;j 2 f0; 1g8i; j:
i;j
A0r i; j
proposed method, FCPMiner, where the overhead caused by processing duplicated patterns and having to perform post ltering
can be avoided.
1 if ai;j 1
0 otherwise
1 if ai;j 1
0
otherwise
In steps 2 and 3 the distance tables are generated for both rows and
columns of A respectively. Thus the algorithm computes distances
between each pair of rows, DT Cr and columns, DT Cc based on the
sparse matrices A0r and A0c , respectively. We used Tanimoto distance
to calculate similarity measures. If the samples X and Y are bitmaps,
Tanimotos similarity ratio T S is
P
X i ^ Y i
T S X; Y Pi
i X i _ Y i
where ^; _ are bitwise and, or operators, respectively. According to
this denition, the distance for example between row i and j is calculated as follows, using our sparse matrices:
Q 0
fA i; A0r jg
DT Cr i; j DT Cr j; i P r0
fAr i; A0r jg
5110
ity measure in this way provides a robust metric, where zero elements are also taken into account during the calucation of
distances.
Finally, dendrograms for rows (DGr ) and columns (DGc ) are calculated based on the distance tables and the rearranged dataset A00
is generated using the sequence of dendrograms leaves.
Algorithm 4. Closed pattern based data visualization
Require: Amn : initial input data
C p1 : set of closed patterns produced by the pattern
mining algorithm
Ensure: B: rearranged input data based on the patterns
1: Generate sparse matrices A0r and A0c
2: Compute distance table for rows using Tanimoto distance:
DT Cr
3: Compute distance table for columns using Tanimoto
distance: DT Cc
4: Creating dendrogram for rows based on DT Cr : DGr
5: Creating dendrogram for columns based on DT Cc : DGr
6: return Rearranged input data A00 based on the
dendrograms
Fig. 3. Visualization of rearranged data matrix based on the pattern mining results.
5111
4. Experimental results
In this section we compare our proposed closed pattern mining
method with a biclustering based (BiMAX (Prelic et al., 2006)) and
a frequent closed itemset mining based (DCI_Closed (Lucchese
et al., 2006)) methods that are able to discover all frequent closed
patterns in binary data. These algorithms previously served as
highly recognized reference methods for their application elds
(Prelic et al., 2006). Note that all methods developed for frequent
closed itemset mining produce the same patterns as DCI_Closed.
Using several synthetic and real biological data sets, we show that
(1) all three methods discover the same closed patterns in binary
data and thus, experimentally prove our claim that both biclustering and frequent closed itemset mining methods discover the same
patterns; (2) our pattern discovery method outperforms the other
methods and (3) it is the only method that is able to discover previously hidden and biologically potentially relevant closed patterns
by using the extended f1; 0; 1g data.
Fig. 4. Visualization illustrating the efciency of the pattern merging algorithm.
Table 1
Computational results using synthetic data sets. r: number of rows; c: number of columns; d: density (proportion of ones) [%]; sc: minimum support count during the search
(min_cols in pattern mining); sr: minimum row count during pattern mining (min_rows); cf: number of identied closed patterns; cff: number of closed patterns after ltering; b:
number of found patterns by the corresponding algorithm; t: running time [s].
Data
sc
sr
BiMAX
DCI_Closed
FCPMiner
cf
cff
S1
S2
S3
S4
S5
S6
S7
S8
S9
S10
S11
S12
S13
S14
S15
50
50
50
100
100
100
300
300
300
700
700
700
1000
1000
1000
50
50
50
100
100
100
300
300
300
700
700
700
1000
1000
1000
10
20
50
10
20
50
10
20
50
10
20
50
10
20
50
2
4
15
3
7
30
8
22
90
15
45
210
20
60
290
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
78
140
238
337
488
694
437
156
1038
1318
375
283
1496
714
1030
1
1
1
2
2
3
5
52
>600
195
>300
>300
>600
>600
>600
119
189
288
436
588
794
737
456
1338
2018
1075
983
2496
1714
2030
78
140
238
337
488
694
437
156
1038
1318
375
283
1496
714
1030
0.016
0.024
0.033
0.041
0.028
0.034
0.041
0.085
0.241
0.365
0.720
2.631
0.916
2.182
8.110
78
140
238
337
488
694
437
156
1038
1318
375
283
1496
714
1030
0.001
0.016
0.438
0.041
0.015
0.488
0.031
0.047
0.318
0.266
0.499
1.857
0.671
1.451
6.238
IBM 1
IBM 2
IBM 3
IBM 4
IBM 5
IBM 6
100
1000
10000
100000
100
100
100
100
100
100
100
1000
9.04
9.32
8.94
8.99
7.78
7.14
4
4
4
4
6
12
4
6
10
12
6
20
6
15
NA
NA
101
216
1
1
NA
NA
0.8
26
452
19974
426508
8572510
350
1649889
6
15
7
16
101
216
0.070
0.142
1.517
38.909
0.015
25.648
6
15
7
16
101
216
0.004
0.061
1.099
24.147
0.001
20.668
Table 2
Comparison to DCI_Closed r: number of rows; c: number of columns; d: density (portion of ones) [%]; sc: minimum support count during the search (min_cols in pattern mining);
sr: minimum row count during pattern mining (min_rows); cf: number of identied closed patterns; cff: number of closed patterns after ltering; b: number of found patterns by
the corresponding algorithm; t: running time [s].
Problem
Compendium
StemCell-27
Leukemia
StemCell-9
Yeast-80
6316
45276
12533
1840
6221
300
27
72
9
80
1.2
5.8
19.3
15.5
6.8
sc
50
200
400
2
80
sr
2
2
2
2
2
DCI_Closed
FCPMiner
cf
cff
2715
7999
3715
186
3348
2594
7972
3643
177
3285
0.157
0.521
0.823
0.032
0.094
2594
7972
3643
177
3285
0.124
0.325
0.787
0.001
0.055
5112
Fig. 5. Three examples of corresponding patterns discovered by FCPMiner and binary FCP mining methods. If the erroneous genes are removed from the patterns discovered
by BiMAX the sets of patterns produced by the two methods become equivalent.
5. Conclusion
Thanks to the signicant economic and scientic importance in
recent years several attempts have been done to design efcient
frequent itemset mining algorithms. Meanwhile biclustering have
become also important eld of development mainly motivated
by gene expression data analysis. We realized there is a need for
interpretable and intuitive approach that is able to nd all of the
frequent closed itemsets in huge transactional databases. We
found that frequent closed itemset mining and biclustering can
be transformed to the same problem of mining frequent closed
patterns. We have proved the analogy of the two techniques both
theoretically and by demonstration on various synthetic and real
data sets where it was shown that the methods provided the same
patterns.
To tackle the long known poor applicability of pattern mining
methods for very large data sets and the restriction to handling
only binary data, we have introduced a novel efcient algorithm
(FCPMiner) to identify frequent closed patterns in both binary
and extended {-1,0,1} data. The nature of our representation and
the algorithm ensures that the we miss no existing inclusion-maximal bicluster contrary to other widely applied biclustering methods, like BiBit or QUBIC. The ability to handle three value categories
can be extremely valuable within many application elds, for
example in gene expression data analysis, as has been shown in
this work. Application of pattern mining methods for very large
data sets also easily results in large numbers of detected patterns,
many of them partially overlapping, causing a considerable challenge for the result interpretation. In order to target this we have
introduced a novel visualization technique that rearranges the original data matrix based on the discovered closed patterns, and a
pattern aggregation method allowing rapid detection of the most
meaningful patterns. The developed methods can be applied to
various application elds ranging from supermarket transaction
data to banking and credit card transaction histories to biomedical
data collections.
Since a signicant part of data appearing in data mining problems are presented in non-binary format, an evident improvement
of our method is to make it capable handling integer or real numbers without any transformation, i.e. extending the domain of our
approach. Data are evolving continuously, especially in web log
5113
http://pr.mk.uni-pannon.hu:80/Research/FCPMiner/.
5114
Eren, K., Deveci, M., Kktun, O., & atalyrek, . V. (2013). A comparative
analysis of biclustering algorithms for gene expression data. Briengs in
Bioinformatics, 14(3), 279292.
Fimi03: Workshop on frequent itemset mining implementations (2003). In B.
Gthals, & M. J. Zaki (Eds.), IEEE international conference on data mining workshop
on frequent itemset mining implementations, Melbourne, Florida, USA.
Fimi04: Workshop on frequent itemset mining implementations. (2004). In R.
Bayardo, B. Gthals, & M. J. Zaki (Eds.), IEEE international conference on data
mining workshop on frequent itemset mining implementations, Brighton, UK.
Freitas, A. V., Ayadi, W., Elloumi, M., Oliveira, J., & Hao, J.-K. (2013). Biological
knowledge discovery handbook: Preprocessing. Mining and postprocessing of
biological data (Vol. 23). John Wiley & Sons [Chap. Survey on biclustering of
gene expression data].
Gomariz, A., Campos, M., Marin, R., & Goethals, B. (2013). Clasp: An efcient
algorithm for mining frequent closed sequences. In Advances in knowledge
discovery and data mining (pp. 5061). Springer.
Grahne, G., & Zhu, J. (2003). Efciently using prex-trees in mining frequent
itemsets. In FIMI03 workshop on frequent itemset mining implementations (pp.
123132).
Gyenesei, A., Wagner, U., Barkow-Oesterreicher, S., Stolte, E., & Schlapbach, R.
(2007). Mining co-regulated gene proles for the detection of functional
associations in gene expression data. Bioinformatics, 23(15), 19271935.
Han, J., Cheng, H., Xin, D., & Yan, X. (2007). Frequent pattern mining: Current status
and future directions. Data Mining and Knowledge Discovery, 15, 5586. http://
dx.doi.org/10.1007/s10618-006-0059-1<http://dx.doi.org/10.1007/s10618006-0059-1>.
Harris, M., Clark, J., Ireland, A., Lomax, J., Ashburner, M., Foulger, R., et al. (2004). The
gene ontology (go) database and informatics resource. Nucleic Acids Research,
32(Database issue), D258D261.
Hartigan, J. A. (1972). Direct clustering of a data matrix. Journal of the American
Statistical Association (JASA), 67(337), 123129.
Heinrich, J., Seifert, R., Burch, M., & Weiskopf, D. (2011). Bicluster viewer: A
visualization tool for analyzing gene expression data. In Advances in visual
computing (pp. 641652). Springer.
Huband, J., Bezdek, J., & Hathaway, R. (2005). bigvat: Visual assessment of cluster
tendency for large data sets. Pattern Recognition, 38(11), 18751886.
Ihmels, J., Bergmann, S., & Barkai, N. (2004). Dening transcription modules using
large-scale gene expression data. Bioinformatics, 20(13), 19932003.
Kanehisa, M., Goto, S., Kawashima, S., Okuno, Y., & Hattori, M. (2004). The kegg
resource for deciphering the genome. Nucleic Acids Research, 32(suppl 1),
D277D280.
Kantardzic, M. (2002). Data mining: Concepts, models, methods, and algorithms.
Wiley-IEEE Press.
Kirly, A., Abonyi, J., Laiho, A., & Gyenesei, A. (2012). Biclustering of high-throughput
gene expression data with bicluster miner. In 2012 IEEE 12th international
conference on data mining workshops (ICDMW) (pp. 131 138).
Koh, K. P., Yabuuchi, A., Rao, S., Huang, Y., Cunniff, K., Nardone, J., et al. (2011). Tet1
and tet2 regulate 5-hydroxymethylcytosine production and cell lineage
specication in mouse embryonic stem cells. Cell Stem Cell, 8, 200213.
Kriegel, H.-P., Krger, P., & Zimek, A. (2009). Clustering high-dimensional data: A
survey on subspace clustering, pattern-based clustering, and correlation
clustering. ACM Transactions on Knowledge Discovery from Data (TKDD), 3(1), 1.
Leiserson, C. E., Rivest, R. L., Stein, C., & Cormen, T. H. (2001). Introduction to
algorithms. The MIT press.
Li, G., Ma, Q., Tang, H., Paterson, A. H., & Xu, Y. (2009). Qubic: A qualitative
biclustering algorithm for analyses of gene expression data. Nucleic Acids
Research, 37(15), e101.
Liu, G., Lu, H., Lou, W., & Yu, J. X. (2003). On computing, storing and
querying frequent patterns. In Proceedings of the ninth ACM SIGKDD