Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Abstract—This paper presents the IMine index, a general and compact structure which provides tight integration of item set extraction in
a relational DBMS. Since no constraint is enforced during the index creation phase, IMine provides a complete representation of the
original database. To reduce the I/O cost, data accessed together during the same extraction phase are clustered on the same disk
block. The IMine index structure can be efficiently exploited by different item set extraction algorithms. In particular, IMine data access
methods currently support the FP-growth and LCM v.2 algorithms, but they can straightforwardly support the enforcement of various
constraint categories. The IMine index has been integrated into the PostgreSQL DBMS and exploits its physical level access methods.
Experiments, run for both sparse and dense data distributions, show the efficiency of the proposed index and its linear scalability also for
large data sets. Item set mining supported by the IMine index shows performance always comparable with, and often (especially for low
supports) better than, state-of-the-art algorithms accessing data on flat file.
1 INTRODUCTION
Each subpath is normalized to p node support. For example, 2.3 IMine Physical Organization
the first prefix path, once normalized to [p:3], is ½p : 3 ! d : The physical organization of the IMine index is designed to
3 ! h : 3 ! e : 3 ! b : 3. minimize the cost of reading the data needed for the current
extraction process. The I-Btree allows a selective access to
2.2.2 Loading the Support-Based Projected Database the I-Tree paths of interest. Hence, the I/O cost is mainly
The support-based projection of relation R contains all transac- given by the number of disk blocks read to load the
tions in R intersected with the items which are frequent with required I-Tree paths.
respect to a given support threshold ðMinSupÞ. The I-Tree When visiting the I-Tree, nodes are read from table
paths completely represent the database transactions. Items TIT ree by using their exact physical location. However,
are sorted by decreasing support along the paths. Thus, the fetching a given record requires loading the entire disk
support-based projection of R is given by the I-Tree subpaths block where the record is stored. On the other hand, once
between the I-Tree roots and the first node with an the block is in the DBMS buffer cache, reading the other
unfrequent item. nodes in the block does not entail additional I/O cost.
The Load_Support_Projected_DB data access method loads Hence, to reduce the I/O cost, correlated index parts, i.e., parts
the support-projected database (see Fig. 4). It is based on the that are accessed together during the extraction task, should
function Load SubT ree, which reads a node subtree by be clustered into the same disk block. The I-Tree physical
means of a top-down depth-first I-Tree visit exploiting both organization is based on the following correlation types:
the node child and brother pointers. First frequent items are . Intratransaction correlation. Extraction algorithms
stored in set I MinSup . Item support is read from the consider together items occurring in the same
appropriate I-Btree leaf (function get_freq_items in line 1). transaction. Items appearing in a transaction are
Then, function Load SubT ree is invoked on each I-Tree root thus intrinsically correlated. To minimize the num-
to read its subtree (line 5). Starting from a root node, the ber of read blocks, each I-Tree path should be
I-Tree is visited depth-first by following the node child partitioned in a limited number of blocks.
pointer (lines 12-14). The visit ends when a node with an . Intertransaction correlation. Transactions with some
unfrequent item (lines 8-10) or a node with no children items in common will be accessed together when
(lines 15-16) is reached. The read subpath is added to the in- item sets including the common items are extracted.
memory representation of the projected database denoted Hence, they should be stored in the same blocks. In
as DMinSup . Then, the search backtracks to the most recent the I-Tree, the common prefix of different transac-
node whose exploration is not finished, i.e., a node with (at tions is represented by a single path. To further
least) one brother node. By following the brother pointer in increase transaction clustering, subpaths with a
the node, the brother node is read and the visit of the I-Tree given percentage of items in common should be
is restarted from there (lines 17-19). Since the I-Tree roots stored in the same disk block.
BARALIS ET AL.: IMINE: INDEX SUPPORT FOR ITEM SET MINING 497
We devised a physical organization for the I-Tree which and items fb; e; hg and fig, preceding and following a on the
exploits both correlation types. For intratransaction correla- I-Tree paths, have average node support 2.5. If item
tion, I-Tree paths are partitioned into three layers, based on selection were based on Kavg , item a would be omitted
node access frequency (Section 2.3.1). Nodes in each layer and paths including it would be broken (e.g., subpath [b:10,
correspond to items within the same support range. Path a:3, i:3]). Finally, items p, d, and i have the same
correlation is then analyzed among the subpaths in each support. However, only items d and i are selected since
layer (Section 2.3.2). A general discussion of the I/O they precede p in lexicographical order.
performed by IMine is presented in Section 2.3.3. Middle layer. This layer includes nodes that are quite
frequently accessed during the mining process. These nodes
2.3.1 I-Tree Layers are typically located in the central part of the tree. They
The I-Tree is partitioned in three layers based on the node correspond to items with relatively high support, but not
access frequency during the extraction processes. The yet dispersed on a large number of nodes with very low
frequency in accessing a node (and thus the subpath node support. We include in the Middle layer nodes with
including it) depends on the interaction of three factors: (node) support larger than 1. Unitary support nodes are
1) the node level in the I-Tree, i.e., its distance from the root, rather rarely accessed and should be excluded from the
2) the number of paths including it, represented by the node Middle layer.
support, and 3) the support of its item. When an item has Bottom layer. This layer includes the nodes correspond-
very low support, it will be very rarely accessed, because it ing to rather low support items, which are rarely accessed
will be uninteresting for most support thresholds. Nodes during the mining process. Nodes in this layer are analyzed
located in lower levels of the I-Tree are associated to items only when mining frequent item sets for very low support
with low support. The three layers are shown in Fig. 2a for thresholds. The Bottom layer is characterized by a huge
the example I-Tree. number of paths which are (possibly long) chains of nodes
Top layer. This layer includes nodes that are very with unitary support. These subpaths represent (a portion
frequently accessed during the mining process. These nodes of) a single transaction and are thus read only few times. A
are located in the upper levels of the I-Tree. They correspond large number of low support items is included in this layer.
to items with high support, which are distributed over few
nodes with high node support. These items can be 2.3.2 I-Tree Path Correlation
characterized by considering the average support of their Correlation among the subpaths within each layer is
nodes. As a first attempt, items whose average node support analyzed to optimize the index storage on disk. Two paths
is larger than or equal to a given threshold Kavg may be are correlated when a given percentage of items is common
assigned to the Top layer. Unfortunately, average node to both paths. Searching for optimal correlation is compu-
support is not monotone with respect to the item ordering on tationally expensive since all pairs of paths should be
I-Tree paths, (i.e., by descending item support and ascend- checked. As an alternative, we propose a heuristic
ing lexicographical order). For example, items fb; e; hg and technique to detect correlation with reduced computation
fig, which, respectively, precede and follow item a on the I- cost. The technique is based on an “asymmetric” definition
Tree paths, have all an average node support higher than a. of correlation. A reference path, named pivot, is selected.
Hence, by directly selecting items based on average node Then, correlation of the other paths with the pivot is
support, tree paths could be unintentionally broken by analyzed. The pivot and its correlated paths are stored in
omitting items that do not meet the Kavg threshold. To the same disk block.
prevent this effect, items are selected according to the item Since each node may be shared by many paths,
ordering on the I-Tree paths. First, the number of items redundancy in storing the paths might be introduced. To
ð#ItemsÞ with average node support Kavg is computed. prevent this effect, paths are partitioned in nonoverlapped
Next, the first #Items most frequent items occurring in the I- parts, named tracks. Each node (even if shared among
Tree are selected. Items are chosen in the same order they are several paths) belongs to a single track. Correlation between
entered in the I-Tree paths. The nodes containing the track pairs is then analyzed.
selected items are all stored in the Top layer. The Tracks are computed separately in each layer. Each layer
pseudocode for Top layer item selection is provided in is bound by two borders, named upper and lower border,
Appendix A, which can be found on the Computer Society which contain, respectively, the root and the tail nodes for
Digital Library at http://doi.ieeecomputersociety.org/ the subpaths in the layer. For a given layer, track computa-
10.1109/TKDE.2008.180. Kavg ranges between 1 and the tion starts from nodes in its lower border. Each node in the
total number of transactions. For Kavg ¼ 1, the Top layer (lower) border is the tail node of a different track. Nodes are
includes all items. Experiments run on different data sets considered based on their order into the lower border. For
(see Section 4) allowed us to select Kavg ¼ 1:2. The first node each tail node, its prefix-path is visited bottom-up. The visit
in each path whose item is not assigned to the Top layer is ends when a node already included in a previously
the root of the paths in the Middle part of the tree. computed track or included in the upper border of the layer
In the example in Fig. 2a, let Kavg be 2.5.1 Six items out of is reached. All visited nodes are assigned to the new track.
22 have average node support Kavg , i.e., fb; e; h; d; i; pg. The pseudocode for track computation is provided in
Hence, the first six items occurring in the I-Tree paths are Appendix A, which can be found on the Computer Society
selected based on the item ordering, i.e., fb; e; h; a; d; ig. Digital Library at http://doi.ieeecomputersociety.org/
Nodes associated to these items are in the top layer. We 10.1109/TKDE.2008.180.
observe that item a has average node support equal to 2, As an example, Fig. 2a shows how paths in the Top and
Middle layers are partitioned in tracks (tracks are repre-
1. The default value is inappropriate for such a small data set. sented as dashed boxes). The top layer upper border
498 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 21, NO. 4, APRIL 2009
support, which is already known when the I-Tree creation The FP-based algorithm selects frequent items by means
begins. A reference item support threshold Sw is identified. of the get_freq_items function. For each item, the correspond-
All items with support lower than Sw should appear only in ing projected database is loaded from the IMine index by
I-Tree nodes with unitary node support. The subpaths means of the Load_Freq_Item_Projected_DB access method
starting with one of these items are directly written on disk. (Section 2.2.1). Then, the original FP-growth algorithm [3] is
Threshold Sw is computed by means of the average run. With respect to [3], the FP-based approach reduces
value Sup of item support. We have observed experimen- memory occupation by loading in memory only the
tally that most items with support lower than Sup occur projection exploited in the current extraction phase. Hence,
only in nodes with unitary node support. Sup is usually more memory space is available for the extraction process.
rather low (e.g., 0.02 percent) and the standard deviation is Data access overhead has been further reduced by exploit-
very high (i.e., at least three order of magnitude larger than
ing correlation. Items considered in sequence may occur in
Sup ). Threshold Sw is computed as Sw ¼ KSup Sup , where
correlated index paths. When the next item is considered,
0 KSup 1. When KSup ¼ 1, then Sw ¼ Sup , while when
its paths are already available in memory and do not have
KSup ¼ 0, no I-Tree part is written on disk before the I-Tree
to be read again.
creation ends. We experimentally set KSup ¼ 0:05 (see
Section 4.5) to deal with data sets whose I-Tree does not LCM-based algorithm. The LCM v.2 algorithm [14]
fit in main memory. loads in memory the support-based projection of the
This approach allows us to build the complete IMine original database. First, it reads the transactions to count
index for large databases without enforcing any constraint, item support. Then, for each transaction, it loads the subset
while also reducing index creation time. A slight overhead including frequent items. Data are represented in memory
in terms of required disk space may be introduced when the by means of an array-based data structure, on which the
data distribution is dense. In this case, some nodes may extraction takes place.
actually have support higher than one. Thus, a single index In the LCM-based algorithm, the database projection
subpath is instead represented by two different paths. The is read from the IMine index by means of the Load_Support_
I-Tree compactness may be reduced and some overhead in Projected_DB access method (Section 2.2.2). Data are stored
data access may be introduced. Experiments showed that in the appropriate array-based structure, on which the
this slight overhead does not affect data loading perfor- original LCM v.2 algorithm [14] is run. Since I-Tree paths
mance during the extraction phase. concisely represent transactions, reading the database
projection from the IMine index instead of from the original
database is more effective in large databases.
3 ITEM SET MINING
Several algorithms have been proposed for item set 3.2 Enforcing Constraints
extraction. These algorithms are different mainly in the Constraint specification allows the (human) analyst to
adopted main memory data structures and in the strategy to better focus on interesting item sets for the considered
visit the search space. The IMine index can support all these analysis task. A significant research effort [10], [11], [12],
different extraction strategies. Since the IMine index is a [15], [13] has been devoted to the exploration of techniques
disk resident data structure, the process is structured in two to push constraint enforcement into the extraction process,
sequential steps: 1) the needed index data is loaded and thus allowing an early pruning of the search space and a
2) item set extraction takes place on loaded data. The data more efficient extraction process. Constraints have been
access methods presented in Section 2.2 allow effectively classified as antimonotonic, monotonic, succinct, and
loading the data needed for the current extraction phase. convertible [15]. These latter constraints (e.g., avg, sum)
Once data are in memory, the appropriate algorithm for are neither antimonotonic nor monotonic, but they can be
item set extraction can be applied. In Section 3.1, frequent converted into monotonic or antimonotonic by an appro-
item set extraction by means of two representative state-of- priate item ordering.
the-art approaches, i.e., FP-growth [3] and LCM v.2 [14], is Constraint enforcement into the FP-growth algorithm is
described. Section 3.2 discusses how the IMine index discussed in [15]. This approach can be straightforwardly
supports constraint enforcement. supported by the IMine index. More specifically, the items
of interest are selected by accessing the I-Btree. For each
3.1 Frequent Item Set Extraction item , the transactions including it are read by means of
This section describes how frequent item set extraction the Load_Item_Projected_DB access method. Once data are in
takes place on the IMine index. We present two approaches, memory, the -projected database is built by including all
denoted as FP-based and LCM-based algorithms, which are items which follow in the item ordering required by
an adaptation of the FP-Growth algorithm [3] and LCM v.2 constraint enforcement. Different constraint classes may
algorithm [14], respectively. require different item orderings, which enable early
FP-based algorithm. The FP-growth algorithm [3] stores pruning for the considered constraint class. For example,
the data in a prefix-tree structure called FP-tree. First, it for convertible constraints, the ordering exploited to convert
computes item support. Then, for each transaction, it stores the constraint is enforced. Since the Load_Item_Projected_DB
in the FP-tree its subset including frequent items. Items are access method loads in memory the entire transaction set
considered one by one. For each item, extraction takes place including , the required order can always be enforced. The
on the frequent-item projected database, which is generated appropriate extraction algorithm performs the extraction by
from the original FP-tree and represented in a FP-tree based recursive projections of the -projected database. Con-
structure. straints are enforced during the iteration steps.
500 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 21, NO. 4, APRIL 2009
TABLE 1
Data Set Characteristics and Corresponding Indices
The approach in [15] can be easily extended to deal with 4.1 Index Creation and Structure
multiple constraints. Constraints may belong to the same Table 1 reports both I-Tree and I-Btree size for the six data
category (e.g., all antimonotonic) or to different categories sets. The overall IMine index size is obtained by summing
(e.g., antimonotonic and monotonic). The IMine index both contributions. The IMine indices have been created
provides a flexible information storage structure which with the default value Kavg ¼ 1:2. Furthermore, the Con-
can easily support also this extension. nect, Pumsb, Kosarak, and T10I200P20D2M data sets have
been created with KSup ¼ 0, while large synthetic data sets
4 EXPERIMENTAL RESULTS with KSup ¼ 0:05. These parameters are further discussed in
Sections 4.4 and 4.5.
We validated our approach by means of a large set of The adopted I-Tree representation is more suitable for
experiments addressing the following issues: dense data distributions, for which it provides good data
1.performance of the IMine index creation, in terms of compression. In dense data sets (e.g., T10I200P20D2M)
both creation time and index size, where data are highly correlated, the I-Tree structure is
2. performance of frequent item set extraction, in more compact. In sparse data sets (e.g., Kosarak), where
terms of execution time, memory usage, and I/O data are weakly correlated, data compression is low and
access time,2 storing the I-Tree requires more disk blocks. For example,
3. effect of the DBMS buffer cache size on hit rate,3 with respect to the original database, the I-Tree for
4. effect of the index layered organization, T10I200P20D2M has a lower size, for T15I100P20C1D5M
5. effect of direct writing, and is quite equivalent, and for Kosarak and T20I100P15C1D7M
6. scalability of the approach. is larger.4 The I-Btree contains pointers to all nodes in the
We ran the experiments for both dense and sparse I-Tree. Hence, its size is proportional to the total number of
data distributions. We report experiments on six repre- I-Tree nodes.
sentative data sets whose characteristics (i.e., transaction The experiments in Section 4.2 show that the index size
and item cardinality, average transaction size (AvgTrSz), does not negatively affect the extraction performance,
and data set size) are in Table 1. Connect and Pumsb [22] because IMine includes additional information with respect
are dense and medium-size data sets. Kosarak [22] is a to the original database (e.g., node pointers) which supports
effective data access. Furthermore, since the index is stored
large and sparse data set including click-stream data.
on disk, only the data needed for the current extraction
T10I200P20D2M is a dense and large synthetic data set,
process are actually loaded in memory.
while T15I100P20C1D5M and T20I100P15C1D7M are quite
Table 1 also shows the index creation time, which is
sparse and large synthetic data sets. Synthetic data sets
mainly due to path correlation analysis and storage of the
are generated by means of the IBM generator [23]. For all
index paths on disk. The first factor depends on the number
data sets, the index has been generated without enforcing
of I-Tree paths. The latter, which is dominant, is propor-
any support threshold.
Both the index creation procedure and the item set tional to the number of I-Tree nodes. Both contributions
extraction algorithms are coded into the PostgreSQL v. 7.3.4 increase for larger and sparser data sets, for which the
open source DBMS [16]. They have been developed in I-Tree size increases (see Section 4.4).
ANSI C. Experiments have been performed on a 2,800-MHz 4.2 Frequent Item Set Extraction Performance
Pentium IV PC with 2-Gbyte main memory running Linux
The IMine structure is independent of the extraction
kernel v. 2.7.81. The buffer cache of the PostgreSQL DBMS
algorithm. To validate its generality, we compared the
has been set to the default size of 64 blocks (block size is
FP-based and LCM-based algorithms with three very
8 Kbytes). All reported execution times are real times,
effective state-of-the-art algorithms accessing data on flat file,
including both system and user time, and obtained from the
i.e., Prefix-tree [17], LCM v.2 [14] (FIMI ’03 and FIMI ’04 best
Unix time command as in [22].
implementation algorithms), and FP-growth [3]. We analyzed
2. Because of space constraints, I/O access time is analyzed in the runtime with various supports. Since our objective is to
Appendix B, which can be found on the Computer Society Digital Library
at http://doi.ieeecomputersociety.org/10.1109/TKDE.2008.180. 4. The Pumsb data set, which is characterized by a highly dense upper
3. Because of space constraints, this effect is discussed in Appendix C, part of the tree and a rather sparse bottom part, is an exception. Its I-Tree
which can be found on the Computer Society Digital Library at http:// size is larger due to the overhead yielded by storing IMine control
doi.ieeecomputersociety.org/10.1109/TKDE.2008.180. information for the bottom part of the tree.
BARALIS ET AL.: IMINE: INDEX SUPPORT FOR ITEM SET MINING 501
Fig. 6. Frequent item set extraction time for the FP-based algorithm. (a) Connect. (b) Pumsb. (c) Kosarak. (d) T10I200P20D2M.
(e) T15I100P20C1D5M. (f) T20I100P15C1D7M.
compare the performance of the approaches on the extraction I/O operations with respect to accessing the flat file
phase, the time for writing the generated item sets is not representation of the data set. This benefit increases when
included. This time is comparable for all approaches. the data set is larger and more correlated.
Fig. 6 compares the FP-based algorithm with the Prefix-
4.3 Memory Consumption
tree [17] and FP-growth algorithms [3] on flat file, all
characterized by a similar extraction approach. For real data Fig. 8 reports the peak main memory required by the two
sets (Connect, Pumsb, and Kosarak), differences in CPU IMine-based algorithms during item set extraction. For both
time between the FP-based and the Prefix-Tree algorithms algorithms, the PostgreSQL buffer cache (with default size
are not visible for high supports, while for low supports the 512 Kbytes) is included in the memory held by the process.
FP-based approach always performs better than Prefix-Tree. The FP-based and LCM-based algorithms are compared
with the Prefix-Tree and LCM v.2 algorithms, respectively.
For large synthetic data sets, the CPU time required by the
Kosarak (sparse) and Pumsb (dense) data sets are discussed
FP-based algorithm is always smaller than that required by
as representative examples.
the Prefix-Tree. The FP-based algorithm adopts exactly the
On the Kosarak data set, the peak main memory is
same extraction approach as FP-growth. For low supports,
always significantly higher for the Prefix-Tree than for
it is significantly faster than FP-growth. FP-growth did not
the FP-based algorithm, while on the Pumsb data set the
terminate correctly the extraction task on the Kosarak data
difference between the two approaches is less relevant. The
set in the considered support range.
FP-based approach selectively loads in memory only the data
The FP-based algorithm, albeit implemented into a
required for the current extraction phase. Instead, the
relational DBMS, yields performance always comparable Prefix-Tree algorithm stores in memory all the data needed
with and sometimes better than the other algorithms. This for the whole extraction process. Memory consumption for
performance is mainly due to the optimized storage of Prefix-Tree increases when larger data sets or lower
correlated data in few blocks, which supports selective data supports are considered, because more data are stored in
loading in memory. Hence, I/O costs are limited and memory. The slightly higher memory consumption shown
significantly more memory space is available for the on the Pumsb data set by the FP-based algorithm is due to
extraction task. This effect is particularly relevant for low the buffer cache size.
supports, because representing in memory a large portion Since the LCM-based and LCM v.2 algorithms store in
of the data set may significantly reduce the space for the memory all data needed during the extraction process, both
extraction task, hence causing more memory swaps. algorithms show the same peak main memory behavior.
As shown in Fig. 7, the LCM-based approach provides Also in this case, the slight difference between the two
an extraction time comparable to LCM v.2 on flat file. For algorithms is due to PostgreSQL buffer cache.
large data sets, it performs better than LCM v.2. Since I-Tree We also evaluated the average and total memory
paths compactly represent the transactions, reading the allocated during the extraction process (not reported here
needed data through the index requires a lower number of for lack of space). For both IMine-based algorithms, average
502 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 21, NO. 4, APRIL 2009
Fig. 7. Frequent item set extraction time for the LCM-based algorithm. (a) Connect. (b) Pumsb. (c) Kosarak. (d) T10I200P20D2M.
(e) T15I100P20C1D5M. (f) T20I100P15C1D7M.
and peak memory are close. Thus, memory consumption is stable with decreasing support thresholds, while for Prefix-
rather constant in time. The difference in both average and Tree they increase significantly. FP-growth needs at least
total memory is more significant between the FP-based and five times more memory than the other approaches,5
the Prefix-Tree algorithms. Since the buffer cache size is because it keeps in memory the whole frequent subset of
kept constant, FP-based memory requirements are rather the data set during the entire mining process.
Fig. 10. Effect of layer partitioning. (a) On hit rate. (b) On extraction time. (c) On index creation time.
504 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 21, NO. 4, APRIL 2009
4.6 Scalability
The scalability of the FP-based and LCM-based algorithms
has been analyzed by varying 1) the number of transac-
tions and 2) the pattern length. All data sets were
generated by means of the IBM generator [23]. All IMine Fig. 13. Scalability on P10DxM data sets (x is the # of transactions
indices have been created with the values Kavg ¼ 1:2 and expressed in millions). (a) FP-based algorithm. (b) LCM-based
KSup ¼ 0:05. The experiments in this section have been algorithm.
performed on a dual-processor quad-core Intel Xeon
Processor E5430 with 8-Gbyte main memory running 5 RELATED WORK
Linux kernel v. 2.6.17-10-generic.6 The wide diffusion of data warehouses caused an increased
To analyze the scalability of the IMine-based algorithms interest in coupling the item set extraction task with
with respect to the number of transactions we considered relational DBMSs. Different levels of coupling were
large databases of size ranging from 10M (million) proposed, ranging from loose coupling, given by SQL
(database size 1.8 Gbytes) to 40M (8 Gbytes) transactions statements exploiting a traditional cursor interface [24], to
with an average length of 12 items. These data sets have tight coupling, which proposed various kinds of optimiza-
100,000 items, pattern length 10, and correlation grade 0.9. tions in managing the interaction with the DBMS [25], [26].
Fig. 13 plots the extraction time for various supports. Both A parallel effort was devoted to the definition of expressive
algorithms scale well even for very large data sets. Since the query languages, such as DMQL [27] or MINE RULE [28],
number of extracted item sets grows significantly for very to specify mining requests. These languages often proposed
low supports (i.e., 0.1 percent), the process becomes SQL extensions that support the specification of various
computationally more expensive. However, the overall types of constraints on the knowledge to be extracted (see
CPU time is still low, less than 1,000 seconds for the lowest [29] for a review of the most important languages).
considered support. Chaudhuri et al. [30] perform one step further toward a
To analyze the scalability with respect to the pattern tighter integration by proposing techniques to optimize
length, we considered transactional databases with 10M queries including mining predicates. Knowledge of the
transactions and pattern length ranging from 10 (database mining model is exploited to derive from mining predicates
size 1.8 Gbytes) to 20 (3 Gbytes), with 100,000 items and simpler predicate expressions that can be exploited for
correlation grade 0.9. Fig. 14 plots the extraction time for the access path selection like traditional database predicates.
IMine-based algorithms with different support thresholds. Different disk-based data structures for frequent item set
By increasing the pattern length, the data set becomes extraction from flat file have been proposed [7], [8], [9].
denser.7 Hence, the I-Tree compactness increases. The Ramesh et al. [9] propose B+tree-based indices to access
extraction time is linear for a broad support range. As data stored by means of either a vertical or a horizontal data
expected, correlation is beneficial for long pattern. For very representation. Item set extraction is performed by means of
low supports, the extraction time increases, but still to a both APRIORI [1] and Eclat [31]. However, performance is
limited extent (less than 600 seconds for the lowest support). usually worse than, or at best comparable to, flat file
mining. In [32], an index structure based on signature files
6. The current implementation of the IMine index does not exploit is proposed to drive the generation of candidate frequent
multithreading processors.
7. This behavior has been observed empirically. It is not monotone and item sets. However, to check the actual candidate fre-
not documented in [23]. quency, the data set needs to be further accessed. More
BARALIS ET AL.: IMINE: INDEX SUPPORT FOR ITEM SET MINING 505
approach would be to incrementally update the index when [24] S. Nestorov and S. Tsur, “Integrating Data Mining with Relational
DBMS: A Tightly-Coupled Approach,” Proc. Fourth Workshop Next
new data become available. Since no support threshold is
Generation Information Technologies and Systems (NGITS), 1999.
enforced during the index creation phase, the incremental [25] R. Agrawal and K. Shim, “Developing Tightly-Coupled Data
update is feasible without accessing the original transac- Mining Applications on a Relational Database System,” Proc.
tional database. Second Int’l Conf. Knowledge Discovery in Databases and Data Mining
(KDD), 1996.
[26] S. Sarawagi, S. Thomas, and R. Agrawal, “Integrating Mining with
Relational Database Systems: Alternatives and Implications,” Proc.
REFERENCES ACM SIGMOD, 1998.
[1] R. Agrawal and R. Srikant, “Fast Algorithm for Mining Associa- [27] J. Han, Y. Fu, W. Wang, K. Koperski, and O. Zaiane, “DMQL: A
tion Rules,” Proc. 20th Int’l Conf. Very Large Data Bases (VLDB ’94), Data Mining Query Language for Relational Databases,” Proc.
Sept. 1994. ACM SIGMOD Workshop Data Mining and Knowledge Discovery
[2] R. Agrawal, T. Imilienski, and A. Swami, “Mining Association (DMKD), 1996.
Rules between Sets of Items in Large Databases,” Proc. ACM [28] R. Meo, G. Psaila, and S. Ceri, “A New SQL-Like Operator for
SIGMOD ’93, May 1993. Mining Association Rules,” Proc. 22nd Int’l Conf. Very Large Data
[3] J. Han, J. Pei, and Y. Yin, “Mining Frequent Patterns without Bases (VLDB), 1996.
Candidate Generation,” Proc. ACM SIGMOD, 2000. [29] M. Botta, J.-F. Boulicaut, C. Masson, and R. Meo, “A Comparison
[4] H. Mannila, H. Toivonen, and A.I. Verkamo, “Efficient Algo- between Query Languages for the Extraction of Association
rithms for Discovering Association Rules,” Proc. AAAI Workshop Rules,” Proc. Fourth Int’l Conf. Data Warehousing and Knowledge
Knowledge Discovery in Databases (KDD ’94), pp. 181-192, Discovery (DaWak), 2002.
1994. [30] S. Chaudhuri, V. Narasayya, and S. Sarawagi, “Efficient Evalua-
[5] A. Savasere, E. Omiecinski, and S.B. Navathe, “An Efficient tion of Queries with Mining Predicates,” Proc. 18th Int’l Conf. Data
Algorithm for Mining Association Rules in Large Databases,” Eng. (ICDE), 2002.
Proc. 21st Int’l Conf. Very Large Data Bases (VLDB ’95), pp. 432-444, [31] M. Zaki, “Scalable Algorithms for Association Mining,” IEEE
1995. Trans. Knowledge Discovery and Data Eng., vol. 12, no. 3, pp. 372-
[6] H. Toivonen, “Sampling Large Databases for Association Rules,” 390, May/June 2000.
Proc. 22nd Int’l Conf. Very Large Data Bases (VLDB ’96), pp. 134-145, [32] B. Lan, B. Ooi, and K.-L. Tan, “Efficient Indexing Structures for
1996. Mining Frequently Patterns,” Proc. 18th Int’l Conf. Data Eng.
[7] M. El-Hajj and O.R. Zaiane, “Inverted Matrix: Efficient Discovery (ICDE), 2002.
of Frequent Items in Large Datasets in the Context of Interactive [33] E. Baralis, T. Cerquitelli, and S. Chiusano, “Index Support for
Mining,” Proc. Ninth ACM SIGKDD Int’l Conf. Knowledge Discovery Frequent Itemset Mining in a Relational DBMS,” Proc. 21st Int’l
and Data Mining (SIGKDD), 2003. Conf. Data Eng. (ICDE), 2005.
[8] G. Grahne and J. Zhu, “Mining Frequent Itemsets from Secondary [34] G. Liu, H. Lu, W. Lou, and J.X. Yu, “On Computing, Storing and
Memory,” Proc. IEEE Int’l Conf. Data Mining (ICDM ’04), pp. 91-98, Querying Frequent Patterns,” Proc. Ninth ACM SIGKDD Int’l Conf.
2004. Knowledge Discovery and Data Mining (SIGKDD), 2003.
[9] G. Ramesh, W. Maniatty, and M. Zaki, “Indexing and Data Access
Methods for Database Mining,” Proc. ACM SIGMOD Workshop Elena Baralis received the master’s degree in
Data Mining and Knowledge Discovery (DMKD), 2002. electrical engineering and the PhD degree in
[10] Y.-L. Cheung, “Mining Frequent Itemsets without Support computer engineering from Politecnico di Torino,
Threshold: With and without Item Constraints,” IEEE Trans. Torino, Italy. She has been a full professor in the
Knowledge and Data Eng., vol. 16, no. 9, pp. 1052-1069, Sept. 2004. Dipartimento di Automatica e Informatica, Poli-
[11] G. Cong and B. Liu, “Speed-Up Iterative Frequent Itemset Mining tecnico di Torino since January 2005. Her
with Constraint Changes,” Proc. IEEE Int’l Conf. Data Mining current research interests include the field of
(ICDM ’02), pp. 107-114, 2002. databases, in particular data mining, sensor
[12] C.K.-S. Leung, L.V.S. Lakshmanan, and R.T. Ng, “Exploiting databases, and bioinformatics. She has pub-
Succinct Constraints Using FP-Trees,” SIGKDD Explorations News- lished numerous papers on journal and con-
letter, vol. 4, no. 1, pp. 40-49, 2002. ference proceedings. She has managed several Italian and EU research
[13] R. Srikant, Q. Vu, and R. Agrawal, “Mining Association Rules projects.
with Item Constraints,” Proc. Third Int’l Conf. Knowledge Discovery
and Data Mining (KDD ’97), pp. 67-73, 1997. Tania Cerquitelli received the master’s degree
[14] T. Uno, M. Kiyomi, and H. Arimura, “LCM ver. 2: Efficient in computer engineering and the PhD degree
Mining Algorithms for Frequent/Closed/Maximal Itemsets,” from the Politecnico di Torino, Torino, Italy, and
Proc. IEEE ICDM Workshop Frequent Itemset Mining Implementations the master’s degree in computer science from
(FIMI), 2004. the Universidad De Las Américas Puebla. She
[15] J. Pei, J. Han, and L.V.S. Lakshmanan, “Pushing Convertible has been a postdoctoral researcher in computer
Constraints in Frequent Itemset Mining,” Data Mining and Knowl- engineering in the Dipartimento di Automatica
edge Discovery, vol. 8, no. 3, pp. 227-252, 2004. e Informatica, Politecnico di Torino since
[16] POSTGRESQL, http://www.postgresql.org, 2008. January 2007. Her research interests include
[17] G. Grahne and J. Zhu, “Efficiently Using Prefix-Trees in Mining the integration of data mining algorithms into
Frequent Itemsets,” Proc. IEEE ICDM Workshop Frequent Itemset database system kernels and sensor databases.
Mining Implementations (FIMI ’03), Nov. 2003.
[18] G. Moerkotte, “Small Materialized Aggregates: A Light Weight Silvia Chiusano received the master’s degree
Index Structure for Data Warehousing,” Proc. 24th Int’l Conf. Very and the PhD degree in computer engineering
Large Data Bases (VLDB ’98), pp. 476-487, 1998. from Politecnico di Torino, Torino, Italy. She has
[19] K. Chakrabarti and S. Mehrotra, “The Hybrid Tree: An Index been an assistant professor in the Dipartimento
Structure for High Dimensional Feature Spaces,” Proc. 15th Int’l di Automatica e Informatica, Politecnico di
Conf. Data Eng. (ICDE ’99), pp. 440-447, 1999. Torino since January 2004. Her current re-
[20] A. Pietracaprina and D. Zandolin, “Mining Frequent Itemsets search interests include the areas of data
Using Patricia Tries,” Proc. IEEE ICDM Workshop Frequent Itemset mining and database systems, in particular
Mining Implementations (FIMI), 2003. integration of data mining techniques into
[21] R. Bayer and E.M. McCreight, “Organization and Maintenance relational DBMSs, and classification of struc-
of Large Ordered Indices,” Acta Informatica, vol. 1, pp. 173-189, tured and sequential data.
1972.
[22] FIMI, http://fimi.cs.helsinki.fi/, 2008.
[23] N. Agrawal, T. Imielinski, and A. Swami, “Database Mining: A
Performance Perspective,” IEEE Trans. Knowledge and Data Eng.,
vol. 5, no. 6, Dec. 1993.