Sei sulla pagina 1di 48

Journal Pre-proof

EFFICIENT METHODS FOR MINING WEIGHTED CLICKSTREAM


PATTERNS

Huy M. Huynh , Loan T.T. Nguyen , Bay Vo , Anh Nguyen ,


Vincent S. Tseng

PII: S0957-4174(19)30710-9
DOI: https://doi.org/10.1016/j.eswa.2019.112993
Reference: ESWA 112993

To appear in: Expert Systems With Applications

Received date: 19 April 2019


Revised date: 29 September 2019
Accepted date: 29 September 2019

Please cite this article as: Huy M. Huynh , Loan T.T. Nguyen , Bay Vo , Anh Nguyen ,
Vincent S. Tseng , EFFICIENT METHODS FOR MINING WEIGHTED CLICKSTREAM PATTERNS,
Expert Systems With Applications (2019), doi: https://doi.org/10.1016/j.eswa.2019.112993

This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition
of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of
record. This version will undergo additional copyediting, typesetting and review before it is published
in its final form, but we are providing this version to give early visibility of the article. Please note that,
during the production process, errors may be discovered which could affect the content, and all legal
disclaimers that apply to the journal pertain.

© 2019 Published by Elsevier Ltd.


Highlights

 We propose a weight measure for mining frequent weighted clickstream patterns.


 We extend CM-SPADE to develop an effective algorithm.
 We propose a pruning heuristic for mining frequent weighted clickstream patterns.
 We present an optimised data structure to mine in large databases.

1
EFFICIENT METHODS FOR MINING WEIGHTED CLICKSTREAM PATTERNS

Huy M. Huynh1, Loan T.T. Nguyen2, Bay Vo3, Anh Nguyen4, Vincent S. Tseng5
1
Institute of Research and Development, Duy Tan University, Da Nang 550000, Viet Nam

huy.hm88@gmail.com
2
School of Computer Science and Engineering, International University - VNU-HCM, Ho
Chi Minh City, Vietnam

nttloan@hcmiu.edu.vn
3
Faculty of Information Technology, Ho Chi Minh City University of Technology (HUTECH),
Ho Chi Minh City, Vietnam
vd.bay@hutech.edu.vn
4
Faculty of Computer Science and Management, Wroclaw University of Science and Technology,
Wroclaw, Poland
nguyenanhvn9@gmail.com
5
Department of Computer Science, National Chiao Tung University, Hsinchu, Taiwan
vtseng@cs.nctu.edu.tw

Abstract: Pattern mining has been an attractive topic for many researchers since its first
introduction. Clickstream mining, a specific version of sequential pattern mining, has been
shown to be important in the age of the Internet. However, most previous works have simply
exploited and applied existing sequential pattern algorithms to the mining of clickstream
patterns, and few have studied clickstreams with weights, which also have a wide range of
application. In this paper, we address this problem by proposing an approach based on the
average weight measure for clickstream pattern mining and adapting a previous state-of-the-art
algorithm to deal with the problem of weighted clickstream pattern mining. Following this, we
propose an improved method named Compact-SPADE to enhance both the efficiency and
memory consumption. Through various tests on both real-life and synthetic databases, we show
that our proposed algorithms outperform state-of-the-art alternatives in terms of efficiency,
memory requirements and scalability.
Keywords: data mining, weighted clickstream pattern mining, sequential pattern mining.

2
1. Introduction

Pattern mining is an important problem in various fields of study, such as bioinformatics,


web log analysis, security and text analysis, and is considered an important task in knowledge
discovery and data mining. The first variant of the pattern mining problem was initially proposed
and solved by Agrawal, Imieliński, and Swami (1993) in relation to a problem involving
association rule mining (ARM). This problem originally involved discovering common and
helpful relations (rules) between products bought from a store (i.e. items in market baskets),
based on the customers‟ recorded transactions. ARM actually contains two steps. The first is a
problem that was initially referred to as mining large itemsets (Agrawal, Imieliński, & Swami,
1993), and the second is the generation of association rules based on the large itemsets mined in
the first step. Large itemset mining was subsequently renamed frequent itemset mining (FIM),
and this was identified as a separate problem that attracted many researchers. In brief, FIM
involves discovering only helpful patterns of itemsets rather than their associated rules, i.e.
common items that customers often purchase together. FIM does not consider chronological
(temporal) order or sequential relationships, and this can lead to some important patterns being
overlooked or the discovery of useless patterns. Agrawal and Srikant (1995) therefore extended
the FIM problem by taking into account chronological orders and sequential relationships, and
this is called sequential pattern mining (SPM).
A clickstream pattern is a specific type of sequential pattern that has recently become
interesting to many researchers due to its resemblance (in terms of recorded behaviors) with how
users interact with machines. It is a strict sequence of events in which no more than one event
occurs at a time, although a given event can appear repeatedly in the same sequence. For
example, as a user navigates a series of websites, each user clickstream in the database records
the user‟s series of clicks. Each website‟s URL represents an element in a clickstream, one
following another, and no two links should be clicked by the user at the same time. Records of
Internet users‟ behaviors are very valuable to various e-commerce site holders. Each frequent
clickstream pattern (i.e. a portion of a user clickstream that appears with high frequency
throughout the database) can be analyzed to show how users interact with websites, offering
great benefits to the site-holders. However, clickstreams are not limited to a series of clicks on
website links; other data types such as DNA sequences, user actions on computers (e.g. deleting
files, opening folders, renaming files or key presses) also fall into this category. This problem‟s

3
applications are largely in the security and web log domains, and this has led to the emergence of
clickstream pattern mining.
Initially, researchers treated all items in databases equally, and most of the algorithms that
have been proposed use the anti-monotonicity property (i.e. downward closure) as their
foundation, meaning that frequent patterns must not contain infrequent subpatterns. Subsequently,
researchers noticed that all items should in fact be considered as having different levels of
importance, and weights have been considered as an addition to items to help distinguish useful
patterns. Adding weights can complicate the mining process, as they cannot preserve anti-
monotonicity without additional restrictions. The anti-monotonicity property is important, as it
allows us to discard redundant search spaces and prevents the algorithms‟ runtime from
dramatically increasing. For example, Yun and Leggett (2006) originally proposed weights for
SPM, in which a pattern‟s weight was the average value of the pattern‟s items‟ weights,
multiplied by the the pattern‟s support count. This approach resulted in a violation of anti-
monotonicity, and to deal with this, the authors had to propose weight ranges in order to maintain
this property; however, this resulted in an alteration of the original item weights given by users.
In our approach, each user clickstream has its own average weight and these are taken into
account when computing a pattern‟s weight support. A clickstream pattern‟s weight support is
the total weights of all user clickstreams containing the pattern, divided by the database weight
(more details of this are given in Section 3). The difference between our usage of weights and
those of previous methods is that ours take user clickstream weights into consideration to
preserve anti-monotonicity and the original items‟ weights without additional restrictions or
modifications.

Table 1. An example of a horizontal clickstream database

CID User clickstream


1 a,c,c,d,f
2 d,c,b,a
3 f,b,f,c,b
4 e,a,c,b,c,b,f
5 a,b,c,e

4
In this paper, we will concentrate on solving the problem of weighted clickstream pattern
mining (WCPM), an SPM‟s extension with weights. Our contributions are as follows:

1. We propose the use of an average weight measure as an alternative approach for mining
frequent weighted clickstream patterns. This new approach can preserve the original
actions‟ weights and simplify the mining process while maintaining anti-monotonicity.
2. We extend a previous state-of-the-art algorithm, CM-SPADE, to develop an effective
algorithm and propose a pruning heuristic that is integrated with the average weight
formula for WCPM.
3. We present an optimized data structure to adapt WCPM to large databases.
4. We evaluate our algorithms through various tests using both synthetic and real-life
databases.

The rest of the paper is organized as follows. In Section 2, we describe related work. In
Section 3, we introduce several concepts, describe the use of the proposed average weight and
define the WCPM problem in detail. In Section 4, we present our baseline method, which
extends CM-SPADE (Fournier-Viger et al., 2014), to solve WCPM. In Section 5, we propose an
improved data structure. In Section 6, we present a heuristic technique based on upper bound
constraints to support temporal join and candidate pruning. The experimental results and a
discussion are presented in Section 7. The final section presents our conclusions and some
directions for future work.

2. Related work

The initial FIM problem was solved by Agrawal, Imieliński, and Swami (1993), and a year
later, Agrawal and Srikant proposed an improved breadth-first search algorithm named Apriori
(Agrawal & Srikant, 1994). This algorithm was based on the so-called Apriori constraint, which
uses anti-monotonicity or the downward closure property to greatly prune the search space. Since
then, the Apriori constraint has been used as a foundation for many other algorithms.
Agrawal and Srikant (1995) were the first to address the SPM problem in terms of
chronological order. The authors proposed an SPM version of the Apriori algorithm, which was
called AprioriAll. Many algorithms were proposed as SPM started to gain more attention from
researchers, and each of these can be categorized into one of three families: horizontal, vertical
or projected.

5
The first of these families uses a horizontal database format (see Table 1) in which each row
is assigned information about a sequence ID and a list of itemsets. The most popular algorithms
in this family are AprioriAll and GSP (Agrawal & Srikant, 1995), an improved version of

a b c
CID Order CID Order CID Order
1 1 1 ∅ 1 2,3
2 4 2 3 2 2
3 ∅ 3 2,5 3 4
4 2 4 4,6 4 3,5
5 1 5 2 5 3

d e f
CID Order CID Order CID Order
1 4 1 ∅ 1 5
2 ∅ 2 ∅ 2 ∅
3 ∅ 3 ∅ 3 1,3
4 ∅ 4 1 4 7
5 ∅ 5 4 5 ∅

AprioriAll with the aim of further reducing the number of redundant candidates.

Figure 1. A vertical clickstream database

The second group uses a vertical database format (Figure 1) in which each item or pattern has
an individual data structure to indicate the sequence in which the item or pattern appears and its
position in the sequence. The more popular algorithms in this family include SPADE (Zaki,
2001), SPAM (Ayres, Flannick, Gehrke, & Yiu, 2002), BitSPADE (Aseervatham, Osmani, &
Viennet, 2006), PRISM (Gouda, Hassaan, & Zaki, 2010), and more recently CM-SPADE and
CM-SPAM (Fournier-Viger et al., 2014). SPADE (Zaki, 2001) is one of the most efficient
algorithms for SPM in the vertical family, (Fournier-Viger et al., 2014), and it is based on the use
of equivalence class and the decomposition of sublattices to divide the entire lattice into separate

6
pieces; each piece can then be fitted into computer memory to be calculated. Ayres et al. (2002)
proposed the SPAM algorithm, which uses bit manipulations to encode a given vertical database
as vertical bitmaps, and introduced a depth-first search strategy to generate candidate patterns by
extending a node with either an item or a new itemset. Inspired by both SPADE and SPAM,
Aseervatham et al. (2006) proposed BitSPADE, which combines the best features of both.
BitSPADE extended SPADE by incorporating the idea of vertical bitmap databases from SPAM
to form semi-vertical bitmap databases. Like SPAM, PRISM (Gouda et al., 2010) also used a
special version of vertical databases based on primal block encoding. Recently, Fournier-Viger et
al. (Fournier-Viger et al., 2014) improved SPAM and SPADE via the integration of CMAP (i.e. a
co-occurrence map), which stores co-occurrence information across a given database to prevent
redundant candidates from being generated. CM-SPADE and CM-SPAM were reported as
offering significant performance improvements over previous state-of-the-art methods.
The projected family can be considered a sub-branch of the horizontal family, and
recursively reduces a given database into multiple smaller databases satisfying certain conditions.
This process of database reduction is called projection, and the resulting databases are called
projected or conditional databases. One difference between horizontal and projected methods is
that projected methods always generate candidates that actually exist in the database, while their
counterparts do not. The more popular algorithms in this family are FreeSpan (Han et al., 2000)
and PrefixSpan (Pei et al., 2001). FreeSpan uses a frequent item matrix to keep track of frequent
items and reduces the databases gradually, by projecting the current database into smaller and
smaller sets of sequences that contain only frequent patterns, thus reducing the other redundant
sequences. With each time projection, the frequent patterns grow in size, while the projected
databases become smaller and the database scans faster. PrefixSpan was an improved version of
FreeSpan that used a prefix projection method rather than frequent pattern projection. PrefixSpan
has been developed into more advanced algorithms to deal with various kinds of PM. For
example, Zhao, Yan, and Ng (2014) extended PrefixSpan to work on databases with high levels
of uncertainty. Projected methods are efficient in terms of runtime, but one of their more
significant drawbacks is high memory consumption, since information about the projected
databases must be kept in memory.
Yun and Leggett (2006) were the first to introduce weight constraints to SPM. However, to
keep the anti-monotonicity property in check in their algorithm, the authors had to alter the

7
original weights of items. Ahmed, Tanbeer, and Jeong (2010) proposed the use of a maximum
sequence weight as a restriction to preserve the anti-monotonicity property. Extending this idea,
Patel, Modi, and Kalpdrum (2016) modified the weight formula in Yun and Legget (2006) and
combined it with a time interval to shift the importance towards events that occur within short
time intervals.
Typical works on weights for FIM include Lee et al. (2016a), Yun (2007), Yun, Lee, and Lee
(2016), and Yun, Lee, and Ryu (2014). Yun (2007) used maximum and minimum weight ranges
to maintain the anti-monotonicity property for the problem of weighted interesting patterns.
Based on the work in Grahne and Zhu (2005), Yun et al. (2014) proposed WMFP-tree and
WMFP-array and combined them with weights to mine maximal frequent patterns in data
streams. The anti-monotonicity property was preserved using weight ranges. Meanwhile, Lee et
al. (2016a) and Yun et al. (2016) used maximum weight constraints to preserve anti-
monotonicity.
Weights are also added into other variants of FIM, such as the erasable itemset mining
problem (Lee, Yun, & Ryang, 2015; Lee et al., 2016b; Yun & Lee, 2016), as the general erasable
itemset mining approaches only take the products‟ profits (itemsets‟ profits) into account. Lee
(2015) proposed an average weight formula to consider the weights of individual items in the
products to improve the reliability of discovered product patterns. Yun (2016) extended the
problem for erasable itemset mining in product data streams with the WEPS-Tree data structure
used to hold additional specific information for this task. Lee et al. (2016b) proposed an
interesting method for mining erasable itemsets in incremental databases. Alongside with an
average weight formula being integrated into the incremental databases, the authors proposed
two data structures, IWEI-tree and OP-List, that accommodate the added weights for incremental
databases. The tree data structure is also used to avoid the problem caused by discarding 1-
patterns early, and the list is used to optimize the algorithms‟ runtime. All these works for
erasable itemset mining used some form of maximum weight constraints to preserve the anti-
monotonicity property.
Unlike the abovementioned works, our proposed approach preserves anti-monotonicity
through the use of average weight formulae rather than weight ranges or maximal weights for
mining clickstream patterns. Our proposed weight usage is somewhat similar to that in some

8
ealier works (Lee, Yun, & Ryu, 2017; Vo, Coenen, & Le, 2013), although these studies involve
the itemset mining problem, whereas our approach is applied to the clickstream mining problem.
In some cases, a user wants to mine a set of compact patterns, and to limit those patterns that
are not helpful to the user‟s needs researchers have begun to find more ways to restrict or alter
the requirements of frequent patterns. One interesting approach that serves the same purpose as
using weights is to apply multiple constraints. There have been many popular works using this
approach for both FIM and SPM (Fournier-Viger et al., 2014; Fowkes & Sutton, 2016; Gan et al.,
2019; Le et al., 2018; Lin & Lee, 2005; Pei, Han, & Wang, 2007; Van, Vo, & Le, 2018). Other
similar approaches to PM select the top- frequent patterns rather than relying on the minimum
support threshold (Kieu et al., 2017; Krishnamoorthy, 2019; Petitjean et al., 2016); alternatively,
instead of mining normal patterns, closed patterns are mined (Fumarola et al., 2016; Le et al.,
2017; Tran, Le, & Vo, 2015). There have also been attempts to add weights into the alternative
patterns (e.g. closed patterns) such as the work in Yun, Pyun, and Yoon (2015), although this is
more difficult. The reason is that adding weights to such alternative patterns can cause
information loss (i.e. patterns that are incorrectly discarded during the searching process). Yun
(2015) thus presented a detailed analysis on the cause of this issue, and proposed the CW-Span
algorithm to deal with it.
Clickstream pattern mining has become important due to its wide range of applications (e.g.
web log analysis, intrusion detection); however, most previous works only mine non-weighted
clickstream patterns. For example, Ting et al. (2005) used general SPM algorithms to discover
users‟ unexpected clickstream patterns to support and improve website design. Setiawan and
Yahya (2018) applied sequential rules discovered from event logs to analyze human behaviors
during the production process in software factories. Dalmas (2017) proposed TWINCLE to mine
patterns in rich event log databases to help optimize organizational processes. Van, Yoshitaka,
and Le (2018) proposed the MWAPC and EMWAPC algorithms to mine web access clickstream
patterns with a super-pattern constraint. In the security domain, Pramono (2015) used the
association rules generated from clickstream patterns of users‟ activities on websites to identify
and prevent malicious behaviors.

9
3. Problem statement

In this section, we present some definitions in detail and describe the use of an average
weight for weighted clickstream pattern mining.
Definition 3.1: Clickstreams. We consider a set of distinct items (i.e. symbols)
representing various actions of the same type (e.g. mouse clicks on specific
links), and a set of positive real value weights corresponding to each
action (E.g. Table 2). A clickstream is a sequence of actions that takes place in
chronological order and is denoted by where and ( is
an integer that identifies the chronological position in ). That is, if an action is shown before
in , is assumed to happen before and this is denoted by . A clickstream that
contains actions ( ) is called a -clickstream, and an action can occur more than once
in the same clickstream at various positions.
For example, a clickstream is a -clickstream with four unique action
items and the action item appears twice at chronological positions and
(assuming the order always starts at ).

Table 2. Weights of actions for the example database

Action Weight
a 0.5
b 0.9
c 0.2
d 0.4
e 0.7
f 0.6

Definition 3.2: Sub- and superclickstreams Let and


( be two clickstreams. is a subclickstream of , or is a
superclickstream of and this is denoted by ( contains ) if and only if there exists at
least one-to-one order-preserving function mapping all elements in to such that
and if then .

10
For example, both clickstreams or are subclickstreams of
in Table 1. The numbers in square brackets are used here to denote
the actions‟ chronological orders or positions in the clickstream. They can be omitted if the
orders are not needed. In particular, , is a subclickstream that can be mapped to either
or in , i.e. there exist two functions and
satisfying the aforementioned requirements. However, is not a subclickstream of
.
Definition 3.3: User clickstream, clickstream pattern and clickstream database. A user
clickstream is one that is generated by users via various actions such as browsing the internet or
navigating folders in a computer. A collection of user clickstreams is called a clickstream
database, and is denoted by . Each user clickstream in the database is associated with a
unique clickstream identification number (called a ). A clickstream database is usually in a
horizontal format (Table 1) and can be converted to a vertical format (Figure 1). A clickstream
pattern is a subclickstream of at least one user clickstream in the database while a clickstream
pattern candidate is a clickstream that may or may not appear in the database. In other words,
is a clickstream pattern if . A clickstream pattern is called a frequent
clickstream pattern if it satisfies a certain condition imposed by users (e.g. a sufficient frequency
for the SPM problem or sufficient weights in the weighted clickstream pattern mining problem).
For example, in the example clickstream database in Table 1, is a user clickstream
with and is considered a clickstream pattern.
Definition 3.4: Weight of a user clickstream. The weight of a user clickstream is the
average weight of all its actions (each action can appear repeatedly in ) and is defined as
follows:

For example, considering user clickstream 1 , its weight =

(Table 3).

Table 3. Weights of user clickstreams in the example database

CID Weight
1 0.38

11
2 0.5
3 0.64
4 0.57
5 0.58
2.67

Definition 3.5: Weight of a database. Given a clickstream database , the total weight
of a clickstream database is the sum of its user clickstreams and is denoted by .
Specifically, this is computed as follows:

Definition 3.6: Weighted support of a clickstream pattern. The weighted support of a


clickstream pattern is the sum of the weights of the user clickstreams, in which appears,
divided by the database weight. It is defined as follows:

Property 3.1. Let be the set of all possible patterns in the given database and the power
set of , which includes all subsets of . A function that maps each element (i.e. a
clickstream pattern) in to an element (i.e. a set of user clickstreams) in satisfies the
property of anti-monotonicity, and is represented as follows:

In other words, is a function that returns a set of user clickstreams containing the specified
pattern.
Property 3.2. The use of the weighted support measure mentioned above (Definition 3.6) has
the property of anti-monotonicity. In other words, for and that are two clickstream
patterns in : .
Proof. We have:

∑ ∑

12
∑ ∑

For example: for two clickstream patterns and , the sequence sets
containing and are and , respectively, and

and . We can see that

.
Property 3.3. All weighted subpatterns (subclickstreams) of a frequent weighted clickstream
pattern are frequent. In other words, we have:

For example, consider the frequent clickstream pattern with and


minimum weighted support . The following subpatterns of are also frequent patterns:
. Their corresponding weighted supports are
.
Definition 3.7: Problem definition. The problem of weighted clickstream pattern mining
involves finding all clickstream patterns with sufficient weighted support in a given clickstream
database . To be more precise, a sequence will be counted as a frequent weighted
clickstream pattern (hereafter referred to as a frequent pattern unless the precise name is needed)
if ; i.e. is a weighted frequent pattern by having a weighted support greater than or
equal to minimum frequent weighted threshold . For example, Table 4 depicts some frequent
weighted patterns provided that .

Table 4. Some frequent weighted patterns with minimum weighted support

Pattern Weighted Support


a 0.76
b 0.86
c 1
e 0.43
f 0.6
b,c 0.67
b,c,b 0.45

13
4. Mining frequent weighted clickstream patterns with average weights

In this section, we present our baseline algorithm CM-WSPADE, which extends CM-SPADE

(Fournier-Viger et al., 2014) by incorporating the average weights proposed in Section 3. CM-

WSPADE‟s overall mining process is shown in Figure 2. The general idea is that it begins by

looking for 1-clickstream candidate patterns and computes their weighted support using our

weighted average formula. The algorithm discards any 1-clickstream candidate patterns with

weighted support lower than a given weighted threshold. The remaining 1-clickstream patterns

are frequent weighted ones, and are used to form the next 2-clickstream candidate patterns; in the

same way, any 2-clickstream candidate patterns with weighted support lower than the given

weight threshold are discarded. The process repeats until no new candidate patterns can be

formed. Additionally, CM-WSPADE uses a depth-first search over a breadth-first search to

increase memory efficiency. The reason for this is that CM-WSPADE uses a lattice

decomposition approach, and lattices constructed from real datasets usually have a maximum

width (i.e. the maximum number of nodes at the same level), which is several times larger than

the maximum height (i.e. the maximum level of nodes). Thus, the memory allocation required

for a breadth-first search would be higher than for a depth-first search.

14
Figure 2. The overall mining process of CM-WSPADE

This section is divided into four subsections. In Section 4.1, we describe the proposed

WIBList data structure for storing the necessary information. In Section 4.2, we present a

method for generating candidate patterns and populating the WIBList data from two different

frequent weighted patterns. In Section 4.3, we present WCMAP, a method for pruning

unnecessary work overheads. Finally, in Section 4.4, we present the CM-WSPADE algorithm.

4.1. WIBList

To accommodate the algorithm with mining weighted sequential patterns with average
weight, we present WIBList (weighted ID bitmap list), which extends the vertical dataset IDList
used in Aseervatham et al. (2006) and Fournier-Viger et al. (2014). A WIBList contains the
necessary information of a frequent pattern and an additional element called ws to keep track of
the weighted support. The WIBList is also used to populate the next candidate patterns‟

15
WIBLists, as described further in Section 4.2. The WIBLists for frequent 1-patterns are
illustrated in Figure 3. Specifically, WIBList includes the following elements:

 : a clickstream pattern.
 : a set of tuples where is the identification number of a user clickstream
containing and is a bitmap representing all the locations of in the user
clickstream; i.e., the positions of the true bits (bits that have a value of „1‟) in the bitmap
are the positions of . In addition, each bitmap is constructed by multiple bit blocks with
fixed length, and the set of tuples are implemented using a hash table.
 : the weight support of .

Figure 3. WIBLists for frequent 1-patterns in which each bit block has a fixed length of four bits

4.2. Candidate generation and temporal join

Since our algorithm extends CM-SPADE, our pattern candidates are also obtained by finding
the join (i.e. minimal common supersequences) of two patterns with the same length and -

16
prefix in the same equivalence class. Clickstream candidate pattern generation involves two main
steps: (i) generating pattern candidates by finding the join of two frequent pattern parents; and
(ii) populating the candidates‟ WIBLists by carrying out temporal joins of the seeds‟ WIBLists.
Each step is described in detail below.
Generating pattern candidates. Let and be two frequent weighted -patterns,
let be a -prefix of and , and let be the last action of and be the last
action of . We then have and . In other words, and are
in the same equivalent class (i.e. they share the same -prefix ).
Assuming , the set of -candidates (i.e. the join of and ) contains two
patterns and ; i.e. . According to
the equivalence class definition, belongs to equivalence class and belongs to
equivalence class . The new candidates need to have their weight support checked before
they can be considered frequent patterns. One special case is that if , then only one
pattern candidate will be generated.
Populating candidates’ WIBLists. Let and be two frequent patterns with WIBLists
and , respectively. Let be the WIBList for . The
construction of a new candidate‟s WIBList (from the temporal join of WIBLists
and ) is described in Algorithm 1.

Algorithm 1. Pseudocode for temporal join of WIBLists

1.
2. ∅
3. all cids in , in other words, these are identification numbers of user
clickstreams containing
4. all cids in , in other words, these are identification numbers of user
clickstreams containing
5. , are identification numbers of user clickstreams containing both
and
6. FOR each in DO
7. the bitmap in that represents the user clickstream associated
with

17
8. the bitmap in that represents the user clickstream associated
with
9. the location of the first „1‟ bit in
10. the location of the first „1‟ bit in that
11. IF exists THEN
12.
13.
14. All bits before in are set to „0‟
15.
16. all cids in , in other words, these are identification numbers of
clickstreams containing
17.

Example. Let and be two frequent patterns with WIBLists and


, as shown in Figure 4. Possible candidates are and . The tuple
set is constructed as follows:

18
Figure 4. WIBLists for and

1. Let be the set of cids in


2. Let be the set of cids in
3. Find the intersection
4. Carry out a temporal join for each bitmap in the intersection set (in this case, the only
bitmap associated with cid “4”)
5. the bitmap in that represents the user clickstream
associated with cid “4”
6. the bitmap in that represents the user clickstream
associated with cid “4”
7. Set to the first location of a true bit in

19
8. Find for the first location of a true bit in that is greater than

9. Let .
10. Set all bits prior to in to „0‟ (in our case, these are already „0‟)

11. Set to .

The tuple set can be constructed using the same steps. The corresponding weighted
supports for and are and . However, only is frequent
with sufficient weighted support ( .

4.3. WCMAP (Weighted Co-Occurrence Map)

SPADE (Zaki, 2001) and SPAM (Ayres et al., 2002) generate new candidates by appending
the last items of the other patterns in the same equivalence class; however, this leads to many
redundant candidates and large workload overheads. According to the anti-monotonicity
constraint, all subsequences of a frequent weighted pattern must also be frequent. By checking
the weighted support of -patterns formed by the last two actions of new pattern candidates and
upon finding out that they are infrequent, the newly formed pattern candidates can be pruned
immediately. Fournier-Viger et al. ( 2014) proposed CMAP for SPM, which uses pre-populated
-patterns (formed from frequent -patterns) appearing in the database to quickly prune
infrequent -pattern candidates based on this property. We extend the work in Fournier-
Viger et al. (2014) by adding our proposed average weight to CMAP to adapt it for weighted
clickstream pattern mining.

Algorithm 2. WCMAP creation pseudocode

Input: the sequential database and frequent -pattern


Output: WCMAP
WCMAP-Creation ()

1. ∅
2. FOR each user clickstream
3. set of all possible -clickstream patterns exists in
4. FOR each -clickstream pattern in DO

20
5. IF both actions of exist in THEN
6. the first action of
7. the second action of
8.

Since two events cannot occur at the same time in clickstreams, only an s-extension CMAP is
needed. Here, we refer to our s-extension CMAP for weighted clickstreams WCMAP (unless the
full name is needed). The data structure for WCMAP involves hashmaps within hashmaps. The
first map‟s keys will be the first actions in -patterns, and their values are nested hashmaps that
have the second actions as keys and the weighted supports as values. For example, Table 5
shows an excerpt from WCMAP with minimum weighted support for -patterns
starting with and .
Example of using WCMAP. Let and and be two weighted
frequent patterns with and . The candidates generated from these
are and . The normal way to determine whether or not the candidates
are frequent is to obtain and and compare and with .
However, the new candidates can be discarded immediately, since the -patterns formed by the
last actions of and (i.e., and have weighted supports in WCMAP that are
lower than (i.e. and ).

Table 5. Excerpt from WCMAP with

First action Second action Weighted support


… … …
a 0.19
b 0.64
C c 0.34
e 0.22
f 0.34
… … ….
b 0.24
F
c 0.24

21
f 0.24
… … …

4.4. CM-WSPADE algorithm

In this section, we present the CM-WSPADE algorithm in detail. CM-WSPADE first


attempts to scan the given database to find all frequent weighted -patterns satisfying and then
assigns them to equivalence class ∅ (line 2). All patterns in ∅ are -pattern atoms sharing an
empty prefix; in other words, they are the seeds used to generate the next -patterns. Nodes two
levels higher than a root are not considered the lattice‟s atoms with the aforementioned root. The
algorithm then generates the WCMAP based on the discovered -pattern atoms (line 4).
CM-WSPADE traverses the lattice recursively in a depth-first manner, commencing at node
∅ (line 5) (level zero). Assuming that is the first -pattern atom of ∅ , we move to node
(level one). However, no child nodes have yet been discovered for node , and in order to
traverse further, must have its next level of nodes fully expanded. The next set of -pattern
nodes (level two) of is created by finding the candidate patterns. These candidates are the
minimal common supersequences (the join ) of and every -pattern atom in ∅ (line 14
and 20) (including a self-join). The WCMAP is used here as a condition to quickly filter out
some candidates violating the anti-monotonicity constraint (lines 13 and 19). After this, any of
the -pattern candidates that satisfy (i.e. being frequent) are kept as atom nodes. The atom
nodes are then assigned to their corresponding equivalence class (or ) depending on
which -prefix it has (line 18 and 24). This can be seen as connecting -pattern atom nodes to
their corresponding -pattern node.
The algorithm continues to traverse the lattice by moving to the first -pattern atom node of
. In this way, the atom node then becomes the root node of a level two sublattice. The
algorithm then repeats the same steps for its first -pattern atom, which is now expanded further
to the next levels of nodes until no atom nodes are created. The process then moves back to the
second atom node of ∅ until the last atom of ∅ is traversed. This process is recursively repeated
until it has finished enumerating the lattice.

Algorithm 3. CM-WSPADE algorithm

22
CM-WSPADE ()
Input: minimum weighted support threshold and the clickstream database
Output: the set containing all weighted frequent clickstream patterns in satisfying

1. ∅ , which is a global variable (i.e., shared across methods)


2. SCAN to determine the equivalence class ∅ containing all frequent weighted -
patterns satisfying and their respective WIBLists
3. ∅
4. POPULATE , which is a global WCMAP, from ∅
5. CALL Node-expend with parameters ∅ and

Node-Expand()
Input: minimum weighted support threshold and an equivalence class

6. FOR each pattern atom in equivalence class


7. IF equivalence class was not initialized THEN
8. ∅
9. FOR each pattern atom with in equivalence class
10. IF equivalence class [ ] was not initialized THEN
11. [ ] ∅
12. last action of ; last action of

13. IF the value of ( ) THEN

14.

15. POPULATE using and


16. IF THEN
17.
18.

19. IF the value of ( ) THEN

20.
21. POPULATE using and

23
22. IF THEN
23.
24. [ ] [ ]
25. CALL Node-Expand with and

5. Compact-SPADE

WIBLists uses hash tables, meaning that CM-WSPADE suffers reduced performance due to
frequent collisions and possibly unused allocated memory. In this section, we describe an
optimized data structure called a Weighted ID-Compact Value List (WICList) to handle this
issue. The proposed CM-WSPADE algorithm that uses WICList rather than WIBList is called
Compact-SPADE. The overall mining process is similar to that of CM-WSPADE and is shown
in Figure 5.
Definition 5.1: Compact values. Let and be a cid of a user clickstream and a
position in the user clickstream, respectively, in the database . A compact value is defined
by a number that carries and preserves information about both and , provided that there
exists some factor that can help to retrieve the original information.
For example, given a user clickstream with = 62 and a position with the value = 2, then
by appending the position to the cid, we obtain a new number 622 that represents both the
and the position. However, to retrieve the separate information of the user clickstream‟s cid and
the position, we need to know how many digits either the position or the takes up in the
compact value. Assuming the maximum value that a position can take is limited to one digit (i.e.
a range from 0 to 9), then we can safely retrieve the information about the user clickstream cids
and the positions from a given compact value. In this case, if the compact value is 622, then we
can retrieve the user clickstream‟s cid of 62, and the position of 2. The factor, denoted by ,
that can help retrieve and is the maximum number of digits used for the position value.

24
Figure 5. The overall mining process of Compact-SPADE

If represented in the decimal system, the direct use of the compact value may not be optimal
for computation in some circumstances. For example, for and , we can
infer that and . However, computers need to use the decimal modulus
operator to iterate each digit up to to obtain the value. This process can be
computationally expensive if it is repeated many times. This problem can be fixed if we shift the
value representation from decimal to binary, since all the operators can be carried out using
bitwise operators; these are very fast, as they are natively supported by computers.
Let be a decimal (i.e. denary) integer number; then returns the bitmap representing .
For example, given then and vice versa. For example, for
and , we have and . Since the result
, the compact value .
Let be an operator that shifts the lowest bits to higher bit positions by bits (i.e. by
appending zero bits), and let be the AND operator of two numbers. is the minimum

25
number of binary digits that the longest user clickstream can take up. A compact value can be
built based on the following function:

Definition 5.2: Retrieving functions. Let be a compact value, the operator that shifts
higher bits to lower bits by bits, and the XOR operator for two bitmaps. The and
can be respectively retrieved from by the following functions:

WICList contains the same three kinds of elements as in WIBList, i.e. , and . However,
is now a set of ascendingly ordered compact values, rather than a set of tuples and is
implemented using a continuous list of integer values. Since one compact value can only
represent one position of a pattern in a user clickstream, we need more than one compact value to
represent all the positions of a pattern in the user clickstream, and those compact values are
given in ascending order.
Figure 6 shows several WIBLists on the left and their corresponding WICLists on the right.
The arrows connect the tuples of WIBLists to their compact values in WICLists. For example,
for the tuple in WIBList for pattern , its corresponding compact values
are and since the positions of in the user clickstreams with are and .

26
Figure 6. WICLists for some frequent 1-patterns with (since the longest user clickstream does
not exceed 7)

Let and be two frequent patterns with WICLists and , respectively. Let
be the WICList for a new pattern candidate . The construction of the
new (from the temporal join of and ) is described in Algorithm 4.

Algorithm 4. Pseudocode for temporal join of WICList

1.
2. ∅
3. , used to track if the process is moved to a different user clickstream
4. ∅ , which is a set of user clickstreams‟ cids that are the super clickstreams of
5. the first value in
6. the first value in
7. WHILE is not the last value in or is not the last value in DO
8. ;
9. IF THEN

27
10.
11. ;
12. WHILE AND DO
13. the next value in
14. ;
15. WHILE DO
16.
17. the next value in
18.
19.
20. WHILE DO
21. the next compact value in
22.
23. ELSE IF THEN
24. the next value in
25. ELSE IF is less than THEN
26. the next value in
27.

6. Weighted upper bound constraint

In this section, we introduce a pruning heuristic called a weighted upper bound constraint
(WUBC) for predicting and discarding unnecessary temporal joins.
Let be two weighted frequent patterns, and the last actions of
and respectively, and the operator denoting the normal join (intersection) of two sets. The
join between and will then result in a set consisting of user clickstreams in which
both and appear.
Theorem 6.1. The weighted support of the user clickstream set produced by a normal join
between two user clickstreams sets and is greater than or equal to the weighted
support of either new pattern candidate, or . In other words,
we have:

28

and

Proof. Let be a user clickstream that contains both and (i.e. ) and
let be the one-to-one order-preserving functions that can map each element in or to
the elements in (as described in Section 2). The joint set is a set of user
clickstreams containing both and .. The set can be represented as follows:

Likewise, and can be formulated as follows:

Therefore, we have:

∑ ∑ ∑

1
As two actions cannot occur at the same time in a clickstream, one must happen before or after another; thus there
will not be any case that

29
∑ ∑ ∑

∑ ∑

Since populating the WIBLists (and WICLists) for new candidates from their parents‟
WIBLists (and WICLists) is very costly if the WIBLists (or WICLists) are large enough, this
pruning technique can be used as a gateway before populating new WIBLists (and WICLists) to
eliminate unnecessary work. In other words, if the regular join (between two different user
clickstream sets containing two different patterns) induces a weighted support lower than , it is
certain that the new candidates (generated from the two patterns) will have weighted supports
lower than . The population of new WIBLists (likewise WICLists) is therefore unnecessary in
this case.
Let be the time for executing WUBC on two patterns and , and be the time
required for populating WIBLists (or WICLists) of pattern candidates and respectively. In
theory, if WUBC successfully discards both and , then the required time is only . The
reduced time is if there are two candidates or if there is only one
candidate. However, if WUBC fails to filter the candidates, then the runtime gets penalized and
extends by for executing WUBC. In the worst case, the whole system‟s runtime can increase
significantly if WUBC fails in every attempt or only succeeds a small number of times.
Example of using WUBC. Let and be the weighted frequent -patterns
(belonging to the same equivalence class ∅ ) with the weighted supports and
and minimum weighted support , the user clickstream set for and
are and . The normal intersection of sets and results
in the set , of which the weighted support is equal to and lower than .
Therefore, two of the new pattern candidates and should be infrequent;
which in fact is the case because their weighted supports are and ,
and both are less than .
Implementation and integration. We use fixed size bitmaps with a number of bits equal to
the number of user clickstreams in the databases. The position of each true bit matches the cid of

30
a user clickstream, and each bitmap is associated with each WIBList or WICList. In this way, we
can quickly determine the intersection of two user clickstream sets to calculate the intersection
set‟s weight support. This heuristic can be integrated into the proposed algorithms and applied
before WCMAP to provide a further filtering layer to discard irrelevant candidates (i.e. after line
13 in the CM-WSPADE algorithm).

7. A running example

Using the example database in Table 1 with minimum weighted support , part of the
derived lattice is shown in Figure 7. We commence by finding the equivalence class ∅
containing frequent -patterns, their respective WICLists and the WCMAP .
∅ consists of five frequent -patterns as its atoms (the order of these
is arbitrary) with respective weight supports . We then
expand node as it is the first atom of ∅ In order to do this, we first have to find the joins
of with other atoms in ∅ to create -pattern candidate sets for equivalence class .
The first join would be between and itself, which would create a candidate set with a
single candidate . However, we need to check whether this candidate pattern is
feasible for the next temporal join step by first using WUBC and then WCMAP if WUBC
fails to discard the candidate . Because , the weighted support of
intersection set is equal to > . Thus, WUBC fails to discard and we
proceed to use WCMAP. The last action of is , so we have to look at the value of
. However, since does not exist, the candidate is deemed infrequent and
is ruled out of the upcoming steps.
We move to find the join between and . The candidate set would be .
WUBC fails to discard the candidates, because the weighted support of intersection set
is equal to > . However, and .
Thus, the pattern candidate is ruled out and only is considered a feasible candidate
for the next steps. After populating WIBList for , the pattern is put into equivalence class
(i.e. it is connected to the atom node ) since its -prefix is .
By repeating the previous steps of joining with , and , we have the
equivalence class consisting of two atoms . We then expand the node by

31
joining with itself and . The atom of equivalence class is , so we only have to
join with itself. After determining that there is no possible node to be expanded by
ruling out using , we return to expanding the node as this is the second atom
in the equivalence class .
Finally, after traversing the whole lattice, the frequent weighted pattern set with respect to the
example database and is
.

Figure 7. Part of the traversed lattice for the example database

8. Experimental results

All of the experiments in this section were performed on a 64-bit computer running Windows
8.1. The hardware specifications were a fourth-generation Intel Core i7-4702MQ 2.20GHz with
four physical cores, hyper-thread technology and 16 GB of RAM. All the source codes were
implemented in Java, developed from the SPMF open-source package (Fournier-Viger et al.,
2016) at http://www.philippe-fournier-viger.com/spmf/ and run on JDK 8 64 bits. The Java
virtual machine was set to a maximum of 10 GB of heap memory (i.e. -Xmx10000m -
Xms10000m). The outputs of frequent patterns were turned off in all the tests to minimize the

32
HDD‟s I/O impact. The CPU‟s turbo boost technology was also turned off to minimize the
fluctuation in the algorithm runtime and to give more stable results.

Table 6. Summary of testing databases

User clickstream Average user clickstream


Database Distinct actions
count length
BIBLE 36,369 13,905 21.6
FIFA 20,450 2,990 34.74
SIGN 730 310 51.99
Korasak 990,002 41,270 8.1
Chainstore 1,112,949 46,086 7.2
KDD 1,000,000 135 16

Six databases were used to evaluate our algorithms‟ performance (Table 6). Five databases

(BIBLE, FIFA, SIGN, Chainstore, and KDD) can be obtained from the SPMF website

(http://www.philippe-fournier-viger.com/spmf/index.php?link=datasets.php), and Kosarak can

be obtained at http://fimi.ua.ac.be/data/. Three of these (FIFA, BIBLE, and SIGN) are considered

small to average-sized, and the rest (Korasak, Chainstore, and KDD) are categorized as large

databases. FIFA and Kosarak are web clickstream databases obtained from the processing logs of

actual sites. BIBLE is a real dataset which contains sequences of words, and SIGN is a real

dataset which contains sequences of sign language statements; although the latter are not

clickstreams, they were selected because their format is the same as that of clickstreams.

Chainstore and KDD were originally in itemset format, but we converted these into clickstream

format in order to test our proposed methods on large databases.

Initially, these databases did not have weights associated with them, and we therefore
assigned weights based on a page rank score (i.e. the values range from one to 100).

33
Table 7 summarizes the minimum weighted support thresholds for the algorithms, and the
number of frequent weighted patterns discovered at each minimum threshold. It also shows the
numbers of temporal joins.
Table 7. Numbers of temporal joins and frequent weighted patterns discovered
W. Freq. Patterns Number of joins Number of joins
Database (%)
without WUBC with WUBC
0.4 162,459 1,370,721 416,661
0.3 304,905 2,701,393 786,996
BIBLE 0.2 747,390 6,948,959 1,934,967
0.1 3,663,386 35,170,713 9,498,890
0.09 4,784,080 45,180,647 12,371,401
11 26,358 274,655 124,563
10 42,189 473,386 221,285
FIFA 9 70,908 832,412 388,538
8 121,823 1,422,550 676,391
7 223,577 2,552,585 1,233,492
5 950,510 5,516,066 1,386,780
4 1,851,822 11,167,692 2,722,151
SIGN 3 4,370,796 27,323,118 6,441,772
2 14,253,724 92,687,434 21,043,882
1 102,720,995 670,465,506 151,328,614
0.4 2,348 3,883 2,146
0.3 4,505 8,753 4,211
Kosarak 0.2 16,979 33,536 16,390
0.1 505,406 811,898 504,200
0.09 1,215,402 2,050,169 1,214,059
0.1 1,403 596 267
0.05 4,013 4,435 1,125
Chainstore
0.01 35,135 311,622 22,954
0.005 94,889 1,570,741 -

34
0.001 1,425,257 58,176,199 -
50 1,671 1,922 1,656
40 33,032 33,047 33,015
KDD 30 33,069 33,111 33,052
20 112,569 114,322 112,543
10 196,127 196,500 196,092

8.1. Algorithm evaluation and the impact of WUBC

In this section, we describe experiments that were carried out to measure both the runtime
and the maximum memory usage of our proposed algorithms. As mentioned above, CM-
WSPADE is our weighted version of CM-SPADE, a state-of-the-art algorithm for mining non-
weighted sequential patterns. CM-WSPADE was created by integrating our proposed average
weight formula into CM-SPADE. We therefore use CM-WSPADE as the baseline algorithm for
benchmarking and comparison with our optimized algorithm, Compact-SPADE.
WSpan requires some specific parameters, such as weight ranges, and alterations to the
original action weights. WSpan runs faster or slower depending on how these weight ranges and
alterations are chosen, and the resulting sets of weighted frequent clickstream patterns also vary.
This means that the results of our algorithms and WSpan are different. Furthermore, the
parameter settings of WSpan may give this approach significant advantages, as it can produce
much smaller sets of weighted frequent patterns and can run much faster. It would appear to be
unfair to compare our algorithms with WSpan, and we choose to use CM-WSPADE as the
baseline benchmark instead.
We also evaluated the WUBC‟s impact in terms of runtime and memory consumption by
integrating it into both CM-WSPADE and Compact-SPADE. We use CM-WSPADE + WUBC
to denote the CM-WSPADE algorithm with WUBC, and similarly for Compact-SPADE +
WUBC. In Figure 8 and Figure 9, the graphs on the left-hand side are for the small to average-
sized databases, while the ones on the right are for the large databases.
Performance of CM-WSPADE and Compact-SPADE. The experimental results presented
in Figure 8 and Figure 9 indicate that the Compact-SPADE algorithm outperformed CM-
WSPADE in terms of both runtime and memory consumption. Compact-SPADE generally ran
two to four times faster and used less memory than its counterpart. In particular, Compact-
35
SPADE ran four times faster on the Kosarak dataset than CM-WSPADE. Clear evidence for the
more efficient memory consumption of Compact-SPADE is seen for the Chainstore and KDD
databases in Figure 9. Compact-SPADE could run with a much lower minimum weighted
support (i.e. on Chainstore and on KDD), whereas CM-WSPADE
exceeded the maximum allowed memory and was unable to function. This indicates that the
implementation of WICList was more efficient than WIBList regarding both runtime and
memory.
Analysis of CM-WSPADE and Compact-SPADE. Even though CM-WSPADE and
Compact-SPADE both share the same mining process, the differences between IDList data
structures used and the temporal join methods for those IDLists make Compact-SPADE perform
better than CM-WSPADE.
For CM-WSAPDE, WIBLists‟ tuples are hashtables. The hash key is CID and its value is the
corresponding bitmap. In perfect condition and with a perfect hash function, it requires
space, where m is the number of sequences and is all the 32-bit blocks used in all the
bitmaps of the WIBList. However, in practice, the available hashtable data structure in Java
requires more space. Additionally, inserting new tuples into WIBList can cause collisions, and
when the hashtables are full they need to expand and rehash. When this happens, the runtime
increases. The bitmap representation of position lists also has a disadvantage. It can require a
whole bitmap just to represent a single position. For example, considering a position list with a
single position <120>, we need a bitmap with 128 bits to represent that position, which is four
times the amount of space required for a single integer value. It also means that the bitmap
iteration for the mentioned case requires a longer runtime just to extract a single position.
Regarding Compact-SPADE, a WICList requires space where is the total position
values in the WICList. It does not require extra space like WIBLists. Provided that there are two
patterns and with and elements respectively, populating the new WICList only
requires in runtime. In addition, WICLists does not suffer from the collisions or
rehashing problems of WIBLists, thus making Compact-SPADE perform better than CM-
WSPADE.
Impact of WUBC. As WUBC adds an extra layer of complexity to the algorithms, it may
potentially degrade performance under certain circumstances. In other words, WUBC may help
improve performance if it can discard numerous unnecessary joins; however, failing to discard

36
temporal joins will result in penalties in terms of runtime and memory, since the algorithms then
have to run both WUBC and the temporal join processes in order to determine which candidate is
frequent.
On medium-sized databases (i.e. BIBLE, FIFA and SIGN), WUBC improved the
performance of both CM-WSPADE and Compact-SPADE by reducing runtime in exchange for
acceptable increases in memory footprint. However, the performance was reduced and the
runtime and memory consumption increased for large databases. The reasons for this are
explained below.
Why does WUBC fail for large databases? While the sizes of WICLists and WIBLists may
shrink as the algorithms progress, the bitmaps‟ sizes used in WUBC are never reduced, and are
proportional to the number of user clickstreams in the database. For example, in a database with
one million user clickstreams, WIBLists with support values of 100 and 1,000 both have the
same size bitmaps (one million bits) for WUBC. This causes the WUBC‟s execution on bitmaps
for the same database to be constant, as the bitmaps are fixed in size and WUBC must iterate
over the whole bitmap. Additionally, the larger the testing database, the more time it takes to
iterate over a whole bitmap. If the WICLists and WIBLists are small enough, carrying out
temporal joins directly and skipping WUBC is better than executing WUBC. This happened in
the large testing databases, in which most of the patterns were found for very low . Many of
their WICLists and WIBLists contain very few records, meaning that the bitmaps were very
large and yet very sparse (i.e. containing few true bits and an abundance of zero bits).
Furthermore, for the Chainstore database, integrating WUBC caused both algorithms to stop
running at a certain value of , as the maximum permitted memory allocated for the bitmaps had
been exceeded.
On the KDD database, even for a high value of ω and dense bitmap representations for
WUBC, the use of integrated WUBC still resulted in a slower runtime and greater memory
consumption, since almost no temporal joins could be discarded by WUBC (Table 7). This
greatly increased the runtime, since the algorithm needs to execute WUBC on large bitmaps
followed by temporal joins in order to determine frequent patterns.

37
BIBLE Kosarak
500 600
450
500
400
350

Runtime (second)
400
Runtime (second)

300
250 300
200
200
150
100
100
50
0 0
0.5 0.4 0.3 0.2 0.1 0.09 0.4 0.3 0.2 0.1 0.09

ω (%) ω (%)

FIFA Chainstore
500 450

450 400
400 350
350 300
Runtime (second)
Runtime (second)

300
250
250
200
200
150
150
100
100
50 50

0 0
12 11 10 9 8 7 0.1 0.05 0.01 0.005 0.001

ω (%) ω (%)

SIGN KDD
1,200 1,400

1,000 1,200

1,000
800
Runtime (second)

Runtime (second)

800
600
600
400
400

200 200

0 0
6 5 4 3 2 1 50 40 30 20 10

ω (%) ω (%)

Figure 8. Runtimes for various values of minimum weighted support

38
BIBLE Kosarak
6,000 11,000

5,500 10,000

5,000 9,000
8,000
Maximum memory (MB)

Maximum memory (MB)


4,500
7,000
4,000
6,000
3,500
5,000
3,000 4,000
2,500 3,000
2,000 2,000
0.5 0.4 0.3 0.2 0.1 0.09 0.4 0.3 0.2 0.1 0.09

ω (%) ω (%)

FIFA Chainstore
5,000 10,000

9,000
4,500
8,000
4,000
Maximum memory (MB)

Maximum memory (MB)

7,000

3,500 6,000

5,000
3,000
4,000
2,500
3,000

2,000 2,000
12 11 10 9 8 7 0.1 0.05 0.01 0.005 0.001

ω (%) ω (%)
SIGN
3,800 KDD
10,000
3,600
9,000
3,400
8,000
3,200
Maximum memory (MB)

Maximum memory (MB)

7,000
3,000
6,000
2,800
2,600 5,000

2,400 4,000

2,200 3,000

2,000 2,000
6 5 4 3 2 1 50 40 30 20 10

ω (%) ω (%)

Figure 9. Maximum memory consumption for various values of minimum weighted support

39
The results show that our currently implemented WUBC was effective on small to normal-
sized databases, but gave reduced performance on large databases. These limitations largely
arose from the ineffective implementation of our continuous bitmaps for large databases. We
believe that the performance of WUBC for large databases can be improved by using a dynamic
or compressed version of the bitmaps for which the sizes are dynamically proportional to the
sizes of the WIBLists and WICLists rather than the size of databases.
8.2. Scalability of the proposed algorithms

To study the scalability of our proposed algorithms in terms of speed and memory and their
ability to handle large databases, we tested them on large databases containing millions of user
clickstreams and low values of minimum weighted support. Following previous work, we used
synthetic databases generated by the IBM AssocGen program (Agrawal & Srikant, 1995) that
imitates real-world transactions of customers in stores, where customers buy sequences of
itemsets. These generated databases are general sequential databases (i.e. there may be multiple
items in an itemset) and do not contain weights. Thus, we modified the source program to
generate clickstream databases to mimic customers surfing in an online web store, and then used
the method described earlier in this section to add weights to the generated databases.

Table 8. Parameters used in the synthetic databases

Symbol Meaning
Database size, i.e. the number of user clickstreams (default unit: thousand)
Average number of actions (i.e. events) per user clickstream
Average length of maximal clickstreams
Number of potentially unique actions (default unit: thousand)
Number of potentially maximal clickstreams (default unit: thousand)

The steps involved in generating synthetic clickstream databases can be described as follows.
A pool of actions is generated, and a pool of maximal clickstreams with average size is
then created by choosing and assigning actions from the action pool to each maximal clickstream.
Next, user clickstreams with average size are created by picking and assigning clickstreams

40
from the maximal clickstream pool to each user clickstream. The process continues until a
number of user clickstreams are created for the databases.
The parameters for the synthetic databases are shown in Table 8. In the following
experiments, the parameters of the generated databases are fixed, while is increased from
to . The values of the fixed parameters are and .

SCALABILITY - RUNTIME SCALABILITY - MEMORY

1,200 9,000

1,000 8,000

800 7,000

600 6,000
Runtime (second)

Maximum memory (MB)

400 5,000

200 4,000

0 3,000

Database size (thousand) Database size (thousand)

RUNTIME GROWTH RATE MEMORY GROWTH RATE

7 2.50

6 2.25

5 2.00

1.75
Growth rate

4
Growth rate

3 1.50

2 1.25

1 1.00

Database size (thousand) Database size (thousand)

Figure 10. Scalability of Compact-SPADE at with varying from to

Since CM-WSPADE exceeded the available memory and was unable to run properly on
large synthetic databases for , we report only the results for Compact-SPADE. The
experimental results in Figure 10 showed that Compact-SPADE was capable of running on very

41
large databases with millions of user clickstreams for low values of . By normalizing the
runtime and peak memory consumption with respect to , we can see that Compact-
SPADE has a linear growth rate in both runtime and peak memory usage; however, the growth
rate of the maximum memory seems more stable than that of the runtime. The reason is that
when the database grows in size linearly and other factors are fixed, the WICLists‟ sizes also
grow linearly. The algorithms spend a large portion of runtime to process those WICLists (i.e.
iterating and populating). Thus, WICLists with linear growth will also make runtime and
memory consumption have linear growth.

9. Conclusions and future work

Although WCPM has various potential applications, there remains a lack of comprehensive
studies on this issue. In this paper, we present two algorithms, called CM-WSPADE and
Compact-SPADE, to tackle this problem. We show via comprehensive experiments that both
CM-WSPADE and Compact-SPADE are effective for WCPM; however, Compact-SPADE
outperformed CM-WSPADE on all testing databases and scaled linearly as the databases grew.
Our proposed pruning heuristic, WUBC, mostly worked well only on small to normal-sized
databases. We believe its performance can be improved for large databases with the use of
dynamic bitmaps, which have the ability to shrink in size as the algorithms progress.
In future work, we plan to develop methods to improve WUBC based on dynamic bitmaps
and to focus on parallelizing Compact-SPADE for greater performance gain. We also intend to
adapt our algorithms for quantitative databases.
Conflict of Interest and Authorship Conformation Form

Please check the following as appropriate:

42
o All authors have participated in (a) conception and design, or analysis and
interpretation of the data; (b) drafting the article or revising it critically for important
intellectual content; and (c) approval of the final version.

o This manuscript has not been submitted to, nor is under review at, another journal or
other publishing venue.

o The authors have no affiliation with any organization with a direct or indirect
financial interest in the subject matter discussed in the manuscript

o The following authors have affiliations with organizations with direct or indirect
financial interest in the subject matter discussed in the manuscript:

Acknowledgements
This research is funded by Vietnam National Foundation for Science and Technology
Development (NAFOSTED) under grant number: 02/2019/TN.

References
Agrawal, R., Imieliński, T., & Swami, A. (1993). Mining association rules between sets of items
in large databases. Proceedings of the ACM SIGMOD International Conference on
Management of Data, 207–216.

Agrawal, R., & Srikant, R. (1994). Fast algorithms for mining association rules. Proceedings of
the International Conference on Very Large Data Bases (VLDB), 487–499.

Agrawal, R., & Srikant, R. (1995). Mining sequential patterns. Proceedings of the International
Conference on Data Engineering (ICDE), 3–14.

Ahmed, C. F., Tanbeer, S. K., & Jeong, B. S. (2010). A novel approach for mining high-utility
sequential patterns in sequence databases. ETRI Journal, 32(5), 676–686.

Aseervatham, S., Osmani, A., & Viennet, E. (2006). BitSPADE: A lattice-based sequential
pattern mining algorithm using bitmap representation. Proceedings of the International

43
Conference on Data Mining (ICDM), 792–797.

Ayres, J., Flannick, J., Gehrke, J., & Yiu, T. (2002). Sequential pattern mining using a bitmap
representation. Proceedings of the ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining (KDD), 429–435.

Dalmas, B., Fournier-Viger, P., & Norre, S. (2017). TWINCLE: A constrained sequential rule
mining algorithm for event logs. Procedia Computer Science, 112, 205–214.

Fournier-Viger, P., Gomariz, A., Campos, M., & Thomas, R. (2014). Fast vertical mining of
sequential patterns using co-occurrence information. Proceedings of the Pacific-Asia
Conference on Knowledge Discovery and Data Mining (PAKDD), 40–52.

Fournier-Viger, P., Lin, J. C. W., Gomariz, A., Gueniche, T., Soltani, A., Deng, Z., & Lam, H. T.
(2016). The SPMF open-source data mining library version 2. Proceedings of the Joint
European Conference on Machine Learning and Knowledge Discovery in Databases
(PKDD), 36–40.

Fournier-Viger, P., Wu, C. W., Gomariz, A., & Tseng, V. S. (2014). VMSP: Efficient vertical
mining of maximal sequential patterns. Proceedings of the Canadian Conference on
Artificial Intelligence, 83–94.

Fowkes, J., & Sutton, C. (2016). A subsequence interleaving model for sequential pattern mining.
Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and
Data Mining (ICDM), 835–844.

Fumarola, F., Lanotte, P. F., Ceci, M., & Malerba, D. (2016). CloFAST: Closed sequential
pattern mining using sparse and vertical id-lists. Knowledge and Information Systems, 48(2),
429–463.

Gan, W., Lin, J. C.-W., Fournier-Viger, P., Chao, H.-C., & Philip S., Y. (2019). HUOPM: High-
utility occupancy pattern mining. IEEE Transactions on Cybernetics (In press).

Gouda, K., Hassaan, M., & Zaki, M. J. (2010). Prism: An effective approach for frequent
sequence mining via prime-block encoding. Journal of Computer and System Sciences,
76(1), 88–102.

Grahne, G., & Zhu, J. (2005). Fast algorithms for frequent itemset mining using FP-trees. IEEE

44
Transactions on Knowledge and Data Engineering, 17(10), 1347–1362.

Han, J., Pei, J., Mortazavi-Asl, B., & Chen, Q. (2000). FreeSpan: Frequent pattern-projected
sequential pattern mining. Proceedings of the ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining (ICDM), 355–359.

Kieu, T., Vo, B., Le, T., Deng, Z. H., & Le, B. (2017). Mining top-k co-occurrence items with
sequential pattern. Expert Systems with Applications, 85, 123–133.

Krishnamoorthy, S. (2019). Mining top-k high utility itemsets with effective threshold raising
strategies. Expert Systems with Applications, 117, 148–165.

Le, B., Duong, H., Truong, T., & Fournier-Viger, P. (2017). FCloSM, FGenSM: Two efficient
algorithms for mining frequent closed and generator sequences using the local pruning
strategy. Knowledge and Information Systems, 53(1), 71–107.

Le, T., Nguyen, A., Huynh, B., Vo, B., & Pedrycz, W. (2018). Mining constrained inter-
sequence patterns : a novel approach to cope with item constraints. Applied Intelligence,
48(5), 1327–1343.

Lee, G, Yun, U., & Ryu, K. (2017). Mining frequent weighted itemsets without storing
transaction ids and generating candidates. International Journal of Uncertainty, 25(1), 111–
144.

Lee, G., Yun, U., & Ryang, H. (2015). Mining weighted erasable patterns by using
underestimated constraint-based pruning technique. Journal of Intelligent and Fuzzy
Systems, 28(3), 1145–1157.

Lee, G., Yun, U., Ryang, H., & Kim, D. (2016a). Approximate maximal frequent pattern mining
with weight conditions and error tolerance. International Journal of Pattern Recognition
and Artificial Intelligence, 30(6), 1650012.

Lee, G., Yun, U., Ryang, H., & Kim, D. (2016b). Erasable itemset mining over incremental
databases with weight conditions. Engineering Applications of Artificial Intelligence, 52,
213–234.

Lin, M.-Y., & Lee, S.-Y. (2005). Efficient mining of sequential patterns with time constraints by
delimited pattern growth. Knowledge and Information Systems, 7(4), 499–514.

45
Patel, M., Modi, N., & Passi, K. (2016). An effective approach for mining weighted sequential
patterns. Proceedings of the International Conference on Smart Trends for Information
Technology and Computer Communications, 904–915.

Pei, J., Han, J., Chen, Q., Hsu, M.-C., Mortazavi-Asl, B., Pinto, H., & Dayal, U. (2001).
PrefixSpan: Mining sequential patterns efficiently by prefix-projected pattern growth.
Proceedings of the International Conference on Data Engineering (ICDE), 215–224.

Pei, J., Han, J., & Wang, W. (2007). Constraint-based sequential pattern mining: The pattern-
growth methods. Journal of Intelligent Information Systems, 28(2), 133–160.

Petitjean, F., Li, T., Tatti, N., & Webb, G. I. (2016). Skopus: Mining top-k sequential patterns
under leverage. Data Mining and Knowledge Discovery, 30(5), 1086–1111.

Pramono, Y. W. T. (2015). Anomaly-based intrusion detection and prevention system on website


usage using rule-growth sequential pattern analysis: Case study: Statistics of Indonesia
(BPS) website. Proceedings of the International Conference on Advanced Informatics:
Concept, Theory and Application, 203–208.

Setiawan, F., & Yahya, B. N. (2018). Improved behavior model based on sequential rule mining.
Applied Soft Computing Journal, 68, 944–960.

Ting, I. H., Kimble, C., & Kudenko, D. (2005). UBB mining: Finding unexpected browsing
behaviour in clickstream data to improve a web site‟s design. Proceedings of ACM
International Conference on Web Intelligence, 179–185.

Tran, M. T., Le, B., & Vo, B. (2015). Combination of dynamic bit vectors and transaction
information for mining frequent closed sequences efficiently. Engineering Applications of
Artificial Intelligence, 38, 183–189.

Van, T., Vo, B., & Le, B. (2018). Mining sequential patterns with itemset constraints. Knowledge
and Information Systems, 57(2), 311–330.

Van, T., Yoshitaka, A., & Le, B. (2018). Mining web access patterns with super-pattern
constraint. Applied Intelligence, 48(11), 3902–3914.

Vo, B., Coenen, F., & Le, B. (2013). A new method for mining frequent weighted itemsets based
on WIT-trees. Expert Systems with Applications, 40(4), 1256–1264.

46
Yun, U. (2007). Efficient mining of weighted interesting patterns with a strong weight and/or
support affinity. Information Sciences, 177(17), 3477–3499.

Yun, U., & Lee, G. (2016). Sliding window based weighted erasable stream pattern mining for
stream data applications. Future Generation Computer Systems, 59, 1–20.

Yun, U., Lee, G., & Lee, K. M. (2016). Efficient representative pattern mining based on weight
and maximality conditions. Expert Systems, 33(5), 439–462.

Yun, U., Lee, G., & Ryu, K. H. (2014). Mining maximal frequent patterns by considering weight
conditions over data streams. Knowledge-Based Systems, 55, 49–65.

Yun, U., & Leggett, J. J. (2006). WSpan: Weighted sequential pattern mining in large sequence
databases. Proceedings of the International IEEE Conference Intelligent Systems, 512–517.

Yun, U., Pyun, G., & Yoon, E. (2015). Efficient mining of robust closed weighted sequential
patterns without information loss. International Journal on Artificial Intelligence Tools,
24(01), 1550007.

Zaki, M. J. (2001). SPADE: An efficient algorithm for mining frequent sequences. Machine
Learning, 42(1–2), 31–60.

Zhao, Z., Yan, D., & Ng, W. (2014). Mining probabilistically frequent sequential patterns in
large uncertain databases. IEEE Transactions on Knowledge and Data Engineering, 26(5),
1171–1184.

47

Potrebbero piacerti anche