Sei sulla pagina 1di 9

Expert Systems with Applications 36 (2009) 10016–10024

Contents lists available at ScienceDirect

Expert Systems with Applications


journal homepage: www.elsevier.com/locate/eswa

An improved approach to find membership functions and multiple minimum


supports in fuzzy data mining q
Chun-Hao Chen a, Tzung-Pei Hong b,c,*, Vincent S. Tseng a
a
Department of Computer Science and Information Engineering, National Cheng-Kung University, Tainan 701, Taiwan, ROC
b
Department of Computer Science and Information Engineering, National University of Kaohsiung, Kaohsiung 811, Taiwan, ROC
c
Department of Computer Science and Engineering, National Sun Yat-sen University, Kaohsiung 80424, Taiwan, ROC

a r t i c l e i n f o a b s t r a c t

Keywords: Fuzzy mining approaches have recently been discussed for deriving fuzzy knowledge. Since items may
Data mining have their own characteristics, different minimum supports and membership functions may be specified
Fuzzy set for different items. In the past, we proposed a genetic-fuzzy data-mining algorithm for extracting mini-
Genetic algorithm mum supports and membership functions for items from quantitative transactions. In that paper, mini-
Genetic-fuzzy mining
mum supports and membership functions of all items are encoded in a chromosome such that it may be
Multiple minimum supports
Membership functions
not easy to converge. In this paper, an enhanced approach is proposed, which processes the items in a
divide-and-conquer strategy. The approach is called divide-and-conquer genetic-fuzzy mining algorithm
for items with Multiple Minimum Supports (DGFMMS), and is designed for finding minimum supports,
membership functions, and fuzzy association rules. Possible solutions are evaluated by their requirement
satisfaction divided by their suitability of derived membership functions. The proposed GA framework
maintains multiple populations, each for one item’s minimum support and membership functions. The
final best minimum supports and membership functions in all the populations are then gathered together
to be used for mining fuzzy association rules. Experimental results also show the effectiveness of the pro-
posed approach.
Ó 2009 Elsevier Ltd. All rights reserved.

1. Introduction engineering, diagnosis, economics, among others (Heng et al.,


2006; Ishibuchi & Yamamoto, 2005; Liang, Wu, & Wu, 2002). Sev-
Data mining is commonly used for inducing association rules eral fuzzy learning algorithms for inducing rules from given sets of
from transaction data. An association rule is an expression X ? Y, data have been designed and used to good effect with specific do-
where X is a set of items and Y is a single item. It means in the mains (Casillas, Cordon, del Jesus, & Herrera, 2005; Hong et al.,
set of transactions, if all the items in X exist in a transaction, then 2001; Rasmani & Shen, 2004).
Y is also in the transaction with a high probability (Agrawal & Srik- Most of the previous approaches set a single minimum support
ant, 1994). Most previous studies focused on binary-valued trans- threshold for all the items or itemsets and identify the relation-
action data. Transaction data in real-world applications, however, ships among binary transactions. In real applications, different
usually consist of quantitative values. Designing a sophisticated items may have different criteria to judge their importance and
data-mining algorithm able to deal with various types of data pre- quantitative data may exist. We can thus divide the fuzzy data
sents a challenge to workers in this research field. mining approaches into two kinds, namely single-minimum-support
Fuzzy set theory has been used in intelligent systems for a long fuzzy-mining (SSFM) and multiple-minimum-support fuzzy-mining
time because of its simplicity and similarity to human reasoning (MSFM) problems. Several mining approaches (Chan & Au, 1997;
(Chen, Mikulcic, & Kraft, 2000; Siler & James, 2004; Zhang & Liu, Hong, Kuo, & Chi, 1999, 2001; Kuok, Fu, & Wong, 1998; Yue, Tsang,
2006). The theory has been applied in fields such as manufacturing, Yeung, & Shi, 2000) have been proposed for the SSFM problem.
Chan and Au proposed an F-APACS algorithm to mine fuzzy associ-
ation rules (Chan & Au, 1997). They first transformed quantitative
q
This is a modified and expanded version of the paper ‘‘A divide-and-conquer attribute values into linguistic terms and then used the adjusted
genetic-fuzzy mining approach for items with multiple minimum supports,” The difference analysis to find interesting associations among attri-
IEEE International Conference on Fuzzy Systems, pp. 1231–1235, 2008. butes. Kuok et al. (1998) proposed a fuzzy mining approach to han-
* Corresponding author. Address: No. 700, Kaohsiung University Road, Kaohsiung
dle numerical data in databases and derived fuzzy association
City 811, Taiwan, ROC.
E-mail addresses: chchen@idb.csie.ncku.edu.tw (C.-H. Chen), tphong@nuk. rules. At nearly the same time, Hong et al. (1999) proposed a fuzzy
edu.tw (T.-P. Hong), tsengsm@mail.ncku.edu.tw (V.S. Tseng). mining algorithm to mine fuzzy rules from quantitative transac-

0957-4174/$ - see front matter Ó 2009 Elsevier Ltd. All rights reserved.
doi:10.1016/j.eswa.2009.01.067
C.-H. Chen et al. / Expert Systems with Applications 36 (2009) 10016–10024 10017

tion data. Basically, these fuzzy mining algorithms first used mem- functions is explained in Section 3. The details of the proposed
bership functions to transform each quantitative value into a fuzzy algorithm for mining multiple minimum supports, membership
set in linguistic terms and then used a fuzzy mining process to find functions and fuzzy association rules are described in Section 4.
fuzzy association rules. Yue et al. (2000) then extended the above An example to illustrate the proposed algorithm is given in Section
concept to find fuzzy association rules with weighted items from 5. Experiments to demonstrate the performance of the proposed
transaction data. They adopted Kohonen self-organized mapping algorithm are stated in Section 6. Conclusions and future works
to derive fuzzy sets for numerical attributes. As to the MSFM prob- are given in Section 7.
lem, Lee, Hong, and Lin (2004) proposed a mining algorithm which
used multiple minimum supports to mine fuzzy association rules. 2. The proposed framework
They assumed that items had different minimum supports and the
minimum support for an itemset was set as the maximum of the In this paper, the GA and the fuzzy concepts are used together
minimum supports of the items contained in the itemset. Under to discover suitable minimum supports, membership functions
the constraint, the characteristic of level-by-level processing was and useful fuzzy association rules from quantitative transactions.
kept, such that the original Apriori algorithm could easily be ex- A GA-based framework with the divide-and-conquer strategy is
tended to finding large itemsets. proposed for searching for minimum supports and sets of member-
In the above approaches, the membership functions were as- ship functions suitable for the mining problems. The final best
sumed to be known in advance. Although many approaches for minimum supports and membership functions for items in all
learning membership functions were proposed (Cordón, Herrera, the populations are then gathered together to be used for mining
& Villar, 2001; Roubos & Setnes, 2001; Setnes & Roubos, 2000; fuzzy association rules. The proposed framework is shown in Fig. 1.
Wang, Hong, & Tseng,1998, 2000), most of them were usually used The proposed framework in Fig. 1 is divided into two phases:
for classification or control problems. For fuzzy mining problems, phase of mining minimum supports and membership functions
Kaya et al. (2003) proposed a GA-based approach to derive a prede- and phase of mining fuzzy association rules. Assume the number
fined number of membership functions for getting a maximum of items is m. In the first phase, the clustering approach is first used
profit within a user specified interval of minimum supports. Hong, for deriving initialization information which is used for getting
Chen, Wu, and Lee (2006) also proposed a genetic-fuzzy data-min- better initial populations as used in Chen et al. (2009). The initial-
ing algorithm for extracting both association rules and member- ization information includes an appropriate number of linguistic
ship functions from quantitative transactions. It maintained a terms, a range of possible minimum supports, and a set of initial
population of sets of membership functions and used the genetic membership functions of each item. It then maintains m popula-
algorithm to automatically derive the resulting one. Its fitness tions of minimum supports and membership functions, with each
function considered the number of large 1-itemsets and the suit- population for an item Ij (1 6 j 6 m). Each chromosome in a popu-
ability of membership functions. The suitability measure was used lation represents a possible minimum support and membership
to reduce the occurrence of bad types of membership functions. functions for that item. The chromosomes in the same population
The above mentioned approaches, however, were mainly pro- are of the same length. Each chromosome is evaluated by the
posed for the SSFM problem. As to the MSFM problem, we proposed requirement satisfaction and the suitability of membership func-
in the past a genetic-fuzzy mining algorithm for items with multi- tions, which are defined later. The evaluation results are then uti-
ple minimum supports (called the GFMMS algorithm) for solving it lized to choose appropriate chromosomes for mating. The offspring
(Chen, Hong, Tseng, & Lee, 2009). The minimum supports and sets then undergo recursive evolution until a good minimum support
of membership functions of all the items were encoded into a chro- and membership functions (the chromosome with the highest fit-
mosome. Each chromosome was then evaluated by the criteria of ness value) have been obtained. Next, in the phase of mining fuzzy
requirement satisfaction and suitability of membership functions. association rules, the obtained minimum supports and member-
Since the chromosome was quite long in this way, lots of process- ship functions of all the items are gathered together and used to
ing time was spent to learn global nearly optimal solutions. mine the fuzzy interesting association rules from the given quanti-
Recently, the divide-and-conquer strategy has been used in the tative database (Lee et al., 2004). The details are described in the
evolutionary computation community with a very good effect. next section.
Many algorithms based on it have been proposed in different appli-
cations (Au, Chan, & Yao, 2003; Darwen & Yao, 1997; Khare, Yao,
Sendhoff, Jin, & Wersing, 2005; Yao, 2003). In this paper, we thus 3. The proposed divide-and-conquer genetic-fuzzy mining
propose an enhanced GFMMS algorithm, namely divide-and-con- approach
quer genetic-fuzzy mining algorithm for items with Multiple Min-
imum Supports (DGFMMS), that can divide-and-conquer the 3.1. Chromosome representation
derivation process of the minimum supports and membership
functions of different items. The proposed algorithm maintains It is important to encode minimum supports and membership
multiple populations, each for one item’s minimum support and functions as string representation for GAs to be applied to our
membership functions. The fitness of each set of membership func- problem. Several possible encoding approaches were described in
tions is evaluated by the requirement satisfaction which is used to the past (Cordón et al., 2001; Parodi et al., 1993; Wang et al.,
reflect the closeness of the derived strength of fuzzy regions of 1998, Wang, Hong, & Tseng, 2000). In this paper, each individual
large 1-itemsets for chromosome to its Required Strength of Fuzzy consists of two parts, respectively for a minimum support and a
regions (RSF) and by the suitability of the derived membership set of membership functions. The first part encodes minimum sup-
functions. The final best sets of membership functions in all the port of a certain item by the real-number schema. Thus, the mini-
populations are then gathered together to be used for mining fuzzy mum support of an item Ij is encoded with a real number aj. The
association rules. Experiments also were made to show the effec- second part handles the set of membership functions for an item.
tive of the proposed approach. It also adopts the real-number schema. Assume the membership
The remaining parts of this paper are organized as follows. The functions are triangular. Three parameters are thus used to repre-
proposed divide-and-conquer genetic-fuzzy mining framework for sent a membership function. Fig. 2 shows an example for item Ij,
items with multiple minimum supports is introduced in Section 2. where Rjk denotes the membership function of the kth linguistic
The adjustment process of minimum supports and membership term for Ij and cjkp indicates the pth parameter of fuzzy region
10018 C.-H. Chen et al. / Expert Systems with Applications 36 (2009) 10016–10024

Percentage on RSF Number of Clusters Minimum Confidence

Mining Minimum Supports and Transaction


Membership Functions Database
PC

Item Set

Clustering Approach

Initialization Information

Item1 Itemm

Membership
…… Membership Membership
…… Membership
Function Set1 …… Function Setq Function Set1 …… Function Set
Population1 … Populationm
q

Chromosome1 …… Chromosomeq Chromosome1 …… Chromosome q

…… ……
Genetic MF Transaction Genetic MF Transaction
mining process mining process
Database Database

MF Set Min. Sup. Fuzzy Large MF Set Min. Sup. Fuzzy Large
for Item1 for Item1 1-itemsets for Item1 for Item1 1-itemsets

Mining Fuzzy
Final Minimum Final Membership
Association Rules Final
Supports Functions Large 1-itemsets

Fuzzy Mining

Fuzzy Association Rules

Fig. 1. The proposed divide-and-conquer genetic-fuzzy mining framework for items with multiple minimum supports.

Membership
value milk α1 = 0.25
Rj1 Rjk R jl
1 Middle
Low High
1
… …

0 4 7 11 Quantity
cj11 cj12 cjk1 cj13 cjk2 cjl1 cjk3 cjl2 Quantity
Fig. 3. An example of a minimum support and membership functions for the item
Fig. 2. Membership functions of item Ij. milk.

Rjk. The inequality condition of the center values of membership


functions is cj12 6 cj22 6 . . . 6 c jl2. For each membership function, MSmilk MFmilk
the inequality condition of the three parameters is cjk1 < cjk2< cjk3. 0.25 2, 4, 6, 3, 7, 11, 8, 11, ∞
The membership functions of item Ij can thus be represented as α1 c111 c112 c111 c113 c122 c121 c133 c132 c133
a string of cj11cj12cj13 cj21cj22cj23. . .cjl1 cjl2cjl3, where cjl3= 1. A chro-
R11 R12 R13
mosome is encoded as a real-number string rather than a bit string.
Low Middle High
All the chromosomes in the same population have the same string
length. Below, an example is given to demonstrate the process of Fig. 4. The chromosome representation for the minimum support and the set of
encoding membership functions. membership functions in Fig. 3.

Example 1. Assume there are four items in a transaction database:


milk, bread, cookies and beverage. Also assume a possible set of
some for representing the minimum support and the set of mem-
membership functions and its minimum support for item milk is
bership functions in Fig. 3 is encoded in Fig. 4.
shown in Fig. 3.
In this example, the minimum support for milk is encoded as a
There are three linguistic terms, Low, Middle and High, for this real number 0.25. The membership function of Low for milk is en-
item. According to the proposed encoding scheme, the chromo- coded as (2, 4, 6) according to Fig. 4. Similarly, the membership
C.-H. Chen et al. / Expert Systems with Applications 36 (2009) 10016–10024 10019

functions of Middle and High for milk are respectively encoded as minimum support and membership functions. The fitness function
(3, 7, 11) and (8, 11, 1). The membership functions are then the of a chromosome Cq is defined as follows:
catenation of the three tuples.
RSðC q Þ
f ðC q Þ ¼ ;
SuitabilityðC q Þ
3.2. Initial population
where RS(Cq) is the requirement satisfaction defined as the close-
A genetic algorithm requires a population of feasible solutions ness of the derived strength of fizzy regions of large 1-itemsets
to be initialized and updated during the evolution process. As men- for chromosome Cq to its RSF, suitability(Cq) represents the suitabil-
tioned above, each individual within the population is a minimum ity of the membership functions for Cq. RS(Cq) is defined as follows:
support and a set of triangular membership functions. Each 8P
>
< X2L1
fuzzyValueðXÞ P
membership function corresponds to a linguistic term of an item. ; if X2L1 fuzzyValueðXÞ 6 RSF;
RSF
In this paper, the initial set of chromosomes can be generated RSðC q Þ ¼ P
>
:P RSF
; if X2L1 fuzzyValueðXÞ > RSF;
according to the initialization information as used in Chen et al. X2L1
fuzzyValueðXÞ

(2009). It includes an appropriate number of linguistic terms, a


range of possible minimum supports and a set of membership where RSF is the required strength of fuzzy regions for item Ij and
functions of each item. fuzzyValue(X) is the fuzzy membership value of the large 1-itemset
Xfrom the given transaction database. RS(Cq) is used to reflect the
3.3. The required strength of fuzzy regions closeness degree between the derived strength of fuzzy regions of
large 1-itemsets and the required strength of fuzzy regions.
Suitability(Cq) represents the shape suitability of the member-
In this paper, the minimum supports of items may be different.
It is hard to assign the values. As an alternative, the values can be ship functions from Cq and is defined as follows:
determined according to the required number of rules. It is, how- SuitabilityðC q Þ ¼ ov erlap factorðC q Þ þ cov erage factorðC q Þ;
ever, very time-consuming to obtain the rules for each chromo-
where ov erlap factor(Cq) represents the overlapping factor of the
some. Usually, a larger number of 1-itemsets will result in a
larger number of all itemsets with a higher probability, which will membership functions for an item Ij in the chromosome Cq and cov-
thus usually imply more interesting association rules. The evalua- erage_factor(Cq) represents the coverage ratio of the membership
tion by 1-itemsets is much faster than that by all itemsets or by functions for item Ij. The overlap_factor(Cq) is defined as follows:
interesting fuzzy association rules. Using the number of large 1- ov erlap factorðC q Þ
itemsets can thus achieve a good trade-off between execution time " ! ! #
X ov erlapðRjk ; Rji Þ
and rule interestingness (Hong et al., 2006). ¼ max   ;1 1 ;
A criterion should thus be specified to reflect the user prefer- k<i
min ðcjk3  cjk1 Þ=2; ðcji3  cji1 Þ=2
ence on the derived knowledge. In our previous paper (Chen
where overlap(Rjk, Rji) is the overlap length of Rjk and Rji. cover-
et al., 2009), the Required Number of Large 1-itemsets (RNL) was
age_factor(Cq) is defined as:
thus proposed for this purpose. Given a user-defined percentage
p, the RNL value for each item could be derived from the number 1
cov erage factorðC q Þ ¼ rangeðR ;...;R Þ ;
of its linguistic terms and p. In this paper, we consider that the j1 jl
maxðIj Þ
fuzzy value of each fuzzy region (linguistic term) directly reflects
the importance of that region. The Required Strength of Fuzzy re- where range(Rj1, Rj2, . . ., Rjl) is the coverage range of the membership
gions (RSF) is then defined here. It is the strength of fuzzy regions functions, l is the number of membership functions for Ij, and
that a user wants to get from an item and can be defined as max(Ij) is the maximum quantity of Ij in the transactions. The suit-
follows: ability factor used in the fitness function can reduce the occurrence
of the two bad kinds of membership functions shown in Fig. 5,
lj where the first one is too redundant, and the second one is too sep-
X
n X
ðiÞ
RSF Ij ¼ fjk  p; arate. The overlap factor is designed for avoiding the first bad case,
i¼1 k¼1 and the coverage factor is for the second one.

where RSFIj is the RSF value of item Ij, n is the number of transac- 3.5. Genetic operators
ðiÞ
tions, lj is the number of linguistic terms of item Ij, fjk is the fuzzy
membership value of the kth fuzzy region of item Ij in the ith trans- Genetic operators are important to the success of specific GA
action, and p is the predefined percentage to reflect users’ prefer- applications. Two genetic operators, the max-min-arithmetical
ence on the strength of fuzzy regions. A minimum support with (MMA) crossover proposed in Herrera, Lozano, and Verdegay
which the strength of large fuzzy regions for an item is close to (1997) and the one-point mutation, are used in the genetic fuzzy
its RSF value is thought of as a good one for that item. For example, mining framework. The max-min-arithmetical (MMA) crossover
assume there are three linguistic terms for an item and the total operator will generate four candidate chromosomes from them.
fuzzy membership values of the item for all the transactions is The best two chromosomes of the four candidates are then chosen
4.91. If the predefined percentage p is set at 80%, the RSF value is as the offspring. The one-point mutation operator will add a ran-
thus 4.91 * 0.8, which is 3.92. RSF is thus used in the fitness function dom value x to the minimum support aj in a chromosome. The
described in the next section to evaluate the goodness of a newly derived minimum support will thus be changed to aj+ x. A
chromosome. new fuzzy membership function will also be created by addition
of a random value e to the center or to the spread of an existing lin-
3.4. Fitness and selection guistic term, say Rjk. Assume that cjkp represents a parameter of Rjk.
The parameter of the newly derived membership function may be
In order to develop a good minimum support and a set of mem- changed to cjkp + e by the mutation operation. Mutation at the fuzzy
bership functions from an initial population, the genetic algorithm membership function may, however, disrupt the order of the result-
selects good parent chromosomes for mating in a probabilistic ing fuzzy membership functions. These fuzzy membership func-
way. An evaluation function is thus needed to qualify the derived tions then need rearrangement according to their center values.
10020 C.-H. Chen et al. / Expert Systems with Applications 36 (2009) 10016–10024

a b
Low Middle High Low Middle High

0 5 8 9 Quantity 0 5 20 25 Quantity

Fig. 5. Two bad kinds of membership functions.

4. The proposed Genetic-Fuzzy mining algorithm Step 5: Using the selection operation to choose individuals in
each population for the next generation. Any selection opera-
According to the above description, the proposed divide-and- tion, such as the elitism selection strategy or the roulette selec-
conquer genetic-fuzzy mining algorithm for mining minimum sup- tion strategy may be used here.
ports, membership functions and fuzzy association rules is de- Step 6: If the termination criterion is not satisfied, go to Step 2;
scribed below. otherwise, do the next step.
The proposed divide-and-conquer genetic-fuzzy mining algorithm Step 7: Get the sets of minimum supports and membership
for items with multiple minimum supports: functions, each of which has the highest fitness value in its
population.
Input: A body of n quantitative transactions, a set of m items, a Step 8: Mine fuzzy association rules from the given database
parameter kfor k-means clustering, a population size P, a cross- using the sets of minimum supports and membership functions.
over rate Pc, a mutation rate Pm, a crossover parameter d, a per- The fuzzy mining algorithm proposed in Lee et al. (2004) is then
centage p of required strength of fuzzy regions, a break adopted to achieve this purpose.
threshold, an interval threshold, and a confidence threshold k.
Output: A set of fuzzy association rules with its associated set of
minimum supports and membership functions. 5. An example
Step 1: Generate m populations according to the initialization
information derived by the clustering procedure stated in Chen In this section, a simple example is given to illustrate the pro-
et al. (2009), each for an item; each individual in a population posed genetic-fuzzy mining algorithm for finding minimum sup-
represents a possible set of membership functions for that ports, membership functions and fuzzy association rules. Assume
items. there are four items in a transaction database: milk, bread, cookies
Step 2: Calculate the fitness value of each chromosome in each and beverage. The data set includes the six transactions shown in
population by the following substeps: Table 1. The proposed algorithm processes it as follows:
Substep 2.1: For each transaction datum Di, i= 1 to n, and for
ðiÞ
each item Ij, j= 1 to m, transform the quantitative value v j Step 1: Four populations are generated as the initial ones by the
ðiÞ
into a fuzzy set fjk represented as: clustering procedure (Chen et al., 2009), each for one item.
ðiÞ ðiÞ ðiÞ
! Assume the population size is 10 in this example. Each popula-
fj1 fj2 fjl
þ þ  þ ; tion then includes 10 individuals. Each individual in the first
Rj1 Rj2 Rjl population consists of the minimum support and membership
functions for item milk. Similarly, an individual in the other
using the corresponding membership functions represented
three populations consists of the minimum support and mem-
by the chromosome, where Rjk is the kth fuzzy region (term)
ðiÞ ðiÞ bership functions respectively for bread, cookies, and beverage.
of item Ij, fjl is v j ’s fuzzy membership value in region Rjk,
Assume the 10 individuals generated are shown in Table 2.
and l(= jIjj) is the number of linguistic terms for Ij.
Step 2: The fitness value of each chromosome is calculated by
Substep 2.2: For each item region Rjk, 1 6 j 6 m, calculate its
the following substeps.
scalar cardinality on the transactions as follows:
Substep 2.1: The quantitative value of each transaction
X
n
ðiÞ datum is transformed into a fuzzy set according the mem-
count jk ¼ fjk :
bership functions in each chromosome. Take the first item
i¼1
in transaction T6 using the membership functions in
Substep 2.3: For each Rjk, 1 6 j 6 m and 16 k6 l, check
whether its countjk is larger than or equal to the minimum
support represented in the chromosome. If Rjk satisfies the
above condition, put it in the set of large 1-itemsets (L1). That Table 1
The ten transactions in this example.
is:
TID Items
L1 ¼ fRjk jcount jk P aj ; 1 6 j 6 m and 1 6 k 6 lg:
T1 (milk, 6); (bread, 4); (cookies, 7); (beverage, 7)
Substep 2.4: Set the fitness value of each chromosome Cq as T2 (milk, 7); (bread, 7); (cookies, 12)
the requirement satisfaction (RS(Cq)) divided by Suitabil- T3 (bread, 8); (cookies, 12); (beverage, 6)
T4 (milk, 2); (bread, 3)
ity(Cq). That is:
T5 (milk, 3); (bread, 8)
RSðC q Þ T6 (milk, 6); (beverage, 6)
f ðC q Þ ¼ : T7 (milk, 10); (cookies, 6)
SuitabilityðC q Þ T8 (milk, 11); (bread, 11)
T9 (beverage, 11)
Step 3: Execute the crossover operation on the population.
T10 (beverage, 10)
Step 4: Execute the mutation operation on the population.
C.-H. Chen et al. / Expert Systems with Applications 36 (2009) 10016–10024 10021

Table 2
The ten chromosomes in each of the four populations.

Population1(milk) Population2(bread)
C1 0.25, 2.0, 4.0, 6.0, 3.0, 7.0, 11.0, 8.0, 11.0, 1; C1 0.07, 3.0, 4.0, 5.0, 5.0, 8.0, 11.0, 7.0, 11.0, 1;
C2 0.257, 0.58, 4.0, 7.41, 4.32, 6.0, 7.67, 2.57, 10.0, 1; C2 0.06, 0.18, 3.0, 5.81, 3.28, 8.0, 12.71, 8.57, 10.0, 1;
C3 0.2, 0.029, 2.0, 3.97, 0.51, 8.0, 15.48, 0.33, 11.0, 1; C3 0.3, 0.32, 2.0, 3.67, 5.94, 7.0, 8.05, 4.14, 11.0, 1;
C4 0.33, 0.10, 3.0, 5.89, 5.62, 7.0, 8.37, 2.07, 10.0, 1; C4 0.09, 1.87, 3.0, 4.12, 2.0, 8.0, 13.99, 1.80, 10.0, 1;
C5 0.12, 0.24, 2.0, 3.75, 3.29, 8.0, 12.70, 4.33, 11.0, 1; C5 0.18, 1.11, 3.0, 4.88, 0.32, 6.0, 11.67, 2.13, 11.0, 1;
C6 0.28, 1.70, 4.0, 6.29, 2.53, 6.0, 9.46, 1.28, 11.0, 1; C6 0, 0.21, 2.0, 3.78, 2.38, 7.0, 11.61, 4.14, 10.0, 1;
C7 0.23, 0.99, 3.0, 5.00, 1.58, 7.0, 12.41, 1.35, 11.0, 1; C7 0.27, 0.99, 2.0, 3.0, 0.74, 6.0, 11.25, 8.36, 10.0, 1;
C8 0.33, 0.50, 3.0, 5.49, 1.67, 8.0, 14.32, 4.20, 11.0, 1; C8 0.06, 2.92, 4.0, 5.07, 5.18, 7.0, 8.81, 1.15, 10.0, 1;
C9 0.27, 0.68, 3.0, 5.31, 2.72, 7.0, 11.27, 0.58, 11.0, 1; C9 0.02, 1.39, 3.0, 4.60, 2.49, 6.0, 9.50, 6.91, 10.0, 1;
C10 0.26, 0.39, 3.0, 5.60, 2.40, 8.0, 13.59, 2.44, 11.0, 1; C10 0.17, 1.12, 3.0, 4.87, 0.19, 6.0, 11.80, 5.01, 11.0, 1;
Population3(cookies) Population4(beverage)
C1 0.16, 4.0, 7.0, 10.0, 3.0, 10.0, 1; C1 0.17, 5.0, 6.0, 7.0, 5.0, 12.0, 1;
C2 0.01, 1.96, 6.0, 10.03, 8.37, 12.0, 1; C2 0.12, 5.35, 7.0, 8.64, 9.75, 12.0, 1;
C3 0.16, 1.90, 6.0, 10.09, 6.26, 12.0, 1; C3 0.2, 0.42, 6.0, 11.57, 0.79, 11.0, 1;
C4 0.09, 3.07, 7.0, 10.92, 9.78, 12.0, 1; C4 0.05, 5.99, 7.0, 8.0, 0.22, 10.0, 1;
C5 0.06, 0.034, 7.0, 13.96, 3.0, 10.0, 1; C5 0.22, 4.87, 7.0, 9.12, 6.38, 12.0, 1;
C6 0.08, 1.66, 6.0, 10.33, 1.75, 10.0, 1; C6 0.16, 3.0, 6.0, 8.99, 1.07, 11.0, 1;
C7 0.08, 5.09, 7.0, 8.90, 4.07, 12.0, 1; C7 0.01, 0.61, 7.0, 13.38, 7.99, 11.0, 1;
C8 0.01, 0.46, 6.0, 11.53, 9.55, 12.0, 1; C8 0.11, 1.63, 7.0, 12.36, 1.99, 12.0, 1;
C9 0.16, 0.10, 6.0, 11.89, 7.11, 12.0, 1; C9 0.1, 3.59, 7.0, 10.40, 8.53, 12.0, 1;
C10 0.13, 3.44, 6.0, 8.55, 5.69, 11.0, 1; C10 0.15, 0.98, 6.0, 11.01, 3.49, 12.0, 1.

milk α1 = 0.25 Table 3


The fuzzy sets transformed from the data in Table 2.
Low Middle High
1
TID Fuzzy Set TID Fuzzy Set
   
T1 0:75 T6 0:75
milk:Middle milk:Middle
   
1:00 0:25 0:666
0 4 7 11 Quantity T2
milk:Middle
T7 þ
milk:Middle milk:High
 
1:0
Fig. 6. The membership functions for milk in C1. T3 Null T8
  milk:High
0:0
T4 T9 Null
milk:Low
 
0:5
T5 T10 Null
milk:Low

chromosome C1of Population1 as an example. The member-


ship functions for milk in C1 are represented as (2.0, 4.0,
6.0, 3.0, 7.0, 11.0, 8.0, 11.0, 1;), which are shown in Fig. 6. Table 4
The counts of the fuzzy regions.
Its minimum support is 0.25. The amount ‘‘6” of item milk
0:75
is then converted into the fuzzy set ðmilk:MiddleÞ using the mem- Item Count
bership functions for milk in C1. The results for all the items milk.Low 0.5
are shown in Table 3, where the notation item.term is called a milk.Middle 2.75
fuzzy region. milk.High 1.66

Substep 2.2: The scalar cardinality of each fuzzy region in the


transactions is calculated as the count value. Take the fuzzy
region milk.Middle as an example. Its scalar cardinality = Table 5
The fitness values of the chromosomes in Population1.
(0.75 + 1.0 + 0 + 0 + 0 + 0.75 + 0.25 + 0 + 0 + 0) = 2.75. The
counts for all the fuzzy regions are shown in Table 4. Chromosome f Chromosome f
Substep 2.3: The count of any fuzzy region is checked against C1 0.411 C6 0.244
the minimum support in a chromosome. Take C1 as an exam- C2 0.167 C7 0.266
ple. Its minimum support is 0.25. Since the count values of C3 0.262 C8 0.261
C4 0.289 C9 0.309
milk.Middle is larger than 2.5 (= 0.25 * 10), these items are
C5 0.444 C10 0.403
put in L1.
Substep 2.4: Assume the percentage p of the required
strength of fuzzy regions is set at 0.8. The RSF values of the Table 6
item milkis 3.928 (= (0.5 + 2.75 + 1.66) * 0.8). The require- The fitness value of the four candidate offspring.

ment satisfaction of milkis thus 0.70 (=2.75/3.928). Since Chromosome f chromosome f


the requirement satisfaction of C1 in Population1 is 0.70 C tþ1 0.458 C tþ1 0.633
1 3
and its suitability is calculated as 1.7, the fitness value of C tþ1
2 0.474 C tþ1
4 0.436
C1 is thus 0.70/1.7 (=0.411). The fitness values of all the chro-
mosomes are shown in Table 5.
Step 3: The crossover operation is executed on the population. (2) C tþ1
2 : 0.12, 0.24, 2.0, 3.75, 3.0, 7.0, 11.0, 4.33, 11.0, 1;

Assume d is set at 0.35 and the crossover rate is set at 0.8. Take (3) C tþ1
3 : 0.16, 0.85, 2.7, 4.54, 3.18, 7.65, 12.11, 5.61, 11.0, 1;

the crossover of the two chromosomes, C1 and C2, as an exam- (4) C tþ1
4 : 0.20, 1.38, 3.3, 5.21, 3.10, 7.35, 11.59, 6.71, 11.0, 1.

ple. According to the MMA crossover operator, the following The fitness value of the above four candidates are then evalu-
four candidate chromosomes are generated from C1 and C5. ated, with results shown in Table 6.The best two chromosomes
(1) C tþ1 are chosen. Thus, chromosomes C tþ1 and C tþ1 are chosen.
1 : 0.25, 2.0, 4.0, 6.0, 3.29, 8.0, 12.70, 8.0, 11.0, 1; 2 3
10022 C.-H. Chen et al. / Expert Systems with Applications 36 (2009) 10016–10024

Step 4: The mutation operation is executed on the population to mutation rate pm is set at 0.001. The parameter d of the crossover
generate possible offspring. The operation is the same as the operator is set at 0.35 according to Herrera et al.’s paper Herrera
traditional one except that rearrangement may need to be done. et al. (1997). The percentage of the required strength of fuzzy re-
Steps 5–8: The elitist selection operation is executed to generate gions is set at 0.8 for a uniform dataset and 1.0 for an exponential
ten chromosomes as the next population. The same procedure is dataset.
then executed until the termination criterion is satisfied. The In the following subsections, we first give a description of the
best chromosome (with the highest fitness value) is output as experimental datasets. The effectiveness of the proposed approach
the minimum supports and membership functions for deriving is then illustrated. Comparisons of the proposed approach
fuzzy rules. After the minimum supports and membership func- (DGFMMS) and the previous approach (GFMMS) (Chen et al.,
tions are derived, the fuzzy mining method proposed in Lee 2009) are then made to show the efficiency of the proposed
et al. (2004) is then used to mine fuzzy association rules. algorithm.

6.1. Description of the experimental datasets


6. Experimental results
Two Simulated datasets with 64 items and with 10000 transac-
In this section, experiments made to show the performance of tions were used in the experiments. One dataset followed uniform
the proposed approach are described. They were implemented in distribution and another one followed exponential distribution.
Java on a personal computer with Intel Pentium IV 3.20 GHz and The factors for the two datasets included the transaction length,
512MB RAM. The initial population size P is set at 50, the cluster the purchased items and their quantities. In the experiments, the
number k is set at 10, the crossover rate pcis set at 0.8, and the number (transaction length) of purchased items in a transaction

(a) Initial MFs


Item1 α1= 0.066 Item2 α2 = 0.056

0.26 0.86 2.0 2.47 3.18 8.0 9.0 13.52 0.44 2.13 4.0 5.0 6.16 7.55 7.86 10.0

Item1 α1= 0.024 (b) Final MFs Item2 α2 = 0.021

0.18 1.80 2.97 3.41 5.95 6.24 9.0 9.89 0.16 1.91 2.25 3.43 5.13 6.44 8.33 10.20

Fig. 7. The initial and the final minimum supports and membership functions of some items for the uniform dataset.

(a) Initial MFs


Item1 α1= 0.013 Item2 α2 = 0.021

0 1.0 2.0 5.09 11.0 2.56 5.0 6.9 7.43 9.0

Item1 α1= 0.029 (b) Final MFs Item2 α2 = 0.03

0.22 4.4 7.17 8.57 9.7 0.05 3.99 6.65 7.89 10.11

Fig. 8. The initial and the final minimum supports and membership functions of some items for the exponential dataset.
C.-H. Chen et al. / Expert Systems with Applications 36 (2009) 10016–10024 10023

was randomly generated in a uniform distribution of the range (Au of some two items among the 64 items for the exponential dataset
et al., 2003; Lee et al., 2004) for both the two datasets. The pur- are shown in Fig. 8(a) and the final results are shown in Fig. 8(b).
chased items in each transaction were then selected from the 64 Next, experiments were made to compare the fitness conver-
items in a uniform distribution of the range (Au et al., 2003) for gence between the proposed approach and the previous one (Chen
the uniform dataset and in an exponential distribution with the et al., 2009). The experiments on the two datasets are shown in
rate parameter set at 16 for the exponential dataset. Their quanti- Figs. 9 and 10.
ties were then assigned from a uniform distribution of the range From Figs. 9 and 10, we can observe that convergence of the
(Au et al., 2003; Hong et al., 2001) for the uniform dataset and from proposed approach is not only faster than the previous one, but
an exponential distribution with the rate parameter set at 5 for the also can get better results on the two datasets. The proposed ap-
exponential dataset. The simulation process was terminated until proach is thus effective.
the dataset size was reached. An item could not be generated twice
in a transaction.
7. Conclusion and future works

6.2. The performance of the proposed approach


In this paper, we have proposed a divide-and-conquer genetic-
fuzzy mining algorithm for items with Multiple Minimum Sup-
After 500 generations, the final membership functions were
ports (DGFMMS) to extract multiple minimum supports, member-
apparently much better than the original ones in both the datasets.
ship functions and fuzzy association rules from quantitative
For example, the initial minimum supports and membership func-
transactions. It maintains multiple populations, each for one item’s
tions of some two items among the 64 items for the uniform data-
minimum support and membership functions. The fitness value of
set are shown in Fig. 7(a). The membership functions have the two
a chromosome is evaluated by the requirement satisfaction and the
bad types of shapes according to the definition in the previous sec-
suitability of it. The proposed approach has two advantages. The
tion. After 500 generations, the final minimum supports and mem-
first one is that since the proposed approach executes the deriva-
bership functions for the same items are shown in Fig. 7(b). It is
tion process in a divide-and-conquer way, its fitness convergence
easily seen that the membership functions in Fig. 7(b) is better
is fast. Experimental results also show that the proposed approach
than those in Fig. 7(a). The two bad kinds of membership functions
(DGFMMS) is faster than the previous one (GFMMS). The second
are improved in the final results.
advantage is that the proposed approach (DGFMMS) can achieve
The same results could also be obtained for the exponential
similar or better results than the previous one (GFMMS). Experi-
dataset. The initial minimum supports and membership functions
mental results also show this. In the future, we will continuously
attempt to enhance the genetic-fuzzy mining framework for more
complex problems.
Uniform Distribution
70
Acknowledgement
60
Average Fitness Values

50 This research was supported by the National Science Council of


the Republic of China under contract NSC 96-2213-E-390-003.
40

30
References
20
Agrawal, R., & Srikant, R. (1994). Fast algorithm for mining association rules. In The
10 international conference on very large databases (pp. 487–499).
0 Au, W. H., Chan, Keith. C. C., & Yao, X. (2003). A novel evolutionary data mining
0 100 200 300 400 500 algorithm with applications to churn prediction. IEEE Transactions on
Evolutionary Computation, 7(6), 532–545.
Generation
Casillas, J., Cordon, O., del Jesus, M. J., & Herrera, F. (2005). Genetic tuning of fuzzy
The Proposed Approach (p = 0.8) The Previous Approach (p = 0.8) rule deep structures preserving interpretability and its interaction with fuzzy
rule set reduction. IEEE Transactions on Fuzzy Systems, 13(1), 13–29.
Chan, C. C., & Au, W. H. (1997). Mining fuzzy association rules. In The conference on
Fig. 9. The comparison results of the proposed approach and the previous approach
information and knowledge management, Las Vegas (pp. 209–215).
for the uniform dataset.
Chen, C. H., Hong, T. P., Tseng, Vincent S., & Lee, C. S. (2009). A genetic-fuzzy mining
approach for items with multiple minimum supports. Soft Computing, 13(5),
521–533.
Chen, J., Mikulcic, A., & Kraft, D. H. (2000). An integrated approach to information
Exponential Distribution retrieval with fuzzy clustering and fuzzy inferencing. In O. Pons, M. A. Vila, & J.
70 Kacprzyk (Eds.), Knowledge management in fuzzy databases. Heidelberg,
Germany: Springer Physica-Verlag.
60
Average Fitness Values

Cordón, O., Herrera, F., & Villar, P. (2001). Generating the knowledge base of a fuzzy
50 rule-based system by the genetic learning of the data base. IEEE Transactions on
Fuzzy Systems, 9(4), 667–674.
40 Darwen, P. J., & Yao, X. (1997). Speciation as automatic categorical modularization.
IEEE Transactions on Evolutionary Computation, 1(2), 101–108.
30 Heng, P. A., Wong, T. T., Rong, Y., Chui, Y. P., Xie, Y. M., Leung, K. S., et al. (2006).
20 Intelligent inferencing and haptic simulation for Chinese acupuncture learning
and training. IEEE Transactions on Information Technology in Biomedicine, 10(1),
10 28–41.
Herrera, F., Lozano, M., & Verdegay, J. L. (1997). Fuzzy connectives based crossover
0
0 50 100 150 200 250 300 350 400 450 500 operators to model genetic algorithms population diversity. Fuzzy Sets and
Systems, 92(1), 21–30.
Generation Hong, T. P., Chen, C. H., Wu, Y. L., & Lee, Y. C. (2006). A GA-based fuzzy mining
approach to achieve a trade-off between number of rules and suitability of
The Proposed Approach (p = 1.0) The Previous Approach (p = 1.0) membership functions. Soft Computing, 10(11), 1091–1101.
Hong, T. P., Kuo, C. S., & Chi, S. C. (1999). A data mining algorithm for transaction
Fig. 10. The comparison results of the proposed approach and the previous data with quantitative values. In The eighth international fuzzy systems
approach for the exponential dataset. association world congress (pp. 874–878).
10024 C.-H. Chen et al. / Expert Systems with Applications 36 (2009) 10016–10024

Hong, T. P., & Lee, Y. C. (2001). Mining coverage-based fuzzy rules by evolutional Rasmani, K. A., & Shen, Q. (2004). Modifying weighted fuzzy subsethood-based rule
computation. In The IEEE international conference on data mining (pp. 218–224). models with fuzzy quantifiers. In The IEEE international conference on fuzzy
Hong, T. P., Kuo, C. S., & Chi, S. C. (2001). Trade-off between time complexity and systems (Vol. 3, pp. 1679–1684).
number of rules for fuzzy mining from quantitative data. International Journal of Roubos, H., & Setnes, M. (2001). Compact and transparent fuzzy models and
Uncertainty, Fuzziness and Knowledge-based Systems, 9(5), 587–604. classifiers through iterative complexity reduction. IEEE Transactions on Fuzzy
Ishibuchi, H., & Yamamoto, T. (2005). Rule weight specification in fuzzy rule- Systems, 9(4), 516–524.
based classification systems. IEEE Transactions on Fuzzy Systems, 13(4), Setnes, M., & Roubos, H. (2000). GA-fuzzy modeling and classification: Complexity
428–435. and performance. IEEE Transactions on Fuzzy Systems, 8(5), 509–522.
Kaya, M., & Alhajj, R. (2003). A clustering algorithm with genetically optimized Siler, William, & James, J. (2004). Fuzzy expert systems and fuzzy reasoning. John
membership functions for fuzzy association rules mining. In The IEEE Wiley and Sons.
international conference on fuzzy systems (pp. 881–886). Wang, C. H., Hong, T. P., & Tseng, S. S. (1998). Integrating fuzzy knowledge by
Khare, V. R., Yao, X., Sendhoff, B., Jin, Y., & Wersing, H. (2005). Co-evolutionary genetic algorithms. IEEE Transactions on Evolutionary Computation, 2(4),
modular neural networks for automatic problem decomposition. In The 2005 138–149.
IEEE congress on evolutionary computation (Vol. 3, pp. 2691–2698). Wang, C. H., Hong, T. P., & Tseng, S. S. (2000). Integrating membership functions and
Kuok, C., Fu, A., & Wong, M. (1998). Mining fuzzy association rules in databases. fuzzy rule sets from multiple knowledge sources. Fuzzy Sets and Systems, 112,
SIGMOD Record, 27(1), 41–46. 141–154.
Lee, Y. C., Hong, T. P., & Lin, W. Y. (2004). Mining fuzzy association rules with Yao, X. (2003). Adaptive divide-and-conquer using populations and ensembles. In
multiple minimum supports using maximum constraints. Lecture notes in The 2003 international conference on machine learning and application (pp. 13–
computer science (Vol. 3214, pp. 1283–1290). Springer-Verlag. 20).
Liang, H., Wu, Z., & Wu, Q. (2002). A fuzzy based supply chain management decision Yue, S., Tsang, E., Yeung, D., & Shi, D. (2000). Mining fuzzy association rules with
support system. The World Congress on Intelligent Control and Automation, 4, weighted items. In The IEEE international conference on systems, man and
2617–2621. cybernetics (pp. 1906–1911).
Parodi, A., & Bonelli, P. (1993). A new approach of fuzzy classifier systems. In The Zhang, H., & Liu, D. (2006). Fuzzy modeling and fuzzy control. Springer-Verlag.
fifth international conference on genetic algorithms (pp. 223–230). Los Altos, CA:
Morgan Kaufmann.

Potrebbero piacerti anche