Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
ture extraction and we propose an approach based on Boolean reasoning for new feature extraction from data tables with symbolic (nominal, qualitative) attributes. New features are of the form a 2 V , where V Va and Va is the set of values of attribute a. We emphasize that Boolean reasoning is also a good framework for complexity analysis of the approximate solutions of the discussed problems.
1 Introduction
"Feature Extraction" and "Feature Selection" are important problems in Machine Learning and Data Mining (see e.g. 6, 3, 4]). In previous papers we have considered problems like: short reduct nding problem 16], rule induction problem 17], optimal discretization problem 12], linear feature (hyperplane) searching problem 13]. Our solutions of these problems are based on Boolean reasoning schema 2]. In this paper we discuss a problem of searching for new features from a data table with symbolic (qualitative) values of attributes. This problem, called symbolic value partition problem di ers from the discretization problem. We do not assume any pre-de ned order on values of attributes. Once again, we apply rough set method and Boolean reasoning to construct heuristics searching for relevant features of the form a 2 V Va generated by partitions of symbolic values of conditional attributes into a small number of value sets. We also point out that Boolean reasoning can be used as a tool to measure the complexity of approximate solution of a given problem. As a complexity measure of a given problem we propose the complexity of the corresponding to that problem Boolean function (represented by the number of variables, number of clauses, etc.). It is known that for some NP-hard problems it is easier to construct e cient heuristics than for the other ones. The problem of symbolic value partition is in this sense harder than the problem of optimal discretization problem.
2 Preliminaries
We consider the Boolean algebra over B = f0; 1g and n-variable Boolean function f : Bn ! B, where n 1.
For any sequence a = (a 1] ; : : : ; a n]) 2 Bn and any vector of Boolean variables x = (x1 ; : : : ; xn ) we de ne the minterm ma and the maxterm sa by
1] 2] :a 1] :a 2] n] :a n] ma (x) = xa ^ xa ^ : : : ^ xa n and sa (x) = x1 _ x2 _ : : : _ xn 1 2 where x1 = x and x0 = x: Theorem 1 . (see 18]) f (x) = W ma (x) = V sb (x)
a2f ?1 (1)
b2f ?1 (0)
These two representations are called disjunctive (DNF) and conjunctive normal forms (CNF) of the function f , respectively. Let u = (u1 ; : : : ; un) ; v = (v1 ; : : : ; vn ) 2 f0; 1gn . We use the coordinate-wise ordering, i.e. u v if and only if ui vi for all i. A Boolean function f is called monotone i u v implies f (u) f (v). One can show that a Boolean function is monotone if and only if it can be de ned without negation 18]. Given aV set of variables S fx1 ; : : : ; xn g we de ne the monomial mS by xi : The set S of variables is called an implicant of the monotone mS (x) =
1 ?1 Boolean function f if and only if m? S (1) f (1). The set S of variables is called a prime implicant of a monotone Boolean function f if S is an implicant of f and any proper subset of S is not an implicant of f . We use the following properties of two problems related to monotone Boolean functions 2]: Theorem 2. 12] For a given monotone Boolean function f of n variables in CNF and an integer k. The decision problem for checking if there exists a prime implicant of f with at most k variables is NP -complete. The problem of searching for minimal prime implicant of f is NP -hard. An information system 15] is a pair A = (U; A), where U is a non-empty, nite set called the universe and A is a non-empty, nite set of attributes, i.e. a : U ! Va for a 2 A, where Va is called the value set of a. Elements of U are called objects. Any information system A = (U; A) and a non-empty set B A de ne a B -information function by InfB (x) = f(a; a(x)) : a 2 B for x 2 U g. The set fInfA(x) : x 2 U g is called the A?information set and denoted by INF (A). Any information system of the form A = (U; A fdg) is called decision table where d 2 = A is called decision and the elements of A are called conditions. Let Vd = f1; : : :; r(d)g. The decision d determines the partition fC1 ; :::; Cr(d)g of the universe U , where Ck = fx 2 U : d(x) = kg for 1 k r(d). The set Ck is called the k ? th decision class of A. With any subset of attributes B A, an equivalence relation called the B -indiscernibility relation 15], denoted by IND(B ), is de ned by IND(B ) = f(x; y) 2 U U : 8a2B (a(x) = a(y))g Objects x; y satisfying relation IND(B ) are indiscernible by attributes from B . By x]IND(B) we denote the equivalence class of IND (B ) de ned by x. A minimal subset B of A such that IND(A) = IND(B ) is called a reduct of A.
xi 2S
If A = (U; A fdg) is a decision table and B A then we de ne a function @B : U ! 2f1;::;r(d)g , called the generalized decision in A, by
@B (x) = fi : 9x0 2U (x0 IND(B )x) ^ (d(x0 ) = i)]g = d x]IND(B) A decision table A is called consistent (deterministic) if card (@A (x)) = 1 for any x 2 U , otherwise A is inconsistent (non-deterministic).
tion of k Boolean variables a1 ; : : : ; ak corresponding to the attributes a1 ; : : : ; ak respectively, and de ned by V W fA (a1 ; : : : ; ak ) =df cij 6=; cij where cij = fa : a 2 cij g: The set of all prime implicants of fA determines the set of all reducts of A 16]. In the sequel, to simplify the notation, we omit the star ? superscripts. Observe that the Boolean function fA consists of k variables and O n2 clauses. A subset B of the set A of attributes of decision table A = (U; A fdg) is a relative reduct of A i B is a minimal set with respect to the following property: @B = @A . The set of all relative reducts in A is denoted by RED(A; d).
reduct of length < k is NP -complete. The searching problem for reduct of minimal length is NP -hard.
3.2 Discretization making Let A = (U; A fdg) be a decision table where U = fx ; x ; : : : ; xn g; A = fa ; :::; ak g and d : U ! f1; :::; rg. We assume Va = la ; ra ) < to be a real interval for any a 2 A and A to be a consistent decision table. Any pair (a; c) where a 2 A and c 2 < will be called a cut on Va . Let Pa be a partition on a Va (for a 2 A) into subintervals i.e. Pa = f ca ; ca ); ca ; ca ); : : : ; ca ka ; cka )g for a a a a a some integer ka , where la = c < c < c < : : : < cka < cka = ra and Va = a ca ; c a ) c a ; ca ) : : : c a partition Pa is uniquely de ned ka ; cka ). Hence any a ); (a; ca ); : : : ; (a; ca )g A <. and often identi ed as the set of cuts: f ( a; c ka S Any set of cuts P = a2A Pa de nes from A = (U; A fdg) a new decision table AP = (U; AP fdg) called P-discretization of A, where AP = faP : a 2 Ag a and aP (x) = i , a(x) 2 ca i ; ci ) for x 2 U and i 2 f0; ::; ka g.
1 2 1 0 1 1 2 +1 0 1 2 +1 0 1 1 2 +1 1 2 +1
Two sets of cuts P0 ; P are equivalent, i.e. P0 A P, i AP = AP0 . The equivalence relation A has a nite number of equivalence classes. In the sequel we will not discern between equivalent families of partitions. We say that the set of cuts P is A-consistent if @A = @AP , where @A and @AP are generalized decisions of A and AP , respectively. The A-consistent set of cuts Pirr is A-irreducible if P is not A-consistent for any P Pirr . The A-consistent set of cuts Popt is A-optimal if card (Popt ) card (P) for any A-consistent set of cuts P.
Theorem 4. 12] The decision problem of checking if for a given decision table A and an integer k there exists an irreducible set of cuts P in A such that card(P) < k is NP -complete. The problem of searching for an optimal set of cuts P in a given decision table A is NP -hard. Let us de ne a new decision table A = (U ; A fd g) where { U = (u; v) 2 U : d (u) = 6 d (v) f?g if c discerns u; v { A = fc : c is a cut on Ag. c (?) = 0; c ((u; v)) = 1 0 otherwise { d (?) = 0; d (ui; uj ) = 1 for (u; v) 2 U . It has been shown 12] that any relative reduct of A is an irreducible set of cuts for A and any minimal relative reduct of A is an optimal set of cuts for A. The Boolean function corresponding ? to the minimal relative reduct problem
2
3.3 Discretization de ned by Hyperplanes Let A = (U; A fdg) be a decision table, where U = fu ; :::; un g, A = ff ; :::; fk g and d : U ! f1; :::; mg and let Ci = fu 2 U : d(u) = ig for i = 1; :::; m. Assuming that objects ui 2 U are described by conditional attributes, we can characterize them as points: Pi = (f (ui ); :::; fk (ui )) in k-dimensional a ne space <k . Any hyperplane H in <k is fully characterized by (k +1)-tuple of real numbers (a; a ; a ; :::; ak ): H = f(x ; x ; :::; xk ) 2 <k : a x + + ak xk + a = 0g: The
1 1 1 1 2 1 2 1 1
hyperplane H splits Ci into two subclasses de ned by: CiU;H = fu 2 Ci : a1 f1 (u) + + ak fk (u) + a 0g; CiL;H = fu 2 Ci : a1 f1 (u) + + ak fk (u) + a < 0g: We consider the discretization problems as before but instead of the cut set we take as the searching space for new features the set of characteristic functions of half-spaces de ned by hyperplanes over the attribute set A. It is easy to observe that the problem of searching for an optimal set of hyperplane cuts is NP-hard. ? Observe that the Boolean function corresponding to this problem has O nk ? 2 variables (hyperplanes) and O n clauses. Hence the problem of searching for sub-optimal set of oblique hyperplanes is harder than the problem of searching for sub-optimal set of (parallel/orthogonal to axes) cuts.
4 Approximate Algorithms
4.1 Johnson strategy
In general, the Johnson greedy algorithm searching for shortest prime implicant of a given Boolean function of k variables V = fx1 ; : : : ; xk g in CNF : f = (x1;1 _ x1;2 : : : _ x1;i1 ) ^ ::: ^ (xN;1 _ xN;2 : : : _ xN;iN ) where xi;j 2 V , is described as follows: Johnson strategy : Greedy algorithm Step 1: Choose the variable x 2 V most frequently occurring in f . Step 2: Remove from f all clauses containing the variable x. Step 3: If f 6= 0 then go to Step 1 else go to Step 4. Step 4: From the set of chosen variables remove super uous variables. The obtained set of variables is returned as the result of the algorithm. In general, the time complexity of the presented algorithm depends on the time complexity of Step 1. If there are k variables and N clauses, Step 1 takes O (kN ) computing steps. In some particular cases one can reduce the time complexity of this algorithm, but usually the number of variables and the number of clauses determine the complexity of the problem.
straightforward realization of this algorithm requires O kn2 of memory space and O(kn3 ) steps to determine one cut, so it is not feasible in practice. The MDheuristic presented in 14] determines the best cut in O (kn) steps using O (kn) space only.
award(H ) =
i6=j
If award(H ) > award(H ) then the number of discernible pairs of objects from di erent decision classes by the hyperplane H is greater than the corresponding number de ned by the hyperplane H 0 . This function has been applied in the MD-heuristic to measure the number of discernible pairs of objects. Since the problems of searching for the hyperplanes are hard some heuristics are used to solve it 13]. We also use the penalty(H ) function:
i=1
penalty(H ) =
r X i=1
r X i=1
( Li R i )
or more advanced functions to measure the quality of oblique hyperplanes: w1 award(H ) ; power1 (H ) = penalty (H ) + w2 power2 (H ) = w1 award(H ) ? w2 penalty(H ): There are numerous methods of searching for optimal hyperplanes ( see e.g. 7] based on various heuristics like "simulated annealing" 7], "randomized induction"... In 13] we proposed a general method (based on genetic strategy) of searching for optimal set of hyperplanes by using genetic algorithm.
In case of symbolic value attributes (i.e. without any pre-assumed order in the value sets of attributes) the problem of searching for partitions of value sets into a "small" number of subsets is, in a sense, more complicated than for continuous attributes. Once again, we apply Boolean reasoning approach to construct a partition of symbolic value sets into small number of subsets. Let A = (U; A fdg) be a decision table where A = fa : U ! V g and Vai = ai ; v ai ; :::; v ai ,for i 2 f1; :::; kg. Any function Pa : Va i ! f1; : : :a;i mi g (where v1 i i ni 2 mi ni ) is called a partition of Vai . The rank of Pai is the value rank (Pi ) = card (Pai (Vai )). The function Pai de nes a new partition attribute bi = Pai ai i.e. bi (u) = Pai (ai (u)) for any object u 2 U: The family of partitions fPa ga2B is B ? consistent i
SYMBOLIC VALUE PARTITION PROBLEM: For a given decision table A = (U; A fdg), and a set of nominal attributes B A, search for a minimal B ? consistent familyP of partitions (i.e. B consistent family fPa ga2B with the minimal value of a2B rank (Pa )).
This concept is useful when we want to reduce the attribute value sets of attributes with large cardinalities. The discretization problem can be derived from the partition problem by adding the monotonicity condition for family fPa ga2A : 8v1 ;v2 2Va v1 v2 ) Pa (v1 ) Pa (v2 )] We propose two approaches for solving this problem, namely the local partition method and the global partition method. The former approach is based on grouping the values of each attribute independently whereas the later approach is based on grouping of attribute values simultaneously for all attributes.
(2)
Theorem 5. If Pa is a-consistent then Pa UNIa. The equivalence relation UNIa de nes a minimal a?consistent partition on a.
We consider the discernibility matrix 16] of the decision table A : M (A) = mi;j ]n i;j=1 where mi;j is the set of all attributes having di erent values on objects ui ; uj i.e. mi;j = fa 2 A : a (ui ) 6= a (uj )g. Observe that if we want to discern between objects ui and uj we have to keep one of the attributes from mi;j . For the need of our problem we would like to have more relevant formulation: to discern objects ui ; uj we have to discern for some a 2 mi;j between values of the value pair (a (ui ) ; a (uj )). Hence instead of cuts used for ? continuous values (de ned by pairs (ai ; cj )), one can discern objects by triples ai ; via1i ; via2i called chains, where ai 2 A for i = 1; :::; k and i1; i2 2 f1; :::; ni g : One can build a new decision table A+ = (U + ; A+ fd+ g) (analogously to the table A (see Section 3.2)) assuming U + = U ; d+ = d and A+ = f(a; v1 ; v2 ) : (a 2 A) ^ (v1 ; v2 2 Va )g. Again one can apply to A+ e.g. the Johnson heuristic to search for a minimal set of chains discerning all pairs of objects from di erent decision classes. One can see that our problem can be solved by e cient heuristics of graph coloring. The "graph k?colorability" problem is formulated as follows:
input: Graph G = (V; E ), positive integer k jV j output: 1 if G is k?colorable, (i.e. if there exist a function f : V ! f1; : : : ; kg such that f (v) = 6 f (v0 ) whenever (v; v0 ) 2 E ) and 0 otherwise.
This problem is solvable in polynomial time for k = 2, but is NP-complete for all k 3. However, similarly to discretization, one can apply some e cient heuristic searching for optimal graph coloring determining optimal partitions of attribute value sets. For any attribute ai in a semi-minimal set X of chains returned from the above heuristic we construct a graph ?ai = hVai ; Eai i, where Eai is equal to the set of all chains in X of the attribute ai . Any coloring of all graphs ?ai de nes an A-consistent partition of value sets. Hence heuristics searching for minimal graph coloring return also sub-optimal partitions of attribute value sets. One can see that this time the constructed Boolean formula has O(knl2 ) variables and O(n2 ) clauses, where l is the maximal value of card(Va ) for a 2 A. Let us note also that if prime implicants have been constructed a heuristic for graph coloring should be applied to generate new features.
5.3 Example
Let us consider the decision table presented in Figure 1 and a reduced form of its discernibility matrix. Firstly, from the Boolean function fA with Boolean variables of the form av v2 1 (corresponding to the chain (a; v1 ; v2 ) described in Section 5.2) we nd a shortest ], which can ^ ba ^ ba ^ ba ^ ba ^ ba ^ aa ^ aa ^ aa prime implicant: aa a3 a1 a2 a2 a1 a3 a1 a2 a1 5 3 3 4 4 4 4 3 2 be represented by graphs (Figure 2). Next we apply a heuristic to color vertices
of those graphs as it is shown in Figure 2. The colors are corresponding to the partitions:
u1 u2 u3 u4 u5 u6 u7 u8 u9 u10
A a b d
a1 a1 a2 a3 a1 a2 a2 a4 a3 a2 b1 b2 b3 b1 b4 b2 b1 b2 b4 b5
0 0 M (A) u1b u2 u3 u4 b2 a1 , bb3 aa1 , bb1 0 u5 bb1 b a a2 b 4 a3 b 4 b 4 b1 a4 b2 b1 1 0 u6 aa b aa a1 a2 b3 2 , bb2 aa2 3 , bb 2 b b a a a 1 1 1 ! u7 aa1 aa1 aa2 ! 2 2 , bb2 bb3 3 a2 , bb2 aa3 , bb1 1 , bb1 aa1 1 u8 aa a a4 b2 a4 a4 b a4 b 2 b1 a1 b2 a2 b3 3 b1 1 u9 aa a1 3 , bb4 aa3 , bb4 aa3 , bb4 bb4 b b b b1 a a 1 2 3 1 1 u10 aa1 aa a2 2 , bb5 aa2 , bb5 bb5 3 , bb 5 1 1
aPa bPb d 1 1 0 2 2 0 1 2 1 2 1 1
a1
a3
@ @@ ?? @? ? ? @@ ? ? @ h a4 x
a
h a2
b5
b B QQ B2 BB QQ BB QQ B BB Q QB h b3 x B BB h b4 b
b1
x Q
6 Conclusions
We have presented applications of Boolean reasoning methods for di erent problems like: minimal reduct nding, optimal discretization making, searching for best hyperplanes, minimal partition. These examples are showing the power of
this tool in searching for new features. In our system for data analysis we have implemented e cient heuristics based on those methods. The tests are showing that they are very e cient from the point of view of time complexity. They also assure high quality of recognition of new unseen cases ( 13, 14]). The heuristics for symbolic value partition allow to obtain more compressed form of decision algorithm. Hence, from the minimum description length principle, one can expect that they will return decision algorithms with high quality of unseen object classi cation. Acknowledgement: This work was supported by the State Committee for Scienti c Research (grant KBN 8T11C01011).
References
1. Almuallim H., Dietterich T.G. (1994). Learning Boolean Concepts in The Presence of Many Irrelevant Features. Arti cial Intelligence, 69(1-2), pp. 279-305. 2. Brown F.M., Boolean reasoning, Kluwer, Dordrecht 1990. 3. Catlett J. (1991). On changing continuos attributes into ordered discrete attributes. In Y. Kodrato , (ed.), Machine Learning-EWSL-91, Proc. of the European Working Session on Learning, Porto, Portugal, March 1991, LNAI, pp. 164-178. 4. Chmielewski M. R., Grzymala-Busse J. W. (1994). Global Discretization of Attributes as Preprocessing for Machine Learning. Proc. of the III International Workshop on RSSC94 November 1994, pp. 294- 301. 5. Dougherty J., Kohavi R., Sahami M.(1995). Supervised and Unsupervised Discretization of Continuous Features, Proceedings of the Twelfth International Conference on Machine Learning, Morgan Kaufmann, San Francisco, CA, pp. 194-202. 6. Fayyad U. M., Irani K.B. (1992). The attribute selection problem in decision tree generation. Proc. of AAAI-92, July 1992, San Jose, CA.MIT Press, pp. 104-110. 7. Heath D., Kasif S., Salzberg S. (1993). Induction of Oblique Decision Trees. Proc. 13th International Joint Conf. on AI. Chambery, France, pp. 1002-1007. 8. Holt R.C. (1993), Very simple classi cation rules perform well on most commonly used datasets, Machine Learning 11, pp. 63-90. 9. John G., Kohavi R., P eger K. (1994). Irrelevant features and subset selection problem. Proceedings of the Twelfth International Conference on Machine Learning, Morgan Kaufmann, pp. 121-129. 10. Kerber R. (1992), Chimerge: Discretization of numeric attributes. Proc. of the Tenth National Conference on Arti cial Intelligence, MIT Press, pp. 123-128. 11. Kodrato Y., Michalski R.(1990): Machine learning: An Arti cial Intelligence approach, vol.3, Morgan Kaufmann, 1990. 12. Nguyen H.S., Skowron A. (1995). Quantization of real values attributes, Rough set and Boolean Reasoning Approaches. Proc. of the Second Joint Annual Conference on Information Sciences, Wrightsville Beach, NC, 1995, USA, pp.34-37. 13. Nguyen H.S., Nguyen S.H., Skowron A.(1996). Searching for Features de ned by Hyperplanes. in: Z. W. Ras, M. Michalewicz (eds.), Proc. of the IX International Symposium on Methodologies for Information Systems ISMIS'96, June 1996, Zakopane, Poland. Lecture Notes in AI 1079, Berlin, Springer Verlag, pp.366-375. 14. Nguyen S. H., Nguyen H. S.(1996), Some E cient Algorithms for Rough Set Methods. Proc. of the Conference of Information Processing and Management of Uncertainty in Knowledge-Based Systems , 1996, Granada, Spain, pp. 1451-1456.
15. Pawlak Z.(1991): Rough sets: Theoretical aspects of reasoning about data, Kluwer Dordrecht. 16. Skowron A., Rauszer C.(1992), The Discernibility Matrices and Functions in Information Systems. In: Intelligent Decision Support-Handbook of Applications and Advances of the Rough Sets Theory, Slowinski R.(ed.), Kluwer Dordrecht 1992, 331-362. 17. Skowron A., Polkowski L., Synthesis of Decision Systems from Data Tables. In T.Y Lin & N. Cercone(eds.), Rough Sets and Data Mining, Analysis of Imprecise Data. Kluwer, Dordrecht, pp. 259-300. 18. Wegener I. (1987). The Complexity of Boolean Functions. Stuttgart: John Wiley & Sons.
ATEX macro package with LLNCS style This article was processed using the L