Sei sulla pagina 1di 18

J Intell Inf Syst (2012) 39:559576 DOI 10.

1007/s10844-012-0203-x

Improving database performance with a mixed fragmentation design


Narasimhaiah Gorla Vincent Ng Dik Man Law

Received: 1 April 2007 / Revised: 16 November 2011 / Accepted: 22 March 2012 / Published online: 12 April 2012 Springer Science+Business Media, LLC 2012

Abstract The performance of database operations can be enhanced with an efficient storage structure design using attribute partitioning and/or tuple clustering. Previous research deals mostly with attribute partitioning. We address here the combined problem of attribute partitioning and tuple clustering. We propose a novel approach for this mixed fragmentation problem by applying a genetic algorithm iteratively to attribute partitioning and tuple clustering sub-problems. We compared our results to attribute-only partitioning and random search solution, resulting in a database access cost reduction of upto 70% and 67% respectively. We analyzed the effect of varying genetic parameters on the optimal solution through experimentation. Keywords Mixed fragmentation Attribute partitioning Tuple clustering Genetic algorithms Database performance Data Mining

1 Introduction The data transfer between CPU and secondary storage is clearly the main bottleneck for database performance. The fewer the accesses to and from the disk, the more efficient the execution of database transactions. Fragmentation or partitioning is a storage structure design technique that improves database performance. It is based on the principle that data segments that are frequently accessed together will require fewer disk accesses if they are physically stored together. Data segments can take the

An earlier version of this paper has been presented at the 2003 ACM Symposium on Applied Computing (SAC) N. Gorla (B ) American University of Sharjah, PO Box 26666, Sharjah, UAE e-mail: ngorla@aus.edu V. Ng D. M. Law Hong Kong Polytechnic University, Hong Kong, China

560

J Intell Inf Syst (2012) 39:559576

form of attributes and tuples in relational databases. This can lead to two types of storage structure problems: attribute partitioning and tuple clustering. A brief review of earlier research concerning these areas is as follows: Attribute partitioning or vertical fragmentation in relational databases involves the decomposition of a table into vertical fragments in which subsets of attributes are stored. Several techniques of attribute partitioning have been proposed in the past. In Eisner and Severances works (1976), a file is partitioned into a primary file and a secondary file, between which the attributes are distributed. Hoffer and Serverance (1975) measured the affinity between pairs of attributes and clustered them according to their pair-wise affinity using the bond energy algorithm (BEA). BEA was developed in Niamir (1978) and it minimized a cost function by using affinity attribute matrix. Pairs of objects within the cluster that carried a large measure of similarity and pairs across cluster boundaries that carried a small measure of similarity were identified as clusters. However, the creation of fragments was left to the subjectivity of the designer. Navathe et al. (1984) extended the results of Hoffer and Severances study and proposed a two-phase approach for vertical partitioning to reduce the subjective element in defining clusters. Cornell and Yu (1990) iteratively used an optimal binary partitioning algorithm to obtain more partitions. Song and Gorla (2000) extended Cornell and Yus study by proposing the combined solution of access path selection and vertical partitioning using the genetic algorithm. Another approach to the attribute partitioning was proposed by Chu and Ieong (1993), using transaction rather than attribute as a manipulation unit. Ailamaki et al. (2001) proposed the Partition Attributes Across (PAX) model by improving cache performance, while Ramamurthy et al. (2002) proposed the fractured mirrors partitioning scheme based on Decomposition Storage Model and N-ary Storage Model. Cheng et al. (2002) used a genetic search-based clustering algorithm based on traveling-salesman problem to obtain vertical partitions in distributed databases. Attribute partitioning is a generic technique and has also been applied in object-oriented databases (Fung et al. 2002; Gorla 2001; Baiao et al. 2004), distributed databases (March and Rho 1995; Ozsu and Valduriez 1996; Tamhanker and Ram 1998; Cheng et al. 2002), and data warehouse design (Ezeife 2001; Furtado et al. 2005) to improve data access performance. More recently, attribute partitioning in relational databases has been achieved in the context of referential integral constraints (Gorla 2007) and by using data-mining techniques such as hierarchical adaptive clustering (Serban and Campan 2008) and association rules (Gorla and Betty 2008). The other storage structure design aimed at improving database performance is tuple clustering. By storing tuples that need to be accessed together in fewer physical blocks, the number of disk accesses will decrease. Knuth (1973) and Rivest (1976) described heuristics for clustering tuples according to their access frequencies. The heuristics are limited in scope as they assume no blocking of tuples into pages and that a sequential scan is the only option for searching in a relation. Horizontal partitioning, a similar concept to tuple clustering, has been popular in distributed database design. For example, Ceri et al. (1983) presented algorithms to divide the problem into smaller problems to provide horizontal partitioning solutions in distributed databases. Tamhanker and Ram (1998) developed an integrated methodology for horizontal fragmentation and allocation of these fragments in a distributed database environment. Recently, horizontal fragmentation has been applied in XML

J Intell Inf Syst (2012) 39:559576

561

data warehouses using the data-mining technique of k-means-based fragmentation approach (Mahboubi and Darmont 2008). These studies have considered attribute partitioning and tuple clustering independently of each other and few have recognized the interdependency of the two design solutions. Using exhaustive enumeration, Gorla and Quinn (1991) have shown that the optimal attribute partitioning is not the same for different tuple ordering solutions, and optimal tuple ordering is not the same for different attribute partitioning solutions. These authors obtained a combined-optimal solution of attribute partitioning and tuple ordering. However, their solution could only solve small database problems with the exhaustive enumeration method because of intractable nature of the problem. It took several hours of computational time even to solve a small database problem. Gorla and Quinn (1991)s method cannot be used to solve large database problems. The objective of this research is to provide a method for solving the combined attribute partitioning and tuple clustering problem. The methodology uses Genetic Algorithm (GA), a heuristic approach, through which larger database problems can be solved. This paper extends the previous research of Gorla and Quinn (1991) by proposing a GA based solution to mixed fragmentation problem in relational databases (Ng et al. 2003). The organization of the paper is as follows: Section 2 provides the formulation of the mixed fragmentation problem and an evaluation cost function, Section 3 describes a GA based procedure to the mixed fragmentation problem, and Section 4 illustrates the procedure. Section 5 has the procedure application to a database problem presented by previous researchers. Section 6 discusses the effect of genetic operators on optimal solution, and Section 7 has conclusions and future research directions.

2 Problem formulation The mixed fragmentation problem can be defined as determining the assignments of attributes to partitions and assignment of tuples to blocks, so that the number of disk accesses needed to execute all transactions is minimized. The number of disk accesses required to execute transactions is used as the cost function to evaluate a mixed fragmentation scheme. The mixed fragmentation problem can be formulated as a 01 programming problem (Gorla and Quinn 1991; Ng et al. 2003): Minimize
q

fq
j i

Xqi Yqj

(1)

where fq Xqi Yqj Xqi Yqj Rqt Pqa frequency of query q 1 if query q needs block i, 0 otherwise 1 if query q needs partition j, 0 otherwise 1 if Rqt Wit > = 1, 0 otherwise
t

1 if
a

Pqa Qja > = 1, 0 otherwise

1 if query q uses tuple t, 0 otherwise 1 if query q uses attribute a, 0 otherwise

562

J Intell Inf Syst (2012) 39:559576

Wit Qja

1 if tuple t belongs to block i, 0 otherwise 1 if attribute a belongs to partition j, 0 otherwise

In the above formulae, Xqi * Yqj is 1 disk access if a query q uses partition j and block i. This value is added over all partitions and all blocks for the query, and is multiplied by the frequency of the query. The sum over all queries is the total number of disk accesses. The decision variables are sets of attributes in each partition j and set of tuples in each block i. Based on the above formulation for mixed fragmentation, the following cost formula is obtained to evaluate a scheme that has an attribute partitioning scheme of P and tuple clustering scheme of T:
Q m p b pj

C PT =
q=1

fq (1 + (m pq 1) W )
j=1 i=1

B X PTqij

(2)

where Q Total number of queries fq Frequency of query q L pj Length of a partition j in an attribute partitioning scheme P LT I D Length of the tuple identifier B Page block Size m p Number of partitions in an attribute partitioning scheme P X PTqij 1, if any record in the block i and partition j is retrieved by query q in an attribute partitioning scheme Pand tuple clustering scheme T 0, otherwise b pj Number of blocks required for a partition j = C R / B/( L pj + LT I D ) W m pq CR Penalty Cost Factor Number of partitions used by query q in attribute partition scheme P Cardinality of Relation

The above expression gives the database access cost as a function of total number of disk accesses required to satisfy all queries in a mixed fragmentation scheme with attribute partitioning P and tuple clustering T. The differences between (1) and (2) are the inclusion of penalty function to account for the cost involved for the query to access multiple fragments and the multiplication of objective function with B to arrive at the amount of data transfer from secondary storage that is representative of database access cost. A penalty cost (W = 10%) is built into the model to satisfy queries requiring visits to multiple fragments. Overheads may be incurred by opening and closing the subfile, initializing the page map table for the subfile, or allocating buffer areas in primary memory for the subfile. The penalty cost is proportional to the additional number of fragments accessed by each transaction.

3 Genetic algorithm procedure The Genetic Algorithm (GA) approach is known as an efficient search method to solve difficult optimization problems, and has been used to solve database design problems (Gorla 2001; Cheng et al. 2002; Du et al. 2006; Gorla and Song 2010).

J Intell Inf Syst (2012) 39:559576

563

Alternate solution methods, such as traditional hill climbing (HC), simulated annealing (SA), and chemotaxis algorithm (CA) present disadvantages (Li and Jiang 2000; Tam 1992). HC results in local optimization; SA produces better solutions than HC but takes more computational time; CA has difficulty getting out of local optimum, and consequently does not produce a good global optimum. GA builds on existing solutions and generates newer and better solutions, thus resulting in a good global optimum quickly. 3.1 An overview of genetic algorithm This section provides an overview of the steps involved in genetic algorithm (Song and Gorla 2000). The genetic algorithm starts with an initial population chosen at random. Each member in a population represents a candidate solution to the problem and is called a chromosome. Each solution (chromosome) is evaluated to yield a performance measure called fitness. The population evolves from one generation to the next through the application of three major types of genetic operators: selection, crossover, and mutation. During a selection operation, members of the population (parents) are selected in pairs to produce new possible solutions (offspring). Parents are selected according to their fitness levels. The crossover operator is used to help generate offspring that inherits properties from both parents. An example crossover operator for two binary strings (parents) is to cut randomly between any two bits on the string, and the new strings are then created by interchanging the tails. The offspring are evaluated according to the evaluation function and placed in the next population, possibly replacing weaker members of the last generation. The mutation operator is used to allow further variation in the offspring. In the case of binary strings, the mutation operator just changes the state of a bit. Mutation explores the search space, i.e., mutation is capable of escaping from a local search region. This process of selectioncrossovermutation is repeated from one generation to the next. Goldberg (1989) gives details concerning genetic algorithms and their operators. Thus, any genetic algorithm must have the following components: representation (a genetic representation of a solution to the problem), initialization (a way to create an initial population of solutions), evaluation (a function that evaluates solutions), genetic operators (selection, crossover and mutation), and iteration of the genetic algorithm. The Genetic Algorithm is shown in Fig. 1. 3.2 Attribute partitioning The genetic algorithm for attribute partitioning follows a procedure similar to the one outlined in Section 3. Though genetic algorithms generally use binary strings for the chromosome representations (Chambers 1995), we use an integer string A of dimension m (m=number of attributes) as our chromosome for attribute partitioning, where the ith element Ai is a number representing the cluster to which the ith partition is allocated. Binary strings are not suitable to represent attribute clusters because there can be more than one cluster. Prior genetic algorithm based studies on attribute partitioning have used non-binary string representations for chromosomes (Cheng et al. 2002). The other parameters for the algorithm are detailed below: i) The initial population is generated based on a random number between 0 and 1 multiplied by the number of fragments minus one. ii) The evaluation or fitness

564 Fig. 1 Genetic algorithm for attribute partitioning

J Intell Inf Syst (2012) 39:559576

function is the total number of disk accesses required by all transactions as shown in (2). iii) Chromosomes from the current generation are selected as parents with a probability proportional to their fitness by using the Roulette Selection (Goldberg 1989). Since the best fit chromosome is placed at the beginning of the population, the roulette wheel is started from the next position of the last selection in order to select the second parent. iv) The new offspring is created from the selected parents by applying a single-point crossover operator. v) After the crossover operator, a mutation operator is applied. vi) The above operations continue till a specified number of generations is reached, or there is no more improvement of the best fit chromosome in consecutive generations. 3.3 Tuple clustering The genetic algorithm for tuple clustering is similar to attribute partitioning, except for the differences in representation of chromosome and the cross-over operators. An order-based representation (Goldberg 1989; Chambers 1995) is used as the chromosome for the tuple clustering problem, which denotes tuple order in the file. The other variation is in cross-over operators in the reproduction phase. Because the solution space for the tuple clustering problem is huge due to large numbers of tuples and various possible tuple orders, the crossover operator becomes a very critical factor to the efficiency and convergence of the genetic algorithm. Therefore, we have tried three crossover operators (Goldberg 1989; Franti et al. 1997): Partially Matched Crossover (PMX) operator, Order crossover (OX) operator, and Cycle crossover (CX) operator. These operators are described below. a) Partially Matched Crossover (PMX) Operator Under PMX, two crossing sites are randomly selected, which define a matching section that is used to affect a crossing through position-by-position exchange

J Intell Inf Syst (2012) 39:559576

565

operations; each offspring string will contain ordering information partially determined by each parent. For example, A=23|487|516 B=43|561|278 First, map string B to string A, using middle segment numbers: 4 < > 5, 8 < > 6, 7 < > 1. I.e. exchange 4 and 5, 8 and 6, and 7 and 1 in string A. Thus A is converted to the offspring A: 2 3 | 5 6 1 | 4 7 8. Next, map string A to string B in a similar way. Then the offspring B becomes 53|487|216 b) Order Crossover (OX) Operator The OX operator works in a way similar to the PMX. Instead of using point-bypoint exchanges, OX uses a sliding motion to fill the holes left by transferring the mapped positions. While PMX retains absolute position, OX tends to retain the relative position. Using the previous example, to generate an offspring for string A, the 5, 6, 1 will leave holes (marked by an H) in the string: A=23|487|HHH These holes are filled with a sliding motion that starts after the second crossing site, and then the holes are filled with a sliding motion that starts following the second crossing site: A=87|HHH|234 A =87|561|234 Performing this operation on string B, we get B=H3|561|2HH B=12|HHH|356 B =12|487|356 c) Cycle Crossover (CX) Operator CX operator performs a recombination under the constraint that each member in the sequence comes from the member of the other parent in the matching position. Using previous examples, we first choose a member from the first parent: (A = 2 - - - - - -). Next we get 4 from string B, so A string becomes 2 - 4 - - - - -. The 4 in turn selects 5 from string B (A = 2 - 4 - - 5 - -), and the operation is repeated until we eventually return to the original member. Finally, the remaining members are filled by the other string. A =23461578 B =43587216

566 Fig. 2 Genetic algorithm for mixed fragmentation

J Intell Inf Syst (2012) 39:559576

3.4 Mixed fragmentation The proposed algorithm for mixed fragmentation (see Fig. 2) starts with a tuple clustering scheme T(0) in which the original tuple order is used. The genetic algorithm for attribute partitioning is applied, and the partitioning scheme is evaluated using the fitness function. After a near optimal attribute partitioning scheme A(1) is obtained, the genetic algorithm for tuple clustering is applied for the given best attribute partitioning scheme A(1). This way, we obtain a near optimal tuple clustering scheme T(1) based on the optimal attribute partitioning scheme A(1). In general, given a tuple clustering scheme T(k-1), the genetic algorithm for attribute partitioning is applied to obtain a near optimal solution A(k). Next, given the current optimal attribute partitioning scheme A(k), the genetic algorithm is applied again to determine a near optimal tuple clustering solution T(k). These two genetic algorithms will be applied alternately until the specified termination criteria are met.

4 Procedure illustration The input to our genetic algorithms is a transaction profile, which includes the information about the attribute usage pattern, tuple usage pattern and the length (in bytes) of each attribute. This transaction profile can be presented using the Attribute Usage Matrix (AUM) (Table 1) and Tuple Usage Matrix (TUM) (Table 2). The AUM in Table 1 shows 5 the attributes accessed by each transaction for example, transaction 1 accesses attributes 1, 2, 3, and 9 with a frequency of 30. The AUM for this procedure has been drawn from Song and Gorla (2000). The TUM in Table 2

Table 1 Attribute usage matrix Transaction 1 2 3 4 5 Attr Length Attribute 1 30 20 0 0 0 7 2 30 0 0 0 50 7 3 30 20 50 20 50 10 4 0 20 0 20 50 25 5 0 20 0 0 0 5 6 0 0 50 0 50 4 7 0 0 50 0 50 8 8 0 0 50 0 0 10 9 30 0 0 20 0 6 10 0 0 50 0 0 6

J Intell Inf Syst (2012) 39:559576 Table 2 Tuple usage matrix Tuple. 1 2 3 4 5 6 7 8 9 10 Transaction 1 0 1 1 1 1 1 0 1 1 0 2 0 0 0 0 0 0 1 0 0 0 3 1 0 0 0 0 0 0 0 0 0 4 0 0 0 1 0 0 0 0 0 0 5 0 0 0 1 1 0 0 0 0 1

567

shows the set of tuples accessed by each transaction. For example, transaction 5 accesses tuples 4, 5, and 10. The above transaction profile (Tables 1 and 2) indicates that there are 5 transactions that access the relation with 10 attributes and 10 tuples. The length of each attribute in bytes is presented in the last row of AUM (Table 1). The other parameters for the formulae and the genetic algorithm are summarized in Table 3. The PMX operator is chosen for illustration. The page block size has been intentionally set to be larger than the tuple length plus tuple identifier length size so that in at least one partition, each block will contain at least two tuples of the partitioned relation. Hence, the benefit of tuple clustering can be achieved even on such a small relation. Moreover, we still assume that all required tuples are retrieved through an index scan. However, in practice, such a small relation is retrieved through a sequential scan. For the attribute partitioning, we choose the smallest number of fragments (i.e. two) to be partitioned to keep the calculation simple for illustration.

Table 3 Parameters used in genetic algorithm

Parameter Length of Tuple Identifier (LTID ) Page Block Size (B) Penalty Cost Factor (W) Maximum Cycles (Ym ) Consecutive Cycles (Yc ) Attribute Partitioning Number of Partition Population (NA ) Mutation Rate A Maximum Generations (G Am) Consecutive Generations (G Ac) Tuple Clustering Population Crossover Operator Mutation Rate Swap % Maximum Generations (GTm) Consecutive Generations (GTc)

Value 4 Bytes 256 Bytes 0.1 30 5 2 10 10% 100 20 10 PMX 30% 30% 100 20

568 Table 4 Initial population for attribute partitioning Chromosome# 1 2 3 4 5 6 7 8* 9 10

J Intell Inf Syst (2012) 39:559576 Chromosome 1, 1, 1, 1, 1, 1, 1, 1, 0, 1 0, 1, 1, 0, 1, 0, 0, 0, 0, 0 1, 1, 1, 0, 0, 1, 0, 0, 0, 0 0, 0, 1, 0, 0, 1, 0, 1, 1, 0 1, 1, 0, 0, 0, 1, 0, 1, 1, 1 0, 0, 1, 1, 0, 1, 1, 0, 1, 1 0, 1, 1, 0, 0, 0, 1, 1, 0, 0 0, 1, 0, 0, 1, 0, 0, 0, 1, 0 0, 1, 1, 1, 0, 0, 0, 1, 0, 1 1, 0, 1, 0, 0, 0, 0, 0, 1, 0 Total cost 462 550 638 638 638 614 638 435 638 550

4.1 Attribute partitioning 4.1.1 Initial population The initial population is filled with randomly generated chromosomes. Table 4 shows the initial population with the total cost and fitness of each chromosome. Chromosome #8 turns out to be the best solution in the initial population (Table 4). 4.1.2 Crossover and mutation After the initial population is formed, the genetic algorithm will select 2 chromosomes to mate using the Roulette Selection method. A description of the procedure for generating a new offspring is given below. For illustration purposes only, assume that chromosome #7 and chromosome #8 are selected for mating with a crossover site at 7: Chromosome #7 : Chromosome #8 : 0, 1, 1, 0, 0, 0, 1, 1, 0, 0 0, 1, 0, 0, 1, 0, 0, 0, 1, 0

The new offspring becomes: 0, 1, 1, 0, 0, 0, 1, 0, 1, 0. Mutation is just a flip of a gene at a certain position. If the third gene is selected for mutation, then it becomes: 0, 1, 0, 0, 0, 0, 1, 0, 1, 0.

Table 5 Final population for the first iteration of attribute partitioning

Chromosome# 1 2 3 4* 5 6 7 8 9 10

Chromosome 0, 1, 1, 1, 0, 1, 1, 1, 1, 1 0, 1, 1, 1, 0, 1, 1, 1, 0, 1 0, 1, 1, 1, 0, 1, 1, 1, 0, 0 0, 1, 1, 1, 0, 1, 1, 1, 1, 1 0, 1, 1, 1, 0, 1, 1, 1, 1, 1 0, 1, 1, 1, 0, 1, 1, 1, 1, 1 0, 1, 1, 1, 0, 1, 1, 1, 0, 1 0, 0, 1, 1, 0, 1, 1, 1, 0, 0 0, 1, 1, 1, 0, 1, 1, 1, 0, 0 0, 1, 1, 1, 0, 1, 1, 1, 1, 1

Total cost 346 370 430 346 346 346 370 638 430 346

J Intell Inf Syst (2012) 39:559576 Table 6 Initial population for tuple clustering Chromosome# 1 2 3 4 5 6 7 8 9 10 Chromosome 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 2, 1, 9, 4, 7, 3, 0, 8, 5, 6 1, 3, 9, 0, 5, 7, 8, 2, 4, 6 1, 2, 9, 7, 3, 4, 8, 5, 0, 6 0, 4, 1, 5, 2, 6, 9, 3, 7, 8 3, 5, 1, 4, 8, 6, 2, 0, 9, 7 4, 2, 5, 7, 1, 3, 0, 8, 9, 6 4, 5, 6, 8, 7, 2, 0, 1, 3, 9 3, 9, 4, 5, 1, 6, 2, 8, 0, 7 4, 9, 5, 8, 3, 2, 1, 0, 7, 6

569 Total cost 346 346 346 346 379 429 396 396 329 346

4.1.3 Final population The attribute partitioning genetic algorithm stops at the 24th generation where there is no further reduction on the cost over 20 consecutive generations. The solution obtained is {0, 1, 1, 1, 0, 1, 1, 1, 1, 1} with cost equalling 346, which is very close to the optimal solution cost 340 obtained by exhaustive enumeration (without tuple clustering consideration). The final population for the first cycle attribute partitioning is shown in (Table 5). 4.2 Tuple clustering The initial population is filled with randomly generated chromosome (Table 6). The tuple clustering genetic algorithm stops at the 21st generation where there is no further reduction on the cost over 20 consecutive generations. The solution obtained is {3, 4, 9, 7, 8, 5, 2, 6, 1, 0} with cost equals to 296 (Table 7). 4.3 Mixed fragmentation The two applications of genetic algorithms detailed above are repeated until there is no reduction in the cost over 5 consecutive cycles. The final solution is: Attribute partitioning scheme : Tuple clustering scheme : Cost : { 0, 0, 0, 0, 0, 0, 0, 1, 0, 1} { 2, 0, 1, 4, 9, 3, 7, 8, 5, 6} 290

Table 7 Second population for tuple clustering

Chromosome# 1 2 3 4 5 6 7 8 9 10

Chromosome 3, 9, 4, 5, 1, 6, 2, 8, 0, 7 4, 8, 1, 5, 2, 6, 9, 3, 7, 0 3, 9, 4, 0, 1, 7, 2, 8, 5, 6 6, 5, 9, 4, 7, 3, 0, 2, 1, 8 4, 3, 9, 0, 5, 7, 8, 2, 1, 6 0, 1, 2, 3, 4, 5, 6, 7, 9, 8 4, 9, 5, 7, 1, 3, 0, 2, 8, 6 3, 5, 6, 8, 7, 1, 2, 0, 9, 4 4, 5, 8, 9, 1, 2, 0, 3, 7, 6 5, 9, 2, 3, 4, 8, 1, 0, 7, 6

Total cost 329 346 296 379 296 378 346 429 396 346

570

J Intell Inf Syst (2012) 39:559576

5 Performance evaluation We present experimental results of the genetic algorithm approach to the mixed fragmentation problem using the transaction profile of the case study used in Cornell and Yu (1990). Validations of the attribute partitioning and tuple clustering schemes in relation to optimal solutions are difficult because of computational complexity. The exhaustive enumeration produces all the possible partitions of a set of attributes for all possible tuple orderings (Gorla and Quinn 1991). A comparison of performance with exhaustive enumeration becomes infeasible when the number of attributes exceeds 10. Hence, we test the performance and convergence of the genetic algorithms by comparing them with the attribute partitioning only solution and with the random search solution. In a random search, the solution is generated randomly. The attribute partitioning and tuple clustering is repeated in the same way, and the only difference is that the optimal solution is obtained by a random search rather than a genetic algorithm. In order to make a reasonable comparison, the number of solutions generated is about same as the number of chromosomes tested in the genetic algorithm. We estimate that this total number is: ( NA G Ac + NT GTc ) Yc 2 where NA G AC NT GTC YC Population for attribute partitioning Consecutive generations allowed for attribute partitioning Population for tuple clustering Consecutive generations allowed for tuple clustering Consecutive cycles allowed

The programs for our partitioning algorithms are written in Java. It can be run in any browser supporting Java 1.1.4 or above. All our experiments run on a Celeron 433 PC with 128 MB. 5.1 Database and transaction profiles For the case in question, the relation has 10 attributes and 6 types of transactions accessing the relation (Table 8). The original cardinality is 100,000. However, in order to reduce the computational complexity, we assume a blocking factor of 100. This will result in 1000 blocks being clustered. The problem then reduces to the clustering of 1000 blocks rather than of 100.000 tuples. The first five transaction types in AUM (Table 8) are simple select type queries with one restrict attribute that is also the scan attribute. The selectivity of the scan attribute in each transaction is specified in the last column. The type 6 transaction represents a join with another relation. 5.2 Performance comparison Table 9 shows database access costs with mixed fragmentation using the genetic algorithm procedure and its comparison with the attribute partitioning only solution (without tuple clustering), and random search methods. The cost reduction with mixed fragmentation compared to attribute partitioning only is: (cost of mixed fragmentation cost of attribute partitioning only) / cost of attribute partitioning

J Intell Inf Syst (2012) 39:559576 Table 8 Attribute usage matrix for case study Tran. 1 2 3 4 5 6 Attribute Length 1 2 3 4 5 6 Attribute Length Attribute 1 3 5 0 0 2 0 8 11 0 0 0 2 0 0 4 2 0 5 0 0 0 0 8 12 0 0 0 2 0 0 8 3 0 5 0 0 0 0 8 13 0 0 0 2 0 0 6 4 3 0 0 0 0 0 8 14 0 0 0 2 0 0 5 5 0 5 0 0 0 0 4 15 0 0 0 2 0 0 3 6 3 0 0 0 0 0 8 16 0 0 0 0 0 50 30 7 3 0 0 2 0 0 8 17 0 0 0 0 0 50 12 8 0 0 10 0 0 0 12 18 0 0 0 0 0 50 8 9 0 0 10 0 0 0 20 19 0 0 0 0 0 50 6 10 0 0 10 0 0 0 22 20 0 0 0 0 2 50 6 0.01 0.01 0.01 0.01 0.01 0.01

571

Selectivity

only. The cost reduction percentage as compared to the random search method is computed as: (cost of mixed fragmentation with random search - cost of mixed fragmentation with genetic algorithm) / cost of mixed fragmentation with random search. With a two fragment solution, the optimal cost for the attribute partitioning only method is 541.8, while those for mixed fragmentation are 210.4 (using genetic algorithm) and 375.2 (using random search method). The three methods produced solutions that are quite different, as can be seen from bit strings. The best solution with the attribute partitioning only method using GA is obtained when the number of fragments = 3; the corresponding attribute partitioning solution is {[15,16,17,18,19,20], [8,9,10], [1,2,3,4,5,6,7,11,12,13,14]}. The cost for this solution is 506.6. The best solution for random search is obtained when the number of fragments is 2 with a minimum cost of 375.2. The best solution with mixed fragmentation is obtained when the number of fragments equals 5. The corresponding attribute partitioning solution is {[16,17,18,19,20], [1,2,3,4,5,6,7,11,13], [12,15], [14], [8,9,10]} and the cost is 154.2. Thus, consideration of both attribute partitioning and tuple/block clustering simultaneously produced different attribute partitioning scheme with much lower costs when compared to the attribute partitioning only method. The cost savings with the mixed fragmentation over the attribute partitioning only solution ranged from 51% to 70% across all partitioning schemes. We observe that the attribute partitioning only method by itself achieved a cost reduction from 24% to 28% over unpartitioned solution (cost=715). By also applying the tuple clustering, we obtained cost savings ranging from 66% to 78% over unpartitioned and unclustered solution. Our results are consistent with those of Gorla and Quinn (1991), since the mixed fragmentation solution outperformed the attribute-only solution in all the partitioning cases examined. Our mixed fragmentation solutions with GA outperformed the random search solutions, resulting in reductions ranging from 26% to 67%. The best solution among all the partitioning schemes is the one with 5 fragments. As the number of fragments increases beyond 5, the additional overhead cost of accessing a tuple across different fragments outweighs the cost reduction owing to the removal of irrelevant attributes in fragments.

572 Table 9 Performance comparison of mixed fragmentation Fragments GA (Attribute Partitioning only) Solution Cost Cost (avg) # tested (avg) Solution 715

J Intell Inf Syst (2012) 39:559576

GA Random %Reduction %Reduction (Mixed (Mixed over over Fragmentation) Fragmentation) attribute random partition 242.2 332.14 277710 00000111111111 100000 210.4 255.65 194721 33313312221110 133333 195.3 234.47 233961 11111113330010 122222 190.8 259.096 341940 11111114441213 200000 154.2 228.47 207726 31121325554444 400000 191.6 239.83 248694 20040446665555 133333 206.4 247.45 238128 51161760002444 633333 215.1 270.35 230310 446 446.5 150000 66.2 45.7 25.6 43.9 35.7 50.8 45.9 54.6 44.4 66.8 54.5 57.9 54.6 51.5 53.2 56.9 52.9

1111111111111 1100000 Cost 541.8 Cost (avg) 597.22 # tested 37372 (avg) Solution 2222222111222 2000000 Cost 506.6 Cost (avg) 593.46 # tested 35134 (avg) Solution 0000000111200 0333333 Cost 511.8 Cost (avg) 553.88 # tested 36648 (avg) Solution 4443433000222 2111111 Cost 511.6 Cost (avg) 553.92 # tested 36096 (avg) Solution 0551511222333 3444444 Cost 511.9 Cost (avg) 545.26 # tested 38868 (avg) Solution 3004045222666 6111111 Cost 513.1 Cost (avg) 562.85 # tested 47148 (avg) Solution 7554541222333 6000000 Cost 521.1 Cost (avg) 552.39 # tested 45108 (avg)

11111111110111 000000 375.2 61.3 397.6 57.1 150000 11111001111201 122222 397.2 61.5 433.96 60.5 150000 33023331112232 211111 420.6 62.7 466.03 53.2 150000 03322101113242 244444 464.8 69.9 501.76 58.8 15000 11143120441213 255555 454.6 62.5 528.61 56.0 150000 21131446660530 411111 425.7 59.8 529.22 56.1 150000 16106123557471 511111 499.5 58.7 574.86 51.1 150000

J Intell Inf Syst (2012) 39:559576 Table 10 Performance comparison of crossover operators Fragments 1 2 3 4 5 Cost Time required Cost Time required Cost Time required Cost Time required Cost Time required PMX 362.6 33.49s 318.34 37.74s 309.56 38.8s 307.2 43.1s 314.68 43.3s OX 336.6 33.5s 326.82 39.4s 347.91 48.57s 340.68 44.7s 334.86 46.3s CX

573

307 47.18s 286.7 41.4s 287.6 55.2s 258.6 54.3s 304.62 47.4s

We also note that the optimal solution of the attribute partitioning only method (with number of fragments =3) obtained by our algorithm is the same as that in Cornell and Yu. This has a cost of 506.6. By applying tuple/blocks clustering in addition, we obtained a different optimal solution with a lower cost, representing cost savings of 69% over the solution of Cornell and Yu. In the original case, the cardinality is 100,000. To see whether the attribute partitioning result would differ when the cardinality increases dramatically, we randomly generated a tuple usage pattern of 100,000 tuples using the same method as before and then applied the attribute partitioning genetic algorithm, given that the other parameters remained the same. As expected, the result obtained was the same as those obtained for 1,000 tuples.

6 Analysis of genetic parameters We evaluated the performance of the genetic algorithm under different parameter settings using the transaction profile of the above case study. The consecutive generations and maximum generations for tuple clustering are set at 50 and 500 respectively. The consecutive cycle is set at 5. 6.1 Effect of crossover operator The cost with each crossover operator and the time required to execute as a function of the number of fragments are given in Table 10. The results are averaged from five test runs. The CX operator outperformed other operators for all partition schemes. It is because the CX operator produces the greatest genetic diversity between each generation while sufficient amount of genetic variation is retained. However, this
Table 11 Performance varying mutation rate for attribute partitioning Mutation rate 5% 10% 20% 30% 40% 50% 60% Cost 292.96 275.14 288.18 272.96 304.32 297.48 298.44 Time required 67.4s 60.5s 61.3s 65.8s 63.27s 57.3s 61.61s

574 Table 12 Performance varying mutation rate for Tuple clustering Mutation rate 5% 10% 20% 30% 40% 50% 60%

J Intell Inf Syst (2012) 39:559576 Cost 294.28 306.26 292.16 275.14 309.46 329.26 308.64 Time required 52.89s 58.51s 62.4s 60.5s 61.7s 58.14s 56.37s

operator is more computationally expensive and required more time than the other two operators. The performance of the OX operator is slightly better than that of the PMX operator, but the time required is slightly longer. The above result demonstrates the following property of genetic algorithm for mixed fragmentation. A successful implementation of GA should direct the search efficiently but it should also retain enough genetic variation in the population. The PMX and OX operators reduced the genetic variation, so the algorithms converged too quickly. 6.2 Effect of mutation rate It is observed that mutation has only a small effect on the performance. We vary the mutation rate for attribute partitioning from 5% to 60% using the CX crossover operator with other parameters being the same as before (Table 11). The algorithm produced four fragments as optimal. The results indicate that there is no significant change in cost and time required when the mutation rate is varied from 5% to 60%. The results for the mutation rate in tuple clustering are the same as those in attribute partitioning (Table 12).

7 Future research We proposed a novel approach to the mixed fragmentation problem using genetic algorithm by applying it iteratively to the attribute partitioning and tuple clustering sub-problems. We compared our results to attribute-only partitioning, random search solution, and previous research and showed significant improvement. We also performed a sensitivity analysis of crossover operators and mutation rate on the database performance. Compared to attribute partitioning only method (as in previous research), the mixed fragmentation design method produced database cost savings up to 69%. It should be noted that in the experiment we used a block size of 100 tuples in order to limit the number of blocks for tuple/block clustering. This was necessary in order to limit the length of chromosome to a reasonable number so that genetic algorithm could be applied successfully on the tuples that can run into large numbers. However, in the case of federated database systems, horizontal partitioning is already done and disk accesses are reduced by clustering tuples into small partitions that are placed on disk blocks. Thus, in federated databases, there would be no need to recluster these partitions into blocks, as the size of chromosome would be manageable.1

1 The

authors are thankful to the reviewer that provided this feedback.

J Intell Inf Syst (2012) 39:559576

575

The present research can be extended in the following ways: first, by integrating access methods (index selection, clustered and unclustered indexes) with the present method, the combined problem of mixed fragmentation and access method can be solved using genetic algorithm. Second, considering the access parameters of disk storage (Gorla and Boe 1990), a more accurate cost function can be obtained, which can lead to a better mixed fragmentation solution. Third, the present research can be extended to multi-relational environment, wherein the impact of referential integrity enforcing transactions can be considered and the optimal mixed fragmentation scheme can be derived.

References
Ailamaki, A., Dewitt, D.J., Hill, M. D., & Skounakis, M. (2001). Weaving relations for cache performance. In Proceedings of the 27th VLDB conference. Baiao, F., Mattoso, M., & Zaverucha, G. (2004). A distribution design methodology for object DBMS. Journal of Distributed and Parallel Databases, 16(6), 4590. Ceri, S., Navathe, S., & Wiederhold, G. (1983). Distribution design of logical database schemas. IEEE Transactions on Software Engineering, 9(4), 487504. Chambers, L. (1995). Practical handbook of genetic algorithms (Vol. 1). CRC Press. Cheng, C. H., Lee, W. K., & Wong, K. F. (2002). A genetic algorithm-based clustering approach for database partitioning. IEEE Transactions on Systems, Man, and Cybernetics, 32(3), 215230. Chu, W. W., & Ieong, I. T. (1993). A transaction-based approach to vertical partitioning for relational database sytems. IEEE Transactions on Systems, Man, and Cybernetics, 19(8), 804812. Cornell, D. W., & Yu, P. S. (1990). An effective approach to vertical partitioning for physical design of relational databases. IEEE Transactions on Software Engineering, 16(22), 248258. Du, J., Alhajj, R., & Barker, K. (2006). Genetic algorithms based approach to database vertical partitioning. Journal of Intelligent Information Systems, 26(2), 167183. Eisner, M. J., & Severance, D. G. (1976). Mathematical techniques for efficient record segmentation in large shared database. Journal of ACM, 23(4), 619635. Ezeife, C. I. (2001). Selecting and materializing horizontally partitioned warehouse views. Data and Knowledge Engineering, 36, 185210. Franti, P., Kivijarvi, J., Kaukoranta, T., & Nevalainen, O. (1997). Genetic algorithm for large-scale clustering problems. Computer Journal, 40(9), 547554. Fung, C. W., Karlapalem, K., & Li, Q. (2002). An evaluation of vertical class partitioning for query processing in object-oriented databases. IEEE Transactions on Knowledge and Data Engineering, 14(5), 10951118. Furtado, C., Lima, A. A. B., Pacitti, E., Valduriez, P., & Mattoso, M. (2005). Physical and virtual partitioning in OLAP database cluster. In 17th international symposium on computer architecture and high performance computing (pp. 143150). Goldberg, D. E. (1989). Genetic algorithm in search, optimization, and machine learning. AddisonWesley. Gorla, N. (2001). An object-oriented database design for improved performance. Data and Knowledge Engineering, 37 (2), 117138. Gorla, N. (2007). A methodology for vertically partitioning in a multi-relation database environment. Journal of Computer Science & Technology, 7 (3), 217227. Gorla, N., & Betty, P. W. Y. (2008). Vertical fragmentation in databases using data-mining technique. International Journal of Data Warehousing & Mining, 4(3), 3553. Gorla, N., & Boe, W. (1990). Database performance of fragmented databases in mainframe, Mini, and micro computer systems. Data and Knowledge Engineering, 5(1), 119. Gorla, N., & Quinn, W. (1991). Combined optimal tuple ordering and attribute partitioning in storage schema design. Information and Software Technology, 33(5), 335339. Gorla, N., & Song, S. K. (2010). Sub query allocations in distributed databases using genetic algorithms. Journal of Computer Science & Technology 10(1), 3137. Hoffer, J. A., & Serverance, D. G. (1975). The use of cluster analysis in physical design. In Proc. f irst international conference on very large data bases.

576

J Intell Inf Syst (2012) 39:559576

Knuth, D. (1973). Sorting and searching. The art of computer programming (Vol. 3). AddisonWelsey. Li, B., & Jiang, W. (2000). A novel stochastic optimization algorithm. IEEE Transactions on Systems, Man, and Cybernetics, Part B, 30(1), 193198. Mahboubi, H., & Darmont, J. (2008). Data mining-based fragmentation of XML data warehouses. In DOLAP 08. March, S. T., & Rho, S. (1995). Allocating data and operations to nodes in distributed database design. IEEE Transactions on Knowledge and Data Engineering, 7 (2), 305317. Navathe, S., Ceri, S., Wiederhold, G., & Dou, J. (1984). Vertical partitioning algorithms for database design. ACM Transactions on Database Systems, 9(4), 680710. Ng, V., Law, D. M., Gorla N., & Chan, C. K. (2003). Applying genetic algorithms in database partitioning. In 2003 ACM symposium on applied computing (SAC). Niamir, B. (1978). Attribute partitioning in a self-adaptive relational database system. PhD Dissertation, MIT Lab. for Computer Science, Jan 1978. Ozsu, M., & Valduriez, P. (1996). Principles of database systems. Prentice Hall. Ramamurthy, R., Dewitt, D. J., & Su, Q. (2002). A case for fractured mirrors. In Proceedings of the 28th VLDB conference. Rivest, R. (1976). On self-organizing sequential search heuristics. Communication of the ACM, 19(2), 6367. Serban, G., & Campan, A. (2008). Hierarchical adaptive clustering. Informatica, 19(1), 101112. Song, S. K., & Gorla, N. (2000). A genetic algorithm for vertical fragmentation and access path selection. The Computer Journal, 45(1), 8193. Tam, K. Y. (1992). Genetic algorithms, function optimization, and facility layout design. European Journal of Operations Research, 63(2), 322346. Tamhanker, A. J., & Ram, S. (1998). Database fragmentation and allocation: An integrated methodology and case study. IEEE Transactions on Man, Systems, and Cybernetics, 28(3), 288305.

Potrebbero piacerti anche