Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
www.elsevier.com/locate/infsof
Received 12 March 2005; received in revised form 4 November 2005; accepted 4 December 2005
Available online 3 March 2006
Abstract
OLAP is a category of database technology that allows analysts to gain insight into the aggregation of data by enabling them to gain access to a
variety of different views of the information contained in a database. It is very important to provide analysts with guaranteed error bounds for
approximate results to aggregation queries in enterprise applications such as decision support systems. We propose a general method of providing
tight error bounds for approximate results to OLAP range-sum queries. We perform an extensive experiment on diverse data sets and examine the
effectiveness of the proposed method for various data cube dimensions and query sizes.
q 2006 Elsevier B.V. All rights reserved.
Keywords: Information system; Database; OLAP; Online aggregation; Decision support; Approximate query answering
guaranteed error bounds and improves the quality of the answer of sumQ,X is defined as follows
answer, until some constraint (time and/or error bound) is X
reached. The MRA-tree produces a better answer to the query sumQ;X Z sumc;X C PN !sumN;X
in a shorter period of time than the pCube. However, the N2N p
bounding technique used in this approach is the same as that where sumc,X is the sum of the aggregate values of the nodes in
used in the pCube. Furthermore, in applications where frequent Nc and PN is the percentage of overlap of node N with query Q.
updates are commonplace, this approach incurs a high update Since all of the nodes in Nc are contained in the query, sumc,X has
cost. Recently, we proposed the D-tree [3] to manage updates an exact value. Therefore, it is clear that the lower and upper
efficiently in the dynamic OLAP environment. The basic bounds of sumQ;X are sumc,X and sumc,XCsump,X, respectively.
precept of this approach is that changes in the data cube are However, since the MRA-tree provides the user with only an
stored in the D-tree and managed separately from the data cube. approximate answer with loose error bounds, we cannot expect
This drastically reduces the update cost at run-time. In him or her to use it as the basis for critical decisions. Moreover,
addition, by taking advantage of the hierarchical structure of the MRA-tree has a prohibitive update cost in applications in
the D-tree, we proposed a hybrid method of providing either an which frequent updates are commonplace and run concurrently
approximate result or a precise one, in such a way as to reduce with queries [7]. The value update in the MRA-tree consists of
the overall cost of a query. However, we did not discuss the changing the value of data point PZhloc, vali to P 0 Zhloc, val 0 i.
technique that is used to provide error bounds for the The update operation consists of two steps, that is, the deletion
approximate results obtained in the hybrid method. of the previous data point, P, and the insertion of the new data
In this paper, we present a general method of providing tight point P 0 . The problem is that the cost of searching the MRA-tree
error bounds for approximate results to OLAP range-sum in order to delete P is very high.
queries. The proposed algorithm is very effective and directly
applicable to various approximation techniques that use a tree
structure such as the pCube and MRA-tree. We conduct an 2.2. The D-tree
extensive experiment on diverse data sets, and examine the
effectiveness of the proposed method for various data cube In this section, we introduce an approximation technique using
dimensions and query sizes. The experimental results show that the D-tree. The D-tree is a modified version of the R*-tree [2],
the proposed method provides tighter error bounds than the which is designed to store the updated values of a data cube and to
MRA-tree. support efficient query processing. The process of constructing
the D-tree is the same as that of the R*-tree. Initially, the D-tree
has only a directory node (called the root node). Whenever the
2. Tree-based index structure data cube cell is updated, the difference (D) between the new and
old values of the data cube cell and its spatial position are stored
The aggregation functions that are used for range query are into the D-tree. We define the D-tree formally as follows:
SUM, AVERAGE, COUNT, MIN, MAX, etc. Among these
Definition 2.1. (the D-tree)
functions, the SUM function is the most popular and important.
So, in this paper, we concentrated on the range-sum queries. 1. Each directory node contains (L1, L2,., Ln), where Li is the
tuple
P pertaining to thePith child node, Ci, and P has the form
2.1. The MRA-tree ( D, M, cpi, MBRi). D is the sum of the D values of the
child nodes (D values) of Ci, where Ci is a directory node (data
The multi-resolution aggregate (MRA) tree is a modified node). cpi is the address of Ci and MBRi is the MBR enclosing
multi-dimensional index structure which stores data points of all entries in Ci. M has the form (m1, m2,., md) where d is the
the form hloc, valuesi, where loc2Rspace and values2Dv. Non- dimension and mj is the weighted mean position of the jth
leaf nodes contain entries of the form hptr, region, aggregatesi dimension of MBRi which is defined as follows:
for each of their child nodes, where ptr is a pointer to a child Let f(k1, k2,., kd) be the value of an updated position (k1,
node, region is the space covered by that node and aggregates is k2,., kd) in MBRi with 1%kj%nj, where nj is the number of
a tuple hSUMi of aggregate information over all data points cells in the jth dimension of MBRi. For 1%m%nj, let
covered by that node. A leaf node contains the data points
covered by the region of the parent node. The authors of this X
n1 X
m X
nd
Gj m Z . . f k1 ;.;kj ;.;kd :
approach provide a progressive algorithm for approximate k1Z1 kjZ1 kd Z1
aggregate queries. The algorithm maintains two sets of tree
nodes Nc and Np, where Nc is a set of nodes completely Let a be the largest lower bound such that Gj(a)%Gj(nj)/2, and
contained in the query and Np is a set of nodes that either b be the smallest upper bound such that Gj(b)RGj(nj)/2. Then
enclose or partially overlap those in the query. It is assumed mj Z aC b=2. That is, mj divides the hyperspace MBRi such
that we find sumQ,X for the query region RQ, that is, the sum of that each subspace has about a half of the sum of updated
the values of the attributes, X, for all points contained in RQ. values in MBRi.
sumQ;X is the estimate for query Q of aggregate type SUM over 2. Each data node is situated at the level 0 and it contains (D1,
attribute X (value stored at each data point). The approximate D2,., Dn), where Di is the tuple pertaining to the ith data entry
S.-J. Chun et al. / Information and Software Technology 48 (2006) 869875 871
Fig. 1. Data cube cells changed from a two-dimensional data cube and the D-tree corresponding to the changed cells.
P P
and has the form (Pi, Di). Pi is the position index and Di is the Let ( D)P i (iZ1,., m) be the D valuePof the ith inclusive
difference of the changed cell. MBR, and ( D)j (jZmC1,., n) be the D value of the jth
intersecting MBR. The answer to the range-sum query at the level
Example 2.2. As shown in Fig. 1a, we assume that 12 data cube
k of the D-tree can be approximated by the following equation:
cells have been changed in the data cube. The value in the data
cube cells is the difference (D) between the new and old values sumQ Z sumQ Q
PC C sumD
of the changed data cube cell. The D value and its spatial
position are stored into the D-tree as shown in Fig. 1b. m X
X
Z sumQ
PC C D
i
Ho et al. [6] presented an elegant algorithm for computing iZ1
range-sum queries in data cubes, which we call the prefix sum X
Xn
VolMBRj h MBRQ
approach. Their approach uses an additional cube called a C ! D
prefix sum cube (PC), to store the cumulative sum of the data. jZmC1
VolMBRj j
UBlQ KLBlQ
1. Inclusive MBRs. MBRi (iZ1,., m), where m is the number EBRl Z ; lR 0
Max1;jsumQ j
of MBRs included in the query MBR, denoted by MBRQ.
2. Intersecting MBRs. MBRj (jZmC1,., n), where nKm is the Note that the level of data nodes (i.e. leaf nodes) is 0. Thus,
number of MBRs intersecting with MBRQ. EBR (0)Z0. The result of a range-sum query is large in almost
872 S.-J. Chun et al. / Information and Software Technology 48 (2006) 869875
all cases. We set sumQ below 1 as l to avoid the effect of the very
unusual sumQ close to 0.
Lemma 2.4. (Progressively refined error bounds)
The error bounds are progressively refined as the level of the
tree decreases.
Proof. Let l and m be the levels of the tree such that l%m.
Since UB(l)QKUB(m)Q%0, and LB(m)QKLB(l)Q%0, then
EBRlKEBRm
Fig. 4. An example of tight bound technique in a two-dimension data cube with quadtree-based index structure (pCube).
P C P K
Case 1. UBQ MBRT lZ D =2 and LBQ MBRT lZ D =2, if Naive bound technique:
dj such that Vol(MBRThMBRQ)3phsjodk such Approximation: 8C14/2C6/2C12/4Z21
that Vol(MBRP ThMBRQP )3nhsk Upper bound: 8C14C6C12Z40
Case 2. UB Q
P MBR T
lZ DC
=2C DK=2 and LBQ MBRT lZ Lower bound: 8C0C0C0Z8
D , if dj such that Vol(MBRThMBRQ)3phsj-
K
Tight bound technique:
odk such that ThMBRQ)InhskP Approximation: 8C14/2C6/2C12/4Z21
Q PVol(MBR
Case 3. UB
P MBR T
lZ DC
and LBQ MBRT lZ DC=2C Upper bound: 8C14/2C6/2C12/2Z27
D =2, if dj such that Vol(MBRThMBRQ)Iphsj-
C Lower bound: 8C14/2C6C12/2Z21
odk such that PVol(MBR PThMBR Q)3nhsk
Q
lZ =2 LBQ In the example, we can find that the EBRs of both the naive
Case 4. UB D D and MBRT lZ
C K
P MBR T P
C
D C D =2, if dj such that Vol(MBRT-
K C and tight bound techniques at tree level 1 are 1.46 and 0.27,
hMBRQ)Iphsjodk such that Vol(MBRTh respectively.
MBRQ)Inhsk
Tight-bound technique can be applied to various kinds of 4. Experiments
tree-based index structure such as MRA-tree, D-tree, quadtree
(pCube). Therefore, it is enough to show an example using 4.1. Test data sets
quadtree-based index structure (pCube) in Fig. 4.
As shown in Fig. 4, when a range-sum query (shaded box) is In order to evaluate the effectiveness of the proposed
given, we can obtain the approximation value and error bounds method, we conducted an extensive experiment on diverse
for the naive bound technique and the tight bound technique at data sets that are generated synthetically with various
the tree level 1 as follows: dimensions. Our experiment focuses on showing the
Table 1
Test data sets used in the experiment
Method d Cube volume (V) Number of cells stored Data distributions Query size (query
in the tree volume (V))
MRA-tree 3 512!512!512 400,000 Uniform/zipf Large (0.1), medium
4 128!128!128!128 400,000 (0.05), small (0.01)
5 64!64!64!64!64 400,000
D-tree 3 512!512!512 40,000
4 128!128!128!128 40,000
5 64!64!64!64!64 40,000
874 S.-J. Chun et al. / Information and Software Technology 48 (2006) 869875
Fig. 5. The error bound ratio for a uniform distribution in the D-tree when the
dimensionality is three and the query sizes are small, medium and large. Fig. 7. The error bound ratio in the MRA-tree and the D-tree for a zipf
distribution when the dimensionZ3 and the query sizeZlarge.
Fig. 6. The error bound ratio in the MRA-tree and the D-tree for a uniform Fig. 9. The error bound ratio for a zipf distribution in the D-tree when the query
distribution when the dimensionZ3 and the query sizeZlarge. size is large and the dimensionalities are 35.
S.-J. Chun et al. / Information and Software Technology 48 (2006) 869875 875