Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Warehousing
63
have to perform larger and larger comparisons. We source sends the full snapshots to an intermediate
are saying, however, that it is a solution we are stuck site where the snapshot differential process is per-
with for the foreseeable future, and because differ- formed. In any case, the exact procedure for sending
entials are such inherently expensive operations it is these modifications to the data warehouse is imple-
critical that we perform them as efficiently aa pos- mentation dependent.
sible. In this paper we will present very efficient It is important to realize that there is no unique
differential algorithms; they perform so well because set of modifications that captures the difference be-
they exploit the particular semantics of the problem. tween two snapshots. At one extreme, a deletion
can be reported for each record in Fl and an in-
1.1 Problem Formulation sertion can be reported for each record in F2. Ob-
We view a source snapshot as a file containing a viously, this can be wasteful. We capture this no-
set of distinct records. The file is of the form tion of wasted messages by defining useless delete-
{RI, R2 , . ..R.,} where Ri denote the records. Each insert pairs and useless insert-delete pairs. A useless
Ri‘is of the form < K, B >, where K is the key and insert-delete pair is a message sequence composed
B is the rest of the record representing one or more of < Insert, Ki, Bi > followed (not necessarily im-
fields. Without loss of generality, we refer to B as mediately) by < Delete, Ki >, produced when the
a single field in the rest of the paper. (In [LGM95] two snapshots both have a record < Ki, Bi > or
we extend the algorithms presented in this paper to when the earlier snapshot has < Ki, Bj > and the
the case where records do not have unique keys.) later one has < Ki, Bi >. A useless insert-delete
For the snapshot differential problem we have two pair introduces a cbrrectness problem. When the
snapshots, J’l and F2 (the later snapshot). Our goal insert is processed at the warehouse, it will most
is to produce a file FOUT that also has the form likely be ignored since a record with the same key al-
{RI, R2, . ..R.,} and each modification record & has ready exists. Thus, when the delete is processed, the
one of the following three forms. record with the key Ki will be deleted from the ware-
house. On the other hand, a useless delete-insert
1. < Update, Ki, Bj >
pair does not compromise the correctness of the
2. < Delete, Ki > warehouse. However, it introduces overhead in pro-
3. < Insert, Ki, Bi > cessing messages since either no modifications were
The first form is produced when a record < Ki, Bi > needed (when the two snapshots both have a record
in file Fi is updated to < Ki, Bj > in file F2. The < Ki, Bi >) or the modification could have been
second form is produced when a record < Ki, Bi > reported more succinctly by < Update, Iii, Bi >.
in Fl does not appear in F2. Lastly, the third form Since useless pairs are not an effective way of
is prodhced when a record < Ki, Bi > in F2 was not reporting changes, one may be tempted to require
present in Fl. We refer to the first form as updates, snapshot differential algorithms to generate no use-
the second as deletes and the third as inserts. The less pairs. However, strictly forbidding useless
first field is only necessary in distinguishing between delete-insert pairs turns out to be counterproduc-
updates and inserts. It is included for clarity in the tive! Allowing the generation of “some” useless
case of deletes. delete-insert pairs gives the differential algorithm
Conceptually, we have represented snapshots as significant flexibility and leads to solutions that can
sets because the physical location of a record within be very efficient in some cases. We return to these is-
a snapshot file may change from one snapshot to an- sues later when we quantify the savings of “flexible”
other. That is, records with matching keys are not differential algorithms over algorithms that do not
expected to be in the same physical position in Fl allow useless delete-insert pairs. Thus, in this paper
and F2. This is because the source is free to reorga- we do allow useless delete-insert pairs, with the ulti-
nize its storage between snapshots. Also, insertions mate goal of keeping their numbers relatively small.
and deletions may also change physical record posi- However, we do want to avoid useless insert-delete
tions in the snapshot. pairs since they may compromise correctness. Use-
The snapshot differential can be performed at the. less insert-delete pairs can be eliminated by batching
source itself. That is, a snapshot is taken peri- the deletes together and sending the deletes first to
odically and stored at the source site. A daemon the warehouse for processing. In essence, we have
process then performs the snapshot differential pe- transformed the insert-delete pairs into delete-insert
riodically and sends the detected modifications to pairs. This method also amortizes the overhead cost
the warehouse. The snapshot differential can also of sending the modifications over the network. We
be performed at an intermediate site. That is, the assume for the rest of the paper that all useless’
64
insert-delete pairs are eliminated. adopt this method in data warehousing since the
sources are autonomous.
1.2 Differences with Joins Reference [CRGMW96] investigates algorithms to
find differences in hierarchical structures (e.g., docu-
The snapshot differential problem is closely related ments, CAD designs). Our focus here is on simpler,
to the problem of performing a join between two record structured differences, and on dealing with
relations. In particular, if we outerjoin Fl and F2 very large snapshots that may not fit in memory.
on their common K attribute on the condition that
There has been recent complementary work on
their B attributes differ, we can obtain the update
copy detection of files and documents ([MW94],
records required for the differential problem.
[BDGM95], [SGM95]). The snapshot differential
Outerjoin is so closely related to the differen- problem is concerned with detecting the specific dif-
tial problem that the traditional, ad hoc, join al- ferences of two files as opposed to measuring how dif-
gorithms ([ME92],[HC94]) can be adapted to our ferent two files are. Also related are [BGMF88] and
needs. Indeed, in Section 3 we show these modifica- [FWJ86], which propose methods for finding differ-
tions. However, given the particular semantics and ing pages in files. However, these methods can only
intended application of the differential algorithms, detect a few modifications.
we can go beyond the ad hoc solutions and obtain
The snapshot differential problem is also related
new and more efficient algorithms. The three main
to text comparison, for example, as implemented by
ideas we exploit are as follows: UNIX di# and DOS camp. However, the text com-
l Some useless delete-insert pairs are acceptable. parison problem is concerned with a sequence of the
Traditional outerjoin algorithms do not have records, while the snapshot differential problem is
useless delete-insert pairs. The extra flexibil- concerned with a set of records. Reference [HT77]
ity we have allows algorithms that are “sloppy” outlines an algorithm that finds the longest common
but efficient in matching records. subsequence of the lines of the text, which is used in
the UNIX difl Report [LGM95] takes a closer look
l For some data warehousing applications, it may at how this algorithm can be adopted to solve the
be acceptable to miss a few of the modifications. snapshot differential problem, although the solution
Thus, for differentials we can use probabilis- is not as efficient as the ones presented here.
tic algorithms that may miss some differences The methods for solving the snapshot differential
(with arbitrarily low probability), but that can problem proposed here are based on ad hoc joins
be much more efficient. Again, traditional al- which have been well studied; [ME921 and [Sha86]
gorithms are not allowed any “errors,” must be are good surveys on join processing. The snap-
very conservative, and must pay the price. shot differential algorithms proposed here are used in
l Snapshot differentials are an on-going process the data warehousing system WHIPS. An overview
running at a source. This makes it possible to of the system is presented in [HGMW+95]. After
save some of the information used in one differ- the modifications of multiple sources are detected,
ential to improve the next iteration. the modifications are integrated using methods dis-
cussed in [ZGMHW95].
Note that knowledge of the semantics of the infor-
2 Related Work mation maintained at the warehouse may help make
Snapshots were first introduced in [AL80]. Snap change detection simpler. An outline of these special
shots were then used in the R* project at IBM Re- cases is in report [LGM95].
search in San Jose [Loh85]. The data warehouse
snapshot can be updated by maintaining a log of 3 Using Compression
the modifications to the database. This approach
was defined to be a differential refresh strategy in In this section we first describe existing, ad hoc, join
[KR87]. If snapshots were sent periodically, this was algorithms but we do not cover all the known varia-
called the full refresh strategy. In this paper we only tions and optimizations of these algorithms. We be-
consider the case where the source strategy is full re- lieve that many of these further optimizations can
fresh. [Lea861 a1so p resented a method for refreshing also be applied to the snapshot differential algo-
a snapshot that minimizes the number of messages rithms we present.
sent when refreshing a snapshot. The method re- After extending the ad hoc algorithms to handle
quires annotating the base tables with two columns the differential problem, we study record compres-
for a tuple address and a timestamp. We cannot sion techniques to optimize them. In the sections
65
below, we denote the size of a file F as IFI blocks pair of corresponding buckets. First, one of the
and the size of main memory as ]M] blocks. We also buckets is read into memory and an in-memory hash
exclude the cost of writing the output file in our cost table is built (assuming the bucket fits in memory).
analysis since it is the same for all of the algorithms. The second bucket is then read and a probe into the
in-memory hash table is made for each record in an
3.1 Outer Join Algorithms attempt to find a matching record in the first bucket.
A more detailed discussion of the partitioned hash
The basic sort merge join first sorts the two input
algorithm is found in [LGM96] where we show that
files. It then scans the files once and any pair of
the IO cost incurred is IF11 + 3 * lF21.
records that satisfy the join condition are produced
as output. The algorithm can be adapted to perform
3.2 Compression Techniques
an outerjoin by identifying the records that do not
join with any records in the other file. This can Our compression algorithms reduce the sizes of
be done with no extra cost when two records are records and the required IO. Compression can be
being matched: the record with the smaller key is performed in varying degrees. For instance, com-
guaranteed to have no matching records. pression may be performed on the records of a file by
Since differentials are an on-going process running compressing the whole record into n bits. A block or
at a source, it is possible to save the sorted file of the a group of blocks can also be compressed into n bits.
previous snapshot. Thus, the algorithm only needs There are also numerous ways to perform compres-
to sort the second file, Fz. This can be done using sion such as computing the check sum of the data
the multiway merge-sort algorithm. This algorithm and hashing the data to obtain an integer. Compres-
constructs runs which are sequences of blocks with sion can also be lossy or lossless. In the latter case,
sorted records. After a series of passes, the file is the compression function guarantees that two differ-
partitioned into progressively longer runs. The al- ent uncompressed values are mapped into different
gorithm terminates when there is only one run left. compressed values. Lossy compression functions do
In general, it takes 2 * 1F I* IoglMl 1F I IO operations not have this guarantee but have the potential of
to sort a file with size 1F 1([UllSS]). However, if there achieving higher compression factors. Henceforth,
is enough main memory (JM] > a), the sorting we assume that we are using a lossy compression
can be done in 4 * IF I IO operations (sorting is done function. We ignore the details of the compression
in two passes). The second phase of the algorithm, function and simply refer to it as compress(z).
which involves scanning and merging the two sorted There are a number of benefits ‘from processing
files, entails )Fl I + IFzl IO operations for a total of compressed data. First of all, the compressed inter-
IFl I + 5 * IFz I IO operations. mediate files, such as the buckets for the partitioned
The IO cost can be reduced further by just pro- hash join, are smaller. Thus, there will be fewer IO
ducing the sorted runs (denoted as F-J ,,,,) in the when reading the intermediate files. Moreover, the
first phase. The first step of the algorithm produces compressed file may be small enough to fit in mem-
the sorted F-J runs, at a cost of only 2 * IF21 IOs. ory. Even if not, some of the join algorithms may
(File Fl has already been sorted at this point.) The still benefit. For example, the compressed file may
sorted F-J file, needed for the next run of the al- result in buckets that fit in memory which improves
gorithm l, can then be produced while matching the matching phase of the partitioned hash join al-
FZ run8 with Fl. In producing the sorted F2 file, gorithm.
we read into memory one block from each run in Compression is not without its disadvantages. As
F2 rUnd (if the block is not already in memory), and mentioned earlier, a lossy compression function may
select the record with the smallest K value. The map two different records into the same compressed
merge process then costs 2 * IF21 + IFI I ZOs. The value. This means that the snapshot differential,al-
total cost incurred is IFl I + 4 * IF2l ZOs. gorithm may not be able to detect all the modifi-
Another method that we discuss here is the parti- cations to a snapshot. We now show that this can
tioned hash join algorithm. In the partitioned hash occur with a probability of 2~“, where n is the num-
join algorithm, the input files are partitioned into ber of bits for ‘the compressed value. Assume that
buckets by computing a hash function on the join we are campressing an object (which may be the B
attribute. Records are matched by considering each field, or the entire record, or an entire block, etc.) of
b bits (b > n). There are then 2’ possible values for
‘Actually the sorted F2 runs will suffice. Producing the
this object. Since there are only 2” values that the
F2 runs reduces the IO cost but requires more memory during compressed object can attain, there are 2’12”’ origi-
the matching phase nal values mapped to each compressed value. Thus
66
for each given original value, the probability that
another value maps to the same compressed value Algorithm 3.1
Input fl sorted, F2
is ((26/2n) - 1)/2b, which is approximately 2~” for
Output Fout (the snapshot differential), f2 sorted
large values of b. For sufficiently large values of n,
Method
this probability can be made very small. The ex- (1) F2 runs t SortZntoRuns(F2)
pression 2~“ , henceforth denoted as E, gives the (2) rl t read the next record from fl sorted
probability that a single comparison is erroneous. (3) r2 t read the next record from F2 runs;
For example, if the B field of the record < Ii, B > is f2 sorted b Ou&Ut(< r2.K, COmpreSS(r2.B) >)
compressed into a 32-bit integer, the probability that (4) while ((t-1 # NULL) A (r2 # NULL))
a single comparison (of two B fields) is erroneous is (5) if ((r-1 = NULL) V (r1.K > r2.K)) then
2-32 or approximately 2.3 * 10-l’. However, as we (6) Fout t Output(< Insert, r2.K, r2.B >)
compare more records, the likelihood that a modi- (7) t-2 t read the next record from F2 runs;
f2 sorted + Ou@ut(< r2.K, COmpreSS(r2.B) >)
fication is missed increases. To put this probability
(8) else if ((rz = NULL) V (rl .I< < r2.K) then
of error into perspective, let us suppose we perform
(9) F,,,, t Output(< Delete, r1.K >)
a differential on two 256 MB snapshots daily. We
(10) t-1 t read the next record from fl sOrted
now proceed to compute how many days we expect (11) else ii(r1.K = r2.K) then
to pass before a record modification is missed. We (1‘4 ’ if (r1.b # Compress(r2.B)) then
first compute the probability (denoted as pd,,v) that (13) Fo,,t t Output(< Update, r2.K,r2.B >)
there is no error in comparing two given snapshots. (14) rl t read the next record from fl sorted
Let us suppose that the record size is 150 bytes (15) r-2 t read the next record from F2 runJ;
which means that there are approximately 1,789,570 f2 sorted + Output(< r2.K, COmpreSS(r2.B) >)
records for each file. Figure 1: Sort Merge Outerjoin Enhanced with the
Pday = (I- ~)rec(~) = (I- 2.3 * io-10)1~7sg~570 (1) < I<, b > Compression Format
Using this probability, we can compute the expected 3.3 Outerjoin Algorithms with Compression
number of days before an error occurs.
We now augment the sort merge outerjoin with com-
pression (shown in Figure 1). The algorithm dif-
Ngood day8 = (I-pdny)*~i*&; = &- (2) fers from the standard sort merge algorithm in that
l<i day
it reads a compressed sorted Fl file (denoted as
This comes out to be 2,430 days, or more than 6.7 fr sorted, with a size of )FI I/u). Also, when detecting
years! We believe that for some types of warehousing the updates in step (12), the compressed versions of
applications, such as data mining, this error rate will the B field are compared. Lastly, steps (3), (7) and
be acceptable. (15) now first compress the B field before producing
It is evident from the equations above that as the an Output into f-2 8orted.
number of records increases, the expected number of The sorting phase of the algorithm incurs 2 * IF21
days before an error occurs goes down. However, as IOs. The matching phase (steps (4) onwards) incurs
the number of bits used for compressing is increased, IF2l + \fr I ZOs since the two files are scanned once.
the expected number of years before an error occurs Lastly, the sorted f2 sorted must be produced for the
can be made comfortably large even for large files. next differential, which costs If2l ZOs. The total cost
For the algorithms we will present here, we con- is then Ifi1 + 3 * IF21+ If21ZOs.
sider two ways of compressing the records. For both <Greater improvements may be achieved by com-
compression formats, we do not compress the key, pressing not only the first snapshot but also the sec-
and we denote the compressed B field as b. The ond snapshot before the files are matched. When
first format is simply compress a record < K, B > the second snapshot arrives, it is read into mem-
into < K, b >. For the second form, the only differ- ory and compressed sorted runs are written out. In
ence is that a pointer is appended forming the record essence, the uncompressed F2 file is read only once.
< K, b, p >. The pointer p points to the correspond- The problem introduced by compressing the second
ing disk resident uncompressed record. The use of snapshot is that when insertions and updates are
the pointer will be explained when we describe the detected, the original uncompressed record must be
algorithms. We use u to represent the ratio of the obtained from F2. In order to find the original (un-
size of the original record to that of the compressed compressed) record, a pointer to the record must be
record. So, if an uncompressed file is size IFI, the saved in the compressed record. Thus, for this al-
compressed size will be IJ’I/u blocks long. gorithm, the < K, b,p > compression format must
67
Algorithm 3.2 Algorithm 3.3
Input fi sorted, FZ Input Fl, F2, n (number of blocks in the input buffer)
Output Fout (the snapshot differential), f2 sorted Output Fout (the snapshot differential)
Met hod Method
(1) f2 rUnS C SortZntoRuns 0 Compress(&) (1) Input Buff erl t Read n blocks from Fl
(2) t-1 t read the next record from fi sorted (2) Input Buffer-2 t Read n blocks from F2
(3) r2 c read the next record from f2 runs; (3) while ((Znput Buffer-1 # EMPTY) A
fi sorted t OutPut(< rz.K, rz.b, r2.p >) (Input Buffer2 # EMPTY))
(4) while ((rl # NULL) A (r-2 # NULL)) (4) Match Input Buf ferl against Input Buf ferz
(5) if ((rl = NULL) V (r1.K > r2.K)) then * (5) Match Input Buffet-1 against Aging Buffet-2
(54 rfull c read tuple in F2 with address r2.p (6) Match Input Buff er2 against Aging Buff et-1
F,,t t Output(< Insert, rz.K, r-full .Z3 >) (7) Put contents of Input Buffer-1 to Aging Buffet-1
167)) r2 t read the next record from f2 runs; (8) Put contents of Input Buf fer2 to Aging Z3uf fer2
f2 sorted t Output(< rz.K, rz.b, r2.p >) (9) Input Buf ferl t Read n blocks from Fl
.
(8) else d ((r-2 = NULL) V (r1.K < r2.K)) then (10) Input Buff er2 t Read n blocks from F2
(9) F,,t t Output(< Delete, rl.Z( >) (11) Report records in Aging Z3uf f erl as deletes
00) rl t read the next record from fi sorted (12) Report records in Aging Buf ferz as inserts
(11) else if (r1.K = r2.K) then
if (r1.b # r2.b) then
Figure 4: Window Algorithm
02)
024 r-full t read tuple in F2 with address r2.p
Faut c Output(< Update, r2.K, rf,,a.B >) The partitioned hash outerjoin is augmented with
(13)
(14) rl t read the next record from fi sorted compression in a very similar manner to the sort
(15) r2 t read the next record from f2 rtLn*; merge outerjoin. We show in [LGM96] that the over-
fi sorted t Output(< r2.K,rz?.b,r2.p >) all cost is reduced to Ifi1 + 3 * IF21 + Ifi1 ZOs if the
buckets are compressed after the matching phase.
Figure 2: Sort Merge Outerjoin Enhanced with the
If the buckets are compressed before the matching
< K, b, p > Compression Format
phase, we also show in [LGM96] that the overall cost,
is lfil+ IF21 + 2 * If21 + I + U 10s.
be used. The full algorithm is shown in Figure 2. The performance gains can even be greater if the
Step (5a) (step(l2a)) shows that when an insertion compression factor u is high enough such that all
(update) is detected, the pointer p of the current of the buckets of Fl fit in memory. In this case, all
record is used to obtain the original record in order the buckets for Fl are simply read into memory (Ifi I
to produce the correct. output. IOs). The file F-J is then scanned, and for each record
AGE QUEUE_,,----, in F2 read, the in-memory buckets are probed. The
compressed buckets for F2 can also be constructed
for the.next differential during this probe. The over-
all cost of this algorithm is only Ifi I + (F21+ If;! I ZOs.
68
of increasing main memory capacity, by maintain-
ing a moving window of records in memory for each Name Description Default
snapshot. Only the records within the window are M Memory Size 32 MB
compared in the hope that the matching records oc- B Block Size 16K
cur within the window. Unmatched records are re- F File Size 256 or 1024 MB II
ported as either an insert or a delete, which can lead R Record Size 150 bytes
to useless delete-insert pairs. As discussed in Section rec( F) # of Records 1,789,569 or
1, a small number of these may be tolerable. I I 7,158,279
For the window algorithm, we divide available r 1 Compressed 1 10 or 14 bytes
memory into four distinct parts as shown in Fig-
ure 3. Each snapshot has its own input buffer (input
buffer i is for Fl) and agihg buffer. The input buffer
is simply the buffer used in transferring blocks from
disk. The aging buffer is essentially the moving win-
IO Number of IOs N/A
dow mentioned above. x Intermediate N/A
The algorithm is shown in Figure 4 and we now File Size
proceed to explain each step. Steps (1) and (2) sim- E Prob. of Error N/A
ply read a constant number of input block of records
from file Fl and file F2 to fill input bufler 1 and in- Figure 5: List of Variables
put bufler 2, respectively. This process will be done
repeatedly by steps (9) and (10). Befcre the input there is an embedded “aging” queue, with the head
buffers are refilled, the algorithm guarantees that of the queue being the oldest record in the buffer,
they are empty. Steps (4) through (6) are concerned and the tail being the youngest. Figure 3 illustrates
with matching the records of the two snapshots. In the aging buffer. Each entry in the hash table has
Step (4), the matching is performed in a nested loop a timestamp associated with it for illustration pur-
fashion. This is not expensive since the input buffers poses only. The figure shows that the oldest record
are relatively small. The matched records can pro- is at the head of the queue. Whenever new records
duce updates if the B fields differ. The slots that are forced into the aging buffer, the new records are.
these matching records occupy in the buffer are also placed at the tail of the queue. If the aging buffer is
marked as free. In step (5), the remaining records full, the record at the head of the queue is displaced
in input buffer 1 are matched against aging buffer 2. as a new record is enqueued at the tail. This action
Since the aging buffers are much larger, the aging produces a delete (insert) if the buffer in question is
buffers are actually hash tables to make the match- aging buffer 1 (aging buffer 2).
ing more efficient. For each remaining record in in- Since files are read once, the IO cost for the win-
put bufler 1, the hash table that is aging bu#er 2 dow algorithm is only 1FlI + 1F2 1 regardless of mem-
is probed for a match. As in step (4), an update ory size, snapshot size and number of updates and
may be produced by this matching. The slots of the inserts. Thus the window algorithm achieves the op-
matching records are also marked as free. Step (6) timal IO performance if compression is not consid-
is analogous to step (5) but this time matching in- ered. However, the window algorithm can produce
put buffer 2 and aging bufler 1. Steps (7) and (8) useless delete-insert pairs in Steps 6 and 7 of the
clear both input buffers by forcing the unmatched algorithm. Intuitively, the number of useless delete-
records in the input buffers into their respective ag- insert pairs produced depends on how physically dif-
ing buffers. The same hash function used in steps (4) ferent the two snapshots are.
and (5) is used to determine which bucket the record To quantify this difference, we define the distance
is placed into. Since new records are forced into the of two snapshots. We want the distance measure to
aging buffer, some of the old records in the aging be symmetric and independent of the size of the file.
The equation below exhibits the desired properties.
buffer may be displaced. These displaced records
constitute the deletes (inserts) if the records are dis- c RlcF1,RlrFa,match(R1,Ra) bdR1) -~s(R2)i
placed from aging buffer 1 (aging buffer 2). The dis- distance =,
maz(rec(Fl),rec(F2))2/2
placement of old records is explained further below. (3)
The steps are then repeated until both snapshots are The function pos returns the physical position of a
processed. At that point, any remaining records in record in a snapshot. The boolean function match
the aging buffers are output as inserts or deletes. is true when records Ri and Rz have matching keys.
In the hash table that constitutes the aging buffer The function ret returns the number of records of
69
a snapshot file. Thus, this equation sums up the IO Performance vs. File Size
901
absolute value of the difference in position of the
matching records and normalizes it by the maximum 60
pH _+__..
distance for the given snapshot file sizes. The max- 70 - p,,Cl ... ..__
PHC2 -+. .--..
imum distance between two snapshots is attained 60 -Window -A-.-..
SMCl ..c---
5.1 Analytical IO Comparison 35 9,.,C2 ..m.... -
We have outlined in the previous section algo-
30
rithms that can compute a snapshot differential:
performing sort merge outerjoin (SM), performing 25
a partitioned hash outerjoin (PH), performing a
sort merge outerjoin with two kinds of record com- 20
pression (SMCl, SMC2), performing partitioned
hash outerjoin with two kinds of record compression ‘15
. *...* .... .... .... ...* .... ...* .... ..... ..... ...* .... ... ...* .... .... ....
(PHCl, PHCZ) and using the window algorithm t... ... t
(W). SMCl denotes sort merge outerjoin with a 101, ’ ’ B * B L ’ ’ ’
2 4 6 6 10 12 14 16 16 20
record compression format of < K, b > (similarly for Compression Factor
PHCl); SMC2 uses the record compression format Figure 8: IO Cost and Compression Factor
< I(, b, p > (similarly for PHCS). In this section, we
will illustrate and compare the algorithms in terms
of IO cost, size of intermediate files, and the proba- pression using .the < K, b > record format achieves
bility of error. Due to space limitations, this is not a 37% reduction in IO cost over sort merge using
a comprehensive study, but simply an illustration of SMCl, and a 50% reduction using SMC2. For the
potential differences between the algorithms in a few 256 MB file, the compressed file fits in memory which
realistic scenarios. enables the PHCl and PHC2 algorithms to build
Figure 5 shows the variables used in comparing a complete in-memory hash table, as explained in
the algorithms. We assume that the snapshots have Section 3.3. The reduction in IO cost for these two
the same number of records. The number of records algorithms, in this case, surpasses even that of the
(ret(F)) is calculated using F/R, where R is the window algorithm.
record size (150 bytes). The compressed record size However, when the larger file is considered, the
is 10 bytes for the < K, b > format and 14 bytes for compressed file no longer fits in the 32 MB mem-
the < K, b,p > format. This leads to compression ory. Thus the PHCl and PHC2 algorithms achieve
factors of 15 and 10 respectively. more modest reductions in this case (37% and 52%
Figure 6 shows a summary of the results com- respectively). ‘Other than these two algorithms, the
puted for the various algorithms. The two columns reductions achieved by the other algorithms are un-
labeled I0256 and 101ez4 show the IO cost incurred changed even with the larger file.
in processing 256 MB and 1024 MB snapshots for the Figure 7 compares the algorithms when the size of
different algorithms. Using the sort merge outerjoin the snapshots is varied. The values of other param-
as a baseline, we can see that the partitioned hash eters are unchanged. Note that we have not plotted
outerjoin (PH) reduces the IO cost by 20%. Com- SMCl and SMC2 since their plots are almost in-
70
n
Algorithm I0256 101024 x256 (MB) x1024 (MB) Probability
(%savings) (%savings) of Error (E)
SM - 81,920 327,680 16384 65,536 0
SMCl 51,336 (37%) 205,346 (37%) 16,384 65,536 2.3 * 10-i’
SMC2 40,833 (50%) 163,333 (50%) 1,639 6,554 2.3 t 10-l’
PH 65,536 (20%) 262,144 (20%) 16,384 65,536 0
PHCl 18,568 (77%) 205,346 (37%) 16,384 65,536 2.3 * 10-i’
PHC2 19,660 (76%) 156,779 (52%) 1,639 6,554 2.3 * lo-i0
W 32,768 (60%) 131,072 (60%) 0 0 0
71
Snapshot Parameters Default Values Effect of Distance on the Number of Extra Messages
3 0.7
Size of B field 150 bytes F=50MB -
8 F = 75 MB -+---.
R Size of Record 156 bytes
Number of Records 650,000
F File Size 100 MB
dWa,, 50,000 records
u Number of Updates 20% of ret(F)
Window Parameters Default Values
’ AB Aging Buffer Size 8MB
IB Input Block Size 16K
Figure 10: List of Parameters
72
Comparison of the Total Time Elapsed
iooo-
,/
900 - Window - ,A
6oo SortMerge -+---. ,,,.,I
Read ... ,/’
,A
,d’
700 - ,/
,,/
3 6oc- ,,/
Figure 12: dist,,it and disp,,i, MB ,/
E 500 - ,,/
,/’
I= 400. ,/
Effect of Memorv Size on the Number of Extra Messaaes ,,/
300 - ,/’
200 _ ,/,~*-~~”
1(Jo - ,,*/,,”
_,___,.
o ._,___._.....__.......
0 . -.-..-.-.-.-.--.-.-.”
. . .*... .-.. ,
0.6 0 20 60 100
Size $‘Snapshoz (MB)
0.5
t Figure 14: Comparison of the Total Times
0.4
0.3
0.2 MB to 16 MB. The disp,,g was set at 50,000 with
0.1
.a resulting distance of 0.34, which is well above the
dist,,it. Figure 13 shows that once the size of the ag-
0
ing buffer is at least 12.8 MB, no extra messages are
produced. This is to be expected since we showed
Figure 13: Effect of the Memory Size on the Number previously (Figure 12) that the tolerable dispc,it MB
of Extra Messages for the 100 MB file is 11.2 MB. Using the .same snap-
shot pair, we also varied the input block size from 8
K to 80 K. The variation had no effect on the num-
sertions nor deletions in the synthetic snapshot pair,
ber of extra messages and we do not show the graph
we can define a critical average record displacement
here. Again, this is to be expected, since the size
(denoted as disp,,it) which is related to dist,,it as
shown in equation (5). of the aging buffer is much larger than the size of
the input block. Thus, even if the input block size
dist,,,t =
c RlcF~.RacF3,mo*ch(R1
,R1) re~(F)~/2
IP~RI)-P~~(W I
(4
is varied, the window size stays the same. We also
varied the record size and this showed no effect on
rec(F)*dtsp,,,~
= rec(F)wec(F)/2
the number of extra messages produced.
Lastly we compared the CPU time and the clock
disp,,,, MB = disp,rit*R = dist,,,t*(rec(F)/2)*R (6) time (which includes the IO time) that the window
algorithm consumes to that of the sort merge out-
Using the size of the record, we can translate erjoin based algorithm. We ran the experiments on
the dist,,it into a critical average physical displace- a DEC Alpha 3000/400 workstation running UNIX.
ment (denoted as disp,,it MB which is in terms of We used the UNIX sort utility in the implementa-
MB) using equation (6). Figure 12 shows the result tion of the sort merge outerjoin. (UNIX sort may
of the calculations for the different snapshot pairs. not be the most efficient, but we believe it is ade-
The dist,,it of the snapshot pairs are estimated from quate for the comparisons we wish to perform here.)
Figure 11. This table shows, for example, that the We used the same input block size for both the win-
window algorithm can tolerate an average physical dow and the sort merge outerjoin algorithms (16 K).
displacement of about 11.2 MB given an aging buffer The disp,,, of the two snapshots was set so that the
size of only 8 MB to compare 100 MB snapshots. resulting distance was 0.05 (within the dist,,it for
Thus, if a system designer knows that the records all file sizes). The analysis in the previous section
can only be displaced within, say a page (which is illustrated that the window algorithm incurs fewer
normally smaller than 11.2 MB), then the designer IO operations than the sort merge outerjoin algo-
can be assured that the window algorithm will not rithm. Our experiments showed that the window al-
produce excessive amounts of extra messages. gorithm is also significantly less CPU intensive than
In the next experiment, we focus on the 100 MB the sort merge based algorithm (e.g., 80s compared
snapshots. Using the parameters listed in Figure to 250s for a 75 MB file). As expected then, Figure
10, we varied the size of the aging buffer from 1.0 14 shows that the window algorithm outperforms the
.
73
sort merge outerjoin in terms of clock time. Figure [CRGMWSG] S. Chawathe, A. Rajaraman, H. Garcia-Molina,
14 also shows the time to simply read the files into and J. Widom. Change detection in hierarchically
structured information. In SIGMOD, June 1996.
memory, without actually processing them. Since no
[FWJ86] W.K. Fuchs, K. Wu, and Abraham J. Low-cost
differential algorithm can avoid this IO overhead, we comparison and diagnosis of large remotely located
see that the Window algorithm has a relatively low fries. In Proceedings of the Fifth Sympostum on
Reliabilrty in Distributed Software and Database
CPU processing overhead. Systems, January 1986.
We have defined the snapshot differential problem [HC94] L. Haas and M. Carey. SEEKing the truth about
and discussed its importance in data warehousing. ad hoc join costs. Technical report, IBM Almaden
Flsearch Center, 1994.
All of our proposed algorithms are relatively simple,
but we view this as essential for dealing efficiently [HGMW+95] J. Hammer, H. Garcia-Molina, J. Widom,
W. Labio, and Y. Zhuge. The Stanford Data Ware-
with large files. In summary, we have the following housing Project. IEEE Data Englneerrng Bul-
results: letrn, June 1995.
l By augmenting the outerjoin algorithms with [HT77] J.W. Hunt and Szymanski T.G. A fast al-
gorithm for computing longest common subse-
record compression, we have shown that very quences. CACM, 20(5), 1977.
significant savings in IO cost can be attained.
[IC94] W.H. Inmon and E. Conklin. Loading data into
l We have introduced the window algorithm the warehouse. Tech Toprc, l(ll), 1994.
which works extremely well if the snapshots are B. Kahler and 0. Risnes. Extending logging for
[Km71
not too different. Under this scenario, this al- database snapshot refresh. In VLDB, September
1987. .
gorithm outperforms the join based algorithms
and its running time is comparable to simply [Lea861 B.G. Lindsay and et al. A snapshot differential
refresh algorithm. In SIGMOD, May 1986.
reading the snapshots once.
We have incorporated the window and the sort merge [LGM95] W.J. Labio and H. Garcia-Molina. Comparing
very large database snapshots. Technical Report
outerjoin algorithms into the initial WHIPS proto- STAN-CS-TN-95-27, Computer Science Depart-
type. The production version of the algorithm takes ment, Stanford University, June 1995.
as input a “format definition” that describes the [LGM96] W.J. Labio and H. Garcia-Molina. Efficient snap-
record format of the snapshots and identifies the shot differential algorithms for data warehousing.
Technical Report STAN-CS-TN-96-36, Computer
key field(s). Th e f ormat allows for complex value Science Department, Stanford University, June
fields, but the window algorithm will consider the 1996.
entire record as a single field. We also plan to imple- [Loh85] G.M Lehman. Query processing in R*. In Query
ment a post-processor that filters out useless delete- Processing in Database Systems, Berlin, West
Germany, March 1985.
insert pairs before they are sent to the warehouse.
The differential algorithm and the warehouse itself [ME921 P. Mishra and M. Eich. Join processing in re-
lational databases. ACM Computtng Surveys,
are implemented within the Corba distributed ob- 24(l), 1992.
ject framework, using ILU, an implementation from
[MW94] U. Manber and S. Wu. Glimpse: A tool to search
Xerox PARC [CJS+94]. For our system demonstra- through entire file systems. In Proceedrngs of the
tions, we use the window algorithm to extract modi- winter USENIX Conference, January 1994.
fications from a legacy source that handles financial [SGM95] N. Shivakumar and H. Garcia-Molina. Scam: A
account information at Stanford. copy detection mechanism for digital documents.
In Proceedings of the 2nd Int. Conference rn The-
ory and Practice of &&al Llbrartes, Austin,
Texas, June 1995.
References
[AL801 M.E. Adiba and B.G. Lindsay. Database snapshots. [Sha86] L. Shapiro. Join processing in database systems
In VLDB, October 1980. with large main memories. ACM Tronsoctrons on
Database Systems, 11(3), 1986.
[BDGM95] S. Brin, J. Davis; and H. Garcia-Molina. Copy
detection mechanisms foi digital documents. In [%u951 C. Squire. Data extraction and transformation for
SIGMOD, May 1995. the data warehouse. In SIGMOD, May 1995.
[BGMFSS] D. Barbara, H. Garcia-Molina, and B. Feijoo. Ex- [U1189] J.D. Ullman. Prmctples of Database and
ploiting symmetries for low-cost comparison of file Knowledge-Base Systems. Computer Science
copies. In Proceedings of the Int. Conference on Press, Rockville, MD, 1989.
Distributed Computrng Systems, June 1988.
[ZGMHW95] Y. Zhuge, H. Garcia-Molina, J. Hammer, and
[CJS+ 941 A. Courtney, W. Janssen, D. Severson, M. Spre- J. Widom. View maintenance m a warehousmg
itzer, and F. Wymore. Inter-language unification, environment. In SIGMOD, May 1995.
release 1.5. Technical Report ISTL-CSA-94-01-01,
Xerox PARC, May 1994.
.
74