Sei sulla pagina 1di 146

Distributed Database Systems

COP5711
Mini Project Suggestion 1 (1/4)
Mobile
sensor
Sensing
range
Mobile
object
within
range
Undetected
mobile
object
Mini Project Suggestion 1 (2/4)
Group Mobility Model
All mobile objects start in the same region
Mobile objects belong to k different mobile groups
Clients in each group have a group mobility
pattern, in which
the group leader follows the random way-point mobility
model, and
other group members move with their leader with
random local movement

Mini Project Suggestion 1 (3/4)
Two objectives
1. Provide full sensing coverage for all mobile objects
2. Prevent network partition among the mobile
sensors
Network
partition
Avoid
this
Mini Project Suggestion 1 (4/4)
Presentation:
1. Explain your techniques
2. Present plots to discuss your simulation results
3. Demo - real-time visualization of the simulation
Documentation:
Submit a report on the distributed algorithms,
diagrams for illustration, and plots for simulation
results.
What is a Distributed Database System ?

A distributed database is a collection of databases which are distributed
over different computers of a computer network.
Each site has autonomous processing capability and can perform local
applications.

Each site also participates in the execution of at least one global
application which requires accessing data at several sites.
Multiprocessor Database Computers
Application
(front-end)
computer
Interface
Processor
Access
Processor
Access
Processor
Access
Processor
What we miss here is the existence of local
applications, in the sense that the integration of the
system has reached the point where no one of the
computers (i.e., IFPs & ACPs) is capable of executing
an application by itself.
Cannot run an
application
by itself
Why Distributed Databases ?
1. Local Autonomy: permits setting and enforcing local policies regarding
the use of local data (suitable for organization that are inherently
decentralized).
2. Improved Performance: The regularly used data is proximate to the
users and given the parallelism inherent in distributed systems.
3. Improved Reliability/Availability:
Data replication can be used to obtain higher reliability and
availability.
The autonomous processing capability of the different sites
ensures a graceful degradation property.

4. Incremental Growth: supports a smooth incremental growth with a
minimum degree of impact on the already existing sites.
5. Shareability: allows preexisting sites to share data.
6. Reduced Communication Overhead: The fact that many applications
are local clearly reduces the communication overhead with respect to
centralized databases.
Disadvantages of DDBSs

Cost: replication of effort (manpower).
Security: More difficult to control
Complexity:
The possible duplication is mainly due to reliability and
efficiency considerations. Data redundancy, however,
complicates update operations.
If some sites fail while an update is being executed, the
system must make sure that the effects will be
reflected on the data residing at the failing sites as
soon as the system can recover from the failure.
The synchronization of transactions on multiple sites is
considerably harder than for a centralized system.
Distributed DBMS
Architecture
NetworkTransparancy
The user should be protected from the
operational details of the network.
It is desirable to hide even the existence
of the network, if possible.
Location transparency: The command used is
independent of the system on which the data is
stored.
Naming transparency: a unique name is
provided for each object in the database.
Replication & Fragmentation
Transparancy
The user is unaware of the replication of
framents
Queries are specified on the relations
(rather than the fragments).
Fragment R1
Fragment R2
Fragment R3
Fragment R4
Copy 2 of R1
Copy 1 of R1
Copy 1 of R2
Relation R
Copy 2 of R2
Site A
Site B
Site C
ANSI/SPARC Architecture
External
view
External
view
External
view
Conceptual
view
Internal
view
External Schema
Conceptual Schema
Internal Schema
Internal view: deals with the physical definition and organization of data.
Conceptual view: abstract definition of the database. It is the real
world view of the enterprise being modeled in the database.
External view: individual users view of the database.
A Taxonomy of Distributed Data Systems
Distributed data systems
Homogeneous
Heterogeneous
(Multidatabase)
Unfederated
(no local users)
Federated
Loosely coupled
(interoperable DB
systems using
export schema)
Tightly coupled
(/w global schema)
A distributed database
can be defined as
a logically integrated
collection of shared
data which is
physically distributed
across the nodes of a
computer network.
Architecture of a Homogeneous DDBMS
Global user
view 1
Global Schema
Global user
view n
Fragmentation
Schema
Local
conceptual
schema 1
Local
internal
schema 1
Local DB 1
Allocation
Schema
Local
conceptual
schema n
Local
internal
schema n
Local DB n
A homogeneous
DDBMS resembles
a centralized DB,
but instead of
storing all the
data at one site,
the data is
distributed across
a number of sites
in a network.
Fragmentation Schema & Allocation Schema

Fragmentation Schema: describes how the global
relations are divided into fragments.

Allocation Schema: specifies at which sites each
fragment is stored.

Example: Fragmentation of global relation R.
A B
C D
E
To materialize R, the following
operations are required:
R = (A B) U ( C D) U E
Homogeneous vs. Heterogeneous
Homogeneous DDBMS
No local users
Most systems do not have
local schemas (i.e., every user
uses the same schema)
Heterogeneous DDBMS
There are both local and
global users
Multidatabase systems are
split into:
Tightly Coupled Systems:
have a global schema
Loosely Coupled Systems:
do not have a global
schema.
Multidatabase
Management
system
DBMS DBMS DBMS DBMS
Database 1 Database 2 Database 3 Database 4
Global
user
Local
user
Local
user
Schema Architecture of a Tightly-
Coupled System
Global user
view 1
Global user
view n
Global Conceptual Schema
Local
Participation
Schema 1
Auxiliary
Schema 1
Local
Conceptual
Schema 1
Local user
view 1
Local user
view 2
Local
Internal
Schema 1
Local DB 1
Local
Participation
Schema 1
Auxiliary
Schema 1
Local
Conceptual
Schema 1
Local user
view 1
Local user
view 2
Local
Internal
Schema 1
Local DB 1
An individual nodes
participation in the MDB
is defined by means of a
participation schema.
Auxiliary Schema (1)
Rules for unit conversion: may be required when
one site expresses distance in kilometers and
another in miles,
Rules for handling null values: may be necessary
where one site stores additional information which
is not stored at another site.
Example: One site stores the name, home address and
telephone number of its employees, whereas another just
stores names and addresses.
Auxiliary schema describes the rules which
govern the mappings between the local and
global levels.
Auxiliary Schema (2)
Rules for naming conflicts: naming conflicts occur when:
semantically identical data items are named differently
DNAME Department name (at Site 1)
DEPTNAME Department name (at Site 2)
semantically different data items are named identically.
NAME Department name (at Site 1)
NAME Manager name (at Site 2)
Rules for handling data representation conflicts: Such
conflicts occur when semantically identical data items
are represented differently in different data source.
Example: Data represented as a character string in one
database may be represented as a real number in the other
database.
Auxiliary Schema (3)
Rules for handling data scaling conflicts: Such
conflicts occur when semantically identical
data items stored in different databases
using different units of measure.
Example: Large, New, Good, etc.
These problems are called
domain mismatch problems
Loosely-Coupled Systems
(Interoperable Database Systems)
Global
user view 1
Global
user view 2
Global
user view 3
Local
Conceptual
schema 1
Local
internal
schema 1
Local
internal
Schema 2
Local
Conceptual
Schema 2
Local
internal
Schema n
Local
Conceptual
Schema n
Local DB n
Local DB 2
Local DB 1
Local
user view 1
Local
user view 2
Loosely-Coupled Systems
Global
user view 1
Global
user view 2
Global
user view m
Local
Conceptual
schema 1
Local
internal
schema 1
Local
internal
Schema 2
Local
Conceptual
Schema 2
Local
internal
Schema n
Local
Conceptual
Schema n
Local DB n
Local DB 2
Local DB 1
Local
user view 1
Local
user view 2
Export
schema 2
Export
Schema 3
Export
Schema n
Export
schema 1
Integration of Heterogeneous Data Models
Provide bidirectional translators between all
pairs of models
Advantage: support multiple models at the global level.
No need to learn another data model and language
Disadvantage: requires n(n-1) translators, where n is
the number of different models.
Adopt a single model (called canonical model) at
the global level and map all the local models onto
this model
Advantage: requires only 2n translators
Disadvantage: translations must go through the global
model.
(The 2
nd
approach is more widely used)
Distributed Database Design
Top-Down Approach: The database system is
being designed from scratch.

Issues: fragmentation & allocation

Bottom-up Approach: Integrating existing
databases into one database

Issues: Design of the export and global
schemas.
Requirements Analysis
System Requirements
(Objectives)
Conceptual
design
View Design
Global
conceptual
schema
Access
information
External Schema
Definitions
Distribution Design
Local Conceptual Schemas
Physical Design
Physical Schema
View integration
TOP-DOWN DESIGN PROCESS
Fragmentation
& allocation
Defining the
interfaces for
end users
Entity analysis +
functional
analysis
Maps the local
conceptual
schemas to
physical storage
devices
Design Consideration (1)
The organization of distributed systems can be
investigated along three dimensions:

Level of sharing

1. No sharing: Each application and its data
execute at one site.
2. Data sharing: Programs are replicated at all
sites, but data files are not.
3. Data + Program Sharing: Both data and
programs may be shared.
Access Pattern
1. Static: Access patterns do not change.
2. Dynamic: Access patterns change over
time.
Level of Knowledge
1. No information
2. Partial information: Access patterns may
deviate from the predictions.
3. Complete information: Access patterns
can reasonably be predicted.
Design Consideration (2)
Fragmentation Alternatives
Horizontal Partitioning
JNO JNAME BUDGET LOC
J1 Instrumental 150,000 Montreal
J2 Database Dev. 135,000 New York
J1
JNO JNAME BUDGET LOC
J3 CAD/CAM 150,000 Montreal
J4 Maintenance. 310,000 Paris
J2
JNO JNAME BUDGET LOC
J1 Instrumental 150,000 Montreal
J2 Database Dev. 135,000 New York
J3 CAD/CAM 250,000 New York
J4 Maintenance 350,000 Paris
J
Vertical Partitioning
JNO BUDGET
J1 150,000
J2 135,000
J3 250,000
J4 310,000
JNO JNAME LOC
J1 Instrumentation Montreal
J2 Database Devl New York
J3 CAD/CAM New York
J4 Maintenance Paris
Why fragment at all?
Reasons:
Interquery concurrency
Intraquery concurrency

Disadvantages:
Vertical fragmentation may incur overhead.
Attributes participating in a dependency may be
allocated to different sites.
Integrity checking is more costly.
Degree of Fragmentation
Application views are usually subsets of
relations. Hence, it is only natural to
consider subsets of relations as
distribution units.

The appropriate degree of fragmentation
is dependent on the applications.
Correctness Rules
Vertical Partitioning
Lossless
decomposition
Dependency
preservation

Horizontal Partitioning
Disjoint fragments


Allocation Alternatives

Partitioning: No replication
Partial Replication: Some
fragments are replicated
Full Replication: Database
exists in its entirety at
each site
Notations
Title SAL
ENO ENAME TITLE
S
E
L
1
JNO JNAME BUDGET
J
LOC
L
2
L
3
ENO JNO RESP DUR G
L
1
: 1-to-many relationship
S: Owner(L
1
), Source relation
E: Member(L
1
), Target relation
Simple Predicates
Given a relation R(A
1
, A
2
, , A
n
) where A
i
has domain D
i
, a
simple predicate p
j
defined on R has the form
p
j
: A
i
Value
where
u
} , , , , , { > > s = < = e u and Value D
i
e
Example:
JNO JNAME BUDGET LOC
J1 Instrumental 150,000 Montreal
J2 Database Dev. 135,000 New York
J3 CAD/CAM 250,000 New York
J4 Maintenance 350,000 Orlando
J
Simple predicates:
p
1
: JNAME = Maintenance
P
2
: BUDGET < 200,000
Note: A simple predicate defines a data fragment
Given a set of simple predicates for relation R.
P = {p
1
, p
2
, , p
m
}
The set of minterm predicates
M = {m
1
, m
2
, , m
n
}
is defined as
M = {m
i
| m
i
= }
where
MINTERM PREDICATE
*
j
p
P p
j
e
.
j j j
p p p p = =
*
j
*
or
TITLE SAL
Elect. Eng. 40,000
Syst. Analy. 54,000
Mech. Eng. 32,000
Programmer 42,000
Possible simple predicates:
P
1
: TITLE=Elect. Eng.
P
2
: TITLE=Syst. Analy
P
3
: TITLE=Mech. Eng.
P
4
: TITLE=Programmer
P
5
: SAL 35,000
P
6
: SAL > 35,000
Some corresponding
minterm predicates:
000 , 30 " . " :
000 , 30 ." . " :
2
1
> . =
s . =
SAL Eng Elect TITLE m
SAL Eng Elect TITLE m

A minterm predicate defines


a data fragment
Primary Horizontal Fragmentation
A primary horizontal fragmentation is defined by a selection
operation on the owner relations of a database schema.
ENO ENAME TITLE
JNO JNAME BUDGET LOC
E
J
ENO JNO RESP DUR
G
L
2
L
3
Owner(L
3
) = J
A possible fragmentation of J is defined as follows:
) (
) (
000 , 200 2
000 , 200 1
J J
J J
BUDGET
BUDGET
>
s
=
=
o
o

Horizontal Fragments


Thus, a horizontal fragment R
i
of relation R
consists of all the tuples of R that satisfy a
minterm predicate m
i
.

There are as many horizontal fragments
(also called minterm fragments) as there are
minterm predicates.
Simple Predicates Minterm Fragments Applications

A
1
k
1


A
2
= k
2


A
3
k
3


A
4
= k
4

Completeness (1)
A set of simple predicate Pr is said to be complete if and only
if there is an equal probability of access by every application
to any two tuples belonging to any minterm fragment that is
defined according to Pr.
F
1
F
2
F
3
A
1
A
2
A
3
A
4
p
1
p
1
p
3
p
3
Complete The fragments look homogeneous
Simple Predicates Minterm Fragments Applications

A
1
k
1


A
2
= k
2


A
3
k
3


A
4
= k
4

Completeness (2)
F
1
F
2
F
3
A
1
A
2
A
3
A
4
p
1
p
1
p
3
p
3
p
4
p
5
Set of simple
predicates is
incomplete
F
32
F
31
Simple Predicates Minterm Fragments Applications

A
1
k
1


A
2
= k
2


A
3
k
3


A
4
= k
4


A
5
> k
5


Completeness (2)
F
1
F
2
F
3
A
1
A
2
A
3
A
4
p
1
p
1
p
3
p
3
p
4
p
5
Additional
simple
predicate
Now complete !
Completeness (4)
A set of simple predicate Pr is said to be complete if and only
if there is an equal probability of access by every application
to any two tuples belonging to any minterm fragment that is
defined according to Pr.
Case 1: The only application that accesses
J wants to access the tuples according to
the location.

The set of simple predicates
LOC=Montreal,
Pr = LOC=New York,
LOC=Orlando

is complete because each tuple of each
fragment has the same probability of
being accessed.
" "
" "
" "
1
2
3
( )
( )
( )
LOC MONTREAL
LOC NewYork
LOC Orlando
J J
J J
J J
o
o
o
=
=
=
=
=
=
J
J
1
J
2
J
3
LOC=Montreal
LOC=New York
LOC=Orlando
Completeness (5)
Example:
JNO JNAME BUDGET LOC
001 Instrumental 150,000 Montreal


JNO JNAME BUDGET LOC
004 GUI 135,000 New York
007 CAD/CAM 250,000 New York
J1
J2
JNO JNAME BUDGET LOC
003 Database Dev. 310,000 Orlando
J3
Case 2: There is a second application which accesses only those
project tuples where the budget is less than $200,000.
Since tuple 004 is accessed more frequently than tuple
007, Pr is not complete.
To make the the set complete, we need to add
(BUDGET< 200,000) to Pr.
LOC=Montreal,
Pr = LOC=New York,
LOC=Orlando
J
J
1
J
2
J
3
LOC=Montreal
LOC=New York
LOC=Orlando
J
11
J
12
BUDGET<=200,000
BUDGET>200,000
J
21
BUDGET<=200,000
J
22
BUDGET>200,000
J
31
J
32
BUDGET>200,000
BUDGET<=200,000
Completeness (6)
Small-budget applications
Note: Completeness is a
desirable property because a
complete set defines
fragments that are not only
logically uniform in that they
all satisfy the minterm
predicate, but statistically
homogeneous.
Redundant Fragmentation
Fragments 1 and 2 have the same
characteristics
The fragmentation is unnecessary
Logically
uniform &
statistically
homogeneous
fragment
Fragment 1
Fragment 2
Minimality
Relevant:
Let m
i
and m
j
be two almost identical minterm predicates:
m
i
= p
1
p
2
p
3
fragment f
i
m
j
= p
1
p
2
p
3
fragment f
j
p
2
is relevant if and only if
) (
) (
) (
) (
j
j
i
i
f card
m acc
f card
m acc
=
Access frequency
Cardinality
f
f
1
f
12
f
i
f
j
p
1

p
3

p
2

p
2

A
Prob
1
Prob
2
Prob
1
Prob
2
Minimality
Relevant:
Let m
i
and m
j
be two almost identical minterm predicates:
m
i
= p
1
p
2
p
3
fragment f
i
m
j
= p
1
p
2
p
3
fragment f
j
p
2
is relevant if and only if
) (
) (
) (
) (
j
j
i
i
f card
m acc
f card
m acc
=
Access frequency
Cardinality
That is, there should be at least one application that accesses f
i
and f
j

differently.
i.e., The simple predicate p
i
should be relevant in determining a
fragmentation.
Minimal: If all the predicates of a set Pr are relevant, Pr is minimal.
A Complete and Minimal Example
Two applications:
1. One application accesses the tuples according
to location.
2. Another application accesses only those project
tuples where the budget is less than $200,000.


Case 1: Pr={Loc=Montreal, Loc=New York, Loc=Orlando,
BUDGET<=200,000,BUDGET>200,000} is
complete and minimal.

Case 2: If, however, we were to add the predicate
JNAME= Instrumentation to Pr, the resulting
set would not be minimal since the new predicate
is not relevant with respect to the applications.
J
J
1
J
2
J
3
LOC=Montreal
LOC=New York
LOC=Orlando
J
11
J
12
BUDGET<=200,000
BUDGET>200,000
J
121
J
122
JNAME = Instrument
JNAME! = Instrument
J
21
BUDGET<=200,000
J
22
BUDGET>200,000
J
31
J
32
Relevant
BUDGET>200,000
BUDGET<=200,000
[ JNAME = Instrument ]
is not relevant.
Irrelevant
Application Information
Qualification Information
The fundamental qualification information
consists of the predicates used in user
queries (i.e., where clauses in SQL).
80/20 rule: 20% of user queries account
for 80% of the total data access.
One should investigate the more
important queries.
Quantitative Information
Minterm Selectivity sel(m
i
): number of
tuples that would be accessed by a query
specified according to a given minterm
predicate.
Access Freequency acc(q
i
): the access
frequency of queries in a given period.
Qualitative
information
guides the
fragmentation
activity
Quantitative
information
guides the
allocation
activity
Determine the set of meaningful minterm predicates
Applications:
Take the salary and determine a raise accordingly.
The employee records are managed in two places, one handling the
records of those with salary less than or equal to $30,000 and the other
handling the records of those who earn more than $30,000.
) 000 , 30 ( ) 000 , 30 ( :
) 000 , 30 ( ) 000 , 30 ( :
) 000 , 30 ( ) 000 , 30 ( :
) 000 , 30 ( ) 000 , 30 ( :
4
3
2
1
> . s
> . s
> . s
> . s
SAL SAL m
SAL SAL m
SAL SAL m
SAL SAL m
Implications:
) 000 , 30 ( ) 000 , 30 ( :
) 000 , 30 ( ) 000 , 30 ( :
) 000 , 30 ( ) 000 , 30 ( :
) 000 , 30 ( ) 000 , 30 ( :
4
3
2
1
s >
s >
> s
> s
SAL SAL i
SAL SAL i
SAL SAL i
SAL SAL i
4 2
1 1
m i
m i

is contradictory
is contradictory

Therefore, we are left with
M = {m
2
, m
3
}
Pr={p1: SAL<=30,000, p2: SAL>30,000} is complete and minimal.

The minterm predicates:
Invalid Implications
JNO JNAME BUDGET LOC
J1 Instrumental 150,000 Montreal
J2 Database Dev. 135,000 New York
J3 CAD/CAM 250,000 New York
J4 Maintenance 350,000 Orlando
J
Simple predicates
p
1
: LOC = Montreal
p
2
: LOC = New York
p
3
: LOC = Orlando
p
4
: BUDGET 200,000
p
5
: BUDGET > 200,000
VALID Implications
4 5 7
5 4 6
4 5 5
5 4 4
2 1 3 3
3 1 2 2
3 2 1 1
:
:
:
:
:
:
:
p p i
p p i
p p i
p p i
p p p i
p p p i
p p p i




.
.
.
INVALID Implications
) 000 , 200 ( " " :
) 000 , 200 ( " " :
9
8
s =
> =
BUDGET Orlando LOC i
BUDGET Montreal LOC i
Implications should be
defined according to the
semantics of the database,
not according to the
current values.
Compute Complete & Minimal Set
Repeat until the predicate set is complete
Find a simple predicate p
i
that is relevant
Determine minterm fragments f
i
and f
j
according to p
i
Accept p
i
, f
i
, and f
j


Remove any p
k
and f
k
from acceptance list if p
k
becomes
irrelevant /* the list is minimal */
Determine the set of minterm predicates M (using
the acceptance list)
Determine the set of implications I (among the
acceptance list)
For each m
i
in M, remove m
i
if it is contradictory
according to I
Rule: a relation or fragment is partitioned into at least two parts which are
accessed differently by at least one application.
Relevant: a simple predicate which satisfies the above rule, is relevant.
Derived Horizontal Fragmentation
Derived fragmentation is used to facilitate the
join between fragments.


In some cases, the horizontal fragmentation of a
relation cannot be based on a property of its own
attributes, but is derived from the horizontal
fragmentation of another relation.
PAY (TITLE, SAL)

EMP (ENO, ENAME, TITLE)
1 ( "Assistant Professor")
2 ( " Associate Professor")
3 ( " Full Professor")
( )
( )
( )
TITLE
TITLE
TITLE
PAY PAY
PAY PAY
PAY PAY
o
o
o
=
=
=
=
=
=
Not using derived fragmentation: one can divide EMP into EMP1
and EMP2 based on TITLE and divide PAY into PAY1, PAY2, PAY3
based on SAL. To join EMP and PAY, we have the following
scenarios.




PAY
1
PAY
2
PAY
3
More communication
overhead !
Benefits of Derived Fragmentation
Primary Fragmentation:
EMP
1
PAY
1
EMP
2
PAY
2
EMPi and PAYi can be allocated
to the same site.
Using Derived Fragmentation:
EMP
1
= EMP SJ PAY
1
EMP
2
= EMP SJ PAY
2
EMP
3
= EMP SJ PAY
3
EMP
3
PAY
3
EMP
1
EMP
2
EMP
3
Chain Relationships
Design the primary
fragmenation for R1.
Derive the derived
fragmentation for Rk as
follows:
Rk = Rk

SJ
RKFK=R(k-1)PK
R(k-1)
for 2 s k s n in that order.
R1 (R1PK, )

R2 (R2PK, R1FK, )

R3 (R3PK, R2FK, )

. . .
Derived Fragmentation
How do we fragment EMP_PROJ ?
Semi-Join with EMP, or
Semi-Join with PROJ
Criterion: Suport the more-frequent join
operation
EMP (ENO, ENAME, TITLE) PROJ (PNO, PNAME, BUDGET)


EMP_PROJ (ENO, PNO, RESP, DUR)
Join might
be required
VERTICAL FRAGMENTATION
Purpose: Identify fragments Ri such that
many applications can be executed using
just one fragment.
Advantage: When many applications which
use R1 and many applications which use R2
are issued at different sites, fragmenting
R avoids communication overhead.
Vertical partitioning is more complicated than horizontal
partitioning:
Vertical Partitioning: The number of possible fragments is
equal to m
m
where m is the number of nonprimary key
attributes
Horizontal Partitioning: 2
n
possible minterm predicates can
be defined, where n is the number of simple predicates in the
complete and minimal set Pr.
R
1
R
2
A
1
A
7
Site 1 Site 2
Greedy Heuristic Approaches:
Split Approach: Global relations are
progressively split into fragments.
Grouping Approach: Attributes are
progressively aggregated to constitute
fragments.
Correctness:
Each attribute of R belongs to at least one
fragment.
Each fragment includes either a key of R or a
tuple identifier.
Vertical Fragmentation Approaches
Vertical Clustering - Replication
Example: EMP(ENUM,NAME,SAL,TAX,MGRNUM,DNUM)
Bad Fragmentation: NAME not available in EMP2
1. EMP1(ENUM,NAME,TAX,SAL)
2. EMP2(ENUM,MGRNUM,DNUM)
Good Fragmentation:
1. EMP1(ENUM, NAME, TAX, SAL)
2. EMP2(ENUM, NAME, MGRNUM, DNUM)
In evaluating the convenience of vertical
clustering, it is important that overlapping
attributes are not heavily updated.
Administrative Applications
at Site 1
Applications
at all sites
NAME is
relatively
stable
Split Approach
1. Obtain attribute affinity matrix.
2. Use a clustering algorithm to group some attributes
together based on the attribute affinity matrix. This
algorithm produces a clustered affinity matrix.
3. Use a partitioning algorithm to partition attributes
such that set of attributes are accessed solely or for
the most part by distinct set of applications.
Splitting is considered only for attributes that do
not participate in the primary key.
The split approach involves three steps:

PNO PNAME BUDGET LOC
PROJ
A
1
A
2
A
3
A
4
q
1
: SELECT BUDGET
FROM PROJ
WHERE PNO=Value;

q
2
: SELECT PNAME, BUDGET
FROM PROJ;

q
3
: SELECT PNAME
FROM PROJ
WHERE LOC=Value;

q
4
: SELECT SUM(BUDGET)
FROM PROJ
WHERE Loc=Value
1 1 0 0
1 0 1 0
0 1 1 0
0 1 0 1
A
1
A
2
A
3
A
4
q
1
q
2
q
3
q
4
Attribute Usage Matrix
1 if A
j
is referenced by q
i
0 otherwise
Attribute Usage Matrix
use(q
i
,A
j
) =
Attribute Affinity Measure
A
i
A
k
A
j
Relation R
Site m
q
k
q
i
Site s
q
k
q
i
Site n
q
i
q
i
( )
s k
ref q
( )
s k
acc q
ref
s
(q
k
) : Number of accesses to
attributes (A
i
,A
j
) for each
execution of q
k
at site s
acc
s
(q
k
) : Application access
frequency of q
k
at site s.
, ( , ) 1 ( , ) 1
( , ) ( ) ( ) i j
k i k j
s k s k
k use q A use q A s
aff A A ref q acc q
= . =
=

For each query q
k
that uses both A
i
and A
j

Popularity of such A
i
-A
j
pair at
all sites Popularity
of using
A
i
and A
j

together
A
1
A
2
A
3
A
4
A
1
A
2
A
3
A
4
Attribute Affinity Matrix
Attribute Affinity Matrix
) , ( 3 2 A A aff
ref
s
(q
k
): Number of accesses
to attributes (A
i
,A
j
)
for each execution
of q
k
at site s
acc
s
(q
k
): Application access
frequency of q
k
at
site s.
, ( , ) ( , )
( , ) ( ) ( ) i j
k i k j
s k s k
k use q A s use q A s s
aff A A ref q acc q
= . =
=

For each query q
k
that uses both A
i
and A
j
Popularity of such A
i
-A
j
pair at
all sites
1 1 0 0
1 0 1 0
0 1 1 0
0 1 0 1
A
1
A
2
A
3
A
4
q
1
q
2
q
3
q
4
Attribute Usage Matrix
78 3 75 0
3 53 5 45
75 5 80 0
0 45 0 45
A
1
A
2
A
3
A
4
A
1
A
2
A
3
A
4
Attribute Affinity Matrix (AA)
Attribute Affinity Matrix Example
Next Step - Determine clustered affinity (CA) matrix
78 3 75 0
3 53 5 45
75 5 80 0
0 45 0 45
A
1
A
2
A
3
A
4
A
1
A
2
A
3
A
4
Attribute Affinity Matrix (AA)
Clustered Affinity Matrix
Step 1: Initialize CA
75 0
5 45
80 0
0 45
A
1
A
2
A
3
A
4
A
1
A
2
A
3
A
4
Clustered Affinity Matrix (CA)
Copy first 2 columns
78 3 75 0
3 53 5 45
75 5 80 0
0 45 0 45
A
1
A
2
A
3
A
4
A
1
A
2
A
3
A
4
Attribute Affinity Matrix (AA)
Clustered Affinity Matrix
Step 2: Determine Location for A
3
75 0
5 45
80 0
0 45
A
1
A
2
A
1
A
2
A
3
A
4
Clustered Affinity Matrix (CA)
3 possible
positions
for A
3

A
0
A
0
A
5
A
5
A
3
A
4
A
1
A
2
A
3

A
1
A
3
A
2

A
0
A
3
A
1

Clustered Affinity Matrix
Step 2: Determine the order for A
3

=
=
n
z
y z x z y x
A A aff A A aff A A bond
1
) , ( ) , ( ) , (
) , ( 2 ) , ( 2 ) , ( 2 ) , , (
j i j k k i j k i
A A bond A A bond A A bond A A A cont + =
78 3 75 0
3 53 5 45
75 5 80 0
0 45 0 45
A
1
A
2
A
3
A
4
A
1
A
2
A
3
A
4
Attribute Affinity Matrix (AA)
75 3 0
5 53 45
80 5 0
0 45 45
A
1
A
3
A
2
A
4
A
1
A
2
A
3
A
4
Clustered Affinity Matrix (CA)
Cont(A
0
,A
3
,A
1
) = 8820 Cont(A
1
,A
3
,A
2
) = 10150 Cont(A
2
,A
3
,A
4
) = 1780
Since Cont(A
1
,A
3
,A
2
) is the greatest, [A
1
,A
3
,A
2
] is the best order.
Note: aff(A
0
,A
i
)=aff(A
i
,A
0
)=aff(A
5
,A
i
)=aff(A
i
,A
5
)=0 by definition
Contribution
78 3 75 0
3 53 5 45
75 5 80 0
0 45 0 45
A
1
A
2
A
3
A
4
A
1
A
2
A
3
A
4
Attribute Affinity Matrix (AA)
Clustered Affinity Matrix
Step 2: Determine the order for A
4
78 75 3 0
3 5 53 45
75 80 5 0
0 0 45 45
A
1
A
3
A
2
A
4
A
1
A
2
A
3
A
4
Clustered Affinity Matrix (CA)
Since Cont(A
3
,A
2
,A
4
) is the biggest, [A
3
,A
2
,A
4
] is the best order.
Clustered Affinity Matrix
Step 3: Re-order the Rows

78 75 3 0
75 80 5 0
3 5 53 45
0 0 45 45
A
1
A
3
A
2
A
4
A
1
A
3
A
2
A
4
Clustered Affinity Matrix (CA)
The rows are organized in the same order as the columns.
78 75 3 0
3 5 53 45
75 80 5 0
0 0 45 45
A
1
A
3
A
2
A
4
A
1
A
2
A
3
A
4
Clustered Affinity Matrix (CA)
78 75 3 0
75 80 5 0
3 5 53 45
0 0 45 45
A
1
A
3
A
2
A
4
A
1
A
3
A
2
A
4
Clustered Affinity Matrix (CA)
Partitioning
Find the sets of attributes
that are accessed, for the
most part, by distinct sets
of applications.

We look for dividing points
along the diagonal such that
Total accesses to only
one fragment are
maximized, while
Total accesses to more
than one fragments are
minimized.
Cluster 1: A
1
& A
3
Cluster 2: A
2
& A
4
Two vertical fragments:
PROJ1(A
1
, A
3
) and PROJ2(A
2
, A
4
)
A
4
and A
3

are
usually
not
accessed
together


A
4
and A
2

are often
accessed
together


Bad grouping since
A
1
and A
2
are never
accessed together
MIXED FRAGMENTATION
Apply horizontal fragmentation to vertical fragments.
Apply vertical fragmentation to horizontal fragments.

Example: Applications about work at each department reference tuples
of employees in the departments located around the site with 80%
probability.
EMP(ENUM,NAME,SAL,TAX,MGRNUM,DNUM)
ENUM NAME TAX SAL ENUM NAME MGRNUM DNUM
Jacksonville
Orlando
Miami
Vertical fragmentation
Horizontal
Fragmentation
(local work)
i: fragment index
j: site index
k: application index
f
kj
: the frequency of
application k at site j
r
ki
: the number of retrieval
references of application k
to fragment i.
u
ki
: the number of update
references of application k
to fragment i.
n
ki
= r
ki
+ u
ki

ALLOCATION
Notations
Fragment i
Application k
/w freq. f
kj
r
ki
u
ki
Site j
Allocation of Horizontal Fragments (1)
No replication: Best Fit Strategy
The number of local references of R
i
at site j is





R
i
is allocated at site j* such that B
ij*
is maximum.

=
k
ki kj
n f Bij
Advantage: A fragment is allocated to a site that needs it most.
Disadvantage: It disregards the mutual effect of placing a
fragment at a given site if a related fragment is also at that
site.
All applications k
at Site j
Frequency of
application k
Number of
Access by k
Benefit to
Site j
Allocation of Horizontal Fragments (2)
All beneficial sites approach (replication)

=
=
k j j k
ki kj ki kj ij
u f c r f B
'
'
Savings due to
retrieval
references

Cost of update
references from
other sites
R
i
is allocated at all sites j* such that B
ij*
> 0.
When all B
ij
s are negative, a single copy of R
i
is
placed at the site such that B
ij*
is maximum.
Fragment i
Site j
Allocation of Horizontal Fragments (3)
Another Replication Approach:
d
i
The degree of redundancy of R
i
F
i
The reliability and availability benefit of having R
i
fully replicated.
|(d
i
)
The reliability and availability benefit when the fragment has d
i

copies.

= = = =

,
4
3
) 3 ( ,
2
) 2 ( , 0 ) 1 ( )
2
1 ( ) (
1
F F
F d
i i
i
d
i
i
| | | |
The benefit of introducing a new copy of R
i
at site j :
) (
'
'
d
u f c r f B
i
k k j j
ki kj ki kj ij
| + =

=
Same as All Beneficial
Sites approach
Also takes into
account the benefit
of availability

1
F
i

d
i

Allocation of Vertical Fragments
This formula can be used within an exhaustive splitting
algorithm by trying all possible combinations of sites s and t.
1
2 3
4
2
s t
l
ist ks kt ks
ks kt ks
k k k
kt ki ki
kt ki kl
l n k k k
f f f
n n n B
A A A
f f f
n n n
A A A
e e e
s s e e e
= +



Applications
of type A
s

at PS
s
A
s
A
t
A
4
A
n
PS
r
A
1
A
3
A
2
R
i
R
s
R
t
PS
s
PS
t
PS
4
PS
n
. . .
Application type A
1

at site PS
r
, that
accesses only R
s
R
s
R
t A
s
A
t
A
1
A
3
A
2
PS
r
PS
s
PS
t
PS
4
PS
n
A
4
A
n
.
.
.
Should we allocate fragment R
s

to site PS
s
, and fragment R
t
to
site PS
t
?


SUMMARY
Design of a distributed DB consists of four phases:
Phase 1: Global schema design (same as in centralized DB
design)
Phase 2: Fragmentation
Horizontal Fragmentation
Primary: Determent a complete and minimal set of predicates
Derived: Use semijoin
Vertical Fragmentation
Identify fragments such that many applications can be executed
using just one fragment.
Phase 3: Allocation
The primary goal is to minize the number of remote accesses.
Phase 4: Physical schema design (same as in centralized DB
design).
Database Integration
Bottom-up Design
Overview
The design process in
multidatabase systems is
bottomup.
The individual databases
actually exists
Designing the global
conceptual schema (GCS)
involves integrating these
local databases into a
multidatabase.
Database integration can
occur in two steps:
Schema Translation and
Schema Integration.
Database 1 Database 2 Database 3
Translator 1
Translator 2 Translator 3
InS
1
INTEGRATOR
GCS
Intermediate
schema in
canonical
representation
InS
3
InS
2
Network Data Model (Review)
There are two basic data structures in the network
model: records and sets.
Record type: a group of records of the same type.
Set type: indicates a many-to-one relationship in the direction of the arrow.
DEPARTMENT (DEPT-NAME, BUDGET, MANAGER)

EMPLOYEE (E#, NAME, ADDRESS, TITLE, SALARY)
Representation of set instances:
Employs
owner record type
set type
member record type
Database
Jones, L.
Patel, J.
Vu, K.
DEPARTMENT (owner record)
EMPLOYEE
(member records)
Example: Three Local Databases
Database 1 (Relational Model):
S (TITLE, SAL)

E (ENO, ENAME, TITLE)

J (JNO, JNAME, BUDGET, LOC, CNAME)

G (ENO, JNO, RESP, DUR)
Database 2 (Network Model):
DEPARTMENT (DEPT_NAME, BUDGET, MANAGER)


Work


EMPLOYEE (E#, NAME, ADDRESS, TITLE, SALARY)
Employs
Worksin
Dummy
Record Type
Example: Three Local Databases
Database 3 (ER Model):
Engineer
No.
Engineer
Name
Title Salary
Project
No.
Project
Name
Budget
Location
Duration
Responsibility
Contract
Date
Address
Client
Name
ENGINEER
WORKS
IN
PROJECT
CONTRACTED
BY
CLIENT
1
N
N
1
Schema Translation: Relational to ER
S (TITLE, SAL)

E (ENO, ENAME, TITLE)

J (JNO, JNAME, BUDGET, LOC, CNAME)

G (ENO, JNO, RESP, DUR)
ENO ENAME
TITLE
SAL
E
PAY
S
CNAME
LOC
J
BUDGET
JNO JNAME
DUR
RESP
N M
1
N
ENO ENAME
TITLE
SAL
E
CNAME
LOC
J
BUDGET
JNO JNAME
DUR
RESP
N M
E & J have a many-to-
many relationship
E & S have a 1-to-many
relationship
Treat salary as
an attribute of
an engineer
entity
Relationships may be identified from
the foreign keys defined for each
relation.
Schema Translation: Network to ER
Map each record type in the network schema to an entity
and each set type to a relationship.
Network model uses dummy records in its representation of
many-to-many relationships that need to be recognized
during mapping.
DEPARTMENT EMPLOYEE
WORK
Employs
Works-in
WORK
DEPARTMENT EMPLOYEE
EMPLOYS WORKS-IN
N
M
1 1
DEPARTMENT
EMPLOYS
EMPLOYEE
N M
Dummy
record type
Schema Integration
Schema integration follows the translation
process and generates the GCS by
integrating the intermediate schemas.
Identify the components of a database which
are related to one another.
Two components can be related as (1) equivalent, (2)
one contained in the other one, (3) overlapped, or (4)
disjoint.
Select the best representation for the GCS.
Integrate the components of each
intermediate schema.
Integration Methodologies
Integration
Process
N-ary Binary
Balanced Ladder Iterative One-shot
Binary: Decreases the
potential integration
complexity and lead toward
automation techniques.

One-shot: There is no
implied priority for
integration order of
schemas, and the trade-off
can be made among all
schemas rather than among
a few.
Integration Process
Preintegration: establish the rules of the integration
process before actual integration occurs.
Comparison: naming and structural conflicts are identified.
Conformation: resolve naming and structural conflicts
Merging and restructuring: all schemas must be merged into
a single database schema and then restructured to create
the best integrated schema.
Schema integration occurs in a sequence of four
steps:
Schema Integration: Preintegration
1. An integration method (binary or n-ary) must be
selected and the schema integration order defined.
The order implicitly defines priorities.
2. Candidate keys in each schema are identified to
enable the integrator to determine dependencies
implied by the schemas.
3. The mapping or transformation rules should be
described before integration begins.
e.g., mapping from degree Celsius in one schema to
degrees Fahrenheit in another.
Preintegration Example: InS
1
Engineer
No.
Engineer
Name
Title Salary
Project
No.
Project
Name
Budget
Location
Duration
Responsibility
Contract
Date
Address
Client
Name
ENGINEER
WORKS
IN
PROJECT
CONTRACTED
BY
CLIENT
1
N
N
1
Preintegration Example: InS
2
& InS
3
E#
Name
Address Salary
Dept-name Budget
Manager
EMPLOYEE DEPARTMENT
EMPLOYS
1
N
InS
2
Eno Ename
Title Sal
JNO Jname
Budget
Loc
Dur
Resp
Cname
ENGINEER
J
M
N
EMPLOYS
InS
3
Title
Keys & Integration Order
InS
1
InS
2
InS
3
KEYS

InS1: Engineer No. in ENGINEER
Project No. in PROJECT
Client name in CLIENT

InS2: E# in EMPLOYEE
Dept-name in DEPARTMENT

InS3: Eno in E
Jno in J
Integration method
Schema Comparison:
Naming Conflict (1)
Synonyms: two identical entities that have
different names.
InS
1
InS
3

ENGINEER
Engineering No
Engineer Name
Salary
WORKSIN
Responsibility
Duration
PROJECT
Project No
Project Name
Location
E
Eno
Ename
Sal
G
Resp
Dur
J
Jno
Jname
Loc
Schema Comparison:
Naming Conflict (2)
In InS
1
, ENGINEER.Title refers to the title of
engineers.
In InS
2
, EMPLOYEE.Title refers to the title of all
employees.
Homonyms: Two different entities that have
identical names.
domain (EMPLOYEE.Title) >> domain (ENIGNEREER.Title)
Schema Comparison Relation
between Schemas
Two schemas can be related in four
possible ways:
They can be identical to one another.
One can be a subset of the other.
Some components from one may occur in other
while retaining some unique features
They could be completely different with no
overlap.
An attribute in one schema may represent
the same information as an entity in
another one
Schema Comparison Example
InS
3
is a subset of InS
2








Some parts of InS
1
(about engineers) and InS
3

(about engineers) occur in InS
2
(about employees)
ENGINEER
EMPLOYS
E#
Name
Title
Salary
Address
IS-A relationship
DEPARTMENT
EMPLOYEE
Schema Comparison Structural
Conflicts (1)
Type conflicts: occur when the same object is
represented by an attribute in one schema and by an
entity in another schema.
The client of a project is modeled as an entity in InS
1
,
however
the client is included as an attribute of the J entity in InS
3


JNO Jname
Budget
Loc
Dur
Resp
Cname
J
M
EMPLOYS
InS
3
Contract
Date
Address
Client
Name
PROJECT
CONTRACTED
BY
CLIENT
N
1
InS
1
Schema Comparison Structural
Conflicts (2)
Dependency conflicts:
occur when different
relationship modes are
used to represent the
same thing in different
schemas.
Engineer
No.
Engineer
Name
Title Salary
Project
No.
ENGINEER
WORKS
IN
PROJECT
1
N
InS
1
Eno Ename
Title Sal
Dur
Resp
ENGINEER
J
M
N
EMPLOYS
InS
3
This is
1-to-many
This is
many-to-many
Schema Comparison: Structural
Conflicts (3)
Key conflicts: occur when different candidate keys
are available and different primary keys are
selected in different schemas
Behavioral conflicts: are implied by the modeling
mechanism,
e.g., deletion of the last employee causes the dissolution
of the department.
Conformation: Naming Conflicts
Naming conflicts are resolved simply by renaming
conflict ones.

InS
3
InS
1

E
Eno Engineering No
Ename Engineering Name
Sal Salary
G
Resp Responsibility
Dur Duration
J
Jno Project No
Jname Project Name
Loc Location
ENGINEER
Engineering No
Engineer Name
Salary
WORKSIN
Responsibility
Duration
PROJECT
Project No
Project Name
Location
Homonyms:
Prefix each attribute
by the name of the
entity to which it
belong,
e.g., ENGINEER.Title
EMPLOYEE.Title
and prefix each entity
by the name of the
schema to which it
belongs.
e.g., InS1.ENGINEER
InS2.EMPLOYEE
Synonyms: rename the schema of InS
3

to conform to the naming of InS
1
.
Engineer
No.
Engineer
Name
Title Salary
Budget
Location
Duration
Responsibility
ENGINEER
WORKS
IN
PROJECT

Client
Name
N
Resolving Structural Conflicts
Transforming entities/attributes/relationships among one another
Transform the attribute Client name in
InS
3
to an entity C to make InS
3

conform to the presentation of InS
1
.
M
Engineer
No.
Engineer
Name
Title Salary
Project
No.
Project
Name
Budget
Location
Duration
Responsibility
ENGINEER
WORKS
IN
PROJECT
M
N
Example:
Project
No.
Project
Name
C-P
C
N
M
Client
Name
InS
3
New
InS
3
Schema Integration:
Merging & Restructuring
Merging requires that the information contained in the
participating schemas be retained in the integrated
schema.

InS
1
InS
2
InS
3
Merging using the IS-A
relationship
Use InS
3
as the final schema
since it is more general in
terms of the C-P relationship
(i.e., many-to-many)
(next page)
(Employees)
(Engineers) (Engineers)
Integrate InS
1
& InS
3
Engineer
No.
Engineer
Name
Title Salary
Project
No.
Project
Name
Budget
Location
Duration
Responsibility
ENGINEER
WORKS
IN
PROJECT
CONTRACTED
BY
C
M
N
N
M
Client
Name
Engineer
No.
Engineer
Name
Title Salary
Project
No.
Project
Name
Budget
Location
Duration
Responsibility
Contract
Date
Address
Client
Name
ENGINEER
WORKS
IN
PROJECT
CONTRACTED
BY
CLIENT
1
N
N
1
InS
1
InS
3
InS
3
is
more
general

Merging & Restructuring Example
Project
No.
Project
Name
Budget
Location
Duration
Address
Client
name
ENGINEER
WORKS
IN
CONTRACTED
BY
CLIENT
M
N
N
1
Final Result:
EMPLOYEE
EMPLOYS
E#
Name
Title
SAL
Address
Dept-name Budget Manager
DEPARTMENT
InS
2

InS
1
/InS
3

Unfortunately, Conformation and
restructuring stages are an art
rather then a science
Responsibility
PROJECT
Query Processing in
Multidatabase Systems
Query Processing in Three Steps
1. Global query is
decomposed into local
queries
Local Schema 1 Local Schema 2 Local Schema 3
Translator 1
Translator 2 Translator 3
InS
1
INTEGRATOR
GCS
InS
3
InS
2
Schema Integration
Q
1
Q
1,1
Q
1,2
Q
1,3
Query Processing in Three Steps
2. Each local query is
translated into
queries over the
corresponding local
database system
Local Schema 1 Local Schema 2 Local Schema 3
Translator 1
Translator 2 Translator 3
InS
1
INTEGRATOR
GCS
InS
3
InS
2
Schema Integration
Q
1
Q
1,1
Q
1,2
Q
1,3
Q
1,1
Q
1,2
Q
1,3
Query Processing in Three Steps
3. Results of the local
queries are combined
into the answer

Local Schema 1 Local Schema 2 Local Schema 3
Translator 1
Translator 2 Translator 3
InS
1
INTEGRATOR
GCS
InS
3
InS
2
Schema Integration
Q
1
Q
1,1
Q
1,2
Q
1,3
Q
1,1
Q
1,2
Q
1,3
Combine
Final
answer
Query Processing in Three Steps
1. Global query is
decomposed into local
queries
2. Each local query is
translated into
queries over the
corresponding local
database system
3. Results of the local
queries are combined
into the answer
Local Schema 1 Local Schema 2 Local Schema 3
Translator 1
Translator 2 Translator 3
InS
1
INTEGRATOR
GCS
InS
3
InS
2
Schema Integration
Outline
Overview of major query processing
components in multidatabase systems:
Query Decomposition
Query Translation
Global Query Optimization
Techniques for each of the above
components
Query Decomposition
Query Decomposition
Overview
Global Query
Query decomposition &
global optimization
SQ
1
SQ
2
SQ
n
. . .
Query
translator 1
Query
translator 2
Query
translator n
TQ
1
TQ
2
TQ
n
DB
1
DB
2
DB
n
. . .

PQ
1
PQ
n

SQ
i
export-schema subquery
in global query language
TQ
i
target query (local
subquery) in local query
language
PQ
i
postprocessing query
used to combine results
returned by subqueries
to form the answer
Assumptions
We use the object-oriented data model to
present a query modification algorithm
To simplify the discussion, we assume that
there are only two export schemas:
ES1 ES2
Emp1: SSN Emp2: SSN
Name Name
Salary Salary
Age Rank
Definitions
type: Given a class C, the type
of C denoted by type(C ), is the
set of attributes defined for C
and their corresponding
domains.
world: the world of C, denoted
by world(C ), is the set of real-
world objects described by C.
extension: the extension of C,
denoted by extension(C ), is the
set of instances contained in C.
Extension
World
Type
A Class
Review: Outerjoin
The outerjoin of relation R
1
and R
2

(R
1

o
R
2
) is the union of three
components:
the join of R
1
and R
2
,
dangling tuples of R
1
padded with null
values, and
dangling tuples of R
2
padded with null
values.
Outerjoin Example
OID SSN Name Salary Age
3 6789 Smith 90,000 40
4 4321 Chang 62,000 30
5 8642 Patel 75,000 35
OID SSN Name Salary Rank
1 2222 Ahad 98,000 S. Mgr.
2 7531 Wang 95,000 S. Mgr.
3 6789 Smith 25,000 Mgr.
OID SSN Name Salary Age Rank
1 2222 Ahad 98,000 null S. Mgr.
2 7531 Wang 95,000 mull S. Mgr.
3 6789 Smith
Incon-
sistent
40 Mgr.
4 4321 Chang 62,000 30 null
5 8642 Patel 75,000 35 null
Emp1
Emp2
EmpO
Dangling Tuple Dangling Tuple
Schema Integration - Outerjoin
Two classes C1 and C2 can be integrated
by equi-outerjoining the two classes on
the OID to form a new class C.
extension(C ) = extension(C1 )
o
extension(C2 )

type(C ) = type(C1 ) type(C2 )
world(C ) = world(C1 ) world(C2 )
C1 C2 C
Schema Integration - Generalization
Two classes C1 and C2 can be integrated by
generalizing the two classes to form the
superclass C.

type(C ) = type(C1 ) type(C2 )
extension(C ) =
type(C)
[extension(C1 )
o
extension(C2 )]
world(C ) = world(C1 ) world(C2 )
Outer
union
Generalization
Generalization Example
Emp1: SSN Emp2: SSN EmpG: SSN
Name Name Name
Salary Salary Salary
Age Rank
Emp1 and Emp2 will also appear in the
global schema since not all information in
Emp1 and Emp2 is retained in EmpG
SSN
Name
Salary
Age Rank
EmpG
Emp2 Emp1
G
e
n
e
r
a
l
i
z
a
t
i
o
n

More
specific
Inconsistency Resolution
The schema integration techniques
work as long as there is no data
inconsistency
If data inconsistency occurs,
aggregate functions may be used to
resolve the problem.
Export Schemas Integrated Schema
Emp1: SSN Emp2: SSN EmpG: SSN EmpO: SSN
Name Name Name or Name
Salary Salary Salary Salary
Age Rank Age
Rank
Aggregate Functions - Examples:
EmpG.Name = Emp1.Name, if EmpG is in world(Emp1)
= Emp2.Name, if EmpG is in world(Emp2) world(Emp1)
EmpG.Salary = Emp1.Salary, if EmpG is in world(Emp1) world(Emp2)
= Emp2.Salary, ifEmpG is in world(Emp2) world(Emp1)
= Sum(Emp1.Salary, Emp2.Salary), if EmpG is in world(Emp1) world(Emp2)
EmpO.Age = Emp1.Age, if EmpO is in world(Emp1)
= Null, if EmpO is in world(Emp2) world(Emp1)
EmpO.Rank = Emp2.Rank, if EmpO is in world(Emp2)
= Null, if EmpO is in world(Emp1) world(Emp2)

Inconsistency Resolution Example
World (Emp1) World (Emp2)
world(Emp2)
world(Emp1)
world(Emp1)
world(Emp2)
world(Emp1)

world(Emp2)
Generalization
Outer
join
Query Modification
Step 1: Determine Number of Subqueries
Global Select EmpO.Name, EmpO.Rank
Query From EmpO
Where EmpO.Salary > 80,000 AND
EmpO.Age > 35


Obtain a partition of world(EmpO) based on the aggregate
function used to resolve the data inconsistency.

Option 1 (based on Salary)
part. 1: world(Emp1) world(Emp2)
part. 2: world(Emp2) world(Emp1)
part. 3: world(Emp1) world(Emp2)





1
3 2
world(Emp1)
world(Emp2)
Inconsistency Function:
EmpO.Salary = Emp1.Salary, if
EmpO is in world(Emp1) world(Emp2)
= Emp2.Salary, if
EmpO is in world(Emp2) world(Emp1)
= Sum(Emp1.Salary,Emp2.Salary), if
EmpO is in world(Emp1) world(Emp2)
Assume
Outerjoin is
used for
schema
integration
Query Modification
Step 1: Determine Number of Subqueries
Global Select EmpO.Name, EmpO.Rank
Query From EmpO
Where EmpO.Salary > 80,000 AND
EmpO.Age > 35


Obtain a partition of world(EmpO) based on the aggregate
function used to resolve the data inconsistency.

Option 2 (based on Age)
part. 1: world(Emp1)
part. 2: world(Emp2) world(Emp1)






2
1
world(Emp1)
world(Emp2)
Inconsistency Function:
EmpO.Age
= Emp1.Age, if EmpO is in world(Emp1)
= Null, if EmpO is in world(Emp2) world(Emp1)
Query Modification
Step 1: Determine Number of Subqueries
Global Select EmpO.Name, EmpO.Rank
Query From EmpO
Where EmpO.Salary > 80,000 AND
EmpO.Age > 35

Obtain a partition of world(EmpO) based on the aggregate
function used to resolve the data inconsistency.

Option 1 (based on Salary) Option 2 (based on Age)
part. 1: world(Emp1) world(Emp2) part. 1: world(Emp1)
part. 2: world(Emp2) world(Emp1) part. 2: world(Emp2)
part. 3: world(Emp1) world(Emp2) world(Emp1)






We use Option 1 since it is the finest partition among all the partitions.
1
3 2
world(Emp1)
world(Emp2)
2
1
world(Emp1)
world(Emp2)
Query Modification
Another Example





1
3 2
world(Emp1)
world(Emp2)
2
1
world(Emp1)
world(Emp2)
1
world(Emp1)
world(Emp2)
2
Option 1:
Option 2:
Use finer partition (Option 3):
Query Modification
Step 2: Query Decomposition
Global Query:
Select EmpO.Name, EmpO.Rank
From EmpO
Where EmpO.Salary > 80,000 AND
EmpO.Age > 35

Partition:




Query Decomposition: Obtain
a query for each subset in
the chosen partition.
part. 1: Select Emp1.Name
From Emp1
Where Emp1.Salary > 80,000 AND
Emp1.Age > 35 AND
Emp1.SSN NOT IN
(Select Emp2.SSN
From Emp2)

part. 2: This subquery is discarded because
EmpO.Age is Null.

part. 3: Select Emp1.Name, Emp2.Rank
From Emp1, Emp2
Where Sum(Emp1.Salary,
Emp2.Salary) > 80,000 AND
Emp1.Age > 35 AND
Emp1.SSN = Emp2.SSN
1
3
2
world(Emp1)
world(Emp2)
EmpO.Age = Emp1.Age, if EmpO is in world(Emp1)
= Null, if EmpO is in world(Emp2) world(Emp1)
EmpO.Salary = Emp1.Salary, if EmpG is in world(Emp1) world(Emp2)
= Emp2.Salary, ifEmpG is in world(Emp2) world(Emp1)
= Sum(Emp1.Salary, Emp2.Salary), if EmpG is in world(Emp1) world(Emp2)
Query Modification
Step 3: Further Decomposition
Before STEP 3:
Select Emp1.Name
From Emp1
Where Emp1.Salary > 80,000 and
Emp1. Age > 35 and
Emp1.SSN NOT IN
(Select Emp2.SSN
From Emp2)
Select Emp1.Name
From Emp1
Where Emp1.Salary > 80,000 and
Emp1. Age > 35 and
Emp1.SSN NOT IN X





Insert INTO X
Select Emp2.SSN
From Emp2)
STEP 3: Some resulting query may still reference
data from more than one database. They need to be
further decomposed into subqueries and possibly also
postprocessing queries
X
Query Modification
Step 4: Query Optimization
STEP 4: It may be desirable to reduce
the number of subqueries by
combining subqueries for the same
database.
Query Translation
Query Translation (1)
IF Global Query Language
Local Query Language

THEN Export Local
Schema Query
Subquery Language
Translator
Query Translation (2)
IF the source query language has a higher
expressive power THEN EITHER
Some source queries cannot be translated; or
they must be translated using both
the syntax of the target query language, and
some facilities of a high-level programming language.
Example: A recursive OODB query may not be
translated into a relational query using SQL
alone.
Translation Techniques (1)
CASE 1: A single target query is generated
IF the target database system has a query
optimizer
THEN the query optimizer can be used
to optimize the translated query
ELSE the translator has to consider the
performance issues
Translation Techniques (2)
CASE 2: A set of target queries is needed.
It might pay to have the minimum number of
queries
It minimizes the number of invocations of the target
system
It may also reduce the cost of combining the partial
results
It might pay for a set to contain target queries
that can be well coordinated
The results or intermediate results of the queries
processed earlier can be used to reduce the cost of
processing the remaining queries
Relation-to-OO Translation
Equivalent Relational Schema:
Auto (Auto-OID, Color, Company-OID)
Company (Company-OID, Name, Profit, City-OID, People-OID)
People (People-OID, Name, Age, City-OID, Auto-OID)
City (City-OID, Name, State)
OODB Schema:
Auto
OID
Color
Manufacturer
Company
OID
Name
Profit
Headquarter
President
People
OID
Name
Hometown
Automobile
Age
City
OID
Name
State
Relational-to-OO Example (1)
Global Query:
Select Auto1.*
From Auto Auto1, Auto Auto2,
Company, People,
City City1, City City2
Where Auto1.Conmpany-OID =
Company,Company-OID AND
Company.People-OID =
People.People-OID AND
People.Age = 52 AND
People.Auto-OID =
Auto2.Auto-OID AND
Auto2.Color = red AND
People.City-OID =
City1.City-OID AND
City1.Name = City2.Name AND
Company.City-OID =
City2.City-OID
Relational Predicate Graph:
Auto1 Company
City2
City1
People
Age=52
Auto2
Color=red
1) Company-OID
2) People-OID
3) Auto-OID
Find all red cars own by a 52 year
old who is the President of the car
manufacturer and lives in the same
city of the car manufacturer
1
2
3
4
5
6
Relational-to-OO Example (2)
OO Query:
Where Auto.Manufacturer.President.Age = 52 AND
Auto.Manufacturer.President.Automobile.Color = red AND
Auto.Manufacturer.Headquarter.Name =
Auto.Manufacturer.President.Hometown.Name
OO Predicate Graph: Auto1 Company
City2
People
Age=52
Auto2
Color=red
Company-OID
People-OID
Auto-OID
City1
Predicate 3
Predicate 1
Predicate 2
Global Query Optimization
Global Query Optimization (1)
A query obtained by the query modification
process may still reference data from
more than one database.
Example: part. 3 (i.e., world(Emp1) world(Emp2))
on page 126
Select Emp1.Name, Emp2.Rank
From Emp1, Emp2 /* access two databases
Where sum(Emp1.Salary, Emp2.Salary) > 80,000 AND
Emp1.Age > 35 AND
Emp1.SSN = Emp2.SSN
Some global strategy is needed to process such queries
Global Query Optimization (2)
Select Emp1.Name, Emp2.Rank
From Emp1, Emp2 /* access two databases
Where sum(Emp1.Salary, Emp2.Salary) > 80,000 AND
Emp1.Age > 35 AND
Emp1.SSN = Emp2.SSN
Some global strategy is needed to process such queries
Emp2
form
result
Emp1
Site 1
Site 2
Emp2
form
result
Emp1
Site 2
Site 1
Emp1
Site 1
Emp2
Site 2
form
result
Site 3
OID SSN Name Salary Age Rank
1 2222 Ahad 98,000 null S. Mgr.
2 7531 Wang 95,000 mull S. Mgr.
3 6789 Smith
Incon-
sistent
40 Mgr.
4 4321 Chang 62,000 30 null
5 8642 Patel 75,000 35 null
Data Inconsistency
If C is integrated from C1 and C2 with no
data inconsistency on attribute A, then

A op a
(C) =
A op a
(C1)
A op a
(C2)
If A has data inconsistency, then the above
equality may no longer hold.
Example: Consider the select operation

EmpO.Salary > 100,000
(EmpO)


EmpO
The correct answer should
have the record for Smith.
However, the above query
returns an empty set
Smith does have a combined salary greater than 100,000
Data Inconsistency - Optimization
Express an outerjoin (or a generalization) as
outer-unions as follows:
C1
o
C2 = C1-O
o
C2-O
o
(C1-C
OID
C2-C)
C1-O: Those tuples of C1 that have no matching tuples
in C2 (private part)
C1-C: Those tuples of C1 that have matching tuples in
C2 (overlap part)

A op a
(C1
o
C2 ) =
A op a
(C1-O)
o

A op a
(C2-O)

o

A op a
(C1-C C2-C)
Can we improve this term ?
Distribution of Selections (1)

A op a
(C1
o
C2 ) =
A op a
(C1-O)
o

A op a
(C2-O)

o

A op a
(C1-C C2-C)

When can we dustribute
over ?
Expensive operation
Attribute A is defined by
an aggregate function
(see page 125)
Distribution of Selection (2)
Four cases were identified when all arguments of the aggregate
function (for resolving conflicts) are non-negative
1. f(A1,A2) op a A1 op a AND A2 op a:

A op a
(C1-C C2-C) =
A op a
(C1-C)
A op a
( C2-C)
Example: max(Emp1-C.Salary, Emp2-C.Salary) < 30K
Emp1-C.Salary < 30K AND
Emp2-C.Salary < 30K
2. f(A1,A2) op a f(A1 op a, A2 op a) op a:

A op a
(C1-C C2-C) =
A op a
(
A1 op a
(C1-C)
A2 op a
(C2-C))
Example: sum(Emp1-C.Salary, Emp2-C.Salary) < 30K
sum(Emp1-C.Salary < 30K,
Emp2-C.Salary < 30K) < 30K

Aggregate
function
Distribution of Selection (3)
3. f(A1,A2) op a f(A1 op a, A2 op a) op a:

A op a
(C1-C C2-C) =
A op a
(
A1 op a
(C1-C)

A2 op a
(C2-C))
Example: sum(Emp1-C.Salary, Emp2-C.Salary) = 30K
sum(Emp1-C.Salary 30K,
Emp2-C.Salary 30K) = 30K
4. No improvement is possible:
Example: sum(Emp1-C.Salary, Emp2-C.Salary) > 30K
Distribution Rules for over

A op a
(C1-C C2-C)


> < = in Not in
sum(A1, A2)
4 4 2 2 3 4 4 4
avg(A1, A2)
4 4 2 2 3 4 4 4
max(A1, A2)
4 4 1 1 3 4 4 4
min(A1, A2)
1 1 4 4 3 4 4 4
op
f
Problem in Global Query
Optimization (1)
Important information about local entity sets that is
needed to determine global query processing plans
may not be provided by the local database systems.
Example: cardinalities
availability of fast access paths
Techniques:
Sampling queries may be designed to collect statistics
about the local databases.
A monitoring system can be used to collect the
completion time for subqueries. This can be used to
better estimate subsequent subqueries.
Problems in Global Query
Optimization (2)
Different query processing algorithms may have been
used in different local database systems.
Cooperation across different systems difficult
Examples: Semijoin may not be supported on some
local systems.
Data transmission between different local database
systems may not be fully supported.
Examples:
A local database system may not allow update operations
For many nonrelational systems, the instances of one entity
set are more likely to be clustered with the instances of
other entity sets. Such clustering makes it very expensive to
extract data for one entity set.
Need more sophisticated decomposition algorithms.

Potrebbero piacerti anche