Sei sulla pagina 1di 5

A distributed database management system ('DDBMS') is a software system that

permits the management of a distributed database and makes the distribution transparent
to the users. A distributed database is a collection of multiple, logically interrelated
databases distributed over a computer network. Sometimes "distributed database system"
is used to refer jointly to the distributed database and the distributed DBMS.
Distributed database management systems (DDBMS) are amongst the most important
and successful software developments in this decade. They are enabling he computing
power and data to be placed within the user environment close to the point of user
activities. The performance efficiency of DDBMS is deeply related to the query
processing strategies involving data transmission over different nodes through the
network. This thesis is to study the optimization of query processing strategies in a
distributed databases environment. With the objective of minimum communication cost,
we have developed a mathematical model to find a join-semijoin program for processing
a given equi-join query in distributed homogeneous relational databases. Rules for
estimating the size of the derived relation is proposed. The distributed query processing
problem is formulated as dynamic network problem. We also extend this model to
consider both communication cost and local processing cost. We extend this model to
query processing in a distributed heterogeneous databases environment. A heterogeneous
database communication system is proposed to integrate heterogeneous database
management systems to combine and share information. The use of a database
communication system for heterogeneous DBMSs makes the overall system transparent
to users from an operational point of view. Problems of schema translation and query
translation of the query processing in this environment are studied.
Distributed Databases
What is a DDBMS?

• A Distributed DB is a related collection of data just like a normal DB, but physically distributed over a
network.

• The data is split into “fragments” on separate machines, each running a local DBMS.

• A Distributed DBMS is the software that manages distribution of data and processing in a fashion that is
invisible to users.

• Local DBMSs can access local data autonomously, or access remote data through the DDBMS and other
local DBMSs.

• Apps that only use local data are called “Local Applications”

• Apps that access remote data are called “Global Applications”

• To be a DDBMS, every local DBMS must participate in at least one Global App.

• Homogeneous DDBMSs use the same DBMS on the same platform on each site

• Heterogeneous DDBMSs use different DBMSs/Platforms and require gateways or other middle-ware to
convert queries/data models between sites

Advantages of a DDBMS:

• Realistically reflects decentralised organisational structures

• If well designed can strike an optimal balance between local speed and global access

• More failure-proof than a single DBMS

• Shared processing and I/O leads to better performance

• Scalability

Disadvantages of a DDBMS:

• Increased complexity and higher cost

• Networks compromises sceurity

• Validity, consistency and integrity control becomes complicated

• It’s a new technology with few standards/best practices

Note:

• A DDBMS is not the same as “distributed processing” which is centralised DB accessible through a network
(rather than the data itself being distributed)

• A DDBMS is not always the same as a “parallel DBMS” which is a single DBMS using multiple
processors/multiple disks

DDBMS Components:

• Global External Schema


User Views. Provides logical data independence

• Global Conceptual Schema


Logical description of entire DB, including entities and relationships, constraints, domains, security etc.
Provides physical data independence

• Fragmentation and Allocation Schema


Describes how the data is partitioned and where it is stored

• 3 Tier Schemas for each Local DBMS


As normal, but instead of external schema, the top level is a mapping schema designed to be used to
communicate with the frag/alloc schema above.

• Local DBMS (LDBMS) – normal DBMS controlling local data

• Data Communications (DC) – network software

• Global System Catalogue – same as a normal systems catalogue, plus frag/alloc information
• Distributed DBMS (DDBMS) – main functional unit – transaction management, backup/recovery etc.

When designing a DDBMS we seek to maximise:

• Locality of reference (letting local apps hitting local data)

• Reliability and Accessibility (by strategically replicating the data)

• System Performance (by avoiding over/under utilisation of resources)

When designing a DDBMS we seek to minimise:

• Cost vs. Storage (affects replication strategy)

• Communication Costs (taking into account consistency maintenance)

Design Strategies with which to approach the problem:

• Centralised Data Storage – this is not a DDB at all. No replication so no additional storage costs.

• Fragmented Data Storage – no replication but data is split up and distributed. If done correctly, locality of
reference will be very high as is performance and storage and communications costs are low. Reliability and
accessibility are only ok.

Design Strategies with which to approach the problem:

• Fully Replicated Storage – every site hosts a full copy of the DB. Very high storage costs, improved
performance for read trans/low comm costs. Write trans low performance/high comms cost.

• Partially Replicated Storage – combination of above methods. It makes the most sense – if done right. You
have high locality, high reliability/access, good all-round performance, ok storage costs and low comm.
costs.

Fragmentation:

• There are a few reasons why it makes sense to fragment data:

• By keeping data not required by local apps separate, we improve security

• By maximising locality of reference, we improve efficiency

• Since most apps only use subsets of relations, it makes sense to break the relations into subsets
for storage across the network.

• Done right, we open the door for parallel processing

• There are drawbacks, like increased integrity administration and performance hits for poorly fragmented
data sets

• For a fragmentation effort to be viable: it must be complete (all items in a relations must appear in at least
one fragment of the relation), functional dependencies must be preserved, and (other than primary keys)
fragments should be disjoint.
• Types of Fragmentation:

• Horizontal – break up relation into subsets of the tuples in that relation, based on a restriction on
one or more attributes. E.g. – we could break up a table with student info into one subset for
undergrads and one subset for postgrads.

• Vertical – breaking up a relation into subsets of attributes. E.g. – breaking up a hypothetical student
table into grade/course related columns and contact/personal related columns.

• Mixed – fragments the data multiple times, in different ways. We could do our postgrad/undergrad
split and then our grades/course split to each of the fragments

• Derived – fragmenting a relation to correspond with the fragmentation of another relation upon
which the 1st relation is dependent in some way.

Transparency:

• Distribution Transparency allows users to ignore the physical fragmentation of data, to varying degrees:

• Frag. Transparency is high level transparency where a user could write “SELECT * FROM Student
WHERE year = 2” without needing to specify what fragment of the Student relation contains the
data, nor where that fragment is stored.

• Location Transparency – mid level transparency where a user would need to write “SELECT *
FROM S14 WHERE year = 2” where S14 is the relevant fragment of the Student relation, but still
wouldn’t need to say where the fragment is stored.

• Local Mapping Transparency – low level transparency where a user would need to write “SELECT
* FROM S14 AT SITE 7 WHERE year = 2” where S14 is the relevant fragment of the Student
relation and SITE 7 is where the fragment is physically located.

• Distribution transparency is supported by a database name server which aliases unique database
object identifiers with user friendly names.

• Transaction transparency ensures integrity and consistency vis-à-vis multi-site transactions,


concurrent users and DB failure.

• Local transactions and remote single site transactions are handled without additional difficulty, but
multi-site transactions must be broken into subtransactions (for each site) and the independence,
atomicity and durability of a centralised DBMS.

• Performance Transparency simply means DDBMS perform at the same level as a normal DBMS.

• This puts a lot of burden on the distributed query processor, which decides what fragment to hit,
which copy (if replicated), and which location to use, as well as calculating I/O time, CPU time and
communication costs.
DBMS Transparancy means that a heterogeneous DDBMS will behave like a homogenous
DDBMS.

Replication:

• Advantages – better access and reliability, improved performance

• Disadvantages – storage and consistency


• Replication Options:

• Synchronous Updates – all copy updates are part of one transactions commit phase - a lot of
admin overhad, comm. Costs and opportunity for failure but often necessary

• Asynchronous Updates – periodic updates of all copies based on one mater copy – violates the
idea of data independence but can be useful in situations where the cost of synch updates are
unwarranted.

• What we expect from replication management:

• Copy data, either synch or asynch from one db to another

• Scalability and mappability (in heterogeneous environments)

• Replication of procedures, indexes, schema etc.

• tools for DBAs to manage replication

• Replicated data ownership models:

• Master/slave: publish and subscribe model – authoritive changes are made only to the master site
and published to the slaves asnchronously

• Workflow: like M/S, but Master status moves from site to site depending on the task at hand

• Update-Anywhere: Symmetric, shared write authority for all replicas.

• Synchronous: We can synch up our replicas using regularly scheduled “snapshots” of the master
data, or database triggers (when X happens, do Y)

Back to Lecture Guide

Potrebbero piacerti anche