Sei sulla pagina 1di 8

Self Organised-Distributed Database Systems

[Using Swarm Intelligence]


Vamsi Krishnam Raju Chiluvuri
Department of computer science
California state university
Fullerton,CA
vamsyraju@gmail.com

ABSTRACT of information available to it. No single individual sees the


This paper discusses in detail about the distribute database overall picture. The creatures act as peers with little hier-
system, swarm intelligence and how swarm intelligence tech- archical chain of command and usually only communicate
niques can be used in distributed database systems. basic information about the environment. No individual in-
structs others, yet together they are able to function well and
General Terms achieve some impressive results. Swarm intelligence can be
used to optimally solve many classic computer science prob-
Distributed database, swarm intellligence
lems such as the Traveling Salesman.

Keywords In this paper, we will discuss in detail about distributed


Distributed database, swarm intellligence, replication, con- database systems and swarm intelligence. The total paper
currecy control is divided into five sections. Section 2 talks about the dis-
tributed database systems. Section 3 gives a overview on
1. INTRODUCTION swarm intelligence. In section 4 we will discuss how swarm
Distributed databases are currently the main focus of re- intelligence techniques can be used in distributed database
search in the area of databases. Early databases in the systems. Section 5 concludes the paper with a summary and
seventies and early eighties moved towards centralization, future work.
and resulted in large monolithic databases. Since then, the
trend reversed more towards decentralization and auton- 2. DISTRIBUTED DATABASE SYSTEMS
omy of processing. With the advances in distributed pro- A distributed database is a collection of multiple, logically
cessing and distributed computing, the database research interrelated databases distributed over a computer network.
community performed considerable work to address issues A distributed database management system is then defined
of data distribution, distributed query, and transaction pro- as the software system that permits the management of the
cessing. A few prototypes have been implemented, how- distributed database and makes the distribution transparent
ever, a full-scale comprehensive distributed database man- to the users. [11] The terms DDBMS and DDBS are often
agement system (D-DBMS) that implements the functional- used interchangeably. Assumptions regarding the system
ity and techniques proposed in distributed database (D-DB) that underlie these definitions are: [8]
research never emerged as a commercially viable product.
Most major vendors redirected their efforts to developing
• Data is stored at a number of sites. Each site is as-
systems based on client-server architecture or towards de-
sumed to logically consist of a single processor. Even if
veloping technology for accessing distributed heterogeneous
some sites are multiprocessor machines, the distributed
data sources.
DBMS is not concerned with the storage and manage-
ment of data on this parallel machine.
Swarm intelligence is a relatively new field in computer sci-
ence. It deals with collective behavior in decentralized, self- • The processors at these sites are interconnected by a
organized systems. The theory behind swarm intelligence computer network rather than a multiprocessor con-
comes from biological swarm behavior. In nature simple figuration. The important point here is the empha-
creatures like ants and bees follow some basic rules when sis on loose-interconnection between processors which
living in a colony. Each creature only acts upon a tiny bit have their own operation systems and operate inde-
pendently. Even though these multiprocessor archi-
tectures are quite similar to the loosely interconnected
distributed systems, they have different issues to deal
with. (e.g., task allocation, migration, load balancing,
etc)
• The DDB is a database, not some collection of files that
can be individually stored at each node of a computer
network. This is a distinction between a DDB and a
collection of files managed by a distributed file system.
Figure 1: Simple Distributed database system Figure 2: Replication approaches

• The system had full functionality of a DBMS. It should 2.2.1 Replication


provide functions like query processing, structures or- Replication plays a major role in increasing the performance
ganization of data and so on. of a distributed database. This is because in a distributed
database, data is distributed across many sites. Every site
has its own local database. But the Database management
Figure 1 shows an example for a simple distributed database system operates on all the databases simultaneously. So to
system. We can see six sites with their own databases which improve system reliability and performance the data items
are connected through a common communication network. are to be stored redundantly at multiple sites. The other
advantage is fault- tolerance, when a node crashes then the
replica of that in another site can be used. Replication in-
2.1 Advantages of Distributed Databases: creases scalability too, for example clusters can be used in-
• Many real world organizations can be represented more stead of mainframes. With the ever increasing use of mobile
naturally. users, data warehouses replication is almost a must.
• It greatly increases the scalability. For example the There are many approaches for replication, most of them
size of DB can be incremented. solving certain classes of problems. The figure 2 shows the
different scale of approaches. Right hand side classes of
• It ensures security by replicating vital data.
technologies are appropriate for supporting operational sys-
tems which need real time transactions that are appropriate
• It increases the availability and reliability because there
real time transaction processing. Left hand side classes of
is no single point of failure.
technologies are well suited for supporting decision making,
browsing and research on LAN based PCs or other plat-
• It increases the performance of the system because it
forms. [9]
speeds up the query processing and by local processing.

• It provides local anatomy because the local organiza- Replication Problems


tion is fully responsible for accuracy and safety of its
own portion of DB.
1. How to replicate data?
It can be done synchronously or asynchronously de-
2.2 Distributed Database Issues: pending on where the updates take place.
There are three important issues with distributed database
systems. 2. When to replicate data?
Update Everywhere: Eager:

W(X) W(X) W(X)

1)T:W(X)

3)Commit

2) Propagate W(X)

Lazy:
Primary Copy:

W(X)

Read Read 1)T:W(X)


Only Only

2)Commit

3) Propagate W(X)

Figure 3: Approaches for submitting Updates Figure 4: Approaches for propagating updates

It depends on the data transfer patterns in the net- can see in the Eager approach, propagation of up-
work. For example a particular site may use a partic- dates occurs even before the commit occurs, but
ular piece of data more often, so a replica near that whereas in the Lazy only after the commit occurs
site will be very useful. does the propagation of updates happens.

3. Where to update?
2.3 Concurrency Control
There are two ways it can be done.
Concurrency control is one of the correctness criteria for
• First approach is Update Everywhere. I.e. exe replicated databases. A replicated database system that
cute update transactions at every site. The ad achieves replication and concurrency control has the same
vantage of this approach is simplicity , but it’s a input/output as a centralized, one-copy database systems
costly approach. that executes user requests one at a time. [13] Concurrency
control problem is exacerbated in a distributed database be-
• Second approach is Update only Primary Copies. cause:
Primary copies are maters copies, whereas sec
ondary copies (slave) are read-only. Figure 3 shows
how this is done. W(X) is the update process. We • Users may access data stored on different computers
can see in the Update every where all the replicas in a distributed system.
are updates. In the primary copy approach all
the Read -only replicas are left alone. • A concurrency control mechanism at one computer
cannot instantaneously know about interactions at other
computers.
4. When to update?
There are two approached for this.
Concurrency control has been actively investigated for the
• First approach is Eager. It means update within past several years, and the problem for non-distributed DBMSs
the boundaries of the transaction, i.e., transac is well understood. The two-phase locking protocol is ac-
tions terminate usually with Two Phase Proto- cepted as a standard solution. [2]
col
• Second approach is Lazy. It means update only 2.3.1 Two-phase locking protocol:
after the commit of transaction. The disadvan Two phase locking is a process used to gain ownership of
tage of this approach is it leads to inconsistency. shared resources without creating the possibility for dead-
Figure 4 shows the two approaches in detail. We lock. The technique is extremely simple, and breaks up the
modification of shared data into ”two phases”, this is what weights in different distributed environments. For example
gives the process its name. [1] the communication cost will dominate in a wide area net-
work, whereas in a local area network it is negligible.
There are actually three activities that take place in the ”two
phase” update algorithm: 3. SWARM INTELLIGENCE
We face dynamic optimization problems in almost all fields.
1. Lock Acquisition Even with today’s ever increasing computing power, some
of these problems are still hard to solve. Finding solutions
2. Modification of Data to these problems in most of the times is not finding the
extema, but to find something that is as close as possible.
3. Release Locks Most recently scientists are turning to insects like ants and
bees for ideas to solve such problems. This form of artificial
The modification of data, and the subsequent release of the intelligence based on the collective behavior of decentralized,
locks that protected the data are generally grouped together self organized systems is called swarm intelligence.
and called the second phase.
A single ant or bee isn’t smart, but their colonies are. The
Two phase locking prevents deadlock from occurring in dis- study of swarm intelligence is providing insights that can
tributed systems by releasing all the resources it has ac- help humans manage complex systems, from truck routing
quired, if it is not possible to obtain all the resources required to military robots [9]
without waiting for another process to finish using a lock.
This means that no process is ever in a state where it is hold- Following a trail of insects as they work together to accom-
ing some shared resources, and waiting for another process plish a task hovers unique possibilities for problem solving -
to release a shared resource which it requires. This means [15]
that deadlock cannot occur due to resource contention. The
resource (or lock) acquisition phase of a ”two phase” shared Swarm intelligence algorithms can be divided into two classes.
data access protocol is usually implemented as a loop within They are
which all the locks required to access the shared data are ac-
quired one by one. If any lock is not acquired on the first • Pheromone based navigational algorithms inspired by
attempt the algorithm gives up all the locks it had previ- biological ant-colony behavior.
ously been able to get, and starts to try to get all the locks
again. • Non-pheromone based navigational algorithms inspired
by biological bee-colony behavior.
2.4 Query Optimization
The queries in distributed data base systems often cannot 3.1 Pheromone based Algorithms
be answered by a single local unit. An aggregate of data, Let us look at the collective behavior of ants. The objec-
spanning over different data bases in a network is needed. tives of ants are very simple, finding food and building a
To do this often there will be many ways. Our goal is finding nest. To achieve these every single ant follows some sim-
the best way. [4] ple set of rules. No one is in-charge; no one knows the
complete picture. But despite this they achieve some ex-
2.4.1 Centralized Vs Distributed Query Processing: traordinary solutions to problems like finding the shortest
In centralized query processing the number of I/O opera- path to food, allocating workers to different tasks, defend-
tions and the usage of CPU to process the query are the main ing their nests from predators, etc,. Now let us look at one
concerns. Whereas in the distributed query processing along of the simple day to day tasks of a Ant, finding the shortest
with this, the amount of data transmission between the sites path to food from nest, which is analogous to finding short-
is also an important concern. Two new operators send and est path problems like travelling sales person [5] Whenever
receive are included in distributed query processing. These a ant bring food to nest it leaves a trail with a chemical
operators are used for transferring the data between sites. called Pheromone. So as many and many ants go through
The other important difference is the heterogeneity in data this trail the track gets reinforced, if no ant uses the trail
formats and data models in distributed databases. In dis- for sometime the chemical slowly evaporates (deleting the
tributed databases the data is replicated in various locations trail in our computer science words). Suppose half of the
to increase the performance. This leads to more complex ants choose the long path and other half choose the shorter
problems while trying process the queries. The usage of path. The shorter path will have more intense pheromone
resource vectors, interconnect matrix, and caching in dis- trail then the longer one, because on the longer path the
tributed environment will make a huge difference. pheromone evaporates faster than on the shorter path. On
the next rounds the ants choose the path with more intense
2.4.2 Query Processing Objectives: pheromone. So after a while all the ants choose the shorter
The cost function of a distributed query is path. This total process is shown in figures 5,6 and 7.

I/O cost + CPU cost + Communication Cost. 3.2 Non-Pheromone based Algorithms:
They are also called bee colony. Bee colony is the area of
The minimization of this function is the main objective of swarm intelligence which studies (1) the behavior of bees and
distributed query processing. These costs may have different similar behavior in other insects, and (2) the applicability of
Figure 5: At the start

Figure 8: Distance and direction by waggling dance

the underlying principles. Non-pheromone based algorithms


are [7]

• Significantly more efficient when finding and collecting


food, i.e., it uses fewer iterations to complete the task.
• More scalable, i.e., it requires less computation time
to complete the task, even through in small worlds,
Figure 6: After some time pheromone-based algorithms are faster on a time-per-
iteration measure
• Less adaptive than pheromone-based algorithms.
Honeybee foraging behavior consists of two behaviors;

– Recruit behavior
– Navigation behavior

In order to recruit members of the colony for food sources,


honeybees inform their nest mates of the distance and di-
rection of these food sources by means of a waggling dance
performed on the vertical combs in the hive [6] This dance is
actually the bee language. It is made by a series of alterna-
tive left and right loops. All these moves are intercepted by
a segment in which the bee waggles her abdomen from side
to side. The duration of the waggle phase is a measure of
the distance to the food, and the angle between the sun and
the axis of the waggle segment on the vertical comb repre-
sents the azimuthal angle between the sun and the direction
in which the recruit should fly to find the target.
Figure 7: By the end
[7] Figure 8 shows this process. The advertisement for a food
source can be adopted by other members of the colony. The
decision mechanism for adopting an advertised food source
location by a potential recruit, is not completely known. It
is considered that the recruitment amongst bees is always a
function of the quality of the food sources.

4. METHODOLOGY
4.1 Agent Based Systems:
Swarm intelligence is been widely used in various distributed
environments. For example in telecommunications to solve
routing problems [10] and load balancing [14] .They were
also used in distributed pattern detection and classification
[3] and in robotics [12] . In this paper we will try dealing
the problem of replication management using bio-inspired
algorithms. The interesting feature of these bio-inspired al-
gorithms is the solutions to complex problems by following
some simple set of rules in the individual levels. 1.Query Executed 2.Data Item
Local Agent is Accessed
These Swarm biological systems can be quite naturally em- Generated. Condition
ulated with any distributed system through the multi-agent checked
paradigm . The main advantages of using such systems are: 3.Data item is
replicated

1. Self organization:
All decisions are made based on local information.

2. Adaptability:
Can adapt themselves to any dynamic environment.

3. Stigmergy:
It is the mechanism of spontaneous, indirect coordi- Figure 9: Working of a local agent
nation between agents or actions, where the trace left
in the environment by an action stimulates the per-
formance of a subsequent action, by the same or a 1. Local Agent
different agent.
2. Global Agent
4.2 Replication Issues:
Issue 1: The local agent will make replicas. The Global Agent is
responsible for maintaining consistency with the replicas.
In a distributed database system, if data items are far away So local agent deals with the issues 1,2 and 3, whereas the
in the network, then the systems performance will be af- global agent deals with the issue 4 described above.
fected. This is because it leads to network load which causes
a decline in the overall performance. So a system should
have its most accessed data itemś replica nearby. 4.3.1 Local Agent
A local agent is generated whenever a query is executed.
Issue 2: Over doing replication is also a major problem, Every agent has a ID which is unique for each source, i.e.
because when we over replicate, then the memory resources all the agents from the same source will have a same ID.
will be affected. So when a system no longer uses a replica Every time a data item is accessed the local agent checks if
it should be deleted. it has to be replicated. The condition on which this checking
takes place will be discussed in the next section. For now
Issue 3: In distributed database systems often a query will just think that the agent knows it.
be operated on more than one data item. If such operations
are very common on a set of data items that are not on a There are two possible cases here
single site, then replicating a merged copy of those data item
would drastically increase the performance of the system.
• Yes, it has to be replicated
Issue 4: Maintaining the consistency of a system is the
main issue when it comes to replication . Whenever a copy is • No, leave it like that.
updated, there should be consistency among all the replicas.
If it has to be replicated then a replica of the data item is
4.3 Agent generated and transferred to the source site. Otherwise no
An Agent is a piece of code. It is the mobile agent we use in action takes place.
our swarm approach in distributed database system. There
will be two types of agents. Figure 9 shows this process in a simple distributed database.
How does the agent know whether to replicate or
not?
Number Of Replicas
Every time a data item is accessed then the agent leaves
a time stamp there, with the agent ID. So when the data
item is accessed the agent generates a time stamp and then
searches for a time stamp with its agent ID. It then computes
the difference between timestamps. It then checks if the
difference with the X which is a constant, If the difference
is less than X, it means the data item is accessed recently.
Then it increments a count in there with its agent ID. After
this it checks the number of notes, if the count is greater than
Y (which is another constant) then it replicates. Otherwise X
it just increases the count. Graph showing value of X Versus Number of Replicas
Number Of Replicas
This count is reset after some interval of time.

How to calculate X and Y?

The value of X is the constant with which we compare the


timestamp difference. This value depends on the activity in
the distributed database. If the value of X is too less it will
decrease the number of replicas, else if the value is too high
Y
then it increases the replicas as shown in the graph from
Graph showing value of Y Versus Number of Replicas
figure 10

The value of Y is the constant with which we compare the


count. This value depends on the structure of distributed
database. If the value of Y is too less then it will increase
the number of replicas, else if its too high then it decreases Figure 10: Graphs showing the effect of constants X
the number of replicas as shown in the graph from the figure and Y on the number of replicas
10

So using the two constants X and Y we can fine tune the


replication management to exactly suit the requirement of
the system.

Deleting old replicas:


2. Replica Updated
Replicas which are no longer used should be deleted, so that 3.Global Agent
memory is not wasted. Every site takes care of the replicas Generated
which are stored in it. It check the usage of replicas in
certain intervals of time, if it sees that the replica is no longer
used then it deletes it. This check also uses the timestamps
of the recently visited agents.

4.3.2 Global Agent


Global Agent is generated whenever a update transaction
occurs on a replica or the master copy. It maintains the con- 1.Query Executed
sistency by propagating updates to all replicas. This process
works in conjunction with other locking protocols like Two-
Phase-Locking protocol. Figure 11 shows this process in a
simple distributed database.
4.Global Agent Maintains consistency among
5. CONCLUSIONS AND FUTURE WORK Replicas
In this paper we started with discussion about replication
management, its problems and why it is important. We
followed it with a discussion about swarm intelligence basics.
Then we started talking about the implementation of some
swarm methods for replication management. We discussed
about the local agents and global agents, their functions.
We discusses about the effect of values of X and y on the Figure 11: Global Agent
number of replicas.
There is a lot of scope for future work in this area. Imple- A.4.2 Replication Issues
mentation of this is to be tried. The constants X and Y need A.4.3 Agent
some formulation on based on the properties of the system.
Actually a program which can generate these constants au- A.5 Conclusion and Future Work
tomatically will be very useful A.6 References

6. REFERENCES
[1] Arnold.P. Two phase locking protocol, march 2009.
Good read with overview of two phase locking
protocol.
[2] Bernstein, P. A., and Goodman, N. Concurrency
control in distributed database systems. acm 13
(1981), 37.
[3] Brueckner, S., and Parunak, H. V. D. Swarming
agents for distributed pattern detection and
classification. pages (forthcoming),.
[4] Cem Evrendilek Asuman Dogac, F. O.
Multidatabase query optimization. Kluwer 39 (1997),
27.
[5] Dorigo, M., and Gambardella. And colonies for
the traveling salesman problem. acm 43 (1997), 13.
[6] Frisch, K. The dance language amd orientation of
bees. acm 5 (1967), 14.
[7] Karl Tuyls, A. N. A bee algorithm for multi-agent
systems. acm 7 (2000), 5.
[8] M. Tamer Özsu, P. V. Distributed database
systems: Where are we now? IEEE 18 (2001), 19.
[9] Miller. National geographic swarm theory. web site,
july 2007.
[10] Ns, T. I. Swarm intelligence and problem solving in.
[11] Ozsu. M. T, V. Principles of distributed database
systems. paperback, 1991.
[12] Pettinaro, G. C., Kwee, I. W., Gambardella,
L. M., Mondada, F., and louis Deneubourg, J.
Swarm robotics: A different approach to service
robotics. 71–76.
[13] Philip A. Berstein, N. G. An algorithm for
concurrency control and recovery in replicated
distributed databases. ACM 9 (1984), 20.
[14] Schoonderwoerd, R., Holland, O., Bruten, J.,
and Rothkrantz, L. Ant-based load balancing in
telecommunications networks.
[15] tarasewitch Patrick R. McMullen, P. Swarm
intelligence, power in numbers. ACM 45,No 8 (Aug
2002), 6.

APPENDIX
A. HEADINGS IN APPENDICES
A.1 Introduction
A.2 Distributed Database Systems
A.2.1 Advantages of Distributed Databases
A.2.2 Distributed Database Issues
Replication Concurrency Control Query Optimization

A.3 Swarm Intelligence


A.3.1 Pheromone based Algorithms
A.3.2 Non-Pheromone based Algorithms
A.4 Methodology
A.4.1 Agent Based Systems

Potrebbero piacerti anche