Sei sulla pagina 1di 33

Consistent Global States of Distributed

Systems: Fundamental Concepts and


Mechanisms

Author: Ozalp Babaoglu and Keith Marzullo


Distributed Systems: 526 U1580
Professor: Ching-Chi Hsu

1
Introduction

 Many problems in distributed computing can be cast as executing


some notification or reaction when the state of the system
satisfies a particular condition
 Global Predicate Evaluation (GPE): to establish the truth of a
Boolean expression whose variables may refer to the global
systems state
 A global state may not be consistent
 Asynchronous system:
 no bounds on the relative speeds of processes and message delays
 Impossible to maintain synchronized local clocks
 Communication remains the only possible mechanism for
synchronization
 channels are reliable but may deliver messages out of order

2
Outline

 Two Class of solutions to the GPE problem:


 A reactive-architecture: each process, when executing an event,
notify P0 by sending it a message describing the event
 A snapshot architecture: the monitor P0 sends each process a
‘state enquiry’ message.

3
Definitions (1)

 distributed systems: a collection of sequential processes p1, p2, ...,


pn networked by unidirectional communication channels
 events: the activity of each sequential process, which can be
internal events or communications: send(m) or receive(m) with
another process
 local history of process pi : hi = ei1ei2...
 global history: H = h1h2... hn
 cause-effect relation '->':
 If eik, eilhi and k<l, then eik eil
 If ei = send(m) and ej = receive(m), then ei ej
 If e e' and e' e'', then e e''
 Concurrent e||e': neither e e' nor e' e

4
Definitions (2)

 distributed computation: a partially ordered set defined by the pair


(H, )
 space-diagram: representation of a distributed computation

e11 e1 2 e13 e14 e15 e16


p1

e22
p2
e21 e23

p3
e31 e32 e33 e34 e35 e36

5
Definitions (3)

 local state of pi immediately after executing event eik is denoted


by ik
 global state: (, ..., n)
 a cut C(c1,...,cn) is a subset of global history H and contains an
initial prefix of each of the local histories, i.e. C  h1c1hncn
 a run R is a total ordering of all events in H and is consistent with
each local history
 Example: pp6
 Note that a single distributed computation may have many runs

6
Example

 Insistent cut and phantom deadlock

e11 e12 e13 e14 e15 e16


p1
resp
req req resp e 2
2
p2
e21 e23
req req

p3
e31 e32 e33 e34 e35 e36

C C’

7
Consistency

 A consistent cut C, is such that


 e and e', (e C)(e' e) => e' C
 A consistent global state is one corresponding to a consistent cut
 Aconsistent run R, is such that
 e and e', (e e') => e appears before e' in R
Example: pp6
 If the run is consistent then all the global states in the sequence
will be consistent as well

8
Observing Distributed Computations
 A monitor p0 will assume a passive role in that it will not send
any messages of its own
 The application processes notify p0 by sending it a message
whenever they execute an event
 The monitor p0 constructs an observation of the underlying
distributed computation as the events arrived
 Due to the variability of message delays, an observation can
correspond to a consistent run, an inconsistent run or no run at all
 O1 = e21e11e31e32e34e12e22e33e13e14e35.... => not a run
 O2 = e11e31e21e32e12e33e34e13e22e35e36.... => inconsistent run
 O3 = e31e21e11e12e32e33e13e34e14e22e15.... => consistent run
 To restore order of messages by defining a delivery rule for
deciding when received messages are to be presented to the
application process
9
FIFO delivery

 First-In-First-Out(FIFO) delivery
 for all messages m and m' from pi to pj
if sendi(m) sendi(m') => deliverj(m) deliverj(m')
 FIFO can be implemented by adding sequence numbers to
messages
 While FIFO delivery is sufficient to guarantee that observations
correspond to runs, it is not sufficient to guarantee consistent
observations

10
Observing Distributed Computations
with Real-Time Clocks

 Environment:
 message delays are bounded by 
 channels are FIFO
 existence of a global real-time clock
 each message includes RC(e), the global real-time clock when event
e occurs, as its timestamp
 DR1:
 At time t, deliver all received messages with timestatmps up to t- in
increasing timestamp order
 Observation is consistent iff the following is satisfied
 Clock condition: e e' => RC(e) < RC(e')

11
Observing Distributed Computations
with Logical Clocks

 Environment:
 channels are FIFO
 asynchronous communication
 implementation of logical clocks
 each message includes LC(e), the logical clock when event e occurs,
as its timestamp
 DR2:
 Deliver all messages that are stable at p0 in increasing timestamp
order
 Note: a message m is stable at p if no future messages with
timestamp < TS(m)
 Given FIFO channels, m is stable at p0 when p0 has received at least
one message with timestamp>TS(m) from all other processes
12
Logical Clocks
Logical Clock
each process pi maintains a local variable LCi
when a new event ei occurs, pi modifies LCi to
LCi + 1 if ei is an internal or send event
max{ LCi, TS(m)} + 1 if ei = receive(m)

1 2 4 5 6 7
p1

5
p2
1 6

p3
1 2 3 4 5 7

13
Observing Distributed Computations
with Causal Delivery

 Causal Delivery (CD):


 sendi(m) sendj(m') => deliverk(m) deliverk(m')
 If p0 uses a delivery rule satisfying CD, then all of its
observations will be consistent

14
Efficient Delivering

 For implementing causal delivery, what is really needed is an


effective procedure for deciding:
 given events e,e' that are causally related and their clock values,
does there exists some other event e'' such that e e'' e'
 Given RC(e) <RC(e') (or LC(e)<LC(e')), it may be that
e e' or e|| e', i.e. e' e)
 The above observations suggest a timing mechanism TC whereby
causal precedence relations between events can be deduced from
their timstamps
 Stong Clock Condition:
e e' TC(e) < TC(e')

15
Causal History (1)

Causal history of event e


(e) = { e' H | e' e} {e}
That is, (e) is the smallest consistent cut that includes e

e11 e12 e13 e1 4 e15 e16


p1

e22
p2
e21 e23

p3
e31 e3 2 e33 e34 e35 e36
Causal history of event e14
16
Causal Histories (2)
 Maintaining Causal History
 Each process pi initializes local variable i to be 
 Each message m contains a timestamp TS(m) which is the causal
history of its send event
 Scheme
 If ei is internal or send event,
then i={ei} the causal history of the previous local event
 If ei is the receive of message m by process pi from pj
then i={ei} the causal history of the previous local event of pi
the causal history of the corresponding send event at pj
 The strong clock condition is satisfied if clock comparison is
interpreted as set inclusion
 e e' (e)  (e') or e e' e  (e') if e  e'
 Problem: the causal histories will grow rapidly

17
Vector Clocks

 The causal history of an event can be represented as a fixed-


dimensional vector VC(e)[1..n] rather than a set, where
 VC(e)[i] = k, iff i(e) = hik for i = 1,2,...,n

p1 (1,0,0) (2,1,0) (3,1,3) (4,1,3) (5,1,3) (6,1,3)

p2 (1,2,4)
(0,1,0) (4,3,4)

p3
(0,0,1) (1,0,2) (1,0,3) (1,0,4) (1,0,5) (1,0,6)

18
Maintaining Vector Clocks

 Maintaining Vector clock


 Each process pi maintains a local vector VCi[1..n]
 Each message m contains a timestamp TS(m) which is the vector
clock value VC(e)of its send event e
 Scheme
 if ei is an internal or send event
 VCi [i]= VCi [i] + 1, and VC(ei)=VCi

 if ei = receive(m)
 VCi = max { VCi , TS(m) }

 VCi [i] = VCi [i] + 1

 VC(ei)[j] number of events of pj that causally precede event ei of pi


 V < V'  (VV')k: 1kn: V[k] V'[k])

19
Properties of Vector Clocks

 Properties of Vector Clocks


 Strong Clock Condition  Simple Strong Clock Condition
 e e' VC(e) < VC(e') ei ej VC(ei)[i] VC(ej)[i]
 Concurrent
 ei||ej VC(ei)[i] VC(ej)[i]) (VC(ej)[j] VC(ei)[j])
 Pairwise Inconsistent
 i j, VC(ei)[i] VC(ej)[i])  (VC(ej)[j] VC(ei)[j])
 Consistent Cut (c1,c2, ..., cn) iff
 i, j: 1 i,j  n, VC(eici)[i] VC(ejcj)[i]
 Counting: the number of events precedes e is givent by #(e)
 #(e) =nj=1 VC(e)[j] -1
 Weak Gap-Detection: Given ei and ej
 if VC(ei)[k] < VC(ej)[k] for some k  j,
then ek such that (ek ei)  (ek ej)
20
Implementing Causal Delibery
with Vector Clocks

 Babaoglu & Marzullo


 monitor p0 maintains an array D[1..n] where D[i] contains
TS(mi)[i] where mi is the last message delivered from process pi
 DR3:
 Deliver message m from process pj when both of the following is
satisfied
 D[j] = TS(m)[j] -1 => guarantee FIFO
 D[k]  TS(m)[k], k  j => guarantee Causal Relation
 DR4:
 Monitor p0 maintains an counter D
 Deliver message m of event ei as soon as
 D = #(ei) - 1

21
Causal Delivery with vector Clock
Examples

p0 (1,0) (1,1) (1,2) (2,2)(3,2)


[0,0]

p1 (1,0) (2,2)
(0,0) (3,2)

p2
(0,0) (1,1) (1,2)

22
Distributed Snapshots

 In this strategy, p0 will request the states of the other processes


and then combined them into a global state
 Definition:
 channel state: for each channel from pi to pj,
i,j = set difference between i and j
 incoming channels of process pi :INi
 outgoing channels of process pi :OUTi
 Snapshot Protocols
 Chandy and Lamport [1985]
 Morgan[1985]

23
Snapshot Protocol 1

 Assumption:
 existence of a global real-time clock : RC
 Each message is attached with timestamp
 Message delays are bounded
 global clock algorithm
 P0 sends [take snapshot at tss] to all processes
 When clock RC reads tss, each process pi do the following
 records its local state i,
 sends an empty message over all its outgoing channels
 and starts recording all message received over each incoming channels
 For the time pi receives a message from pj with timestamp greater
than or equal to tss, pi stops recording messages for that channel

24
Snapshot Protocol 2

 Assumption:
 Bounded message delays
 Channels are FIFO
 Chandy & Lamport
 P0 send [take snapshot] to itself
 For each process pi receiving [take snapshot]
 If it is the first time
 records its local state i
 sends each out-going channels [take snapshot]
 starts recording messages from other incoming channels
 If it is not the first time
 stops recording message from that incoming channel

25
Chandy & Lamport (1985)

p0

e11 e12 e13 e14 e15 e16


p1
e1 *
p2
e21 e22 e23 e24 e25
e2 *
Real computation R= e21 e11 e12 e13 e22 e14 e23 e24 e15 e25 e16
in terms of global state =00 0111 21 31 32 42 43 44 54 55 65

26
Properties of Snapshots

 Definition
 a : the global state in which the snapshot protocol is initiated,
 f : the global state in which the protocol terminates and
 S : the global state constructed
 ei* denote the event when pi receives [take snapshot] for the first
time, causing pi to start recording its state
 let the time be ti when ei* occurs
 ei is a prerecordering event if ei ei* ,
 otherwise it is a post-recording event
 Properties
 Then there exists a run R' such that a S f
 That is to say S could have happened

27
Argumentation (1)

 Chandy & Lamport(1985)


 consider any (post-recordering, prerecordering) pair (e, e')
then e  e')
 swapping all such events will result in another consistent run R'
 swap (e13 , e22 ) r1= e21 e11 e12 e22 e13 e14 e23 e24 e15 e25 e16
 swap (e14 , e23 ) r2= e21 e11 e12 e22 e13 e23 e14 e24 e15 e25 e16
 swap (e13 , e23 ) R'= e21 e11 e12 e22 e23 e13 e14 e24 e15 e25 e16
 the global state after executing the last prerecording event (e23 ) in
R' is S (=23), the constructed global state
 If the computation goes in this run, S could have happen

28
Argumentation (2)

 Lai & Yang(1987)


 Let GSN(ti:piP) be a snapshot taken between 1 and 2, during the
computation R.
 Let =2-1, construct R' as follows:
 R' is the same as R except that every post-recording event in R is now
postponed for d units of time, that is
 R'(t) = R(t) if R(t) is an event at piand tti
R(t-) if R(t-) is an event at pi and t- ti
 otherwise
 Example

29
Properties of Global Predicates

 Stable Predicates
 Many system properties one wishes to detect have the characteristic
that once they become true, they remain true
 If  is a stable predicate, since a S f
 ( is true in s ) => ( is true in f )
 ( is false in s ) =>( is false in a )
 Nonstable Predicates
 the condition encoded by the predicate may not persist long enough
for it to be true when the predicate is evaluated
 if a predicate  is found to be true by the monitor, we do not know
whether  ever held during the actual run

30
Nonstable Predicates

 Two problems
 The condition encoded by the predicate may not persist long
enough for it to be true when the predicate is evaluated
 If a predicate is found to be true by the monitor, we do not
know whether  ever held during the actual run
 The predicate may have held even if it is not detected, and even if
it is detected it may have never held.
 Extended nonstable global predicate: apply to the entire
distributed computation
 Possibly()
 Definitely()

31
Detecting Possibly and Definitely 

 min (ik) : the global state with the smallest level in the lattice
containing ik
 max(ik) : the global state with  the largest level in the lattice
containing ik

 Examples: min (13) = 31,max (13) = 33
 min(ik) = (1c1, 2c2,…, ncn ): j: VC(jcj)[j]=VC( ik)[j]
 max(ik) = (1c1, 2c2,…, ncn ): j: VC(jcj)[i]<=VC( ik)[i]
and ((jCj = jf) or (VC(jCj+1)[i] > VC(jk)[i]))
 The minimum level containing jk is the sum of components of
the vector timestamp VC(jk)
 An algorithm for detecting Definitely(): O(kn): k is the
maximum number of events a monitored process has executed

32
Example

33

Potrebbero piacerti anche