Unit 3

A Proposal for a Method to Achieve Deterministic Latency in Distributed Systems
Ben Trapani
Introduction
Software engineers developing applications for cloud infrastructures make the mistake of
architecting their parallel applications in a task-parallel manner. This paper describes the issue of
tail latency as it relates to task-parallel architectures, and proposes data-parallel architectures as a
superior alternative. The paper outlines a concrete meta-procedure for converting an arbitrary
input procedure into a fully data-parallel execution tree, and then describes a second metaprocedure which configures a set of compute nodes as per properties of the generated execution
tree and manages the execution of the data-parallel tree.
Motivation
Distributed computing environments are characterized by high tail latency, which poses
an acute cost to sectors dependent on real-time distributed computation such as financial
services, navigation, and artificial intelligence. High tail latency is not a direct result of the
underlying hardware configuration and OS performance, as many engineers in the field believe.
High tail latency is a result of the poorly structured task-parallel nature of most applications
which are executed on top of distributed systems. The poor architecture of such applications
results in highly variable latency for system calls, which leads many developers to blame
hardware configurations and OS performance. The core problem underlying tail latency in
distributed systems is unnecessary synchronization between tasks operating on the same data,
much of which is enforced through system calls (Dean 2). This synchronization can be avoided
by converting all task-parallel distributed applications into data-parallel applications.
The following paper proposes a two step approach to achieve deterministic latency in
distributed systems: a procedure which fully converts an input procedure into a data-parallel
execution tree, and a procedure which configures a compute cluster to ensure deterministic
runtime of the generated execution tree.
Benefits of Data-Parallelism over Task-Parallelism
Task-parallel architectures typically rely on a messaging system between worker nodes to
ensure that data is processed in the correct order and that operations on the same data are atomic.
Synchronizations primitives such as semaphores, condition variables and mutexes are built on
top of a messaging system between different processes. Peter A. Dinda et al. of Carnegie Mellon
University provide an outline for a compiler which enables different tasks distributed across a set
of networked computers to remain in sync and to run efficiently. The main problems addressed
in Dindas paper have to do with non-deterministic performance of the system, and other
synchronization issues such as deadlocks and race conditions. Task-parallel applications
typically suffer from these problems, even when generated automatically by a compiler as
illustrated by Dinda et al.s research (Dinda 11-13). These problems stem from contention over
shared data, which is not strictly avoided in task-parallel architectures.
A single procedure stands to benefit from parallelism if and only if operations on
independent data exist. Jeff A. Stuart and John D. Owens of the University of California, Davis
describe an architecture allowing a generic task-parallel application utilizing MPI to be simulated
on a data-parallel compute unit. They create a messaging system built on top of a graphics
processing unit (GPU) to allow shaders to communicate with each other, thereby allowing
separate tasks to be mapped to separate shader units as tasks are mapped to CPUs in traditional
systems. The task-parallel architectures with resource contention performed worse than their
equivalents executed on a single CPU. The task-parallel architectures which did not contend for
resources as frequently (i.e. sent fewer messages) performed more efficiently on the GPU-based
system (Stuart 9). The fact that Stuart et al. are able to design a system that enables a generic
task-parallel architecture to run on a data-parallel compute unit to achieve a performance
increase when there are a significant number of independent operations in the task-parallel
architecture implies in a practical sense that a single procedure stands to benefit from parallelism
if and only if operations on independent data exist. A more general proof follows.
Assume that no independent operations exist within a procedure. This implies that every
operation depends on a previous operation. A concrete example is a procedure which accepts a
list of integers, and adds the value of the previous integer to each successive integer from left to
right. Since the value of the ith integer in the list depends on the i-1 value, the i-1 value depends
on the i-2 value, etc. until we reach the left hand side, it is never possible to execute two
additions simultaneously while maintaining the original functionality. We have show that if there
are no independent operations in a procedure, parallelizing the procedure does not yield a
performance increase, since every potentially parallel operation needs to wait for every previous
operation.
Assume now that at least two independent operations exist within a procedure. A
concrete example is a function which accepts two integers, increments them, sums the
increments and returns the result. Since the two initial increments do not depend on each other,
they can be executed in parallel, thereby yielding a performance benefit over the nave approach
which increments them in an arbitrary order sequentially. The final sum operation will wait for
both increments to complete before executing. Since each increment is an identical computation
cost, the final sum operation will not block for long on an individual parallel sum operation.
With this example, we have shown that if there exist at least two independent operations in a
procedure, parallelizing the procedure yields a performance improvement.
Since the absence of any independent operations in a procedure implies that the
procedure cannot benefit from parallelization and since the existence of at least two independent
operations in a procedure implies that the procedure will benefit from parallelization, it is
possible to conclude that a single procedure stands to benefit from parallelism if and only if
independent operations exist within the procedure. This conclusion implies that there is an
optimal parallel architecture for a given procedure: an architecture which builds a dependency
tree out of each operation in the original procedure, and executes each branch of the resulting
tree in parallel.
Building the Execution Tree
The execution tree consists of a set of connected transformations such that each
transformation accepts the inputs of parent transformations and waits to execute until all input
dependencies are satisfied. The final output of a transformation is passed along to all children of
the given transformation. The described execution tree bears many similarities to the map-reduce
execution tree, which is used primarily to define a data-transformation pipeline used to
efficiently analyze large data-sets. Muneto Yamamoto et al. effectively leverage the map-reduce
execution tree pattern in tandem with a configured compute cluster to analyze a collection of
images (Yamamoto 3-5). In the following two sections, Yamamoto et al.s research is
generalized to provide deterministic high performance distributed execution of general input

procedures.
The input procedure can be regarded as a single transformation that accepts the input data
set equal to the input set of the program and yields the output data set of the program. Here is
pseudo-code for converting a generic input procedure into an execution tree:
1. Format the input procedure as a single transformation object.
2. Examine the transformation object for independent operations.
3. Move independent operations out of the parent transformation, and annotate data
dependencies in parent transformation.
4. Repeat steps 2-4 for every newly generated transformation until no more transformations
can be generated.
The recursion in the meta-procedure makes the meta-procedure a good candidate for parallel
execution as well, which is ideal given its typical usage in cloud infrastructures. The generated
execution tree consists of a tree of dependent transformations, which can be executed from the
top of the tree down. Children of each transformation can be executed in parallel, since their
dependent data is available and they do not share any other data. Any procedure which contains
independent operations will have an execution tree with a width of at least two at one level in the
tree, which is consistent with the proof given in the first section.
To implement step 1 of the meta-procedure, all that is required is to mark the inputs to the
program as inputs to the transformation and mark outputs of the program as the outputs of the
transformation. Implementing step 2 is the hardest task in the meta-procedure, but can be
computed by identifying independent blocks within the transformation. To identify an
independent block, build a dependency tree between memory locations in the input
transformation. Building the dependency tree between memory locations can be done from the
top down by matching write locations to subsequent read calls to the same location, and then
binding the result of the read and any operations on the read data to the root write operation. The
roots of the resulting dependency tree represent the roots of the independent blocks. Reconstruct
the independent blocks by traversing down the tree and executing the operations in the order that
they were originally programmed. These new independent blocks represent the core of the new
set of independent transformations. Step 3 can be accomplished by marking any non-generated
data referenced in an independent block as a dependency, and adding the newly generated
transformation as a child of the transformation yielding the needed data. Step 3 does not require
children of a newly generated block to be identified by the parent, since children will identify
their parents after identifying the data dependencies in their core independent blocks. Step 4 is a
simple recursion on each unexamined node in the current tree. When no unexamined nodes exist
in the tree, the execution tree is complete.
Configuring the Hardware and Executing the Tree
Once the execution tree described in the previous section has been obtained, configuring
the hardware to guarantee deterministic latency is straightforward: reserve as many compute
nodes as there is the widest level of the execution tree. A compute node in this paper is assumed
to have CPU and memory units which are separate from the rest of the cluster. Each compute
node is assumed to have the ability to transfer data to every other compute node in its cluster.
Data transfers between nodes are not assumed to take a deterministic amount of time. Executing
a procedure on the compute node is assumed to take a nearly-deterministic amount of time. All
compute nodes in a cluster are assumed to be identical.
When executing the tree, the first step is to map the root transformation and data onto a
compute node. Then execute the transformation. After the transformation completes, map all of
the children to different compute nodes in the cluster and execute each transformation. After
each one of these transformations completes, instantiate their children in shared memory if the
children do not already exist and mark the childs dependency on the current transformation as
filled. If all of the childs dependencies are met, place the child transformation on the current
compute node and exit. Repeat this process until the leaves of the tree are reached. The outputs
of the leaves of the tree are the results of the program. Once every leaf node completes, write the
results to the standard output device and continue.
Configuring and utilizing the hardware compute cluster as described guarantees that
multiple executions of the same input procedure will be performed in nearly the same amount of
time on the same set of compute nodes. This is true for two reasons: the amount of time each
transformation takes is constant from execution to execution, and there is never a point in time at
which a transformation is available and is not executed. Since nodes do not need to communicate
with each other throughout the process of executing the tree and per-node executions are
assumed to take near-deterministic amounts of time, the overall time taken to execute the tree
will be nearly deterministic from execution to execution of the same input procedure.
Andrew Brooks paper Low-Latency Distributed Applications in Finance provides a
concrete method for determining the latency of different components of a process. Brook
recommends that engineers working on systems with deterministic latency should create a
dynamic profiling library which analyzes process response latency and optimizes resource
allocations to processes in order to achieve more deterministic execution time (Brook 4-5). The
system managing the execution of the input execution tree could leverage Brooks method to
quantify and cache the relative compute cost of different transformations in repeated input
execution trees. With these cached performance metrics, the assumption that all compute nodes
in the cluster are equal in compute power can be relaxed. Since the collection of transformation
latency metrics allows the proposed hardware manager to determine the relative cost of
transformations in a tree and the relative compute power of nodes in the cluster is provided via
the OS, the hardware manger can effectively map transformations to compute nodes by matching
relatively expensive transformations with relatively efficient compute nodes.
One significant challenge arises when multiple clients are sharing a compute cluster. As
per the assumptions made when configuring the hardware, an input procedure is given complete
access to the hardware cluster until it completes. One method to support multiple clients consists
of queuing incoming procedures in a separate compute node that accepts all incoming procedures
and hands them off to the rest of the cluster as the current procedure running on the cluster
terminates. However, this scheduling policy may lead to many short tasks being blocked by a
larger, potentially less urgent task. The context switching needed by modern CPU scheduling
techniques introduces additional overhead, and sharing compute nodes among multiple processes
inherently introduces nondeterministic execution times. One method to address cluster sharing
effectively consists of scaling the cluster horizontally at the expense of per-core performance,
which would allow more procedures to be executed in parallel at the expense of typical perprocedure performance. Another option is to create a marketplace in which users buy a latency
guarantee for their procedures. The cloud vendor could automatically tweak prices based on
available resources and market demand to ensure that all purchased latency guarantees are
satisfied, and could therefore prioritize incoming procedures in a manner which satisfies all
latency requirements.
Conclusion
The two procedures described in this paper provide software engineers with a complete
template which guarantees efficient and deterministic execution time for input procedures to
cloud compute clusters. The template cluster manager first constructs an execution tree from the
input procedure, and then manages the mapping of the transformations in the execution tree onto
nodes in the compute cluster. Both portions of the meta-procedure depend only on neardeterministic per-core performance and OS calls which provide per-core performance
information, both of which are present in modern systems. Future work includes a concrete
implementation of the hardware manager on a small compute cluster, in addition to further
research into scheduling techniques that avoid context switching and keep compute cluster costs
as low as possible.
References:
Brook, A. (2015). Low-Latency Distributed Applications in Finance. Communications Of The
ACM, 58(7), 42-50. doi:10.1145/2747303.
Dean, J., & Barroso, L. A. (n.d.). The Tail at Scale (Tech.).
Dinda, P. A., O'Hallaron, D. R., Subhlok, J., Webb, J. A., & Yang, B. (n.d.). Language and Run
time Support for Network Parallel Computing. Retrieved February 16, 2016, from
https://pdfs.semanticscholar.org/446f/637e30e033136abe3abe48650e433f3c516a.pdf
Stuart, Jeff A., and John D. Owens. Message Passing on Data-Parallel Architectures. Rep. Print.
Yamamoto, M., Kaneko, K. (2012). PARALLEL IMAGE DATABASE PROCESSING WITH
MAPREDUCE AND PERFORMANCE EVALUATION IN PSEUDO DISTRIBUTED MODE.
Retrieved February 16, 2016, from http://academicjournals.org/ojs2/index.php/ijecs/article/viewFile/1092/124

Unit 3

Caricato da

Informazioni sul documento

Descrizione originale:

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Unit 3

Caricato da

Copyright:

Formati disponibili

A Proposal for a Method to Achieve Deterministic Latency in Distributed Systems

generalized to provide deterministic high performance distributed execution of general input

Potrebbero piacerti anche