Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
AbstractWe present techniques to process large scale-free also contributes to communication hotspots for hub vertices
graphs in distributed memory. Our aim is to scale to trillions with large in-degree. To prevent communication hotspots we
of edges, and our research is targeted at leadership class selectively use ghost vertices to represent hub vertices with
supercomputers and clusters with local non-volatile memory,
e.g., NAND Flash. high in-degree, and show that only a small number of ghosts
We apply an edge list partitioning technique, designed to are required to provide significant performance improvement.
accommodate high-degree vertices (hubs) that create scaling We tackle three important graph algorithms and demon-
challenges when processing scale-free graphs. In addition to strate scalability to large scale-free graphs: Breadth-First
partitioning hubs, we use ghost vertices to represent the hubs Search (BFS), K-Core decomposition, and Triangle Counting.
to reduce communication hotspots.
We present a scaling study with three important graph al- The data-intensive community has identified BFS as a key
gorithms: Breadth-First Search (BFS), K-Core decomposition, challenge for the field and established it as the first kernel for
and Triangle Counting. We also demonstrate scalability on the Graph500 benchmark [2]. The Graph500 was designed to
BG/P Intrepid by comparing to best known Graph500 results be an architecture benchmark used to rank supercomputers by
[1]. We show results on two clusters with local NVRAM storage their ability to traverse massive scale-free graph datasets. It is
that are capable of traversing trillion-edge scale-free graphs. By
leveraging node-local NAND Flash, our approach can process backed by a steering committee of over 50 international HPC
thirty-two times larger datasets with only a 39% performance experts from academia, industry, and national laboratories.
degradation in Traversed Edges Per Second (TEPS). While designed to be an architecture benchmark, recent work
Keywords-parallel algorithms; graph algorithms; big data; has shown that it is equally an algorithmic challenge [3].
distributed computing. Our aim is to scale to trillions of edges, and our research is
targeted at leadership class supercomputers and clusters with
I. I NTRODUCTION local non-volatile memory (NVRAM). These environments
Efficiently storing and processing massive graph data are required to accommodate the data storage requirements
sets is a challenging problem in data intensive computing of large graphs. NVRAM has significantly slower access
as researchers seek to leverage Big Data to answer speeds than main-memory (DRAM), however our previous
next-generation scientific questions. Graphs are used in a work demonstrated that it can be a powerful storage media
wide range of fields including Computer Science, Biology, for graph applications [4], [5].
Chemistry, and the Social Sciences. These graphs may
A. Summary of our contributions
represent complex relationships between individuals, proteins,
chemical compounds, etc. We identify key challenges for traversing scale-free
Graph datasets from many important domains can be graphs in distributed memory: high-degree hub vertices,
classified as scale-free, where the vertex degree-distribution and dense all-to-all processor-processor communication;
asymptotically follows a power law distribution. The power We present a novel edge list partitioning strategy for
law distribution of vertex degree causes significant im- large scale-free graphs in distributed memory, and show
balances when processing using distributed memory. The its application to three important graph algorithms: BFS,
growth of high-degree vertices, also known as hubs, provides K-Core, and triangle counting;
significant challenges for balancing distributed storage and Using ghost vertices to represent high-degree hubs, we
communication. scale our approach on leadership class supercomputers
In this work, we identify key challenges to storing and by scaling up to 131K cores on BG/P Intrepid and
traversing massive scale-free graphs using distributed external compare it to best known Graph500 results;
memory. We have developed a novel graph partitioning We show that by leveraging node-local NAND Flash,
strategy that evenly partitions scale-free graphs for distributed our approach can process thirty-two times larger datasets
memory, and overcomes the balance issues created by vertices with only a 39% performance degradation in TEPS over
with large out-degree. The growth of high-degree vertices a DRAM-only solution.
Graph500 Hub Growth
II. BACKGROUND
1012
deg(v) >= 1, 000
A. Graphs, Properties & Algorithms 1011 deg(v) >= 10, 000
Max Degree
In a graph, relationships are stored using vertices and 1010
Edge Count
edges; a vertex may represent an object or concept, and the
109
relationships between them are represented by edges. The
power of graph data structures lies in the ability to encode 108
III. S CALE - FREE GRAPHS IN DISTRIBUTED MEMORY partitioned across multiple consecutive partitions. Our edge
Parallel processing of unstructured scale-free and small- list partitioned graph supports the following partition-related
world graphs, G(V, E) containing a set V of vertices and a operations:
set E of edges, using p processors can lead to significant min owner(v) returns the minimum partition rank
scaling challenges. These challenges include high-degree hub that contains source vertex v;
vertices and dense processor-processor communication, and max owner(v) returns the maximum partition rank
stem from unstructured locality and power-law distributions that contains source vertex v.
of the graph data. These operations can be performed in constant time by
A. High Degree Hubs preserving the rank owner information with the identifier
v, or by a O(lg(p)) binary search. We choose to store the
Underlying many of the challenges is the growth of hub owner information as part of the identifier v. The underlying
vertices in the graph as the size of the graph increases. Hub storage of each edge list partition is flexible; we choose to
vertices have degrees significantly above average. The hub store each local partition as a compressed sparse row.
growth for Graph500 scale-free graphs is shown in Figure 1. An example of edge list partitioning is illustrated in
While the average degree is held constant at 16, the number Figure 3. In this example, vertices 2 and 5 have adja-
of edges belonging to hubs of degree greater than 1,000 or cency lists that span multiple partitions: min owner(2) = 0,
10,000 continue to grow as graph size increases. The max max owner(2) = 2, min owner(5) = 2, max owner(5) = 3.
degree hub also continues to grow, and by the graph size
Requiring the edge list to be globally sorted is an additional
of 230 vertices, the max degree hub has already crossed 10
step that is not needed by 1D or 2D graph partitioning. This
Million edges.
is not an onerous requirement, because there are numerous
Graphs that contain vertices with above-average degree
distributed memory and external memory sorting algorithms,
may have large communication imbalances. Specifically,
and in many graph file formats the edge list is already sorted.
when a Pgraph is 1D partitioned into a set of P partitions, and
Each partition that contains v also contains the algorithm
max ( degree(v) ) > |E| |P | , then at least one processor
pi P vV
pi
state for v (e.g., BFS level). This means that state is replicated
will process more than its fair share of visitors. for vertices whose adjacency list spans multiple partitions.
1) Edge List Partitioning: Partitioning the graph data The min owner partition is the master partition with all
amongst p distributed processes becomes challenging with others acting as replicas. The algorithms for controlling
the presence of hub vertices. The simplest partitioning is 1D, access to each replica are discussed in section V. The global
where each partition receives an equal number of vertices number of partitioned adjacency lists is bounded by O(p),
and their associated adjacency list. In 1D, the adjacency list where each partition contains at most two split adjacency
of a vertex is assigned to a single partition. This simple lists.
partitioning leads to significant data imbalance, shown in 2) Ghost vertices: The number of incoming edges to hub
Figure 2, because a single hubs adjacency list can exceed the vertices in scale-free graphs can grow very large, significantly
average edge count per partition. Data imbalance is computed larger than the total number of edges per partition. To mitigate
from the distribution of edges per partition. Recent work the communication hotspots created by hubs, we selectively
has advocated the use of 2D partitioning [3], [10], [11], use ghost information. Ghosts can be used to filter excess
where each partition receives a 2D block of the adjacency visitors, reducing the communication hotspots created by high
matrix. In effect, this partitions the hubs adjacency list across in-degree hubs. Ghosts are discussed further in Section IV-B.
O( p) partitions, and significantly improves data balance,
also shown in Figure 2. 2D partitioning is discussed further B. Dense processor-processor communication
in Section VIII-A. Effectively partitioning many scale-free graphs is difficult,
To maintain a balance of edges across p partitions with and often not feasible. Without good graph separators,
ranks 0 to p1, we designed a partitioning based on a sorted parallel algorithms will require significant communication.
edge list. In this work, the graphs edge list is first sorted Specifically, when the parallel partitioned graph contains
by the edges source vertex, then evenly distributed. This (|E| ), where 0 < 1, cut edges, a polynomial
causes many of the adjacency lists (including hubs) to be number of graph edges will require communication between
Receive Ranks
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
exclusive access to the vertexs data. The traversal completes
0 when all visitors have completed, and the distributed queue
1
0 1 2 3 is globally empty.
2
3
A. Visitor Abstraction
4
5
4 5 6 7 In our previous work [4], we used an asynchronous visitor
6
pattern to compute Breadth-First Search, Single Source
Send Ranks
7
8 Shortest Path, and Connected Components in shared and
9
8 9 10 11 external memory. We used multi-threaded prioritized visitor
10 Msg Router
11
queues to perform the asynchronous traversal.
12 In this work, we build on the asynchronous visitor pattern
13
12 13 14 15 with modifications to handle edge list partitioning and ghost
14
15
vertices. The required visitor procedures and state are outlined
in Table I.
Figure 4. Illustration of 2D communicator routing of 16 ranks. As an
example, when Rank 11 sends to Rank 5, the message is first aggregated Table I
and routed through Rank 9. V ISITOR P ROCEDURES AND S TATE
Required Description
processors. Additionally, dense communication occurs when pre visit( ) Performs a preliminary evaluation of the state
(p+1 ) pairs of processors share cut edges, in the worst and returns true if the visit should proceed; this
can be applied to ghost vertices.
case creating all-to-all communication. visit( ) Main visitor procedures.
We apply communication routing and aggregation through operator<( ) Less than comparison used to locally prioritize
a synthetic network to reduce dense communication re- the visitors in a min heap priority queue.
quirements. For dense communication patterns, where every vertex Stored state representing the vertex to be visited.
process needs to send messages to all p other processes,
we route the messages through a topology that partitions
the communication. We have experimented with 2D and B. Ghost Vertices
3D routing topologies. Figure 4 illustrates a 2D routing Ghost information replicates distributed state to avoid
topology that reduces the number of communicating channels communication. By replicating the state of vertices with
a process requires to O( p). This reduction in the number of large in-degree, the communication hotspot associated with
communicating pairs comes at the expense of message latency that vertex can be reduced. The ideal case is to reduce the
because messages require two hops to reach their destination. communication from hundreds of millions down to O(p),
In addition to reducing the number of communicating pairs, where each partition only requires communicating once per
2D routing increases the amount of message aggregation hub vertex.
possible by O( p). The ghost information is never globally synchronized, and
Scaling to hundreds of thousands of cores requires ad- represents only the local partitions view of remote hubs.
ditional reductions in communication channels. Our experi- Each partition locally identifies high-degree vertices from its
ments on BG/P use a 3D routing topology, that is very similar edges targets to become ghost vertices. Ghosts cannot be
to the 2D illustrated in Figure 4, and is designed to mirror used for every algorithm, so each algorithm must explicitly
the BG/P 3D torus interconnect topology. declare ghost usage.
We investigate the useful number of ghosts in our ex-
IV. D ISTRIBUTED A SYNCHRONOUS V ISITOR Q UEUE perimental study in Section VII-E2. The number of ghosts
The driver of our graph traversal is the distributed asyn- required for scale-free graphs is small, because the number
chronous visitor queue; it provides the parallelism and creates of high-degree vertices is small.
a data-driven flow of computation. Traversal algorithms are Ghosts can only be used as an imprecise filter for
created using a visitor abstraction, which allows an algorithm algorithms such as BFS, because the ghosts are not globally
designer to define vertex-centric procedures to execute on synchronized. Algorithms that require precise counts of
traversed vertices with the ability to pass visitor state to other events, such as k-core discussed in Section VI-B, cannot
vertices. The visitor pattern is discussed in Section IV-A. use ghosts.
Each asynchronous traversal begins with an initial set of
C. Visitor Queue Interface
visitors, which may create additional visitors dynamically
depending on the algorithm and graph topology. All visitors The visitor queue has the following functionality that may
are asynchronously transmitted, scheduled, and executed. be used by a visitor or initiating algorithm:
When the visitors execute on a vertex, they are guaranteed push(visitor) pushes a visitor into the visitor queue.
do traversal() initializes and runs the asynchronous Algorithm 1 Distributed Visitor Queue
traversal to completion. This is used by the initiating 1: state: g input graph
2: state: mb mailbox communication
algorithm. 3: state: my rank local partition rank
When an algorithm requires dynamic creation of new 4: state: local queue local visitor priority queue
visitors, they are pushed into the visitor queue using the push( 5: procedure PUSH(visitor)
visitor ) procedure. When an algorithm begins, an initial set 6: vertex = visitor.vertex
of visitors is pushed into the queue, then the do traversal() 7: master partition = g.min owner(vertex)
8: if g.has local ghost(vertex) then
procedure is invoked and runs the asynchronous traversal to 9: ghost = g.local ghost(vertex)
completion. . pre visit locally stored ghost
10: if visitor.pre visit(ghost) then
D. Example Traversal 11: mb.send(master partition, visitor)
12: end if
To illustrate how a asynchronous traversal algorithm works, 13: else
we will discuss Breadth-First Search (BFS) at a high level. 14: mb.send(master partition, visitor)
The details of BFS are discussed in Section VI-A. The visitor 15: end if
16: end procedure
is shown in Algorithm 2, and the initiating procedure is shown
in Algorithm 3. 17: procedure CHECK MAILBOX
BFS begins by setting an initial path length for every 18: for all visitor mb.receive() do
19: vertex = visitor.vertex
vertex to (Alg. 3 line 4). A visitor is queued for the 20: if visitor.pre visit(g[vertex]) then
source vertex with path length 0, then the traversal begins 21: local queue.push(visitor)
(Alg. 3 lines 910). As the visitors proceed, if they contain 22: if my rank < g.max rank(vertex) then
. forwards to next replica
a lower path length that is currently known for the vertex, 23: mb.send(my rank + 1, visitor)
they update the path information and queue new visitors for 24: end if
the outgoing edges (Alg. 2 lines 1316). 25: end if
26: end for
27: end procedure
V. V ISITOR Q UEUE D ESIGN D ETAILS
In this section, we discuss the design of the distributed vis- 28: procedure GLOBAL EMPTY . quiescence detection [12]
29: return true if globally empty, else f alse
itor queue. The visitor queue has the following functionality, 30: end procedure
which will be discussed in detail in the following subsections
and is shown in Algorithm 1. 31: procedure DO TRAVERSAL(source visitor)
32: push(source visitor)
push(visitor) pushes a visitor into the distributed 33: while !global empty() do
queue; 34: check mailbox()
check mailbox() receives visitors from mailbox and
35: if !local queue.empty() then
36: next visitor = local queue.pop()
queues them locally; 37: next visitor.visit(g, this)
global empty() returns true if globally empty; 38: end if
do traversal() runs the asynchronous traversal.
39: end while
40: end procedure
mailbox: The communication occurs though a mailbox
abstraction with the following functionality:
send(rank, data) Sends data to rank, using the
routing and aggregation network; ghost information is found, the framework will apply the
receive() Receives messages from any sender. visitor to the ghost. The ghosts act as local filters, reducing
ghost information: Our graph has the following ghost related unnecessary visitors sent to hub vertices.
operations used by the distributed visitor queue: check mailbox(): This function checks for incoming visitors
has local ghost(v) returns true if local ghost infor- from the mailbox, and forwards visitors to potential edge list
mation is stored for v; partitioned replicas. The details are shown in Algorithm 1
local ghost(v) returns ghost information stored for v. (line 17). For all visitors received from the mailbox (line 18),
push(visitor): This function pushes newly created visitors if the visitors pre visit returns true, the visitor is queued
into the distributed visitor queue; the details are shown in locally (line 21). Additionally, if the vertex is owned by ranks
Algorithm 1 (line 5). If the ghost information about the larger than the current rank, the visitor is forwarded to the
visitors vertex is stored locally, it is pre visited (line 8). If next replica (line 22). This forwarding chains together vertices
the pre visit returns true or no local ghost information is whose adjacency lists span multiple edge list partitions. The
found, then the visitor is sent to min owner using the mailbox replicas are kept loosely consistent because visitors are first
(lines 11, 14). This functionality abstracts the knowledge of sent to the master and then forwarded to the chain of replicas
ghost vertex information from the visitor function. If local in an ordered manner.
global empty(): This function checks if all global visitor Algorithm 2 BFS Visitor
queues are empty, returning true if all queues are empty, 1: visitor state: vertex vertex to be visited
2: visitor state: length BFS length
and is used for termination detection. It is implemented 3: visitor state: parent BFS parent
using a simple O(lg(p)) quiescence detection algorithm
based on visitor counting [12]. The algorithm performs an 4: procedure PRE VISIT(vertex data)
5: if length < vertex data.length then
asynchronous reduction of the global visitor send and receive 6: vertex data.length length
count using non-blocking point-to-point MPI communication. 7: vertex data.parent parent
It is important to note that to check for non-termination is 8: return true
9: end if
an asynchronous event, and only becomes synchronous after 10: return f alse
the visitor queues are already empty. 11: end procedure
do traversal(): This is the driving loop behind the asyn-
12: procedure VISIT(graph, visitor queue)
chronous traversal process and is shown in Algorithm 1 13: if length == graph[vertex].length then
(line 31). The procedure assumes that a set of initial visitors 14: for all vi out edges(g, vertex) do
has been previously queued, then the main traversal loops 15: new vis bf s visitor(vi, length + 1, vertex)
16: visitor queue.push(new vis)
until all visitors globally have been processed (line 33). 17: end for
During the loop, it checks the mailbox for incoming visitors 18: end if
(line 34), and processes visitors already queued locally. 19: end procedure
Time (seconds)
500 Rewire = 0%
GTEPS
Rewire = 10%
Rewire = 20%
400
Rewire = 30%
10
300
200
8,192 32,768 131,072 512 1,024 2,048 4,096
Number of Cores Number of Cores
Figure 5. Weak scaling of Asynchronous BFS on BG/P Intrepid with all Figure 7. Weak scaling of triangle counting on BG/P using Small World
edge weights equal to 1. Compared to Intrepid BFS performance from the graphs. Performance shown at with different small world rewire probabilities.
Graph500 list. There are 218 vertices per core, with the largest scale graph There are 218 vertices and 222 undirected edges per core; at 4096 cores,
having 235 . the graph has 230 vertices and 234 edges.
Weak Scaling KCore on BG/P - RMAT Weak scaling of Async BFS on Hyperion-DIT
2 Distributed External Memory
k = 64 kcore
MTEPS
256
0.5
128
0
512 1,024 2,048 4,096
Number of processing cores 8 16 32 64
Number of compute nodes (8-cores per node)
800
in Section VII-D, the performance of triangle counting is
dependent on the maximum vertex degree of the graph. For 700
this weak scaling study, we use small world graphs to isolate
the effects of hub growth that would occur with PA or RMAT 600
graphs. 231 232 233 234 235 236
Number of Vertices
C. Scalability of Distributed External Memory BFS
Figure 9. Effects of increasing external memory usage on 64 compute
We demonstrate the distributed external memory scalability nodes of Hyperion-DIT. At 236 , which is 32x larger data than DRAM-only,
of our BFS algorithm on the Hyperion-DIT cluster at the NVRAM performance is only 39% slower than DRAM graph storage.
Effects of Diameter on BFS Effects of Degree on Triangle Counting
Small World, 4096 cores, BG/P Preferential Attachment, 4096 cores, BG/P
2.5 0.6% 20%
Time (seconds)
0.01%
1.5 40%
GTEPS
0.005% 50%
1 500
60%
0.002% 70%
0.5 80%
0.001% 100%
Figure 10. Effects of diameter on BFS performance. Small World graph Figure 11. Effects of vertex degree on Triangle Counting performance.
model with varying random rewire probabilities (shown above each point). Preferential Attachment graph model with varying random rewire probabili-
BFS level depth used for x-axis. Computational resources fixed at 4096 ties (shown above each datapoint). Maximum vertex degree used for x-axis.
cores of BG/P. Graph size is fixed at 230 vertices and 234 undirected edges. Computational resources fixed at 4096 cores of BG/P. Graph size is fixed
at 228 vertices and 232 undirected edges.
GTEPS
twice the size as experiments on BG/P Intrepid. We expect
our approach to continue scaling as larger HPC clusters with 0.5
18
For the sparse Graph500 datasets with average degree of
16 16, this may occur for as low as 256 partitions and is
independent of graph size. Second, under weak scaling,
14 the amount of algorithm state (e.g., BFS level) stored per
partition scales with O( Vp ), where V is the total number
12 of vertices. This will ultimately hit a scaling wall where the
1 4 16 64 256
Number of Ghosts per Partition
amount of local algorithm state per partition exceeds the
capacity of the compute node. Finally, with respect to our
Figure 13. Experiment showing the percent improvement of ghost vertices desire to use semi-external memory where the vertex set is
vs. no ghost vertices. Results from 4096 cores of BG/P on a graph with stored in in-memory and the edge set is stored in external
230 vertices.
memory, hypersparse partitions are poor candidates to apply
semi-external memory techniques, because the in-memory
1D suffers slowdowns from the partition imbalance. requirements (proportional to the number of vertices) are
2) Use of Ghost Vertices: We experimented with the larger than the external memory requirements (proportional
effects on BFS performance of choosing different size k to the number of edges). Our edge list partitioning does not
on a graph with 230 vertices on 4096 cores of BG/P. The create hypersparse partitions (assuming the original graph is
results of this experiment are shown in Figure 13, where not hypersparse) as 2D can.
the percent improvement of k ghosts per partition vs. no Recent work related to our routed mailbox has been
ghosts is shown. Using a single ghost shows more than a 12% explored by Willcock [20]. Active messages are routed
improvement, and 512 ghosts shows an 19.5% improvement. through a synthetic hypercube network to improve dense
This improvement is highly dependent on the graph, and communication scalability. A key difference to our work
as hubs grow to larger sizes it may have a larger effect. is that their approach has been designed for the Bulk
For scale-free graphs, when degree(v) > p, there is an Synchronous Parallel (BSP) model and is not suitable for
opportunity for ghosts to have a positive effect, because at asynchronous graph traversals.
least one partition will have multiple edges to v. All other The STAPL Graph Library [21] provides a framework that
BFS experiments in this work use 256 ghost vertices per abstracts the user from data-distribution and parallelism and
partition. supports asynchronous algorithms to be expressed.