Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Latency (nsec)
N0 N1 N2 N3 N4 N5 N6 N7 20
(a) 15
10
R0 R1 R2 R3 5
R0 R1 R2 R3 0
Conc Conc Conc Conc
mesh flattened mesh flattened
N0 N1 N2 N3 N4 N5 N6 N7
N0 N1 N2 N3 N4 N5 N6 N7
butterfly butterfly
average worst-case
(b) (c)
3 On-Chip Flattened Butterfly routers within a row are fully connected, as are the routers
3.1 Topology Description within a column.
The wire delay associated with the Manhattan distance
The flattened butterfly topology [15] is a cost-efficient between a packet’s source and its destination provides a
topology for use with high-radix routers. The flattened but- lower bound on latency required to traverse an on-chip net-
terfly is derived by combining (or flattening) the routers in work. When minimal routing is used, processors in this flat-
each row of a conventional butterfly topology while preserv- tened butterfly network are separated by only 2 hops, which
ing the inter-router connections. The flattened butterfly is is a significant improvement over the hop count of a 2-D
similar to a generalized hypercube [4]; however, by provid- mesh. The flattened butterfly attempts to approach the wire
ing concentration in the routers, the flattened butterfly sig- delay bound by reducing the number of intermediate routers
nificantly reduces the wiring complexity of the topology, al- – resulting in not only lower latency but also lower energy
lowing it to scale more efficiently. consumption. However, the wires connecting distant routers
in the flattened butterfly network are necessarily longer than
To map a 64-node on-chip network onto the flattened but-
those found in the mesh. The adverse impact of long wires
terfly topology, we collapse a 3-stage radix-4 butterfly net-
on performance is readily reduced by optimally inserting re-
work (4-ary 3-fly) to produce the flattened butterfly shown
peaters and pipeline register to preserve the channel band-
in Figure 3(a). The resulting flattened butterfly has 2 di-
width while tolerating channel traversal times that may be
mensions and uses radix-10 routers. With four processor
several cycles. The longer channels also require deeper
nodes attached to each router, the routers have a concen-
buffer sizes to cover the credit round trip latency in order
tration factor of 4. The remaining 6 router ports are used
to maintain full throughput.
for inter-router connections: 3 ports are used for the dimen-
sion 1 connections, and 3 ports are used for the dimension 2
connections. Routers are placed as shown in Figure 3(b) to 3.2 Routing and Deadlock
embed the topology in a planar VLSI layout with each router
Both minimal and non-minimal routing algorithms can
placed in the middle of the 4 processing nodes. Routers con-
be implemented on the flattened butterfly topology. A lim-
nected in dimension 1 are aligned horizontally, while routers
ited number of virtual channels (VCs) [8] may be needed
connected in dimension 2 are aligned vertically; thus, the
to prevent deadlock within the network when certain rout-
1 The latency calculations were based on Intel TeraFlop [25] parameters ing algorithms are used. Additional VCs may be required
(tr = 1.25ns, b = 16GB/s, L = 320bits) and estimated value of wire delay for purposes such as separating traffic into different classes
for 65nm (tw = 250ps per mm). or avoiding deadlock in the client protocols. Dimension-
R1
to/from R0
(c) R0 R1 R2 R3 to/from R3
Bypass
Channels
(d) R0 R1 R2 R3
(a)
input muxes
R1
Figure 4. Routing paths in a 2-D on-chip flattened butterfly.
to/from R0 to/from R2
(a) All of the traffic from nodes attached to R1 is sent to nodes
Router
attached to R2. The minimal path route is shown in (b) and the to/from R3
two non-minimal paths are shown in (c) and (d). For simplicity,
to/from R2
the processing nodes attached to these routers are not shown
to/from R0
and only the first row of routers is shown. to/from R3
output muxes
pass channels) or direct inputs (e.g. inputs/outputs to/from control packet route
Data V CNT DEST
the local router). The input muxes receive packets destined
for the local router that would otherwise bypass the local
router enroute to the intermediate node selected by the rout-
ing algorithm, as illustrated in Figure 4(c). Thus, each input
mux receives both the direct inputs that the packet would
Figure 6. Modification to the buffers introduced into the flow
have used if the non-minimal path was taken and the inputs
control with the utilization of bypass channels. The additional
from the bypass channels. The output muxes are used by
bits of the buffers correspond to V : valid bit, CNT : count
packets that would have initially been routed away from their
of control packet, and DEST corresponds to control packet
destinations before being routed back over the local router,
content which contains a destination. Once a flit in the input
as illustrated in Figure 4(d). The inputs to the output muxes
buffer is processed, if the V bit is set, CNT number of control
are the direct outputs from the local router and the bypass
packets are processed in the router before the next flit in the
channel inputs – the path the packet would have taken if
input buffer is processed.
non-minimal routing path was taken. The addition of these
muxes does not eliminate the need for non-minimal routing
for load-balancing purpose. Instead, the muxes reduce the
output bandwidth. This policy guarantees that a packet wait-
distance traveled by packets to improve energy efficiency
ing at the non-primary input of a bypass mux will eventu-
and reduce latency.
ally be granted access to the bypass mux. In the worst-case
(i.e. highly congested) environment, the latency of the non-
4.2 Mux Arbiter minimal routed packets will be identical to the flattened but-
terfly that does not directly utilize the bypass channels. How-
The arbiters that control the bypass muxes are critical ever, there will still be energy savings because the flit does
to the proper utilization of the bypass channels. A sim- not traverse the non-minimal physical distance; instead, only
ple round-robin arbiter could be implemented at the muxes. the control flit, which is much smaller than a data packet,
While this type of arbiter leads to a locally fair arbitration at travels the full physical distance of the non-minimal route.
the mux, the arbiter does not guarantee global fairness [11].
To provide global fairness, we implement an yield arbiter
4.3 Switch Architecture
that yields to the primary input – i.e. the input that would
have used the channel bandwidth at the output of the mux if With minimal routing, the crossbar switch can be simpli-
the bypass channels were not connected to the local router. fied because it need not be fully connected. Non-minimal
Accordingly, the direct input is given priority at an input routing increases the complexity of the switch because some
mux, while the bypass channel is given priority at an out- packets might need to be routed twice within a dimension,
put mux. Thus, if the primary input is idle, the arbiter grants which requires more connections within the crossbar. How-
access to the non-primary input. ever, by using the bypass channels efficiently, non-minimal
To prevent starvation of the non-primary inputs, a control routing can be implementing using a switch of lesser com-
packet is sent along the non-minimal path originally selected plexity – one that approaches the complexity of a flattened
by the routing algorithm. This control packet contains only butterfly that only supports minimal routing. If the bypass
routing information and a marker bit to identify the packet channels are utilized, non-minimal routing does not require
as a control packet. The control packet is routed at the in- sending full packets through intermediate routers, and as
termediate node as though it were a regular packet, which a result, the connections within the switch themselves ap-
results in it eventually arriving at the primary input of the proaches that of the switch that only supports minimal rout-
muxes at the destination router. When the control packet ing.
arrives, the non-primary input is granted access to the mux
2 Using the analogy of cars and highways, the long wires introduced cor-
4.4 Flow Control and Routing
respond to adding highways. The input muxes correspond to adding addi-
tional exit ramps to the highways while the outputs muxes correspond to Buffers are required at the non-primary inputs of the by-
adding entrance ramps to get on the highway. pass muxes for flow control as illustrated in Figure 6. Thus
with non-minimal routing, credits for the bypass buffers are Topology Routing
needed before packet can depart a router. The control pack- FBLY-MIN randomized-dimension
ets that are generated can be buffered in the input buffers. FBLY-NONMIN UGAL [23]
However, since buffers are an expensive resource in on-chip FBLY-BYPASS UGAL [23]
networks, instead of having the short control packets occupy- CMESH O1Turn with express channels [3]
ing the input buffers, minor modifications can be made to the MESH O1Turn [22]
input buffers to handle the control packets. Once a control
packet arrives, the destination of the control packets is stored
in a separate buffer (shown with DEST field in Figure 6) and Table 1. Routing algorithms used in simulation comparison.
the control bits of the input buffer need to be updated prop-
CMESH FBFLY-MIN
erly such that the control packet can be properly processed in
FBFLY-NONMIN FBFLY-BYP
the router. These modifications should introduce little over-
head because the datapaths of on-chip networks are typically 50
Latency (cycles)
40
5 Evaluation 30
20
We compare the performance of the following topologies 10
in this section:
0
1. conventional 2-D mesh (MESH) 0 0.1 0.2 0.3 0.4
Offered load (fraction of capacity)
0.5
16 16 16
Number of Nodes
Number of Nodes
Number of Nodes
12 12 12
8 8 8
4 4 4
0 00 0 0
00
00
00
00
00
0
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
80
90
80
90
80
90
10
11
12
13
14
15
16
10
11
12
13
14
15
16
10
11
12
13
14
15
16
Completion Time (cycles) Completion Time (cycles) Completion Time (cycles)
Figure 9. Node completion time variance for the different topologies (a) mesh (b) CMESH and (c) flattened butterfly.
bitrev transpose
tornado randperm
0.8
UR bitcomp ther. By using FBFLY-NONMIN, the latency can actually
increase because of the extra latency incurred with the non-
0.6 minimal routing. However, the FBFLY-BYP provides the
0.4
benefit of non-minimal routing but reducing the latency as
all packets take minimal physical path and results in approx-
0.2
imately 28% latency reduction, compared to the MESH net-
0 work.
MESH CMESH FBFLY- FBFLY- FBFLY-
MIN NONMIN BYP
In addition to the latency required to complete the batch
job, we also plot the variance of the completion time of the
64 nodes. A histogram is shown in Figure 9 that collects the
Figure 8. Latency comparison of alternative topologies across completion time for each processing node. With the flattened
different synthetic traffic pattern. butterfly, because of the lower diameter, the completion time
has much more tighter distribution and smaller variance in
two topologies that utilize concentration to reduce the cost
the completion time across all of the nodes. Less variance
of the network. By effectively utilizing non-minimal rout-
can reduce the global synchronization time in chip multi-
ing and smaller diameter of the topology, FBFLY can pro-
processor systems. The CMESH, because it is not a sym-
vide up to 50% increase in throughput compared to CMESH
metric topology, leads to an unbalanced distribution of com-
while provide lower zero-load latency. Although the MESH
pletion time.
can provide higher throughput for some traffic pattern, it has
been previously shown that CMESH results in a more cost-
and energy-efficient topology compared to the MESH [3]. 5.1.3 Multiprocessor Traces
5.1.2 Synthetic Batch Traffic Network traces were collected from an instrumented version
of a 64-processor directory based Transactional Coherence
In addition to the throughput measurement, we use a batch and Consistency (TCC) multiprocessor simulator [6]. In ad-
experiment to model the memory coherence traffic of a dition to capturing the traffic patterns and message distrib-
shared memory multiprocessor. Each processor executes a utions, we record detailed protocol information so that we
fixed number of remote memory operations (e.g. remote can infer dependencies between messages and identify se-
cache line read/write requests) during the simulation and quences of interdependent communication and computation
we record the time required for all operations to complete. phases at each processor node. This improves the accu-
Read requests and write acknowledgements are mapped into racy of the performance measured when the traces are re-
64-bit messages, while read replies and write requests are played through a cycle-accurate network simulator. When
mapped into 576-bit messages. Each processor may have capturing the traces, we use an idealized interconnection net-
up to four outstanding memory operations. The synthetic work model which provides instantaneous message delivery
traffic pattern used are uniform random (UR), bit comple- to avoid adversely biasing the traces. The traffic injection
ment, transpose, tornado, a random permutation, and bit re- process uses the recorded protocol information to reconstruct
verse [11]. the state of each processor node. Essentially, each processor
Figure 8 shows the performance comparison for the batch is modelled as being in one of two states: an active comput-
experiment and we normalize the latency to the mesh net- ing state, during which it progresses towards its next com-
work. CMESH reduces latency, compared to the MESH, munication event; or, an idle communicating state, in which
1
12
Memory
Latency (Normalized to Mesh Network)
0.9
10 Crossbar
0.8 Channel
8
Power (W)
0.7
6
0.6
4
0.5
2
FBFLY-NONMIN
FBFLY-NONMIN
FBFLY-NONMIN
FBFLY-NONMIN
FBFLY-BYP
FBFLY-BYP
FBFLY-BYP
CMESH
FBFLY-MIN
CMESH
FBFLY-MIN
CMESH
FBFLY-MIN
CMESH
FBFLY-MIN
mesh
mesh
mesh
mesh
0
MESH CMESH FBFLY- FBFLY- FBFLY-
barnes ocean equake tomcatv MIN NONMIN BYP
Figure 10. Performance comparison from SPLASH bench- Figure 11. Power consumption comparison of alternative
mark traces generated from a distributed TCC simulator. topologies on UR traffic.
R0 R1 R2 R3 R4 R5 R6 R7
P8 P9 P10 P11 P12 P13 P14 P15
(a)
P0 P1 P2 P3 P4 P5 P6 P7
R0 R1 R2 R3
P8 P9 P10 P11 P12 P13 P14 P15
(a)
(b)
P0 P1 P2 P3 P4 P5 P6 P7
R0 R1 R2 R3 R4 R5 R6 R7
2-D Flattened 2-D Flattened
P8 P9 P10 P11 P12 P13 P14 P15 Butterfly Butterfly
R8 R9 R10 R11 R12 R13 R14 R15
(c)
2-D Flattened 2-D Flattened 2-D Flattened 2-D Flattened
Figure 12. Layout of 64-node on-chip networks, illustrating Butterfly Butterfly Butterfly Butterfly
the connections for the top two rows of nodes and routers
for (a) a conventional 2-D mesh network, (b) 2-D flattened
butterfly, and (c) a generalized hypercube. Because of the (b) (c)
complexity, the channels connected to only R0 are shown for
the generalized hypercube. Figure 13. Different methods to scale the on-chip flattened
butterfly by (a) increasing the concentration factor, (b) increas-
ing the dimension of the flattened butterfly, and (c) using a
neighboring routers in the middle by a factor of 4, the use of
hybrid approach to scaling.
concentration allows the two rows of wire bandwidth to be
combined – thus, resulting in a reduction of bandwidth per
channel by only a factor of 2. However, the generalized hy- channels crossing the bisection increase, reduction in band-
percube (Figure 12(c)) topology would increase the number width per channel and increased serialization latency be-
of channels in the bisection of the network by a factor of 16, come problematic. To overcome this, an hybrid approach
which would adversely impact the serialization latency and can be used to scale the on-chip flattened butterfly. One such
the overall latency of the network. possibility is shown in Figure 13(c) where the 2-D flattened
butterfly is used locally and the cluster of 2-D flattened but-
6.2 Scaling On-Chip Flattened Butterfly terfly is connected with a mesh network. This reduces the
number channels crossing the bisection and minimizes the
The flattened butterfly in on-chip networks can be scaled impact of narrower channels at the expense of slightly in-
to accommodate more nodes in different ways. One method crease in the average hop count (compared to a pure flattened
of scaling the topology is to increase the concentration factor. butterfly).
For example, the concentration factor can be increased from
4 to 8 to increase the number of nodes in the network from 6.3 Future Technologies
64 to 128 as shown in Figure 13(a). This further increases
the radix of the router to radix-14. With this approach, the The use of high-radix routers in on-chip networks intro-
bandwidth of the inter-router channels needs to be properly duces long wires. The analysis in this work assumed opti-
adjusted such that there is sufficient bandwidth to support the mally repeated wires to mitigate the impact of long wires and
terminal bandwidth. utilized pipelined buffers for multicycle wire delays. How-
Another scaling methodology is to increase the dimen- ever, many evolving technology will impact communication
sion of the flattened butterfly. For example, the dimension in future on-chip networks and the longer wires in the on-
can be increased from a 2-D flattened butterfly to a 3-D flat- chip flattened butterfly topology are suitable to exploit these
tened butterfly and provide an on-chip network with up to technologies. For example, on-chip optical signalling [17]
256 nodes as shown in Figure 13(b). To scale a larger num- and on-chip high-speed signalling [13] attempt to provide
ber of nodes, both the concentration factor as well as the signal propagation velocities that are close to the speed of
number of dimensions can be increased as well. light while providing higher bandwidth for on-chip commu-
However, as mentioned in Section 6.1, as the number of nications. With a conventional 2D mesh topology, all of the
Packet A
wires are relatively short and because of the overhead in-
Packet B
volved in these technologies, low-radix topologies can not Destination
of Packet B
exploit their benefits. However, for the on-chip flattened
butterfly topology which contains both short wires as well P0 P1 P2 P3
(a)
as long wires, the topology can take advantage of the cheap
electrical wires for the short channels while exploiting these Blocked
new technologies for the long channels.
P0 P1 P2 P3
(b)
6.4 Use of Virtual Channels
Blocked
Virtual channels (VCs) were originally used to break
deadlocks [9] and were also proposed to increase network P0 P1 P2 P3
performance by provide multiple virtual lanes for a single (c)
physical channel [8]. When multiple output VCs can be as-
Blocked
signed to an input VC, a VC allocation is required which can
significantly impact the latency and the area of a router. For
example, in the TRIPS on-chip network, the VC allocation Figure 14. Block diagram of packet blocking in (a) wormhole
consumes 27% of the cycle time [12] and an area analysis flow control (b) virtual channel flow control and (c) flattened
show that VC allocation can occupy up to 35% of the total butterfly.
area for an on-chip network router [20].
In Figure 14(a), an example of how blocking can oc-
100
cur in conventional network with wormhole flow control 1VC
is shown. By utilizing virtual channel flow control (Fig- 80
2 VC
4 VC
ure 14(b)), buffers are partitioned into multiple VCs, which
Latency (cycles)
8 VC
allow packets to pass blocked packets. However, with the 60