Sei sulla pagina 1di 12

1

TRIDENT: A load-balancing Clos-network Packet


Switch with Queues between Input and Central
Stages and In-Order Forwarding
Oladele Theophilus Sule and Roberto Rojas-Cessa, Senior Member, IEEE

Abstract—We propose a three-stage load balancing packet depends on the response time of the fabric and reconfiguration
switch and its configuration scheme. The input- and central-
arXiv:1907.00736v2 [cs.NI] 27 Aug 2019

time.
stage switches are bufferless crossbars, and the output-stage Clos-network switches can be categorized based on whether
switches are buffered crossbars. We call this switch ThRee-
stage Clos-network swItch with queues at the middle stage and a stage performs space- (S) or memory-based (M) switching
DEtermiNisTic scheduling (TRIDENT), and the switch is cell into SSS (or S3 ) [2], [3], MSM [4]–[8], MMM [9]–[13], SMM
based. The proposed configuration scheme uses predetermined [14], and SSM [15], [16], among the most popular ones. Com-
and periodic interconnection patterns in the input and central pared to the other categories, S3 switches require the smallest
modules to load-balance and route traffic, therefore; it has low amount of hardware, but their configuration complexity is
configuration complexity. The operation of the switch includes a
mechanism applied at input and output modules to forward cells high. Despite having a reduced configuration time, MMM
in sequence. TRIDENT achieves 100% throughput under uniform switches, must deal with internal blocking and the multiplicity
and nonuniform admissible traffic with independent and identical of input-output paths associated with diverse queuing delays
distributions (i.i.d.). The switch achieves this high performance [9], [17]. In general, switches with buffers in either the central
using a low-complexity architecture while performing in-sequence or output stage are prone to forwarding packets out of sequence
forwarding and no central-stage expansion or memory speedup.
Our discussion includes throughput analysis, where we describe because of variable queue lengths, making in-sequence trans-
the operations the configuration mechanism performs on the mission mechanisms or re-sequencing a required feature.
traffic traversing the switch, and proof of in-sequence forwarding. Traffic load balancing is a technique that improves the
We present a simulation analysis as a practical demonstration performance of switching and reduces the configuration com-
of the switch performance under uniform and nonuniform i.i.d. plexity [18]. Such a technique is especially attractive for its
traffic.
application to Clos-network switches as these suffer from
Index Terms—Clos-network switch, load-balancing switch, in- high configuration complexity. A large number of network
order forwarding, high performance switching, packet scheduling, applications such as those used in network virtualization
packet switching, matrix analysis.
and data center network, adopt load balancing techniques to
obtain high performance [19]–[21]. Load balancing finds its
I. Introduction application in wireless networks [22]–[24].
Predetermined and periodic permutations scheduling mech-
Clos networks are very attractive for building large-size anism may be used for load-balancing and routing to achieve
switches [1]. Most Clos-network switches adopt three stages, high switching performance [9], [25], [26]. A switch using
where each stage uses switch modules as building blocks. The a deterministic and periodic schedule may require queues
modules of the first, second, and third stages are called input, between the load-balancing and routing stages. These queues
central, and output modules, and they are denoted as IM, CM, store the cells while they wait for forwarding. These queues en-
and OM, respectively. Overall, Clos-network switches require able multiple interconnection paths between the load-balancing
fewer switching units (crosspoint elements), than a single- stage and the other stages of the switch, but they also make
stage switch of equivalent size, and thus may require less these switches prone to forwarding cells out of sequence [18].
building hardware. The hardware reduction of a Clos-network Re-sequencing [27] and out-of-sequence prevention mecha-
switch often increases its configuration complexity. In general, nisms [28], [29], as they become switch components, may
a Clos-network switch requires configuring its modules before affect the switching performance and increase complexity.
forwarding packets through them. The issues above raise the question, can a load-balancing
We consider for the remainder of this paper that the pro- Clos-network switch attain high switching performance, low
posed packet switch is cell-based; this is, upon arrival in an configuration complexity, and in-sequence cell forwarding
input port of a switch, packets of variable size are segmented without resorting to memory speedup nor switch expansion?
into fixed-size cells and re-assembled at the output port, after We answer this question affirmatively in this paper by
being switched through the switch. The smallest size of a cell proposing a load-balancing Clos-network switch that has
buffers placed between the IMs and CMs. Furthermore, we
O.T. Sule and R. Rojas-Cessa are with the Department of Electrical and use OMs implemented with buffered crossbars with per-flow
Computer Engineering, New Jersey Institute of Technology, Newark, NJ
07102. Email: {ots5, rojas}@njit.edu. queues. The switch is called ThRee-stage Clos swItch with
(Corresponding author: Oladele Theophilus Sule) queues at the middle stage and DEtermiNisTic scheduling
2

(TRIDENT). This switch uses predetermined and periodic poration of the in-sequence mechanism enables preserving this
interconnection patterns for the configuration of IMs and CMs. delay under nonuniform traffic, as Section V shows.
The incoming traffic is load-balanced by IMs and routed by The switch has virtual input-module output port queues
CMs and OMs. The result is a switch that attains high through- (VIMOQs) between the IMs and CMs to store cells coming
put under admissible traffic with independent and identical from I M(i) and destined to OP( j, d), and each queue is
distribution (i.i.d.) and uses a configuration scheme with O(1) denoted as V I MOQ(r, i, j, d). Each output of an IM is denoted
complexity. The switch also adopts an in-sequence forwarding as LI (i, r). Each output of a VIMOQ is connected to a CM.
mechanism at the input ports and output modules to keep cells Each input and output of a CM are denoted as IC (r, p) and
in sequence. LC (r, j), respectively. Each OP has N k crosspoint buffers, each
The motivation for adopting this configuration method is denoted as CB(r, j, d, i, s) and designated for the traffic from
its simplicity and low complexity. For instance, TRIDENT each IP traversing different CMs to an OP. A flow control
reduces the amount of hardware needed by another load mechanism operates between a CB and VIMOQs to avoid
balancing switch [26] and it also reduces the complexity buffer overflow and underflow [31].
of the in-sequence mechanism. The configuration approach Cells are sent from IPs through the IMs for load balancing
used by TRIDENT also provides full utilization of the switch and then queued at VIMOQs before they are forwarded to their
fabric and requires a small configuration time because of its destined OMs through the CMs.
predeterministic and periodic pattern. Our solution overcomes
TABLE I
the required module or port matching, which are complex and Notations used in the description of the TRIDENT switch
time consuming, as required by other schemes.
We analyze the performance of the proposed switch by
modeling the effect of each stage on the traffic passing through
Term Description
the switch. In addition, we study the performance of the
N Number of input/output ports.
switch through traffic analysis and by computer simulation. n Number of input/output ports for each IM
We show that the switch attains 100% throughput under several and OM.
admissible traffic models, including traffic with uniform and m Number of CMs.
nonuniform distributions, and demonstrate that the switch for- k Number of IMs and OMs, where k = N n .
I P(i, s) Input port s of I M(i), where 0 ≤ i ≤ k −
wards cells to the output ports in sequence. This high switching 1, 0 ≤ s ≤ n − 1.
performance is achieved without resorting to speedup nor I M(i) Input module i.
switch expansion. C M(r) Central Input Module r, where 0 ≤ r ≤
m − 1.
The remainder of this paper is organized as follows: Section L I (i, r) Output link of I M(i) connected to C M(r).
II introduces the TRIDENT switch. Section III presents the IC (r, p) Input port p of C M(r).
throughput analysis of the proposed switch. Section V presents LC (r, j) Output link of C M(r) connected to
OM(j).
a proof of the in-sequence forwarding property of TRIDENT.
V I MOQ(r, i, j, d) VIMOQ at input of CMs that stores cells
Section VI presents a simulation study on the performance of from I M(i) destined to OP(j, d).
the proposed switch. Section VII presents our conclusions. C B(r, j, d, i, s) Crosspoint buffer at OM(j) that stores cells
from I P(i, s) going through C M(r) and
destined to OP(j, d).
II. Switch Architecture OP(j, d) Output port d at OM(j).

The TRIDENT switch has N inputs and N outputs, each


denoted as I P(i, s) and OP( j, d), respectively, where 0 ≤
i, j ≤ k − 1, 0 ≤ s, d ≤ n − 1, and N = nk. Figure A. Module Configuration
1 shows the architecture of TRIDENT. This switch has k The IMs are configured based on a predetermined sequence
n × m IMs, m k × k CMs, and k m × n OMs. Table I lists of k disjoint permutations, where one permutation is applied
the notations used in the description of TRIDENT. In the each time slot. We call a permutation disjoint from the set of
remainder of this paper, we set n = k = m for symmetry and permutations if the input-output pair interconnection is unique
cost-effectiveness. The IMs and CMs are bufferless crossbars in one and only one of the k permutations. Cells at the inputs
while the OMs are buffered ones. In order to preserve the of IMs are forwarded to the outputs of the IMs determined by
staggered symmetry and in-order delivery [30], this switch the configuration at that time slot. A cell is then stored in the
uses a fixed and predetermined configuration sequence, and VIMOQ corresponding to its destination OP.
a reverse desynchronized configuration scheme in CMs. The Similar to the IMs, CMs are configured based on a predeter-
staggered symmetry and in-order delivery refers to the fact mined sequence of k disjoint permutations. Unlike IMs, CMs
that at time slot t, I P(i, s) connects to CM(r) which connects follow a desynchronized configuration; a different permutation
to OM( j). Then at the next time slot (t + 1), I P(i, s) connects is used each time slots, and the configuration follows a cycle
to CM((r + 1) mod m), which also connects to OM( j). This but in counter clock manner to that of the IM. The Head-
property enables us to easily represent the configuration of of-Line (HoL) cell at the VIMOQ destined to OP( j, d) is
IMs and CMs as a predetermined compound permutation that forwarded to its destination when the input of the CM is
repeats every k time slots. This property also ensures that cells connected to the input of the destined OM( j). Else, the HoL
experience similar delay under uniform traffic, and the incor- cell waits until the required configuration takes place. The
3

CM(r) CB(j,d,r,i,s)
IM(i) VIMOQ(r,i,j,d) IC(r,p) OM(j)
LI(i,r) LC(r,j) OP(j,d)
IP(i,s)
CM(0) OM(0)
IM(0)
.. 0
.
IC(0,0) LC(0,0) .. .
OP(0,0)
IP(0,0)
.
LI(0,0)
. . . . .
. .
N-1
. . . .
. . . . . ..
.
IP(0,n-1) .. 0
I (0,k-1)
C . OP(0,n-1)
. N-1
. . .
. . .
. . .
IM(k-1) CM(m-1) OM(k-1)
.. 0
.
IC(m-1,0) ..
.
OP(k-1,0)
IP(k-1,0)
. . . . . .
. .
N-1
. . . .
. .L (k-1,k-1) .. 0
.
I (m-1,k-1)
.L (m-1,k-1).
C ...
.
OP(k-1,n-1)
IP(k-1,n-1) I C
. N-1

Fig. 1. TRIDENT switch.

forwarded cell is queued at the CB of its destination OP once Because the output port arbiter selects the older cell based on
it arrives in the OM. the order of arrival to the switch, this selection prevents out-
The configurations of the bufferless IMs and CMs are as of-sequence forwarding. We discuss this property in Section
follows. At time slot t, IM input I P(i, s) is interconnected to V. Furthermore, the round-robin schedule ensures fair service
IM output LI (i, r), as follows: for different flows. If there is no HoL cell with the expected
value for a particular flow, the arbiter moves to the next flow.
r = (s + t) mod m (1)
and each CM input IC (r, p) is interconnected to output LC (r, j) C. Analysis of Crosspoint Buffer Size
as follows: In this section, we show that no CB queue in the switch
j = (p − t + r) mod k. (2) receives more than one cell in a time slot and those who
receive cells at a rate of 1/k N are served at rates of 1/k N.
The use of CBs at an OP allows forwarding a cell from of a Let us consider a scenario where all the IPs in the switch only
VIMOQ to its destined output without requiring port matching have traffic for one OP. The largest admissible arrival rate at
[15]. an IP is:
Table II shows an example of the configuration of the IMs 1
and CMs of a 9×9 TRIDENT switch. Because k = 3, the λi,s, j,d = (3)
N
example shows the configuration of three consecutive time The input load, λi,s, j,d , gets load-balanced to VIMOQs at a
slots. In this table, we use w → x to denote an interconnection rate of m1 . The aggregate traffic arrival rate at a VIMOQ from
between w and x. Figure 2 shows the configuration of the an IM, RV , is:
modules. n
1 Õ n
RV = λi,s, j,d = (4)
B. Arbitration at Output Ports m i=0 mN
Each output port has a round-robin arbiter to keep track of because m = n = k, therefore,
the next flow to serve, and N flow pointers to keep track of 1
RV = (5)
the next cell to serve for each flow. Here, a flow is the set of N
cells from I P(i, s) destined to OP( j, d). An output port arbiter The aggregate traffic rate at a CM for an OP is:
selects the flow to serve in a round-robin fashion. For this k
selection, the output arbiter selects the HoL cell of a CB if
Õ 1 1
RC M = = (6)
the cell’s order matches the expected cell order for that flow. N k
4

TABLE II
Example of configuration of modules in a 9 × 9 TRIDENT switch.

Configuration
Time slot I M(0) C M(0) I M(1) C M(1) I M(2) C M(2)
I P(0, 0) → L I (0, 0) Ic (0, 0) → LC (0, 0) I P(1, 0) → L I (1, 0) Ic (1, 0) → LC (1, 1) I P(2, 0) → L I (2, 0) Ic (2, 0) → LC (2, 2)
t=0 I P(0, 1) → L I (0, 1) Ic (0, 1) → LC (0, 1) I P(1, 1) → L I (1, 1) Ic (1, 1) → LC (1, 2) I P(2, 1) → L I (2, 1) Ic (2, 1) → LC (2, 0)
I P(0, 2) → L I (0, 2) Ic (0, 2) → LC (0, 2) I P(1, 2) → L I (1, 2) Ic (1, 2) → LC (1, 0) I P(2, 2) → L I (2, 2) Ic (2, 2) → LC (2, 1)
I P(0, 0) → L I (0, 1) Ic (0, 0) → LC (0, 2) I P(1, 0) → L I (1, 1) Ic (1, 0) → LC (1, 0) I P(2, 0) → L I (2, 1) Ic (2, 0) → LC (2, 1)
t=1 I P(0, 1) → L I (0, 2) Ic (0, 1) → LC (0, 0) I P(1, 1) → L I (1, 2) Ic (1, 1) → LC (1, 1) I P(2, 1) → L I (2, 2) Ic (2, 1) → LC (2, 2)
I P(0, 2) → L I (0, 0) Ic (0, 2) → LC (0, 1) I P(1, 2) → L I (1, 0) Ic (1, 2) → LC (1, 2) I P(2, 2) → L I (2, 0) Ic (2, 2) → LC (2, 0)
I P(0, 0) → L I (0, 2) Ic (0, 0) → LC (0, 1) I P(1, 0) → L I (1, 2) Ic (1, 0) → LC (1, 2) I P(2, 0) → L I (2, 2) Ic (2, 0) → LC (2, 0)
t=2 I P(0, 1) → L I (0, 0) Ic (0, 1) → LC (0, 2) I P(1, 1) → L I (1, 0) Ic (1, 1) → LC (1, 0) I P(2, 1) → L I (2, 0) Ic (2, 1) → LC (2, 1)
I P(0, 2) → L I (0, 1) Ic (0, 2) → LC (0, 0) I P(1, 2) → L I (1, 1) Ic (1, 2) → LC (1, 1) I P(2, 2) → L I (2, 1) Ic (2, 2) → LC (2, 2)

TABLE IV
The traffic arrival rate to a CB, RC , is the aggregate traffic Time slots of cells departure from VIMOQs in example of the
from an IP through a CM or: in-sequence forwarding mechanism.

1 1
RC = RC M = (7)
N kN
Cell departure time slots from VIMOQs
Therefore, RC ≤ SC for admissible traffic, which implies tx t x+1 t x+2 t x+3 t x+4 t x+5 t x+6
that the crosspoint buffer size at OMs does not impact the c1,1
c2,2 c2,1
performance of the switch because the queue size does not
grow with the input load. TABLE V
Time slots of cells departure from CBs in example of the in-sequence
forwarding mechanism.
D. In-sequence Cell Forwarding Mechanism
The proposed in-sequence forwarding mechanism of TRI-
DENT is based on tagging cells of a flow at the inputs with Cell departure time slots from CBs
tx t x+1 t x+2 t x+3 t x+4 t x+5 t x+6 t x+7 t x+8
their arriving sequence number, and forwarding cells from the c1,1
crosspoint buffers to the output port in the same sequence c2,1 c2,2
they arrived in the input. The policy used for keeping cells
in-sequence is as follows: When a cell of a flow arrives in the
input port, the input port arbiter appends the arrival order to Figure 3 shows a single flow A with two cells, A3 and A4 ,
the cell (for the corresponding flow). After being forwarded arriving at timeslots, t3 and t4 , respectively. Let us assume
through LI (i, r), the cell is stored at the VIMOQ for the that no cell of this flow has transited the switch. The cell
destination OP. When the CM configuration permits, the cell that arrives at t3 is appended a tag of 1 (i.e., the order of
is forwarded to the destined OM and stored at the queue for arrival) and the cell that arrives at t4 is appended a tag of 2.
traffic from the IP to the destined OP traversing that CM. An Both cells are load balanced and forwarded to different virtual
OP arbiter selects cells of a flow in the order they arrived input module output queues (VIMOQs). As shown in Step
in the switch by using the arrival order carried by each cell. 2 of Figure 3, A31 is forwarded to a queue with cells from
As an example of this operation, Table III shows the arrival other flows, while A42 , the younger cell, is forwarded to an
times of cell c1,1 , c2,1 , and c2,2 , where c y,tx denotes flow y empty queue. Therefore, A42 arrives at the output port (OP)
and arrival time tx to the VIMOQs. Cell c2,1 is queued behind before A31 (Step 3). Because the pointer of flow A at this OP
c1,1 , and c2,2 is placed in an empty VIMOQ. Table IV shows has not received any cell for this flow, it currently points to
the time slots when the cells are forwarded from the VIMOQ. tag 1. Hence A42 remains at the CB until A31 arrives and is
For example, when c2,2 leaves the VIMOQ before c2,1 . Table forwarded out the OP. Thereafter, flow A pointer at this OP is
V shows the time slots when the cells are forwarded to the updated to 2 and A42 is forwarded out the OP.
destination OP after the output-port arbitration is performed.
III. Throughput Analysis
TABLE III In this section, we analyze the performance of the proposed
Time slots of cell arrival to VIMOQs in example of the in-sequence
forwarding mechanism. TRIDENT switch.
Let us denote the traffic coming to the IMs, CMs, OMs,
OPs, and the traffic leaving TRIDENT as R1 , R2 , R3 , R4 and
Cell arrival time R5 , respectively. Here, R1 and R2 , and R3 are N × N matrices,
tx t x+1 t x+2 R4 comprises N N × 1 column vectors, and R5 comprises N
c1,1 scalars. Figure 1 shows these traffic points set at each stage of
c2,1 c2,2 TRIDENT with the corresponding labels at the bottom of the
figure.
5

Flow A pointer
1
1 t4 t3 2 t4 t3 t2 t1 t0 t4k . . . t4+k 3
. A4 A3 A31 Z21 Y11 X01 A31
A42 A31
IP A42 A42
OP
VIMOQ(r,i,j,d) VIMOQ CB

IM(0) CM(0) OM(0) Flow A pointer Flow A pointer


IP(0,0) LI(0,0) ... IC(0,0) LC(0,0) OP(0,0) 2 3
.. .. 4 t4k+1 t4k+2 5
IP(0,1) ... IC(0,1) . . OP(0,1)
.. .. A42
. . OP(0,2) A31
IP(0,2) ... IC(0,2) A42
OP OP
CB CB

IM(1) CM(1) OM(1)


... OP(1,0)
Fig. 3. Example of TRIDENT In-sequence Mechanism.
IP(1,0) IC(1,0)
.. ..
IP(1,1) LI(1,1) ... IC(1,1) LC(1,1) . . OP(1,1)
.. ..
IP(1,2) IC(1,2) . . OP(1,2)
...
The traffic from input ports to the IM stage, R1 , is defined
as:
IM(2) CM(2) OM(2)
IP(2,0)
...
IC(2,0) ....
OP(2,0) R1 = [λu,v ] (8)
IP(2,1) ... IC(2,1) .. OP(2,1)
where, λu,v is the arrival rate of traffic from input u to output
....
IP(2,2) LI(2,2) ... IC(2,2) LC(2,2) .. OP(2,2)

v, and
(a) Time slot 0
u = ik + s (9)

VIMOQ(r,i,j,d)
IM(0) CM(0) OM(0) v = jm + d (10)
IP(0,0) LI(0,0) .. IC(0,0) LC(0,0) OP(0,0)
..
IP(0,1) .. IC(0,1) .
..
OP(0,1) where 0 ≤ u, v ≤ N − 1.
IC(0,2) . OP(0,2)
IP(0,2) .. In the following analysis, we consider admissible traffic,
which is defined as:
IM(1) CM(1) OM(1)
N −1 N −1
IP(1,0) ... IC(1,0) OP(1,0) Õ Õ
IP(1,1) LI(1,1) .. IC(1,1) LC(1,1)
..
. OP(1,1) λu,v ≤ 1, λu,v ≤ 1 (11)
..
IP(1,2) ... IC(1,2) . OP(1,2) u=0 v=0
and as i.i.d. traffic.
IM(2) CM(2) OM(2) The IM stage of TRIDENT balances the traffic load coming
..
IP(2,0) IC(2,0) ..
OP(2,0) from the input ports to the VIMOQs. Specifically, the permu-
IP(2,1) .. IC(2,1) . OP(2,1)
..
. OP(2,2)
tations used to configure the IMs forwards the traffic from an
IP(2,2) LI(2,2) .. IC(2,2) LC(2,2)
input to k different CMs, and then to the VIMOQs connected
to these CMs in k consecutive time slots.
(b) Time slot 1 R2 is the traffic directed towards CMs and it is derived
from R1 and the permutations of IMs. The configuration of
VIMOQ(r,i,j,d)
the IM stage at time slot t that connects I P(i, s) to LI (i, r) are
represented as an N × N permutation matrix, Π(t) = [πu,v ],
IM(0) CM(0) OM(0)
IP(0,0) LI(0,0) ... IC(0,0) LC(0,0) OP(0,0)
where r( is determined from (1) and the matrix element:
IP(0,1) ...
..
. OP(0,1) 1 for any u, υ = r k + i
πu,υ =
IC(0,1)
..
IC(0,2) . OP(0,2)
IP(0,2) .. 0 elsewhere.
IM(1) The configuration of the IM stage can be represented as a
CM(1) OM(1)
.. OP(1,0)
compound permutation matrix, P1 , which is the sum of the
IP(1,0) IC(1,0) ..
IP(1,1) LI(1,1) .. IC(1,1) LC(1,1)
..
. OP(1,1) IM permutations over k time slots as follows,
IP(1,2) IC(1,2) . OP(1,2) k
... Õ
P1 = Π(t)

IM(2) CM(2) OM(2) Because the configuration is repeated every k time slots, the
IP(2,0)
..
IC(2,0)
..
OP(2,0)
traffic load from the same input going to each VIMOQ is k1
IP(2,1) ... IC(2,1) . OP(2,1)
IP(2,2) LI(2,2) IC(2,2) LC(2,2)
..
. OP(2,2) of the traffic load of R1 . Therefore, a row of R2 is the sum
...
of the row elements of R1 at the non zero positions of P1 ,
normalized by k. This is:
(c) Time slot 2 1
R2 = ((R1 ∗ 1) ◦ P1 ) (12)
Fig. 2. Configuration example of a 9 × 9 TRIDENT switch modules. k
where 1 denotes an N × N unit matrix and ◦ denotes
element/position wise multiplication. There are k non-zero
elements in each row of R2 . Here, R2 is the aggregate traffic
6

in all the VIMOQs destined to all OPs. This matrix can be From (13), the traffic matrix at VIMOQs destined for the
further decomposed into k N × N submatrices, R2 ( j, d), each different OMs are:
of which is the aggregate traffic at VIMOQs designated for
OP( j, d). λ0,0 + λ0,1 0 λ0,0 + λ0,1 0 
1 λ + λ1,1 λ1,0 + λ1,1
 
0 0
R2 (0) =  1,0

k Õ
k−1
λ2,0 + λ2,1 λ2,0 + λ2,1 

0 0
Õ
R2 = R2 ( j, d) (13) 2
d=0

 0 λ3,0 + λ3,1 0 λ3,0 + λ3,1 

where j is obtained from (10) ∀ d and d is also obtained λ0,2 + λ0,3 0 λ0,2 + λ0,3 0 
from (10) but for the different j. The configuration of the CM
1 λ + λ1,3 λ1,2 + λ1,3
 
0 0
R2 (1) =  1,2

stage at time slot t that connects Ic (r, p) to LC(r, j) may be
λ2,2 + λ2,3 λ2,2 + λ2,3 

2 0 0
represented as an N × N permutation matrix, Φ(t) = [φu,v ],  0 λ3,2 + λ3,3 0 λ3,2 + λ3,3 
where (j is determined from (2) and the matrix element: 
1 for any u, v = j k + r The rows of R2 (v) represent the traffic from IPs, and the
φu,v =
0 elsewhere. columns represent V I MOQ(r, i, j, d) at IC (r, p). The com-
Similarly, the switching process at the CM stage is rep- pound permutation matrix for the CM stage for this switch
resented by a compound permutation matrix P2 , which is the is:
sum of k permutations used at the CM stage over k time slots.
1 0 1 0
Here, 
k−1 1 0 1 0
P2 = 
Õ
P2 = Φ(t)
0 1 0 1
t=0 0 1 0 1
The traffic destined to OP( j, d) at OM( j), R3 ( j, d), is: 
From (14), the traffic forwarded to an OP is:
R3 ( j, d) = R2 ( j, d) ◦ P2 (14) λ0,0 0 λ0,0 0 
λ λ1,0

0 0 
R3 (0, 0) = 12  1,0

The aggregate traffic at CBs of an OP for the different IPs,
R4 (v), is obtained from the multiplication of R3 ( j, d) with a  0 λ2,0 0 λ2,0 
® or:
vector of all ones, 1,
 0 λ3,0 0 λ3,0 
λ0,1 λ0,1

0 0 
λ λ1,1

R4 (v) = R3 ( j, d) ∗ 1® (15) R3 (0, 1) = 12  1,1
 0 0 
 0 λ2,1 λ2,1 
Each row of R4 (v) is the aggregate traffic at the CBs from λ3,1 λ3,1 

 0 0
each IP. The traffic leaving an OP, R5 (v), is: λ0,2 λ0,2

0 0 
λ λ1,2

0 0 
R3 (1, 0) = 12  1,2

R5 (v) = (1)
® T ∗ R4 (v) (16)
 0 λ2,2 0 λ2,2 
Therefore, R5 (v) is the sum of the traffic leaving OP(v).
 0 λ3,2 0 λ3,2 
λ0,3 λ0,3

The following example shows the operations performed on 0 0 
λ λ1,3

traffic coming to a 4×4 (k = 2) TRIDENT switch. Let the 0 0 
R3 (1, 1) = 12  1,3

input traffic matrix be  0 λ2,3 0 λ2,3 
 0 λ3,3 0 λ3,3 
λ0,0 λ0,1 λ0,2 λ0,3 

The rows of R3 ( j, d) represent the traffic from V I MOQ(r, i, j)
λ λ1,1 λ1,2 λ1,3 

at IC (r, p) and the columns represent LC (r, j).
R1 =  1,0
λ2,0 λ2,1 λ2,2 λ2,3  The traffic forwarded from CBs allocated for the different IPs
λ0,0 
λ3,0 λ3,1 λ3,2 λ3,3 
λ1,0 
  
to the corresponding OP is obtained from (15): R4 (0) =  ,
λ2,0 

Then, R2 is generated from the arriving traffic and the config- 
uration of IM. The compound permutation matrix for the IM λ3,0 
λ0,1 
 
stage for this switch is:
λ 
 
1 0 1 0 R4 (1) =  1,1 
λ2,1 

1 0 1 0
P1 =  λ3,1 
0 1 0 1 λ0,2  λ0,3 
 
0 1 0 1 λ1,2  λ1,3 
   
R4 (2) =  R4 (3) = 

,
λ2,2  λ2,3 
   
Using (12), we get 
λ3,2  λ3,3 
 3i=0 λ0i λ0i
Í Í3
0 0
   
Íi=0
 The rows of R4 (v) represent the traffic from I P(i, s). Using
λ i=0 λ1i
Í3 3

0 0 (16), we obtain the sum of the traffic leaving the OP, or:
R2 = 1/2  i=0 1i
 

λ2i λ2i  R5 (0) = 3i=0 λi0 , R5 (1) = 3i=0 λi1 , R5 (2) = 3i=0 λi2 ,
Í3 Í3 Í Í Í
0 0
 Íi=0 Íi=0
R5 (3) = i=0 λi3
Í3
i=0 λ3i i=0 λ3i 
 0 3 3
 0 
7

As raised from the example, one may wonder if TRIDENT Then, the service matrix of VIMOQs is:
achieves 100% throughput. This property of TRIDENT is
Dµ (t) = [dµv (t)] (21)
discussed as follows:
From R4 (0) to R4 (3) above, we can deduce that R4 is equal and representing R2 as the aggregate traffic arrival to VIMOQs
to the input traffic R1 , or, in general: or:
t
R4 (v) = R1 (v) ∀ v (17) Õ
R2 = A2 (γ) (22)
Also, because R2 and R4 (v) meet the admissibility condition γ=0
in (11), and R5 (v) does not exceed the traffic rate for any Substituting (21) and (22) into (20) gives:
OP(v), the aggregated traffic loads at each VIMOQ, CB, and
1
OP do not exceed the capacity of each output link. From the Nµ (t) = Nµ (0) + R2 − P1 (23)
admissibility of R2 and R4 (v), and (17), we can infer that the N
input traffic is fully forwarded to the output ports.
As discussed in Section II-B, an output arbiter selects a 1
R2 − P1 ≤  < ∞ (24)
flow in a round-robin fashion and a cell of that flow based N
on the arrival order. If a cell of a flow is not selected, the We recall from section III.A that R2 is admissible, and by
OP arbiter moves to the next flow. This arbitration scheme substituting P1 and R2 into (24), shows that  is finite. We can
ensures fairness and that the cells forwarded to the OP are conclude from (23) and (24), that the occupancy of VIMOQ
also forwarded out of the OP. Hence, from R5 (0) to R5 (3), we is weakly stable. 
can infer that R5 (v) is equal to R4 (v), or: We now prove the stability of CBs. The queue occupancy
matrix of CBs at time slot t can be represented as:
R5 (v) = (1)
® T ∗ R4 (v) ∀ v (18)
Nc (t) = Nc (t − 1) + Ac (t) − Dc (t) (25)
From (17) and (18), we can conclude that TRIDENT achieves
100% throughput under admissible i.i.d. traffic. We present where Ac (t) is the aggregate traffic arrival matrix at time slot
the proof in Section IV. t to CBs, and Dc (t) is the service rate matrix of CBs at time
slot t. Solving (25) recursively as before yields:
t
Õ t
Õ
IV. 100% Throughput Nc (t) = Nc (0) + Ac (γ) − Dc (γ) (26)
γ=0 γ=0
In this section we prove that TRIDENT achieves 100%
throughput by using the analysis under admissible i.i.d traffic. Because a CB is serviced at least once every N k time slots.
The service rate of the CB at OP(v) at time slot t, dcv (t) is:
Theorem 1: TRIDENT achieves 100% throughput under 1
admissible i.i.d traffic. ≤ dcv (t) ≤ 1
Nk
Proof: Here, we proof that TRIDENT achieves 100% through-
and service matrix of CBs is:
put. This is achieved by showing that VIMOQs and CBs
are weakly stable under i.i.d. traffic. Because a stable switch Dc (t) = [dcv (t)] (27)
achieves 100% throughput under admissible i.i.d traffic [32].
The aggregate traffic arrival to CBs, R4 , or:
A switch is considered stable under a traffic distribution if
the queue length is bounded. The queues are considered to t
Õ
be weakly stable if the queue occupancy drift from its initial R4 = Ac (γ) (28)
state is finite  ∀ t as limt→∞ . Let us represent the queue γ=0

occupancy of VIMOQs at time slot t, Nµ (t) as: Let us assume the worst case scenario, where the CB is service
only once in N k timeslots or dcv (t) = N1k ∀ v in (27).
Nµ (t) = Nµ (t − 1) + Aµ (t) − Dµ (t) (19)
Substituting (27) and (28) into (26) gives:
where Aµ (t) is the aggregate traffic arrival matrix at time slot t 1 ®
to VIMOQs and Dµ (t) is the service rate matrix of VIMOQs Nc (t) = Nc (0) + R4 − ∗1 (29)
Nk
at time slot t. Solving (19) with an initial condition Nµ (0), where
recursively yields:
1 ®
t t R4 − ∗1 ≤  < ∞ (30)
Õ Õ Nk
Nµ (t) = Nµ (0) + Aµ (γ) − Dµ (γ) (20)
γ=0 γ=0
Because R4 is admissible, as discussed in Section III.A,
substituting R4 into (30) shows that  is finite. We can
Because a VIMOQ is serviced at least once every N time slots, conclude from (29) and (30), that the occupancy of CB is
the service rate of a VIMOQ at a CM for OP(v) at time slot also weakly stable.
t, dµv (t) is: 
1 This completes the proof of Theorem 1.
dµv (t) = ∀ µ and v 
N
8

V. Analysis of In-Sequence Service other flows) while other VIMOQs to where the subsequent
In this section, we demonstrate that the TRIDENT switch cells of the same flow are sent are empty. This scenario would
forwards cells in sequence to the OPs through the proposed have the largest probability to delay the first cell of the flow
in-sequence forwarding mechanism. Table VI lists the terms and, therefore; to forward the subsequent cells of the flow out
used in the in-sequence analysis of the proposed TRIDENT of sequence. Also, let us consider that the flow pointer at the
switch. Here, c y,τ (i, s, j, d) denotes the τth cell of traffic flow output ports initially points to the cell arrival order L yθ , where
y, which comprises cells going from I P(i, s) to OP( j, d). y is the flow id and θ is the cell’s order of arrival.
In addition, ta y, τ denotes the arrival time of c y,τ , and qVy, τ Also, let us assume that the cells arrive at LI (i, r) one or
and qC y, τ denote the queuing delays experienced by c y,τ at more time slots before the configuration of the CM allows
V I MOQ(r, i, j, d) and CB(r, j, d, i, s), respectively. The depar- forwarding a cell to its destined OM. Thus, a cell may depart in
ture times of c y,τ from the corresponding VIMOQ and CB the following or a few time slots after its arrival. This cell then
are denoted as dVy, τ and dC y, τ , respectively. We consider may wait up to k − 1 time slots for a favorable interconnection
admissible traffic in this analysis. to take place at the CM before being forwarded to the destined
Here, we claim that TRIDENT forwards cells in sequence OM. In the remainder of the discussion, we show that the
to the output ports, through the following theorem. arriving cells are forwarded to the destination OP in the same
order they arrive in the IP.
Theorem 2: For any two cells c y,τ (i, s, j, d) and Given flow y, the arrival time of the first cell c y,τ is:
c y,τ0 (i, s, j, d), where τ < τ 0, c y,τ (i, s, j, d) departs the
destined output port before c y,τ0 (i, s, j, d). ta y, τ = tx (31)
Upon arriving in the IP, c y,τ is tagged with L y0 and forwarded
TABLE VI
Notations for in-sequence analysis. to the VIMOQ. Based on the backlog condition, c y,τ is placed
behind γ cells from other flows upon arriving at the VIMOQ.
Therefore, the VIMOQ occupancy, NVy, τ , is:
c y, τ The τth cell of flow y from I P(i, s) to OP(j, d). NVy, τ = γ (32)
t a y, τ Arrival time of c y, τ at I P(i, s).
NV y, τ The number of cells at V I MOQ(r, i, j, d) upon the arrival Using (32) the queuing delay of c y,τ at the VIMOQ is:
of c y, τ .
q H y, τ The residual queuing delay of the HoL cell at qVy, τ = qH y, τ + (γ − 1)k + k (33)
V I MOQ(r, i, j, d) upon the arrival of c y, τ .
qV y, τ Queuing delay of c y, τ at V I MOQ(r, i, j, d). where qH y, τ is the time it takes the HoL cell to depart the
dV y, τ Departure time of c y, τ from V I MOQ(r, i, j, d) at IC (r, p). VIMOQ and (γ −1)k is the delay generated by the other (γ −1)
NC y, τ The number of cells at C B(r, j, d, i, s) upon the arrival of
c y, τ . cells ahead of c y,τ in the VIMOQ. The extra k time slots are
qC y, τ Queuing delay of c y, τ at C B(r, j, d, i, s) of OP(j, d). the delay c y,τ experiences as it waits for the configuration
dC y, τ Departure time of c y, τ from C B(r, j, d, i, s). pattern to repeat after the last cell ahead of it is forwarded to
the OM.
Lemma 1: For any flow traversing the TRIDENT switch, Using (31) and (33), the departure time of c y,τ from the
an older cell is always placed ahead of a younger cell from VIMOQ is:
the same flow in the same crosspoint buffer. dVy, τ = ta y, τ + qH y, τ + γk (34)
Proof: From the architecture and configuration of the switch
an IP connects to a CM once every k time slots. If a younger When c y,τ arrives at the output module it is stored at the
cell arrives at the OM before an older cell then the younger corresponding output buffer before being forwarded to the
cell was forwarded through a different CM from the one the output port.
older cell was buffered. Also, two cells of the same flow may Let us now consider the next arriving cell from flow y,
be queued in the same CB if and only if the younger cell c y,τ+θ , where 0 < θ < k. The time of arrival of c y,τ+θ is:
arrived at the VIMOQ k time slots later than the older cell,
and therefore, the younger cell would be lined up in a queue ta y, τ+θ = tx + θ (35)
position behind the position of the older cell. Upon arrival, c y,τ+θ would have L yθ appended to it and
 forwarded to the VIMOQ. Based on the traffic scenario, c y,τ+θ
would be forwarded to an empty VIMOQ. The queuing delay
at the V I MOQ for c y,τ+θ is:
Lemma 2: For any number of flows traversing the TRI-
DENT switch, cells from the same flow are cleared from the qVy, τ+θ = β (36)
OP in the same order they arrived at the IP. where β is the number of time slots before the configuration
Proof: Let us consider a traffic scenario where multiple pattern enables forwarding c y,τ+θ to the destined OM. Using
flows are traversing the switch. We focus on one flow with (34), (35), and (36), the departure time of c y,τ+θ from the
cells arriving back to back. Let us also consider as an initial VIMOQ is:
condition that all CBs are empty, and the VIMOQ to where the
first cell of the flow is being sent has backlogged cells (from dVy, τ+θ = tx + θ + β (37)
9

100000
At the output port, the pointers all initially pointed to

Average delay (time slots)


L y0 based on the initial condition. Therefore, irrespective of 10000

dVy, τ+θ < dVy, τ , for θ + β < qH y, τ + γk, c y,τ+θ remains stored 1000
at the output buffer until c y,τ is cleared from the output port, 100
because the pointer points to L y0 . Because CBs are empty as 10
initial condition, the CB occupancy, NC y, τ , upon c y,τ arrival
1
is:

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9
0.95
0.05

0.15

0.25

0.35

0.45

0.55

0.65

0.75

0.85

0.99
0.1
NC y, τ = 0 (38) 0.01
Input load
and the occupancy of the CB, NC y, τ+θ , upon c y,τ+θ arrival is MMM MMeM OQ TRIDENT SMM

NC y, τ+θ = 0 (39)
Fig. 4. Average queueing delay under uniform traffic for N =64.
Using (38), the queuing delay, qC y, τ , at the CB for c y,τ is:
qC y, τ = 0 (40) select these switches for comparison because they achieve the
From (34), (37), and (39), the queuing delay, qC y, τ+θ , at the highest performance among Clos-network switches, despite
CB for c y,τ+θ is: been categorized as different architectures. We considered
switches with size N = {64, 256}. For performance analysis,
qC y, τ+θ = qH y, τ + γk − β (41) queues are assumed long to avoid cell losses and to identify
From (31), (34), and (40), the departure time of c y,τ from the average cell delay.
OP, dC y, τ , is: Table VII shows a comparison between the architectures of
OQ, SMM, MMM, MMe M, and TRIDENT.
dC y, τ = tx + 1 + qH y, τ + γk (42)
From (35), (37), and (41), the departure time of c y,τ+θ from A. Uniform Traffic
the OP, dC y, τ+θ , is:
Uniform distribution is mostly considered to be benign and
dC y, τ+θ = tx + 1 + θ + qH y, τ + γk (43) the average rate for each output port λi,s, j,d = N1 . where
I P(i, s) is the source IP and OP( j, d) is the destination OP.
Using (42) and (43), Hence, a packet arriving at the IP has an equal probability of
dC y, τ+θ − dC y, τ = θ (44) being destined to any OP. Figures 4 and 5 show the average
under uniform traffic with Bernoulli arrivals for N = 64
The difference between the departure times of any two cells and N = 256, respectively. The finite and moderate average
of a flow from the CB is a function of θ, which is the arrival queuing delay indicated by the results shows that TRIDENT
time difference between any two cells. Therefore, cells of a achieves 100% throughput under this traffic pattern. This
flow are forwarded to the OP in the same order they arrived. throughput is the result of the efficient load-balancing process
 in the IM stage. However, such high performance is expected
for uniformly distributed input traffic.
This completes the proof of Theorem 2. TRIDENT switch experiences a slightly higher average
 delay than the OQ switch. This delay is the result of cells
being queued in the VIMOQs until a configuration occurs
that enables forwarding the cells to their destined output
VI. Performance Analysis modules. Due to the amount of memory required by MMe M to
We evaluated the performance of TRIDENT through com- implement the extended set of queues, our simulator can only
puter simulation under uniform traffic model and com- simulate small MMe M switches for queueing analysis, so we
pared with that of an output-queued (OQ), Space-Memory- simulated the switches under this traffic pattern for N = 64.
Memory (SMM), and a Memory-Memory-Memory Clos- This figure also shows that TRIDENT achieves a lower average
network (MMM) switch. We also evaluated the performance delay than the MMM switch.
of TRIDENT through computer simulation under nonuniform Uniform bursty traffic is modeled as an ON-OFF Markov
traffic model and compared with that of an output-queued modulated process, with an average duration of the ON period
(OQ), space-Memory-Memory (SMM), Memory-Memory- set as the average burst length, l, with l = {10, 30} cells.
Memory Clos-network (MMM), and MMM switch with ex- Figures 6 and 7 show the average delay under uniform traffic
tended memory (MMe M) switches. The SMM switch uses with bursty arrivals for average burst length of 10 and 30
desynchronized static round robin at IMs and select celss cells, respectively. The results show that TRIDENT achieves
from the buffers at CMs and OMs. The MMM switch se- 100% throughput under bursty uniform traffic and it is not
lects cells from the buffers in the previous stage modules affected by the burst length, while the MMM switch has a
using forwarding arbitration schemes and is prone to serving throughput of 0.8 and 0.75 for an average burst length of 10
cells out of sequence. Considering that most load-balancing and 30 cells, respectively. Therefore, TRIDENT achieves a
switches based on Clos networks deliver low performance, we performance closer to that of the OQ switch.
10

TABLE VII
Switches used in performance comparison to TRIDENT.

Architecture OQ SMM MMM MM e M TRIDENT


Scalability Non scalable Scalable Scalable Scalable Scalable
Packet order preserved Yes No No No Yes
Speedup N 1 1 1 1
Configuration scheme N/A Desynchronized static Select cells from buffers Select cells from the Prederministic and peri-
round robin at IM and in the previous stage mod- buffers in the previous odic
select cells from the ules stage modules
buffers at CMs and OMs
On-line complexity for crossbar connections O(1) O(N) O(N) O(N) O(1)
Internal blocking Non blocking Blocking Blocking Non blocking Non blocking
Total number of VOQs per IM N/A Nn Nn Nn 0
Total number of Virtual central module queues per IM N/A 0 mn nN 0
Total number of virtual output (module or port) queues per CM N/A k2 k2 mN kN
Total number of queues per OM N/A mn mn nN N 2k
Total number of queues per OP N 0 0 0 0

10000 100000

Average delay (time slots)


Average delay (time slots)

1000 10000

100 1000

10 100

1 10

0.1 1

0.5
0.1

0.2

0.3

0.4

0.6

0.7

0.8

0.9
0.05

0.15

0.25

0.35

0.45

0.55

0.65

0.75

0.85

0.95
0.99
0.01
Input load Input load
OQ TRIDENT MMM SMM TRIDENT OQ MMM SMM

Fig. 5. Average queueing delay under uniform traffic for N =256.


Fig. 7. Average queuing delay under uniform bursty traffic with average burst
length l=30.
10000
Average delay (time slots)

1000
the offset in delay for a light load.
100

10
B. Nonuniform traffic
We also evaluated the performance of TRIDENT, MMM,
1
MMe M, and OQ switches under nonuniform traffic. We
0.1 adopted the unbalanced traffic model [31], [33] as a nonuni-
Input load form traffic pattern. The nonuniform traffic can be modeled us-
TRIDENT OQ MMM SMM ing an unbalanced probability ω to indicate the load variances
for different flows. Consider input port I P(i, s) and output port
Fig. 6. Average queuing delay under uniform bursty traffic with average burst OP( j, d) of the TRIDENT switch, the traffic load is determined
length l=10.
by

1−ω
The uniform distribution of the traffic and the load-  ρ(ω + N ), if i = j and s = d,



ρi,s, j,d =

balancing stage helps to attain this low queueing delay and (45)
1−ω
high throughput. Figures 4, 5, 6, and 7 show that the queueing ρ , otherwise


delay difference between TRIDENT and the OQ switch is not  N
significant. Figures 4, 5, 6, and 7 also show that TRIDENT where ρ is the input load for input I P(i, s) and ω is the unbal-
outperforms the SMM switch for all tested traffic patterns at anced probability. When ω=0, the input traffic is uniformly
high input load. Because the SMM switch uses load-balancing distributed and when ω=1, the input traffic is completely
at the bufferless IMs which enables it to attain high perfor- directional; traffic from I P(i, s) is destined for OP( j, d).
mance similar to TRIDENT at low input load, but at high input Figure 8 shows the throughput of TRIDENT, SMM, MMM,
load the configuration complexity at CMs and OMs affects its and MMe M switches. The figure shows that TRIDENT switch
performance. In addition to the high configuration complexity attains 100% throughput under this traffic pattern for all values
required for the SMM switch as compared to TRIDENT, it of ω, matching the performance of SMM and MMe M and
also forwards cells out-of-sequence while TRIDENT forwards outperforming that of MMM. These three buffered switches
cells in-sequence. These figures also show that the effective are known to achieve high throughput at the expense of out-
load balancing reduces the average delay and also eliminates of-sequence forwarding.
11

100
1

Average delay (time slots)


0.95
0.9 10
Throughput

0.85
0.8
1
0.75
0.7
0.65 0.1

0.6
0.01
Input load
Unbalanced Probability, w
TRIDENT MMM MMeM SMM TRIDENT OQ SMM

Fig. 8. Throughput under unbalanced traffic for 0 ≤ w ≤ 1.0 and N =256. Fig. 10. Average queuing delay under unbalanced traffic with w = 0.6 for
N =256.

10000 100

Average time delay (time slots)


Average delay (time slots)

1000
10
100

10 1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
1
0.1
0.1

0.01 0.01
Input load
Input load
TRIDENT MMeM MMM OQ SMM TRIDENT short-queue TRIDENT long-queue TRIDENT infinte-queue

Fig. 9. Average queuing delay under unbalanced traffic with w = 0.6 for Fig. 11. Average queuing delay under unbalanced traffic with w = 0.6 for
N =64. N =256.

We also tested the average queueing delay of TRIDENT for three TRIDENT switches with CB sizes of k 2 , N 2 , and ∞,
under this nonuniform traffic. It has been shown that many respectively. Figure 11 shows that the size of the crosspoint
switches do not achieve high throughput when ω is around 0.6 buffer does not impact the switch performance. The TRIDENT
[33]. Therefore, we measured the average delay of TRIDENT switches, each with different crosspoint buffer size, attains
under this unbalanced probability, as Figures 9 and 10 show 100% throughput for hotpsot per port traffic model. Which
for N = 64 and N = 256, respectively, and compared it also indicates that the size of the CB does not impact the
with MMM, SMM, MMe M, and OQ switches. One should performance of the switch as shown in the analysis above.
note that due to the limited scalability of MMM and MMe M, where TRIDENT short-queue has a crosspoint buffer size of
the comparison of TRIDENT for N = 256 under this traffic k 2 , TRIDENT short-queue has a crosspoint size of N 2 , and
conditions only includes SMM and OQ switches. Figure 10 TRIDENT infinite-queue has an infinite crosspoint buffer size.
shows that the delay of TRIDENT is lower than the delay
achieved by SMM under high input loads.
As Figure 9 for N = 64 shows, the average delay of VII. Conclusions
TRIDENT is lower than the delay achieved by SMM, MMM, We have introduced a three-stage load-balancing packet
and MMe M under high input loads while also achieving a switch that has virtual output module queues between the
comparable delay of an OQ switch. The small performance input and central stages, and a low-complexity scheme for
difference between TRIDENT and OQ is similar for N = 256, configuration and forwarding cells in sequence for this switch.
as Figure 10 shows. These results are achieved because the We call this switch TRIDENT. To effectively perform load bal-
load-balancing stage of TRIDENT distributes the traffic uni- ancing TRIDENT has virtual output module queues between
formly throughout the switch. Therefore, the queuing delay is the IM and CM stages. Here, IMs and CMs are bufferless
similar to that observed under uniform traffic. These results modules, while the OMs are buffered ones. All the bufferless
also show that high switching performance of TRIDENT is modules of TRIDENT follow a predetermined configuration
not affected by the in-sequence mechanism of the switch and while the OM selects the cell of a flow to be forwarded to an
the load-balancing effect is more noticeable under nonuniform output port based on the cell’s arrival order and uses round-
traffic. robin scheduling to select the flow to be served. Because of
In addition to the analysis in II-C, we also tested the the buffers at crosspoints of OMs, the switch rescinds port
impact of the CB size through computer simulations. Where matching, and the configuration complexity of the switch is
we tested and measured the average delay under unbalanced minimum, making it comparable to that of MMM switches.
traffic and throughput under hot-spot per port traffic models, We introduce an in-sequence mechanism that operates at
12

the outputs based on arrival order inserted at the inputs of [18] C.-S. Chang, D.-S. Lee, and Y.-S. Jou, “Load balanced Birkhoff–von
TRIDENT to avoid out-of-sequence forwarding caused by the Neumann switches, part i: one-stage buffering,” Computer Communica-
tions, vol. 25, no. 6, pp. 611–622, 2002.
central buffers. We modeled and analyzed the operations of [19] L. Shi, B. Liu, C. Sun, Z. Yin, L. N. Bhuyan, and H. J. Chao,
each of the stages and how they affect the incoming traffic to “Load-balancing multipath switching system with flow slice,” IEEE
obtain the loads seen by the output ports. We show that for Transactions on Computers, vol. 61, no. 3, pp. 350–365, March 2012.
[20] S. M. Irteza, H. M. Bashir, T. Anwar, I. A. Qazi, and F. R. Dogar, “Load
admissible independent and identically distributed traffic, the balancing over symmetric virtual topologies,” in IEEE INFOCOM 2017
switch achieves 100% throughput. This high performance is - IEEE Conference on Computer Communications, May 2017, pp. 1–9.
achieved without resorting to speedup nor switch expansion. In [21] A. Dixit, P. Prakash, Y. C. Hu, and R. R. Kompella, “On the impact
of packet spraying in data center networks,” in INFOCOM, 2013
addition, we analyzed the operation of the forwarding mech- Proceedings IEEE. IEEE, 2013, pp. 2130–2138.
anism and demonstrated that it forwards cells in sequence. [22] Z. Wang, E. Bulut, and B. K. Szymanski, “Energy efficient collision
We showed, through computer simulation, that for all tested aware multipath routing for wireless sensor networks,” in 2009 IEEE
International Conference on Communications. IEEE, 2009, pp. 1–5.
traffic, the switch achieves 100% throughput for uniform and [23] L. Le, “Multipath routing design for wireless mesh networks,” in
nonuniform traffic distributions. 2011 IEEE Global Telecommunications Conference-GLOBECOM 2011.
IEEE, 2011, pp. 1–6.
References [24] M. Ploumidis, N. Pappas, and A. Traganitis, “Flow allocation for
maximum throughput and bounded delay on multiple disjoint paths
[1] C. Clos, “A study of non-blocking switching networks,” Bell System for random access wireless multihop networks,” IEEE Transactions on
Technical Journal, vol. 32, no. 2, pp. 406–424, 1953. Vehicular Technology, vol. 66, no. 1, pp. 720–733, 2017.
[2] T. T. Lee and C. H. Lam, “Path switching-a quasi-static routing scheme [25] M. Zhang, Z. Qiu, and Y. Gao, “Space-memory-memory Clos-network
for large-scale ATM packet switches,” IEEE Journal on Selected Areas switches with in-sequence service,” Communications, IET, vol. 8, no. 16,
in Communications, vol. 15, no. 5, pp. 914–924, 1997. pp. 2825–2833, 2014.
[3] H. J. Chao, Z. Jing, and S. Y. Liew, “Matching algorithms for three-stage [26] O. T. Sule, R. Rojas-Cessa, Z. Dong, and C.-B. Lin, “A split-central-
bufferless Clos network switches,” Communications Magazine, IEEE, buffered load-balancing clos-network switch with in-order forwarding,”
vol. 41, no. 10, pp. 46–54, 2003. IEEE/ACM Transactions on Networking, 2018.
[4] F. M. Chiussi, J. G. Kneuer, and V. P. Kumar, “Low-cost scalable switch- [27] C.-S. Chang, D.-S. Lee, and C.-M. Lien, “Load balanced Birkhoff-von
ing solutions for broadband networking: the ATLANTA architecture and Neumann switches, part ii: Multi-stage buffering.”
chipset,” IEEE Communications Magazine, vol. 35, no. 12, pp. 44–53, [28] I. Keslassy and N. McKeown, “Maintaining packet order in two-stage
1997. switches,” in INFOCOM 2002. Twenty-First Annual Joint Conference of
[5] E. Oki, N. Kitsuwan, and R. Rojas-Cessa, “Analysis of space-space- the IEEE Computer and Communications Societies. Proceedings. IEEE,
space Clos-network packet switch,” in Computer Communications and vol. 2. IEEE, 2002, pp. 1032–1041.
Networks, 2009. ICCCN 2009. Proceedings of 18th International Con- [29] R. Rojas-Cessa, Interconnections for Computer Communications and
ference on. IEEE, 2009, pp. 1–6. Packet Networks. CRC Press, 2016.
[6] J. Kleban and U. Suszynska, “Static dispatching with internal back- [30] B. Hu and K. L. Yeung, “On joint sequence design for feedback-
pressure scheme for SMM Clos-network switches,” in Computers and based two-stage switch architecture,” in High Performance Switching
Communications (ISCC), 2013 IEEE Symposium on. IEEE, 2013, pp. and Routing, 2008. HSPR 2008. International Conference on. IEEE,
000 654–000 658. 2008, pp. 110–115.
[7] J. Kleban, M. Sobieraj, and S. Weclewski, “The modified MSM Clos [31] R. Rojas-Cessa, E. Oki, Z. Jing, and H. J. Chao, “CIXB-1: combined
switching fabric with efficient packet dispatching scheme,” in High input-one-cell-crosspoint buffered switch,” in High Performance Switch-
Performance Switching and Routing, 2007. HPSR’07. Workshop on. ing and Routing, 2001 IEEE Workshop on, 2001, pp. 324–329.
IEEE, 2007, pp. 1–6. [32] A. Mekkittikul and N. McKeown, “A practical scheduling algorithm to
[8] R. Rojas-Cessa, E. Oki, and H. J. Chao, “Maximum weight matching achieve 100% throughput in input-queued switches,” in INFOCOM’98.
dispatching scheme in buffered Clos-network packet switches,” in Com- Seventeenth Annual Joint Conference of the IEEE Computer and Com-
munications, 2004 IEEE International Conference on, vol. 2. IEEE, munications Societies. Proceedings. IEEE, vol. 2. IEEE, 1998, pp.
2004, pp. 1075–1079. 792–799.
[9] H. J. Chao, J. Park, S. Artan, S. Jiang, and G. Zhang, “Trueway: a [33] R. Rojas-Cessa, E. Oki, and H. J. Chao, “CIXOB-k: Combined input-
highly scalable multi-plane multi-stage buffered packet switch,” in High crosspoint-output buffered packet switch,” in Global Telecommunica-
Performance Switching and Routing, 2005. HPSR. 2005 Workshop on. tions Conference, 2001. GLOBECOM’01. IEEE, vol. 4. IEEE, 2001,
IEEE, 2005, pp. 246–253. pp. 2654–2660.
[10] N. Chrysos and M. Katevenis, “Scheduling in non-blocking buffered
three-stage switching fabrics.” in INFOCOM, vol. 6, 2006, pp. 1–13.
[11] Z. Dong and R. Rojas-Cessa, “Non-blocking memory-memory-memory
Clos-network packet switch,” in 34th IEEE Sarnoff Symposium. IEEE,
2011, pp. 1–5.
[12] Y. Xia, M. Hamdi, and H. J. Chao, “A practical large-capacity three-
stage buffered Clos-network switch architecture,” IEEE Transactions on
Parallel and Distributed Systems, vol. 27, no. 2, pp. 317–328, 2016.
[13] F. Hassen and L. Mhamdi, “High-capacity Clos-network switch for data
center networks,” in IEEE International Conference on Communications
2017. IEEE, 2017.
[14] X. Li, Z. Zhou, and M. Hamdi, “Space-memory-memory architecture
for clos-network packet switches,” in Communications, 2005. ICC 2005.
2005 IEEE International Conference on, vol. 2. IEEE, 2005, pp. 1031–
1035.
[15] C.-B. Lin and R. Rojas-Cessa, “Minimizing scheduling complexity
with a Clos-network space-space-memory (SSM) packet switch,” in
High Performance Switching and Routing (HPSR), 2013 IEEE 14th
International Conference on. IEEE, 2013, pp. 15–20.
[16] R. Rojas-Cessa and C.-B. Lin, “Scalable two-stage Clos-network switch
and module-first matching,” in 2006 Workshop on High Performance
Switching and Routing. IEEE, 2006, pp. 6–pp.
[17] Z. Dong and R. Rojas-Cessa, “MCS: buffered Clos-network switch with
in-sequence packet forwarding,” in Sarnoff Symposium (SARNOFF),
2012 35th IEEE. IEEE, 2012, pp. 1–6.

Potrebbero piacerti anche