Sei sulla pagina 1di 13

818

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 18, NO. 5, MAY 2010

Implementation of a Self-Motivated Arbitration Scheme for the Multilayer AHB Busmatrix

Soo Yun Hwang, Dong Soo Kang, Hyeong Jun Park, and Kyoung Son Jhang, Member, IEEE

Abstract—The multilayer advanced high-performance bus (ML-AHB) busmatrix employs slave-side arbitration. Slave-side arbitration is different from master-side arbitration in terms of request and grant signals since, in the former, the master merely starts a burst transaction and waits for the slave response to proceed to the next transfer. Therefore, in the former, the unit of arbitration can be a transaction or a transfer. However, the ML-AHB busmatrix of ARM offers only transfer-based fixed-pri- ority and round-robin arbitration schemes. In this paper, we propose the design and implementation of a flexible arbiter for the ML-AHB busmatrix to support three priority policies—fixed priority, round robin, and dynamic priority—and three data multiplexing modes—transfer, transaction, and desired transfer length. In total, there are nine possible arbitration schemes. The proposed arbiter, which is self-motivated (SM), selects one of the nine possible arbitration schemes based upon the priority-level notifications and the desired transfer length from the masters so that arbitration leads to the maximum performance. Experimental results show that, although the area overhead of the proposed SM arbitration scheme is 9%–25% larger than those of the other arbitration schemes, our arbiter improves the throughput by 14%–62% compared to other schemes.

Index Terms— Multilayer AHB (ML-AHB) busmatrix, on-chip bus, self-motivated (SM) arbitration scheme, slave-side arbitra- tion, system-on-a-chip (SoC).

I. INTRODUCTION

T HE ON-CHIP bus plays a key role in the system-on-a-chip (SoC) design by enabling the efficient integration of het-

erogeneous system components such as CPUs, DSPs, applica- tion-specific cores, memories, and custom logic [1]. Recently, as the level of design complexity has become higher, SoC de- signs require a system bus with high bandwidth to perform mul- tiple operations in parallel [2]. To solve the bandwidth prob- lems, there have been several types of high-performance on-chip buses proposed, such as the multilayer AHB (ML-AHB) bus- matrix from ARM [3], the PLB crossbar switch from IBM [4], and CONMAX from Silicore [5]. Among them, the ML-AHB busmatrix has been widely used in many SoC designs. This is

Manuscript received July 15, 2008; revised December 13, 2008. First pub- lished July 28, 2009; current version published April 23, 2010. This work was supported in part by the STSAT-3 Program, Grant of the Ministry of Educa- tion, Science and Technology of South Korea, and by the IT R&D program of MKE/IITA [2006-S001-03, Development of Adaptive Radio Access and Trans- mission Technologies for the 4th Generation Mobile Communications]. S. Y. Hwang and H. J. Park are with the High-Speed User Equipment Modem Research Team, Department of Mobile Convergence Research, Electronics and Telecommunications Research Institute, Daejeon 305-700, Korea (e-mail: syh- wang@etri.re.kr; parkhj@etri.re.kr). D. S. Kang and K. S. Jhang are with the Digital System Laboratory, De- partment of Computer Engineering, Chungnam National University, Daejeon 305-764, Korea (e-mail: atom@cnu.ac.kr; sun@cnu.ac.kr). Digital Object Identifier 10.1109/TVLSI.2009.2015665

because of the simplicity of the AMBA bus of ARM, which at- tracts many IP designers [6], and the good architecture of the AMBA bus for applying embedded systems with low power [7]. The ML-AHB busmatrix is an interconnection scheme based on the AMBA AHB protocol, which enables parallel access paths between multiple masters and slaves in a system. This is achieved by using a more complex interconnection matrix and gives the benefit of both increased overall bus bandwidth and a more flexible system structure [3]. In particular, the ML-AHB busmatrix uses slave-side arbitration. Slave-side arbitration is different from master-side arbitration in terms of request and grant signals since, in the former, the master merely starts a burst transaction and waits for the slave response to proceed to the next transfer. Therefore, the unit of arbitration can be a transaction or a transfer [8]. The transaction-based arbiter mul- tiplexes the data transfer based on the burst transaction, and the transfer-based arbiter switches the data transfer based on a single transfer. However, the ML-AHB busmatrix of ARM presents only transfer-based arbitration schemes, i.e., transfer- based fixed-priority and round-robin arbitration schemes. This limitation on the arbitration scheme may lead to degradation of the system performance because the arbitration scheme is usu- ally dependent on the application requirements; recent applica- tions are likewise becoming more complex and diverse. By im- plementing an efficient arbitration scheme, the system perfor- mance can be tuned to better suit applications [9]. For a high-performance on-chip bus, several studies re- lated to the arbitration scheme have been proposed, such as table-lookup-based crossbar arbitration [10], two-level time-division multiplexing (TDM) scheduling [11], token-ring mechanism [12], dynamic bus distribution algorithm [13], and LOTTERYBUS [14]. However, these approaches employ master-side arbitration. Therefore, they can only control pri- ority policy and also present some limitations when handling the transfer-based arbitration scheme since master-side arbitra- tion uses a centralized arbiter. In contrast, it is possible to deal with the transfer-based arbitration scheme as well as the trans- action-based arbitration scheme in slave-side arbitration. In this paper, we propose a flexible arbiter based on the self-motivated (SM) arbitration scheme for the ML-AHB busmatrix. Our SM arbitration scheme has the following advantages: 1) It can adjust the processed data unit; 2) it changes the priority policies during runtime; and 3) it is easy to tune the arbitration scheme according to the characteristics of the target application. Hence, our arbiter is able to not only deal with the transfer-based fixed-priority, round-robin, and dynamic-priority arbitration schemes but also manage the transaction-based fixed-priority, round-robin, and dynamic-priority arbitration schemes. Fur- thermore, our arbiter provides the desired-transfer-length-based

1063-8210/$26.00 © 2009 IEEE

HWANG et al. : IMPLEMENTATION OF A SELF-MOTIVATED ARBITRATION SCHEME FOR THE ML-AHB BUSMATRIX

819

ARBITRATION SCHEME FOR THE ML-AHB BUSMATRIX 819 Fig. 1. Overall structure of the ML-AHB busmatrix of

Fig. 1. Overall structure of the ML-AHB busmatrix of ARM [3].

fixed-priority, round-robin, and dynamic-priority arbitration schemes. In addition, the proposed SM arbiter selects one of the nine possible arbitration schemes based on the priority-level notifications and the desired transfer length from the masters to ensure that the arbitration leads to the maximum performance. In Section II, we briefly explain the arbitration schemes for the ML-AHB busmatrix of ARM, while Section III describes an implementation method for our flexible arbiter based upon the SM arbitration scheme for the ML-AHB busmatrix. We then present experimental results in Section IV and concluding re- marks in Section V.

II. A RBITRATION S CHEMES FOR THE ML-AHB

BUSMATRIX OF ARM

The ML-AHB busmatrix of ARM consists of the input stage, decoder, and output stage, including an arbiter [3]. Fig. 1 shows the overall structure of the ML-AHB busmatrix of ARM. The input stage is responsible for holding the address and control information when transfer to a slave is not able to com- mence immediately. The decoder determines which slave that a transfer is destined for. The output stage is used to select which of the various master input ports is routed to the slave. Each output stage has an arbiter. The arbiter determines which input stage has to perform a transfer to the slave and decides which the highest priority is currently. The ML-AHB busmatrix em- ploys slave-side arbitration, in which the arbiters are located in front of each slave port, as shown in Fig. 1; the master simply starts a transaction and waits for the slave response to proceed to the next transfer. Therefore, the unit of arbitration can be a transaction or a transfer. However, the ML-AHB busmatrix of ARM furnishes only transfer-based arbitration schemes, specif- ically transfer-based fixed-priority and round-robin arbitration schemes. The transfer-based fixed-priority (round-robin) arbiter multiplexes the data transfer based on a single transfer in a fixed-priority or round-robin fashion.

III. SM A RBITRATION S CHEME FOR THE ML-AHB B USMATRIX

An assumption is made that the masters can change their pri- ority level and can issue the desired transfer length to the arbiters in order to implement a SM arbitration scheme. This assumption

order to implement a SM arbitration scheme. This assumption Fig. 2. Arbitration scheme examples in an

Fig. 2. Arbitration scheme examples in an embedded system. (a) Arbitration scheme with no consideration of the latency constraint. (b) Arbitration scheme minimizing latency. (c) SM arbitration scheme.

should be valid because the system developer generally recog- nizes the features of the target applications [15]. For example, some masters in embedded systems are required to complete their job for given timing constraints, resulting in the satisfac- tion of system-level timing constraints. The computation time of each master is predictable, but it is not easy to foresee the data transfer time since the on-chip bus is usually shared by several masters. Previous works solved this issue by minimizing the la- tencies of several latency-critical masters, but a side effect of these methods is that they can increase the latencies of other masters; hence, they may violate the given timing constraints [16]. Unlike existing works, our scheme can keep the latency close to its given constraint by adjusting the priority level and transfer length of the masters. Fig. 2 shows an example. In this example, the service latencies (latency-limit times) of M1, M2, and M3 are 4, 8, and 2 cycles (T14, T8, and T10), re- spectively. The requests for three masters are all initiated at T0, and M3 is the most latency-sensitive master. Fig. 2(a) shows an arbitration scheme that does not use latency constraints for ar- bitration. Therefore, M2 and M3 violate the latency constraint as the masters are selected in ascending order. Only M1 meets the constraint. Fig. 2(b) shows the scheduling of a typical la- tency-minimizing arbiter. It minimizes the latency of the most latency-sensitive module, namely, M3, causing M2 to violate its constraint. Although neither of these two arbitration schemes can meet the latency constraints for all three masters, in the SM arbitration shown in Fig. 2(c), all masters use the bus with no violations by configuring the priority levels (transfer lengths) of M1, M2, and M3 as the lowest, highest, and intermediate prior- ities (4, 8, and 2), respectively. We use part of a 32-b address bus of the masters to inform the arbiters of the priority level and the desired transfer length

820

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 18, NO. 5, MAY 2010

SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 18, NO. 5, MAY 2010 Fig. 3. Decoding information of the

Fig. 3.

Decoding information of the 32-b address bus.

2010 Fig. 3. Decoding information of the 32-b address bus. Fig. 4. Internal structure of our

Fig. 4.

Internal structure of our arbiter.

of the masters. Fig. 3 shows the decoding information for our address bus. In Fig. 3, S_Number indicates the target slave number, P_Level means the priority level of a master, T_Length denotes the desired transfer length of a master, and Offset_Add specifies the internal address of the target slave. Each of S_Number and P_Level consists of 3 b because the maximum number of master–slave sets is 8 8 [3]. Also, T_Length is composed of 4 b because the maximum number of burst lengths is 16 [3]. Although we used 7 b for P_Level and T_Length in the 32-b address bus to notify the arbiters of the priority level and the desired transfer length of a master, we consider it adequate to express the internal address of a slave because the range of Offset_Add is from 0 to . Through the aforementioned assumption, the priority level and transfer length can then be changed by the SM demand of each master. Fig. 4 shows the internal structure of our arbiter based upon the SM arbitration scheme. In Fig. 4, the NoPort signal means that none of the mas- ters must be selected and that the address and control signals to the shared slave must be driven to an inactive state, while Master No. indicates the currently selected master number gen- erated by the controller for the SM arbitration scheme. In gen- eral, our arbiter consists of an RR block, a P block, two multi- plexers, a counter, a controller, and two flip-flops. MUX_1 and MUX_2 are used to select the arbitration scheme and the desired transfer length of a master, respectively. A counter calculates the transfer length, with two flip-flops being inserted to avoid the at- tempts by the critical path to arbitrate. An RR block (P block) performs the round-robin- or priority-based arbitration scheme. Fig. 5 shows the internal process of an RR block. Initially, we create the up- and down-mask vectors ( Up_Mask and Dn_Mask ) based on the number of currently selected masters, as shown in Fig. 5. We then generate the up- or down-masked vector created through bitwise AND -ing operation between the mask vector

through bitwise AND -ing operation between the mask vector Fig. 5. Internal process of the RR
through bitwise AND -ing operation between the mask vector Fig. 5. Internal process of the RR
through bitwise AND -ing operation between the mask vector Fig. 5. Internal process of the RR

Fig. 5.

Internal process of the RR block.

the mask vector Fig. 5. Internal process of the RR block. Fig. 6. VHDL code of

Fig. 6.

VHDL code of the round-robin function.

and the requested master vector. After generating the up- and down-masked vectors, we examine each masked vector as to whether they are zero or not. If the up-masked vector is zero, the down-masked vector is inserted to the input parameter of the round-robin function; if it is not zero, the up-masked vector is the one inserted. A master for the next transfer is chosen by the round-robin function, and the current master is updated after 1 clock cycle. The RR block is then performed by repeating the arbitration procedure shown in Fig. 5. Fig. 6 shows the VHDL code of the round-robin function at the behavioral level. In Fig. 6, a master for the next transfer is selected through the for-statements in line 6, with the priority level of the least significant bit in Masked_Vector being the highest. If we modify the range of Masked_Vector in line 6 to “ 0 to Masked_Vector’left ,” then the priority level of the most signifi- cant bit in Masked_Vector becomes the highest. Fig. 7 shows the internal procedure of the P block. First of all, we create the highest priority vector (V) through the round-robin function of Fig. 6. After generating the highest priority vector (V), the priority-level vectors and the highest priority vector (V) are inserted to the input parameters of the priority function. The master with the highest priority is chosen by the priority func- tion, while the current master is updated after 1 clock cycle. Fig. 8 shows the VHDL code of the priority function at the behavioral level. In Fig. 8, the master with the highest priority is selected through the for-statements in line 7.

HWANG et al. : IMPLEMENTATION OF A SELF-MOTIVATED ARBITRATION SCHEME FOR THE ML-AHB BUSMATRIX

821

ARBITRATION SCHEME FOR THE ML-AHB BUSMATRIX 821 Fig. 7. Internal procedure of the P block. Fig.

Fig. 7.

Internal procedure of the P block.

BUSMATRIX 821 Fig. 7. Internal procedure of the P block. Fig. 8. VHDL code of the

Fig. 8.

VHDL code of the priority function.

A controller compares the priority levels of the requesting masters. If the masters have equal priorities, the controller se- lects the round-robin arbitration scheme (RR block); in other

cases, it chooses the priority arbitration scheme (P block). The controller also makes the final decision on the master for the next transfer based on the transfer length of the selected master. The control process follows the following three steps. 1) If HMASTLOCK is asserted, the same master remains se- lected. 2) If HMASTLOCK is not asserted and the currently selected master does not exist, the following hold.

a) If no master is requesting access, the NoPort signal is asserted.

b) Otherwise, a new master for the next transfer is ini- tially selected. If the masters have equal priorities, the round-robin arbitration scheme is selected; otherwise, the priority arbitration scheme is chosen. In addition, the counter is updated based on the transfer length of the selected master.

3) If none of the previous statements applies, the following hold.

a) If the counter is expired, the following hold.

i) If the requesting masters do not exist, the No-

Port signal is updated based on the HSEL signal of the currently selected master. If the HSEL signal is “1,” the same master remains selected, and the NoPort signal is deasserted. Otherwise, the NoPort signal is asserted. ii) Otherwise, a master for the next transfer is se- lected based on the priority levels of the re- questing masters. Also, the counter is updated.

b) If the counter is not expired, and the HSEL signal of the current master is “1,” the same master remains selected, and the counter is decreased.

c) If the currently selected master completes a transac-

tion before the counter is expired, the following hold.

i) If the requesting masters do not exist, the No-

Port signal is asserted. ii) Otherwise, a master for the next transfer is chosen based on the priority levels of the re- questing masters, and the counter is updated. The SM arbitration scheme is achieved through iteration of the aforementioned steps. Combining the priority level and the desired transfer length of the masters allows our arbiter to handle the transfer-based fixed-priority, round-robin, and dynamic-priority arbitration schemes (abbreviated as the FT, RT, and DT arbitration schemes, respectively), as well as the transaction-based fixed-priority, round-robin, and dynamic-pri- ority arbitration schemes (abbreviated as the FR, RR, and DR arbitration schemes, respectively). Moreover, our arbiter can also deal with the desired-transfer-length-based fixed-priority, round-robin, and dynamic-priority arbitration schemes (abbre- viated as the FL, RL, and DL arbitration schemes, respectively). The transfer- or transaction-based arbiter switches the data transfer based upon a single transfer (burst transaction), and the desired-transfer-length-based arbiter multiplexes the data transfer based on the transfer length assigned by the masters. Fig. 9 shows the configurations for the fixed-priority arbitra- tion schemes. In this figure, the smaller the priority level number, the higher the priority level. In the fixed-priority arbitration schemes, each master has a static priority. In transfer-based arbitration, how- ever, the transfer length is allocated as 1, indicating a single transfer; in transaction-based arbitration, the transfer length is equal to the HBURST signal, which refers to the transaction type (transfer ). In addition, the transfer length for the desired-transfer-length-based arbitration is allotted by the demand of each master (for example, let , , , and ). The arbitration results of Fig. 9 are as follows (“#” indicates the transfer number). 1) FT arbitration scheme: M2(#0), M2(#1), M2(#2), M1(#0), M1(#1), M1(#2), M1(#3), M1(#4), M0(#0), M0(#1), M0(#2), M0(#3), M0(#4), M0(#5), M0(#6), M0(#7), M1(#5), M1(#6), M1(#7), M2(#3), M2(#4), M2(#5), M2(#6), M2(#7), M3(#0), M3(#1), M3(#2), M3(#3), M3(#4), M3(#5), M3(#6), M3(#7). 2) FR arbitration scheme: M2(#0), M2(#1), M2(#2), M2(#3), M2(#4), M2(#5), M2(#6), M2(#7), M0(#0), M0(#1), M0(#2), M0(#3), M0(#4), M0(#5), M0(#6), M0(#7), M1(#0), M1(#1), M1(#2), M1(#3), M1(#4), M1(#5),

M2(#7), M0(#0), M0(#1), M0(#2), M0(#3), M0(#4), M0(#5), M0(#6), M0(#7), M1(#0), M1(#1), M1(#2), M1(#3), M1(#4), M1(#5),
M2(#7), M0(#0), M0(#1), M0(#2), M0(#3), M0(#4), M0(#5), M0(#6), M0(#7), M1(#0), M1(#1), M1(#2), M1(#3), M1(#4), M1(#5),
M2(#7), M0(#0), M0(#1), M0(#2), M0(#3), M0(#4), M0(#5), M0(#6), M0(#7), M1(#0), M1(#1), M1(#2), M1(#3), M1(#4), M1(#5),
M2(#7), M0(#0), M0(#1), M0(#2), M0(#3), M0(#4), M0(#5), M0(#6), M0(#7), M1(#0), M1(#1), M1(#2), M1(#3), M1(#4), M1(#5),

822

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 18, NO. 5, MAY 2010

SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 18, NO. 5, MAY 2010 Fig. 9. Configurations for the fixed-priority

Fig. 9. Configurations for the fixed-priority arbitration schemes.

M1(#6), M1(#7), M3(#0), M3(#1), M3(#2), M3(#3), M3(#4), M3(#5), M3(#6), M3(#7). 3) FL arbitration scheme: M2(#0), M2(#1), M2(#2), M2(#3), M2(#4), M2(#5), M2(#6), M2(#7), M0(#0), M0(#1), M0(#2), M0(#3), M0(#4), M0(#5), M0(#6), M0(#7), M1(#0), M1(#1), M1(#2), M1(#3), M1(#4), M1(#5), M1(#6), M1(#7), M3(#0), M3(#1), M3(#2), M3(#3), M3(#4), M3(#5), M3(#6), M3(#7). In this case, the result of transaction-based arbitration is equal to that of desired-transfer-length-based arbitration because the priority levels of all the masters are fixed. Fig. 10 shows the combinations for the round-robin arbitra- tion schemes. In these schemes, the masters have equal priorities, with the transfer length being assigned as 1 in transfer-based arbitration and 8 in transaction-based arbitration. Also, in desired-transfer- length-based arbitration, the transfer length is assigned by the demand of each master (for example, let , , , and ). The arbitration results of Fig. 10 are as

follows. 1) RT arbitration scheme: M0(#0), M1(#0), M2(#0), M3(#0), M0(#1), M1(#1), M2(#1), M3(#1), M0(#2), M1(#2), M2(#2), M3(#2), M0(#3), M1(#3), M2(#3), M3(#3), M0(#4), M1(#4), M2(#4), M3(#4), M0(#5), M1(#5), M2(#5), M3(#5), M0(#6), M1(#6), M2(#6), M3(#6), M0(#7), M1(#7), M2(#7), M3(#7).

M1(#6), M2(#6), M3(#6), M0(#7), M1(#7), M2(#7), M3(#7). 2) RR arbitration scheme: M0(#0), M0(#1), M0(#2), M0(#3),
M1(#6), M2(#6), M3(#6), M0(#7), M1(#7), M2(#7), M3(#7). 2) RR arbitration scheme: M0(#0), M0(#1), M0(#2), M0(#3),
M1(#6), M2(#6), M3(#6), M0(#7), M1(#7), M2(#7), M3(#7). 2) RR arbitration scheme: M0(#0), M0(#1), M0(#2), M0(#3),

2) RR arbitration scheme: M0(#0), M0(#1), M0(#2), M0(#3), M0(#4), M0(#5), M0(#6), M0(#7), M1(#0), M1(#1), M1(#2), M1(#3), M1(#4), M1(#5), M1(#6), M1(#7), M2(#0), M2(#1), M2(#2), M2(#3), M2(#4), M2(#5), M2(#6), M2(#7), M3(#0), M3(#1), M3(#2), M3(#3), M3(#4), M3(#5), M3(#6), M3(#7). 3) RL arbitration scheme: M0(#0), M0(#1), M1(#0), M1(#1), M1(#2), M1(#3), M1(#4), M1(#5), M1(#6), M1(#7), M2(#0), M2(#1), M2(#2), M2(#3), M2(#4), M2(#5), M3(#0), M3(#1), M3(#2), M3(#3), M0(#2), M0(#3), M2(#6), M2(#7), M3(#4), M3(#5), M3(#6), M3(#7), M0(#4), M0(#5), M0(#6), M0(#7). Fig. 11 shows the configurations for the dynamic-priority ar- bitration schemes. In the dynamic-priority arbitration schemes, the priority of the masters can be changed by the SM demand of each master. Furthermore, the transfer length is assigned as 1 in transfer-based arbitration and 4 in transaction-based arbitration. Also, the transfer length for desired-transfer-length-based arbi- tration is assigned, as shown in Fig. 11. The arbitration results of Fig. 11 are as follows. 1) DT arbitration scheme: M2(#0), M3(#0), M3(#1), M3(#2), M3(#3), M1(#0), M1(#1), M1(#2), M1(#3), M0(#0), M0(#1), M0(#2), M0(#3), M2(#1), M2(#2), M2(#3) M3(#0), M3(#1), M0(#0), M0(#1), M0(#2), M2(#0), M2(#1), M2(#2), M2(#3), M0(#3), M1(#0), M1(#1), M1(#2), M1(#3), M3(#2), M3(#3).

HWANG et al. : IMPLEMENTATION OF A SELF-MOTIVATED ARBITRATION SCHEME FOR THE ML-AHB BUSMATRIX

823

ARBITRATION SCHEME FOR THE ML-AHB BUSMATRIX 823 Fig. 10. Configurations for the round-robin arbitration

Fig. 10. Configurations for the round-robin arbitration schemes.

2) DR arbitration scheme: M2(#0), M2(#1), M2(#2), M2(#3), M3(#0), M3(#1), M3(#2), M3(#3), M1(#0), M1(#1), M1(#2), M1(#3), M0(#0), M0(#1), M0(#2), M0(#3) M3(#0), M3(#1), M3(#2), M3(#3), M0(#0), M0(#1), M0(#2), M0(#3), M2(#0), M2(#1), M2(#2), M2(#3), M1(#0), M1(#1), M1(#2), M1(#3). 3) DL arbitration scheme: M2(#0), M2(#1), M2(#2), M3(#0), M3(#1), M3(#2), M3(#3), M1(#0), M1(#1), M1(#2), M1(#3), M0(#0), M0(#1), M0(#2), M0(#3), M2(#3) M3(#0), M3(#1), M0(#0), M0(#1), M0(#2), M0(#3), M2(#0), M2(#1), M2(#2), M2(#3), M1(#0), M1(#1), M1(#2), M1(#3), M3(#2), M3(#3).

IV. IMPLEMENTATION R ESULTS AND P ERFORMANCE A NALYSIS

A. Implementation Results

We implemented different slave-side arbitration schemes for

the ML-AHB busmatrix. Each arbitration-scheme-based bus- matrix was implemented with synthesizable RTL VHDL tar- geting XILINX FPGA (XC2VP100-6ff1704). The XILINX de- sign tool (ISE 7.1i) was used to measure the total area. The im- plemented arbitration schemes are as follows:

• FT, FR, RT, RR, DT, DR, and SM arbitration schemes.

The ML-AHB busmatrix of ARM provides only two arbitra- tion schemes: FT and RT arbitration schemes. Thus, we com-

pared the FT- and RT-based busmatrixes of ARM with our cor- responding busmatrixes in the area overhead to show the credi- bility of our implementation. Fig. 12 shows the comparison re- sults. The total areas of our FT- and RT-based busmatrixes de- creased by 21% and 13% on average, respectively, compared with the FT- and RT-based busmatrixes of ARM. One reason is that we adapted the bit masking mechanism [17] to our busma- trixes to reduce the area of the arbiter, while ARM used multiple priority encoders, a multiplexer, and a demultiplexer to imple- ment the arbiters of the busmatrixes. Table I charts the synthesis results of our ML-AHB busma- trixes with the different arbitration schemes. It is apparent that the total area of the SM-based busmatrix is 9%–25% larger than those of the other busmatrixes. This may be due to our SM-based busmatrix also requiring the comparator to compare the priority of the masters and the counters to calculate the transfer length. Although our SM-based busmatrix occupies more area than the other busmatrixes, our arbiter is able to deal with varied arbitration schemes such as the FT, FR, RT, RR, DT, and DR arbitration schemes.

B. Performance Analysis

We utilized a ModelSim II simulator to measure the perfor- mance of the ML-AHB busmatrixes with the different arbitra- tion schemes and demonstrate the efficiency of our flexible SM arbitration scheme.

824

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 18, NO. 5, MAY 2010

SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 18, NO. 5, MAY 2010 Fig. 11. Configurations for the dynamic-priority

Fig. 11. Configurations for the dynamic-priority arbitration schemes.

1) Simulation Environments: Fig. 13 shows our simulation environment. In our simulation environment, the clock frequencies of all components are 100 MHz (10 ns). The implemented ML-AHB busmatrix has a 32-b address bus, a 32-b write data bus, a 32-b read data bus, a 15-b control bus, and a 3-b response bus. Meanwhile, the simulation environment consists of both an implemented and a virtual part. The former corresponds to the ML-AHB busmatrixes with different arbitration schemes and consists of four masters and two slaves. Specifically, we only considered two target slaves, which is when conflict frequently happens. The masters then access these in order to focus on the performance analysis based on the arbitration schemes of each busmatrix. The virtual part, however, is composed of AHB masters and AHB slaves. The AHB master generates the trans- actions, with the transactions of the masters having the same length as an 8-beat incrementing burst type. The AHB slave responds to the transfers of the masters. Both the AHB masters and slaves are fully compatible with the AMBA AHB protocol [3]. For a more realistic model of a SoC design, we modeled the AHB masters after the features of the processor and DMA with VHDL at the behavioral level. For the AHB slaves, we used the real SRAM, SDRAM, and SDRAM controller RTL models used in many applications. We also constructed the protocol checker and performance monitor modules with the VHDL and foreign language interface (FLI C module) to ensure the reliability of our performance simulations.

Prior to the simulation, the workloads should be determined as they affect the simulation results. However, determining the appropriate workloads of real applications is difficult because these can only be obtained when all applications with real input data are specifically modeled. Instead, the workloads for per- formance simulation are obtained through synthetic workload generation [18] with the following parameters. 1) The distribution of transactions. This indicates what pro- portion of the total transactions that each master is respon- sible for.

The ratio of the nonbus transaction time to the total transac-

tion time per AHB master, where the total transaction time consists of a nonbus transaction (internal transaction of the master) time and a bus transaction (external transaction of the master through the busmatrix) time. 3) The latency time of the accessed slave by each master. These parameters determine the delay of components in the virtual part. Through synthetic workload generation, various possible situations are investigated, where the ML-AHB bus- matrixes with each arbitration scheme can be utilized well. In this regard, we found three useful categories of experiments to identify the effects of the following factors:

1) job length of the masters; 2) latency time of the slaves; 3) both the job length of the masters and the latency time of the slaves.

2)

HWANG et al. : IMPLEMENTATION OF A SELF-MOTIVATED ARBITRATION SCHEME FOR THE ML-AHB BUSMATRIX

825

ARBITRATION SCHEME FOR THE ML-AHB BUSMATRIX 825 Fig. 12. Comparison of our busmatrixes with those of

Fig. 12. Comparison of our busmatrixes with those of ARM in total area.

TABLE I

S YNTHESIS R ESULTS OF THE ML-AHB B USMATRIXES W ITH D IFFERENT A RBITRATION S CHEMES (N UMBER OF FPGA S LICES)

A RBITRATION S CHEMES (N UMBER OF FPGA S LICES ) The dynamic-priority-based arbitration scheme has

The dynamic-priority-based arbitration scheme has the ad- vantage for throughput when there are few masters with long job lengths in a system; in other cases, the round-robin-based arbitration scheme can get higher throughput than other arbi- tration schemes [19]. In addition, the arbitration scheme with transaction-based multiplexing performs better than the same arbitration scheme with single-transfer-based switching in ap- plications with frequent access to long-latency slaves such as SDRAM [19]. The slave for the first category is the SRAM-type AHB slave (AHB slave0 in Fig. 13) without latency for access, while the slave for the second category is the SDRAM-type AHB slave

TABLE II

D ISTRIBUTION OF T RANSACTIONS: T OTAL T RANSACTION/N ONBUS

T RANSACTION

RANSACTIONS : T OTAL T RANSACTION /N ONBUS T RANSACTION In Table II, the transactions of

In Table II, the transactions of the masters with long job lengths are generally bus transactions. In this case, the masters follow the characteristics of the DMA master in that the DMA aspect is used for data transmissions through buses in many applications.

TABLE III

D ISTRIBUTION OF T RANSACTIONS: T OTAL T RANSACTION/N ONBUS

T RANSACTION

RANSACTIONS : T OTAL T RANSACTION /N ONBUS T RANSACTION In Table III, the transactions of

In Table III, the transactions of the masters with long job lengths are mostly nonbus transactions. In other words, the majority of the masters have internal jobs. In this case, the masters are processor-type AHB masters in the sense that the processor usually has many internal jobs for calculation compared with external jobs of setting the control registers for the slave module.

(AHB slave1 in Fig. 13) with a long latency time for access. The slave for the third category can be an AHB slave0 or an AHB slave1. In particular, the target addresses are generated based on the uniform distribution random number function between AHB slave0 and AHB slave1. Therefore, each master communicates with the slaves with the same probability in the third category. Tables II and III tabulate the simulation parameters. We performed a number of performance simulations at var- ious job lengths and observed no difference in the results of the performance simulation at specific job lengths. The specific job length was 4800, and we decided the job length for performance analysis to be the same at 4800. In addition, this job length explicitly exhibits the features of each arbitration scheme very well. 2) Simulation Results: Fig. 14 shows the simulation results of the first category. In this paper, throughput is defined as

first category. In this paper, throughput is defined as where is the total number of transactions,
first category. In this paper, throughput is defined as where is the total number of transactions,

where is the total number of transactions, indicates the number of transfers per transaction, denotes the data bit width, and means the completion time of the data transmission. Note that , , and are all fixed in three categories because the job length is fixed as 4800. However, the simulation results are different from each other since the distribution of transactions (total transaction/nonbus transaction) is different from each other, as shown in Tables II and III. In addition, (total transaction

transaction) is different from each other, as shown in Tables II and III. In addition, (total
transaction) is different from each other, as shown in Tables II and III. In addition, (total
transaction) is different from each other, as shown in Tables II and III. In addition, (total

826

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 18, NO. 5, MAY 2010

SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 18, NO. 5, MAY 2010 Fig. 13. Simulation environment for performance

Fig. 13. Simulation environment for performance analysis.

time) consists of internal- and external-job times. For example, we can assume the job schedule of a master as follows:

we can assume the job schedule of a master as follows: In the aforementioned job schedule,

In the aforementioned job schedule, Job0 is performed re- gardless of the arbitration scheme because an internal job does not use the busmatrix. However, Job2 can be performed after

completion of Job1 and Job1 is strongly related to the busmatrix arbitration scheme since Job1 is an external job. In other words, there is a close dependence between internal and external jobs. The external-job time is a critical factor that decides , which is defined as the maximum for all . We employ the aforementioned scheme for more realistic experimentations. Based on the performance simulations of the first category, we observed that the overall system performance depends on the number of masters with a long job length and the processed data unit by the arbiter, regardless of the masters’ transaction type (bus or nonbus transaction). In type1 and type5 cases, where only one of the masters had a long job length, the SM-based busmatrix had the highest throughput. This is because master0, which had the longest job length, issued the highest priority level together with the desired transfer length to the arbiter. The arbiter, in turn, processed the data transfer, focusing on the demands of master0. Accordingly, master0 could finish the transactions more rapidly. Although the transactions of the other masters were somewhat delayed, the total transaction end time of the SM-based busmatrix was

the shortest in type1 and type5. In particular, master0

could

shortest in type1 and type5. In particular, master0 could Fig. 14. Simulation results for the first
shortest in type1 and type5. In particular, master0 could Fig. 14. Simulation results for the first
shortest in type1 and type5. In particular, master0 could Fig. 14. Simulation results for the first

Fig. 14. Simulation results for the first category. (a) Simulation result for Table II. (b) Simulation result for Table III.

HWANG et al. : IMPLEMENTATION OF A SELF-MOTIVATED ARBITRATION SCHEME FOR THE ML-AHB BUSMATRIX

827

ARBITRATION SCHEME FOR THE ML-AHB BUSMATRIX 827 Fig. 15. Simulation results for the second category. (a)

Fig. 15. Simulation results for the second category. (a) Simulation result for Table II . (b) Simulation result for Table III . (c) Simulation result for Table II . (d) Simulation result for Table III . (e) Simulation result for Table II . (f) Simulation result for Table III .

TABLE IV

C ONFIGURATIONS OF THE SM-B ASED A RBITRATION S CHEME FOR M AXIMUM

P ERFORMANCE

ASED A RBITRATION S CHEME FOR M AXIMUM P ERFORMANCE quickly complete the internal job (nonbus

quickly complete the internal job (nonbus transaction) in type5 due to the prompt processing of the arbiter for the master0 bus transaction. The SM-based busmatrix likewise showed the highest performance in type2 and type6 cases, where there were two masters with long job lengths. The reasons are similar to those of type1 and type5. Clearly, when there are few masters with long job lengths in a bus system, the SM-based busmatrix configured as the DL arbitration scheme shows the maximum performance. However, the SM-based busmatrix configured as the RL arbitration scheme had the highest throughput in type3, type4, type7, and type8 cases, where there were many masters with long job lengths. In other words, when there are many

masters with long job lengths or the job lengths of all masters

are similar or the same, the SM-based busmatrix organized as the RL arbitration scheme shows the highest performance. In most cases, the fixed-priority scheme has the lowest throughput because of starvation. We also identified the effect of the data switching unit on the overall system performance. The data multiplexing unit can be ordered by the system throughput as follows:

- - -
-
-
-

The desired transfer length- or transfer-based arbitrations have the highest or lowest throughput because the data mul- tiplexing occurs as a unit of desired transfer length or single transfer. Based on the performance simulations for the first category, we observed that the SM-based arbitration scheme improved its throughput by 14%–47% compared with other arbitration schemes. In addition, the maximum bandwidth on the busmatrix of the first category was 8.4 Gb/s, with our SM-based arbitration utilizing about 82% of the bandwidth. Fig. 15 shows the simulation results of the second category. Using the performance simulations in this category, we iden- tified that the latency time of the slave has an effect on the

828

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 18, NO. 5, MAY 2010

SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 18, NO. 5, MAY 2010 Fig. 16. Simulation results for the

Fig. 16. Simulation results for the third category. (a) Simulation result for Table II . (b) Simulation result for Table III . (c) Simulation result for Table II . (d) Simulation result for Table III . (e) Simulation result for Table II . (f) Simulation result for Table III .

overall system performance. Based on the simulation results of other arbitration schemes besides our SM-based one, we ob- served that an arbitration scheme with transaction-based mul- tiplexing displays a higher performance than the same arbitra- tion scheme with transfer-based switching in an application with frequent access to long-latency devices or memories such as the SDRAM. The improvements of the arbitration scheme with transaction-based rather than transfer-based multiplexing are, on average, 19%, 24%, and 27% when the latency times of the SDRAM are 1, 2, and 3 clock cycles, respectively. More specifically, the differences between the transfer- and transac- tion-based arbitrations are largest in the round-robin arbitration schemes because the data switching occurs as a unit of single transfer in the RT arbitration scheme; furthermore, the latency increases as the data multiplexing augments. In fact, the im- provements of the RR arbitration scheme over the RT arbitration scheme are about 26%, 42%, and 51% when the latency times of the SDRAM are 1, 2, and 3 clock cycles, respectively. Based on the previous results and the outcome of the first category, we can configure our SM-based arbitration scheme to obtain the maximum throughput as follows:

1) type1, type2, type5, and type6: DR arbitration scheme; 2) type3, type4, type7, and type8: RR arbitration scheme. The performance simulations for the second category show that the SM-based arbitration scheme enhances the throughput

by 26%–62% compared with other arbitration schemes. Also, the maximum bandwidth on the busmatrix of the second cate- gory is 7.12 Gb/s, while our SM-based arbitration used about 79% of the bandwidth. By virtue of the results of the first and second categories, we can predict the optimal configurations of the SM-based arbitra- tion scheme for the highest performance, as shown in Table IV. For SRAM-type slaves without latency for access, the de- sired-transfer-length-based arbitration schemes are suitable for the highest throughput; the transaction-based arbitration schemes are appropriate for SDRAM-type slaves with a long latency time for access. In addition, when there are few mas- ters with long job lengths in a bus system, such as in type1, type2, type5, and type6, the SM-based busmatrix configured as a dynamic-priority arbitration scheme has the maximum performance. In comparison, the SM-based busmatrix config- ured as the round-robin arbitration scheme obtains the highest throughput, provided that there are many masters with long job lengths or that the job lengths of all masters are similar or the same to each other in a bus system, such as in type 3, type4, type7, and type8. On the other hand, based on the performance simulations for the third category, we confirm that the configurations of Table IV have the maximum performance among the arbitration schemes. Fig. 16 shows the simulation results of the third category. These

HWANG et al. : IMPLEMENTATION OF A SELF-MOTIVATED ARBITRATION SCHEME FOR THE ML-AHB BUSMATRIX

829

results indicate that the total throughput of the third category im- proves to about 67% compared to the first and second categories because the number of accessible target slaves is two and some slave operations are performed in parallel. Based on the results of the performance simulations for the third category, we observe that the SM-based arbitration scheme configured in Table IV improves the throughput by 19%–47% compared to other arbitration schemes. Moreover, the maximum bandwidth on the busmatrix of the third category is 15.52 Gb/s, and our SM-based arbitration utilized about 69% of the band- width.

V. C ONCLUSION

In this paper, we proposed a flexible arbiter based on the SM arbitration scheme for the ML-AHB busmatrix. Our arbiter supports three priority policies-fixed priority, round-robin, and dynamic priority-and three approaches to data multi- plexing-transfer, transaction, and desired transfer length; in other words, there are nine possible arbitration schemes. In ad- dition, the proposed SM arbiter selects one of the nine possible arbitration schemes based on the priority-level notifications and the desired transfer length from the masters to allow the arbitration to lead to the maximum performance. Experimental results show that, although the area of the proposed SM arbitra- tion scheme is 9%–25% larger than those of other arbitration schemes, our arbiter improves the throughput by 14%–62% compared with other schemes. We therefore expect that it would be better to apply our SM arbitration scheme to an appli- cation-specific system because it is easy to tune the arbitration scheme according to the features of the target system. For future work, we feel that the configurations of the SM arbitration scheme with the maximum throughput need to be found automatically during runtime. We are likewise looking at the applicability of the proposed arbitration scheme to AMBA AXI (ver. 3.0). 1

R EFERENCES

[1] M. Drinic, D. Kirovski, S. Megerian, and M. Potkonjak, “Latency- guided on-chip bus-network design,” IEEE Trans. Comput.-Aided De- sign Integr. Circuits Syst. , vol. 25, no. 12, pp. 2663–2673, Dec. 2006. [2] S. Y. Hwang, K. S. Jhang, H. J. Park, Y. H. Bae, and H. J. Cho, “An ameliorated design method of ML-AHB busmatrix,” ETRI J. , vol. 28, no. 3, pp. 397–400, Jun. 2006. [3] ARM, “AHB Example AMBA System,” 2001 [Online]. Available:

http://www.arm.com/products/solutions/AMBA_Spec.html [4] IBM, New York, “32-bit Processor Local Bus Architecture Specifica- tion,” 2001. [5] R. Usselmann, “WISHBONE interconnect matrix IP core,” Open- Cores, 2002. [Online]. Available: http://www.opencores.org/ ?do=project=wb_conmax [6] N.-J. Kim and H.-J. Lee, “Design of AMBA wrappers for multiple- clock operations,” in Proc. Int. Conf. ICCCAS , Jun. 2004, vol. 2, pp.

1438–1442.

[7] D. Flynn, “AMBA: Enabling reusable on-chip designs,” IEEE Micro , vol. 17, no. 4, pp. 20–27, Jul./Aug. 1997.

1 http://www.arm.com/products/solutions/axi_spec.html, accessed Feb. 2008

[8] S. Y. Hwang, H.-J. Park, and K.-S. Jhang, “Performance analysis of slave-side arbitration schemes for the multi-layer AHB busmatrix,” J. KISS, Comput. Syst. Theory , vol. 34, no. 5, pp. 257–266, Jun. 2007. [9] S. S. Kallakuri and A. Doboli, “Customization of arbitration policies and buffer space distribution using continuous-time Markov decision processes,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst. , vol. 15, no. 2, pp. 240–245, Feb. 2007. [10] D. Seo and M. Thottethodi, “Table-lookup based crossbar arbitration for minimal-routed, 2D mesh and torus networks,” in Proc. Int. Conf. IPDPS , Mar. 2007, pp. 1–10. [11] K. Lahiri, A. Raghunathan, and S. Dey, “Performance analysis of sys- tems with multi-channel communication architectures,” in Proc. Int. Conf. VLSI Design , Jan. 2000, pp. 530–537.

[12] J. Turner and N. Yamanaka, “Architectural choices in large scale ATM switches,” IEICE Trans. Commun. , vol. E-81B, no. 2, pp. 120–137, Feb. 1998. [13] C. H. Pyoun, C. H. Lin, H. S. Kim, and J. W. Chong, “The efficient bus arbitration scheme in SoC environment,” in Proc. Int. Conf. SoC Real-Time Appl. , Jul. 2003, pp. 311–315. [14] K. Lahiri, A. Raghunathan, and G. Lakshminarayana, “The LOT- TERYBUS on-chip communication architecture,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst. , vol. 14, no. 6, pp. 596–608, Jun. 2006. [15] J. H. Han, M. Y. Lee, B. Younghwan, and C. Hanjin, “Application spe- cific processor design for H.264 decoder with a configurable embedded processor,” ETRI J. , vol. 27, no. 5, pp. 491–496, Oct. 2005. [16] M. Jun, K. Bang, H.-J. Lee, N. Chang, and E.-Y. Chung, “Slack-based bus arbitration scheme for soft real-time constrained embedded sys- tems,” in Proc. Int. Conf. ASP-DAC , Jan. 2007, pp. 159–164. [17] S. Y. Hwang, H. J. Park, and K. S. Jhang , An Efficient Implementa- tion Method of Arbiter for the ML-AHB Busmatrix . Berlin, Germany:

Springer-Verlag, May 2007, vol. 4523, LNCS, pp. 229–240. [18] E.-G. Jeong, J.-G. Lee, K.-S. Jhang, J.-A. Lee, and D. Har, “Asyn- chronous layered interface of multimedia socs for multiple outstanding transactions,” J. VLSI Signal Process. Syst. , vol. 46, no. 2/3, pp. 133–151, Mar. 2007.

[19]

S. Y. Hwang, H. J. Park, and K. S. Jhang, “An implementation and per- formance analysis of slave-side arbitration schemes for the ML-AHB busmatrix,” in Proc. Int. Conf. ACM Symp. Appl. Comput. , Mar. 2007, vol. 2, pp. 1545–1551.

Symp. Appl. Comput. , Mar. 2007, vol. 2, pp. 1545–1551. Soo Yun Hwang was born in

Soo Yun Hwang was born in Seoul, Korea, in 1976. He received the B.S. degree in computer engineering from Hannam University, Daejeon, Korea, in 2002 and the M.S. and Ph.D. degrees in computer engi- neering from Chungnam National University, Dae- jeon, in 2004 and 2008, respectively. His Ph.D. de- gree work concerned enhancements in the architec- ture and arbitration schemes of the multilayer AHB busmatrix. He joined the Electronics and Telecommunications Research Institute, Daejeon, in 2006, where he is cur- rently a Senior Member of the Engineering Staff working in the High-Speed User Equipment Modem Research Team, Department of Mobile Convergence Research. His research interests include CAD for VLSI, system-on-a-chip (SoC) design methodology, on-chip communication architecture in SoC, and high- speed user equipment modem designs.

in SoC, and high- speed user equipment modem designs. Dong Soo Kang received the B.S. and

Dong Soo Kang received the B.S. and M.S. degrees from the Department of Computer Engineering, Chungnam National University, Daejeon, Korea, in 2005 and 2007, respectively, where he is currently working toward the Ph.D. degree in the Digital System Laboratory. His research interests include satellite onboard computers, memory-aware compilers and architec- tures, multimedia system designs, and processor architectures.

830

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 18, NO. 5, MAY 2010

SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 18, NO. 5, MAY 2010 Hyeong Jun Park received the B.S.

Hyeong Jun Park received the B.S. degree in elec- tronic engineering from Hanyang University, Seoul, Korea, in 1987 and the M.S. degree in electronic en- gineering from Chungnam National University, Dae- jeon, Korea, in 2001, where he is currently working toward the Ph.D. degree. He joined the Electronics and Telecommunica- tions Research Institute, Daejeon, in 1987, where he is currently the Head of the High-Speed User Equipment Modem Research Team, Department of Mobile Convergence Research. His research inter- ests include system-on-a-chip design methodology, mobile communication architecture, and high-speed user equipment modem designs.

architecture, and high-speed user equipment modem designs. Kyoung Son Jhang (M’89) was born in Oggu-Gun, Korea,

Kyoung Son Jhang (M’89) was born in Oggu-Gun, Korea, in 1964. He received the B.S., M.S., and Ph.D. degrees in computer engineering from Seoul National University, Seoul, Korea, in 1986, 1988, and 1995, respectively. In 1996, he joined Hannam University, Daejeon, Korea as a Faculty Member. He then moved to Chungnam National University, Daejeon. He then moved to Chungnam National University, Daejeon, where he is currently a Professor teaching systems programming and digital hardware design in the Department of Computer Engineering. His current major interests include fault-tolerant hardware designs, electronic design automation, and digital system design.