Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Besksadress: Kyrkogatan 15
Telefax: 036-12 00 65
Abstract
This thesis describes a design flow of a Network on Chip (NoC), which could be a solution for communication in future System on Chip (SoC). The time span in which this is thought to be commercial is 5 to 10 years. Because of the lack of information on performance of various NoC configurations, one important purpose of the design phase is to make a system- level model that can be used for performance simulations. To build the system- level model the programming language SDL has been used. A discrete event simulator for the SDL model has been used for the simulations. The NoC is designed as a packet switched network, with micro-routers placed in a two-dimensional m*n mesh, in this case 4*4 that equals to 16 micro-routers. Every router has a connection for a resource, which could be for instance a processor, memory or an FPGA. Another objective is to make a prototype of a NoC in an FPGA. For that purpose VHDL has been used to describe the circuit at a synthesizable le vel of abstraction. It is concluded that it is useful and relatively easy to use SDL for making performance simulations of a NoC and use these to draw conclusions of design questions. For example, the results of the simulations showed that increasing the buffer of a switch output from 2 to 3 packets only marginally have an effect of the performance. When the behaviour and structure is described in SDL it also helps as a template to the design in VHDL. A small working NoC prototype has been built on FPGA and tested using the serial port of a PC.
Acknowledgements
We like to thank Professor Shashi Kumar for the invaluable guiding in this new NoC world. Alf Johansson, programme coordinator has been very helpful and for that we are much grateful. Magnus want to thank his apartment friends for putting up with the unwashed plates. Rickard sends a special thanks to his wife and daughter and promises to spend more time at home.
Sammanfattning
Detta dokument beskriver ett designarbete som behandlar Network on Chip (NoC), vilket r en mjlig lsning fr kommunikationen i framtida System On Chip (SoC). Denna lsning finns inte p marknaden men r tnkt att kunna anvndas kommersiellt om 5-10 r. Designen r byggd som ett paketvxlat ntverk, med 16 mikroroutrar placerade i en tvdimensionell 4*4 matris. Varje router har en koppling till en resurs som till exempel kan anvndas fr en processor, minne eller FPGA. P grund av att det lilla utbud av prestandamtningar av olika NoC konfigurationer r simuleringar en viktig del av designfasen. Fr att gra dessa simuleringar har programmeringssprket SDL anvnts. Efter det har en VHDL beskrivning av kretsen gjorts fr att kunna gra en implementering i en FPGA. Det konstateras att SDL r relativt enkelt att anvnda fr prestandamtningar p ett NoC. Det r sedan mjligt att anvnda dessa fr designavvgningar. Till exempel visar resultaten p att det inte lnar sig att utka utbufferternas storlek frn 2 till 3 paket, eftersom detta endast har en marginell effekt p prestandan. Nr beteendet och strukturen r beskriven i SDL hjlper det ocks till som ett std fr konstruktionen i VHDL. En liten NoC prototyp har implementerats i en FPGA och testats via serieporten p en PC.
Key words
Core Based Design FPGA Network on Chip (NoC) On Chip Communication Packet Switched Network SDL System on Chip (SoC) VHDL
List of Contents
1.1 System on Chip ........................................................................................................................ 7 1.2 Network on Chip ...................................................................................................................... 7 1.2.1 NoC Evaluation Tools...................................................................................................... 8 1.3 SDL as a modelling platform................................................................................................... 8 1.4 Objectives of the project .......................................................................................................... 9 1.5 Outline...................................................................................................................................... 9 2 Theoretical Background...................................................................................................... 10 2.1 Communication Networks ..................................................................................................... 10 2.1.1 Communication Techniques........................................................................................... 10 2.1.2 OSI Model ...................................................................................................................... 10 2.1.3 Routers ........................................................................................................................... 11 2.1.4 Buffers ............................................................................................................................ 12 2.1.5 Topologies ...................................................................................................................... 12 2.2 Network on Chip Concepts .................................................................................................... 13 2.2.1 Survey of Network on Chip Ideas................................................................................... 13 2.3 System Level Design ............................................................................................................. 14 2.3.1 SDL................................................................................................................................. 14 3 NoC: Design Decisions ......................................................................................................... 17 3.1 Design Methodology.............................................................................................................. 17 3.2 Network configuration........................................................................................................... 17 3.3 Route and Switch function..................................................................................................... 18 3.4 Connections between Nodes .................................................................................................. 18 3.4.1 Drop of Packets.............................................................................................................. 19 3.4.2 Physical issues ............................................................................................................... 19 3.4.3 Data-link layer connection............................................................................................. 19 3.4.4 Network layer connection .............................................................................................. 19 3.4.5 Transport layer .............................................................................................................. 20 3.5 Packet structures .................................................................................................................... 20 3.6 Buffers.................................................................................................................................... 21 3.7 Routing Algorithm ................................................................................................................. 22 3.8 RNI function .......................................................................................................................... 22 3.9 Resource................................................................................................................................. 22 3.10 Connection to Environment ................................................................................................... 23
4.1 Requirements of Model.......................................................................................................... 24 4.2 System Structure .................................................................................................................... 24 4.2.1 Design Blocks................................................................................................................. 24 4.2.2 Parameterised Mesh size ............................................................................................... 24 4.2.3 Micro-Router.................................................................................................................. 26 4.2.4 RNI ................................................................................................................................. 27 4.2.5 Resource......................................................................................................................... 28 4.2.6 Description of Common Types....................................................................................... 28 4.3 Design Tool............................................................................................................................ 29 4.3.1 Simulation Tool .............................................................................................................. 29 4.4 Simulation Set-up................................................................................................................... 29 4.5 Simulation Results ................................................................................................................. 31 4.5.1 Simulations with equal delay of Switch and Buffer ....................................................... 31 4.5.2 Simulations with unequal delay of Switch and Buffer ................................................... 39 4.6 Chapter Discussion ................................................................................................................ 42 5 NoC: Hardware Design ....................................................................................................... 44 5.1 Model Requirements.............................................................................................................. 44 5.2 Design Structure..................................................................................................................... 44 5.2.1 Micro-Router.................................................................................................................. 45 5.2.2 RNI ................................................................................................................................. 47 5.2.3 Resource......................................................................................................................... 47 5.3 Design and Simulation Tool .................................................................................................. 47 5.4 Simulation Results ................................................................................................................. 48 5.4.1 Simulated values ............................................................................................................ 48 5.4.2 Simulated and Implemented values................................................................................ 50 5.5 Chapter Discussion ................................................................................................................ 50 6 NoC: Prototyping on FPGA................................................................................................ 51 6.1 Prototype Board ..................................................................................................................... 51 6.2 Functional Description........................................................................................................... 51 6.2.1 Communication.............................................................................................................. 51 6.2.2 Resources ....................................................................................................................... 51 6.2.3 I/O-ports......................................................................................................................... 51 6.3 Technology Mapping tool...................................................................................................... 52 6.4 Implementation Result ........................................................................................................... 53 6.5 Chapter Discussion ................................................................................................................ 53 7 7.1 7.2 7.3 8 9 Results ................................................................................................................................... 54 SDL Modelling and Simulation of NoC ................................................................................ 54 Designing NoC using VHDL................................................................................................. 54 Implementation of NoC prototype in FPGA.......................................................................... 55 Conclusions ........................................................................................................................... 56 Vocabulary ............................................................................................................................ 59
List of Figures
FIGURE 1-1. RESOURCES IN A NOC FIGURE 2-1. LAYERS IN THE OSI-MODEL FIGURE 2-2. NETWORK TOPOLOGIES. FIGURE 2-3. A SIMPLE SDL SYSTEM. FIGURE 3-1. DESIGN REFINEMENT. FIGURE 3-2. BLOCKS IN MICRO-ROUTER FIGURE 3-3. LAYERS IN THE NOC FIGURE 3-4. PACKET STRUCTURES IN DIFFERENT LAYERS. FIGURE 3-5. RNI INTERFACE FIGURE 3-6. RESOURCE FIGURE 3-7. CONNECTIONS TO ENVIRONMENT FIGURE 4-1. A NODE IN THE NOC FIGURE 4-2. INTERNAL BLOCKS OF A NODE FIGURE 4-3. SDL BLOCKS IN THE NETWORK LA YER. FIGURE 4-4. NETWORK OVERVIEW FIGURE 4-5. TABLE OF RESOURCE CONFIGURATION FIGURE 4-6. TRANSFER STATISTICS FOR 1 CONTINUOUS AND 15 BURSTY RESOURCES FIGURE 4-7. TRANSFER MEAN TIM E FOR 1 CONTINUOUS 15 BURSTY RESOURCES FIGURE 4-8. SPREADING FACTOR WITH 1 CONTINUOUS AND 15 BURSTY RESOURCES FIGURE 4-9. SIMULATION SET-UP FOR 16 VS. 14 BURSTY RESOURCES FIGURE 4-10. NUMBER OF TRANSFERRED PACKETS, 16 VS. 14 BURSTY RESOURCES FIGURE 4-11. NUMBER OF CANCELLED PACKETS, 16 VS. 14 BURSTY RESOURCES FIGURE 4-12. NUMBER OF DROPPED PACKETS, 16 VS. 14 BURSTY RESOURCES FIGURE 4-13. TRANSFER MEANTIME, 16 VS. 14 BURSTY RESOURCES FIGURE 4-14. SIMULATION RESULTS WITH DIFFERENT BURST LENGTH FIGURE 4-15. TRANSFER STATISTICS FOR 1 CONTINUOUS AND 15 BURSTY RESOURCES FIGURE 4-16. TRANSFER MEAN TIME FOR 1 CONTINUOUS AND 15 BURSTY RESOURCES FIGURE 4-17. NUMBER OF TRANSFERRED PACKETS, 16 VS. 14 BURSTY RESOURCES FIGURE 4-18. NUMBER OF CANCELLED PACKETS, 14 VS. 16 BURSTY RESOURCES FIGURE 4-19. NUMBER OF DROPPED PACKETS, 14 VS. 16 BURSTY RESOURCES FIGURE 4-20. TRANSFER MEANTIME, 14 VS. 16 BURSTY RESOURCES FIGURE 4-21. SIMULATION RESULTS WITH DIFFERENT BURST LENGTH FIGURE 5-1. VHDL BLOCK MODEL OF NOC AT NODE LEVEL FIGURE 5-2. VHDL MODEL OF NOC AT NETWORK LAYER FIGURE 5-3. VHDL BLOCK MODEL OF RNI AND A RESOURCE FIGURE 5-4. THE COMMUNICATION PROCESS BETWEEN A SWITCH AND BUFFERS FIGURE 6-1. OVERVIEW OF NETWORK ON CHIP PROTOTYPE IN FPGA FIGURE 6-2. BITS IN IMPLEMENTATION FIGURE 6-3. IMPLEMENTATION RESULT FIGURE 7-1. DESCRIPTION OF COMMUNICATION IN NOC-PROTOTYPE. 8 11 12 15 17 18 20 21 22 23 23 25 25 26 30 31 33 33 34 35 35 36 36 37 38 39 40 40 41 41 41 42 45 46 47 49 52 52 53 55
Introduction
1.2.1 NoC Evaluation Tools Today the development of NoC is focused on developing a suitable ne twork configuration. The ideas have to be tested and therefore tools for evaluation of a NoC design have to be considered. If a design of a NoC should be able to claim having some degree of efficiency, this would have to be supported by performance simulations. There are several network simulators available, for example NS-2 [15] has been used for this purpose. NS-2 was however, not designed to be used for NoC and the configuration possibilities seem unable to meet the requirements of a NoC model. Another idea is to use an ordinary high- level programming language, like C++ to build a simulator. Here there is, of course, a possibility to make it as accurate as the developer wants, but in turn it will take a lot of time to develop. The idea of this project is to use a system- level description language to build a model that will meet the requirements of a NoC simulator, in order to make the results valid for a specific design.
SDL supports division of the system in hierarchy using blocks, which can be used to describe functional or physical units. The behaviour of the system can be specified using concurrent processes. The structural properties of the language, makes it possible to use a model in order to simplify the lower levels of the system design.
1.5
Outline
In this chapter the reader is introduced to the NoC concept and its motivation. Here the purpose of the project is also defined. Chapter 2, entitled Theoretical Background, describes theories about network in general and some different ideas on how a NoC could be designed. The purpose of this is for the readers understanding of the area and to show that there exist many different aspects to consider when designing a network. The theories presented here are also the basis that the design is built upon and the reader should after reading this chapter be able to understand the design decisions and limitations of the model. In chapter 3, entitled NoC: Design Decisions, the overall decisions regarding all the stages of the design are presented and motivated. Chapter 4, entitled NoC: Modelling in SDL, describes the system level design and functionality of NoC in SDL. Chapter 5, entitled NoC: Hardware Design, discusses VHDL Design of various components of a NoC system. In chapter 6, entitled NoC: Prototyping on FPGA, we discuss issues of implementing a small prototype of NoC on a programmable platform like FPGA. The last three chapters also show part results and conclusions about NoC during various design phases. Chapter 7, entitled Results, presents the overall results and it is a summary of the obtained results in the previous chapters. In chapter 8, entitled Conclusions, some important thoughts about the project and the results are discussed and proposals for future work are also given.
Theoretical Background
This chapter describes the theoretical background and programming language on which the project is based on. It also gives a brief description about what has been presented in the NoC area by other researchers.
10
Application
Application
Presentation
Presentation
Session
Session
Transport
Transport
Network
Network
Data Link
Data Link
Physical
Physical
2.1.3 Routers In a switched network there is a need to find a route through several switches. Therefore the switches that are cross-points in the network also implement a routing function. They are called routers because of this functionality. The routing algorithm is a very important part of the router since its task is to route every packet towards the right direction. Some routing algorithms are able to tell which route is the fastest, not only, which way that is the shortest. The two main kinds of routing algorithms are static and adaptive routing. Static routing is when there are one, or possibly a few paths between sender and receiver that are fixed. In static routing algorithm, the routing changes very slowly, if at all, over time. When the routing is changed it is often a result of human intervention. Adaptive, also called dynamic routing, on the other hand is when the routing algorithm alters the route of packets in a dynamic way. A dynamic routing algorithm changes the routes according to, for example, network traffic or due to changes of the topology. A global routing algorithm has complete information about connectivity and link costs in the network. The algorithm can thereby compute the least-cost path between source and receiver. The calculation itself can be run at one site or at multiple sites. Decentralized routing algorithms calculate the least-cost path by communicating with its neighbours. In the beginning the node only knows the costs of its own directly attached links, then through an iterative process of communication between nodes, the least cost path to a destination is calculated. 11
2.1.4 Buffers If buffers are added to a switch, in order to store packets when at times the network is overloaded, the possibility that packets will be dropped decreases. Some switches use only a single output buffer and multiple input buffers, which can cause the problem called head of line blocking[6] often seen in such switches. This fault appears when the first message in the FIFO queue on the input buffer cant be sent, because its desired output is not available. The next packet cannot pass through the line, since it is waiting for the packet first in line to be sent. Multiple output buffers with single input buffers do not suffer from this kind of problem. The main drawback with these is however that packets may be rejected if the rate of transmission to the router is higher than the router can handle. Another method is to use shared memory in the switch. The problem with this type is that it may result in slower system since there have to be some amount of synchronization and organization of memory access. 2.1.5 Topologies Network topology refers to the shape of the network. How the different nodes in a network are connected to each other and how they communicate are determined by the network's topology.
Full Mesh
Star
Bus
Mesh topology comes in two types. They are full mesh and partial mesh. Full mesh means that a node is connected to every other node in the network, this is a very costly method and mostly used to connect busses. Partial mesh means that a node doesnt have to be directly connected to all other nodes. This type of mesh is not as costly as full mesh, but the disadvantage is less redundancy. 2D-array is a type of mesh in which nodes form a two dimensional grid where each node is connected to the four adjacent routers. The routers at the edges have only two or three connections since the y dont have more adjacent routers. The number of nodes will then become CxR where C is the number of columns and R is the number of rows. Torus is a topology, which is similar to the 2D-array in which nodes form a regular cyclic 2dimensional grid. Here all routers have four connections since a torus basically is a mesh with wrap-around on the edges. Star topology uses a central hub to which all recourses are connected. All communication between resources is then passed through the central hub. Ring topology when the resources are connected to each other in a ring. Every resource is then connected to its two neighbours communication with other resources then has to pass through the neighbours. 12
Bus topology means that several resources use the same communication channel. In an ordinary local area network this can results in collisions, caused by two resources sending a packet at the same time. If you want to avoid collisions it is a possible to let the resources send their packet in a time slot, which is unique for each resource.
The general network is called CLICH (Chip-Level Integration of Communicating Heterogeneous Elements). For more special purposes were performance is of more importance the network may have to support the concept of regions. These regions will not necessary have the same structure and communication mechanisms as the rest of the network.
14
Figure 2-3. A simple SDL system. In the upper left is the system specification, which in this case contains a block called Ware_Machine and channels that connect it to the environment. On the channel CoinInput there can pass a signal called Coin, which is of the type CoinType that is declared as a new type. The interior of block Ware_Machine is viewed in the upper left and in this case it contains one process called Ware_Machine_Process.
15
In this process the behaviour is described as an EFSM in the lower part of the picture. Let us follow what happens if a coin is inserted in the CoinInput. The EFSM makes a transition and leaves the state Wait_for_Coin and checks CoinSort in the decision box. If it is other than a TenCrown it puts Coin on CoinOutput and returns to Wait_for_Coin. If CoinSort is TenCrown a Task with the ANY operator randomly choose one of the wares, a key ring, a plastic snake or a small doll, declared in WareType. It is assigned to WareSort and put out with the signal Ware on Ware_Output. After this the EFSM returns to Wait_for_Coin.
16
Before and during the project time it is necessary to make several design decisions since detail specifications for the project are not defined. The reason for this is that there is not enough information available to make a detailed specification before start.
17
Micro-Router
In/Out Buffer
In/Out Buffer
Route-Control
In/Out Buffer
In/Out Buffer
18
3.4.1 Drop of Packets The connections of the packet transferring components in the design are made in a way that a sender of a packet cannot send, unless it gets a ready to receive signal from the receiver. A receiver does not give this signal if it is full. No packets will therefore be dropped between these components. Drops are allowed in the micro-router when it is impossible to route for a certain amount of time, in order to prevent deadlock or other unwanted behaviour. This strategy is chosen because when the network is heavily loaded, the resources cannot send packets that most likely will be dropped. This brings the information closer to the source and the decision on how to react is made by the resource. 3.4.2 Physical issues In [1] the proposed bus-width is set to 300 wires including address and control wires. This is based on the assumption that the physical size of each side on the router can have this amount of I/O: s. If a bus-width is decided there must be a decision in what direction the wires should be used. There could be a bus were all wires are used by both routers to send and receive data. Another way is to use half of the bus as output on one and input on the other and vice versa. This means that less data will be transmitted in one direction but an easier way to communicate and a possibility to both send and receive at same time. For simulation and implementation in this project it should not be necessary to use such a wide bus, as it would only slow simulation down and take up a lot of space in the FPGA. It is however useful to have a bus-width that is easily controlled and there should be a possibility to set the width at any size. The physical layer defines the actual physical connections that are the basis for a network. Since the implementation is intended for an existing FPGA with a fixed structure the advantages of modelling at the physical layer in SDL is not clear. It is therefore not included in the design. 3.4.3 Data-link layer connection The data- link layer deals with how the transfer of data between two micro-routers can be reliable. Data is grouped into frames that may consist of for example a header, payload and checksum. In this design the payload is a packet from the network layer. The transfer of one data-link frame is sent parallel to the neighbouring node. Every frame is checked for errors and retransmission is ordered if it is incorrect. The transfer of bits on the bus has to be synchronised in some way. The method chosen to realise this is by a simple handshaking protocol. For example data will be present on the bus on a write signal from the sender and acknowledged by the receiver with another signal. 3.4.4 Network layer connection This layer regards the transfer of data between any arbitrary nodes in the network. This is done by the routing function in the micro-routers. A message at this layer may consist of several packets that carry an address to its destination. The purpose of the network layer is to make sure that the packets reach their desired address in a way that is decided by the routing algorithm. The network layer is considered not reliable, because there is no confirmation that a message has reached its destination. As mentioned in the beginning of this chapter there should be a possibility to simulate the design without the data- link layer. In this case the connection is directly between the buffers in the opposite nodes. Figure 3-3 gives a graphical explanation. 19
Transport
Transport
Network
Network Connection
Network
Data Link
Data Link
Physical
Physical
Figure 3-3. Layers in the NoC The vertical connection of network and data- link layer is in this case broken. The transfer of data on the network layer is done in the same manner as if it were in the same node, i.e. as a transfer between network and data- link layer. In Figure 3-3 it is viewed as the horizontal network connection. 3.4.5 Transport layer The purpose of the transport layer is to provide a reliable transfer of messages between the resources that use the network. For example if a packet gets dropped in the network, this would be detected and the transport layer would take necessary actions. The micro-routers do not implement the transport layer, since they only operate up to the network layer. In the project design resources are very simplified and transmitting transport layer packets. This leads to that the RNI will not perform transport layer services but will provide the service of the network layer to these packets.
20
Figure 3-4. Packet Structures In Different Layers. In the transport layer it is necessary to be able to identify the Destination Process Id (DPID) and Source Process ID (SPID). Every message that a process sends has a Message Sequence Number. If a message is too large to fit into one packet it will be divided into several packets and thus every packet will have a Packet Sequence Number (PSN). The payload in the transport layer is in a real situation a packet from a higher- level service but in this model this will be sent as some test dummy bits. The network layer adds the destination RID with the actual row address and column address of the node in the network were the DPID resides. The payload of this layer is a packet from the transport layer. To make sure that a packet with errors does not stay around in the network there is a possibility of a Hop Counter (HC) that is counted up for each router that it passes. The data-link layer frames consist of a payload in form of a packet from the network layer and an error check field. Example sizes in bits that each layer consists of are shown in Figure 3-4. In the SDL model there is no purpose to limit the size of fields and therefore the type integer is used. The size of each field is limited only in VHDL because of the need to set a fixed size in order to be able to implement it.
3.6 Buffers
Buffers make it possible to store packets for a while without dropping them if it is not possible to transmit them further at the moment. A major design issue is how large buffers are cost effective to be used in a design. The larger the buffer, the higher will the cost be in terms of chip area. To satisfy these conflicting demands, it is decided to use a maximum size of 4 buffers for our evaluating purposes.
21
3.9 Resource
Because of the need to get good simulation results, the resource in the model are to resemble real resources behaviour. There can be different kinds of resources in a real system, such as DSP: s, general-purpose processors or memory. In VHDL the resource will be made very simple due to lack of time.
22
23
There are many ways to look at NoC, but since it is basically a communication network it could be a good idea to divide it into structures using the layered network model. The hierarchical and regular structure of a NoC architecture are quite suitable for simulation using SDL.
24
Figure 4-1. A node in the NoC The internal structure of the Node block is viewed in Figure 4-2. The IO_Switch is a connection that is only used in the initialisation of the simulation in order to get all the created routers connected appropriately. When this is done this connection is no longer used.
25
4.2.3 Micro-Router The block that is called the Router is a micro-router that is designed with functional blocks dividing the network and data-link layer. Figure 4-3 shows the network layer block consisting of one switch/routing block, 4 in/out buffers, and 1 RNI in/out buffer. This block has two optional modes; it can either communicate with the neighbouring routers directly at the network layer or through the data-link layer.
Switch and Route unit The block SWITCH_CONTROL contains the processes Switch and Route_Control. Though they have a sequential behaviour in this design, the division into two processes have been done in order to make it possible running them independently of each other. For example this is used when the Route_Control is updating the out buffer states. The Switch receives packets from the in-buffers and sends information about the packets desired destination to the Route_Control.
26
The Route_Control has information about the state of the out-buffers and makes a decision about the route for the packet. The first choice is to send it north or south according to the destination. If that is not possible it investigates west or east. If none of the preferred buffers are available it will send to any other free buffer. If there is no free buffer the Switch will be informed about this. When there are free buffers the Switch receives a message, telling in which buffer to put the packet. If at that time there were no free buffer, it starts a counter and sends another request for route of another packet. The counter is reset if a packet is routed to an out-buffer. In the case that the routing request has failed and the counter has reached a timeout value (16 in our case) the packet will be dropped. Buffers The buffers use the BUFFER_TYPE block and handle all communication in one direction (North, South, West, and East). To make it possible to set different configuration for the buffer connecting to RNI this is of a special type called RNI_BUFFER_TYPE. These blocks are also able to communicate with the neighbouring router or RNI, both directly on the network layer and via the data-link layer. Data-link layer The layer is divided into a separate block that connects to the buffers and performs service of the data-link layer. The type used is D_IO_TYPE. 4.2.4 RNI The RNI contains the units that perform the actions of the Resource Network Interface. It encapsulates the block type N_LAYER with processes that perform the network layer services in the RNI. This block operates in two optional modes; it can either communicate with the neighbour router directly at the network layer or through the data- link layer. It contains one RNI service block and one in/out-buffer towards the micro-router. There is also a possibility to use a data- link layer block. RNI service The block RNI_SERVICE is responsible for the transformation of data between the resource and the network. In this design the resource send transport layer (T_Layer) packets In the process RNI_OUT, T_Layer packets are attached with network layer information and sent out on the network. There is a possibility to set a certain delay for the operations in this process. The process RNI_IN the N_layer packets are unpacked to transport layer and sent to the resource. There is also a possibility to set a certain delay for the operations at this stage. Buffer The N_BUFFER_TYPE is used to handle the communication with the network router on the network layer or via the data-link layer. It is similar to the BUFFER_TYPE in the router except that it is of an own type for configuration options. Data-link layer In the block D_LAYER there is one sub-block of the type D_IO_TYPE. 27
4.2.5 Resource The block RESOURCE is a container of the type of resource that is connected. It is modelled using two processes, one each for sending packets and receiving packets. Packet Sender The process Sender simulates the behaviour of the resource. In this model the resource simulates the transport layer service, which result in that data is sent and received as transport layer packets. Between the RNI and the resource there is only a transport layer connection. Adding lower layers here will not give any benefit to evaluate the model. Since the resource and RNI sits in the same node there is no reason for implementing the network layer between these. Different behaviours can be set with procedures for individual resources. We provide a possibility to choose between bursty and continuous base behaviour. With bursty behaviour it means that a certain number of packets is put out after each other with maximum rate, called burst, and after that there is a delay called burst gap before the process repeats itself. The number of packets between the delays is randomly selected according to the Poisson distribution. The burst gap is calculated from a random delay that is uniformly distributed between a minimum and a maximum value. The length of the burst gap is weighted with the number of packets in the burst, which results in that if a big number of packets are sent there will be a longer burst gap. The selection of addresses are random within the limits of the network, but addresses situated next to the sender has double probability to be chosen since this is the most likely scenario for communication. With the continuous behaviour it is possible to set the delay between two packets transmitted which gives the output frequency of packets. Packet Receiver In the Receiver process the time of arrival of the packet is noted and the packet information is written to a file. It is now possible to compare the logged information from the receiver with other values. We can get a lot of information from these logs, for instance number of packets sent/received, transfer time and many other interesting figures. 4.2.6 Description of Common Types The following is a description of some common types defined to model NoC. Block type BUFFER_TYPE Process DATA_IN_TYPE receives data from data- link layer (D_N_TYPE) or the neighbouring routers buffer (DATA_OUT_TYPE) and buffers it. When SWITCH_TYPE is ready to pass it on it is removed from the buffer. DATA_OUT_TYPE receives data from SWITCH_TYPE and buffers it. Data is then passed on either to the neighbouring router (DATA_IN_TYPE) or to the Data-Link Layer (D_N_TYPE) and is thereafter removed from the buffer. Block type D_IO_TYPE This block contains processes, which performs the data-link layer services. The processes in this block will not be active during network layer simulation.
28
The process N_D_TYPE is consuming network layer data packets from network layer, framing them with data link layer information and error check. After that it passes them to the data-link when the receiver is ready to receive. If there is an error introduced in the transmission there will be a retransmission of the failing message. Process D_N_TYPE receives data- link layer data packets from the data link, unframes the data-link layer information and checks for errors. After that it passes them to the network. If there is an error in the transmission there will be a signal sent to the sender.
Burstiness- Same average in packet rate but change in burst gap and packets/burst. Network clock speed- Change the delays of the network components. Network load- Change the number of sending resources. Network Size It was decided that a 4*4 node network, like Figure 4-4, should be used for the simulations. This size is chosen because that a smaller will not test the routing thoroughly and a larger will take to long time to simulate and analyse. For a SoC it also seems like a size that could be realistic in a near future. After studying some simulations the following set- ups were used for experiments: 1. A mixed set- up according to Figure 4-5 with 1 continuous and 15 bursty resources. The positions of the resources will be same for all simulations in order to make a fair comparison. 2. Set up all sending resources as bursty according to the Poisson distribution. 3. Set up fewer sending resources as bursty. The last two will give a comparison on how the load change, in number of resources, affects performance.
1,4
2,4
3,4
4,4
In Figure 4-5 it is shown what kind of communication the resources in Figure 4-4 have.
Mean Nr of Mean Rate Mean Burst Gap packets/Burst (MHz) (us) 100 59,2 8 64 64 64 64 64 64 100 59,2 8 100 59,2 8 100 59,2 8 100 59,2 8 100 59,2 8 100 59,2 8 100 59,2 8 100 59,2 8 100 59,2 8 100 59,2 8 100 59,2 8 100 59,2 8 100 59,2 8 100 59,2 8
Position Resource Destination Behaviour Max Rate/Packet (MHz) 1,1 0 3,1 Burst 384 1,2 1 1,1 Continuous 64 1,2 1 2,2 Continuous 64 Continuous 64 1,2 1 4,4 Continuous 64 1,2 1 3,3 Continuous 64 1,2 1 2,3 Continuous 64 1,2 1 1,4 384 1,3 2 2,4 Burst Burst 384 1,4 3 1,3 Burst 384 2,1 4 3,1 Burst 384 2,2 5 4,1 Burst 384 2,3 6 2,1 Burst 384 2,4 7 2,1 Burst 384 3,1 8 4,3 Burst 384 3,2 9 4,3 Burst 384 3,3 10 3,2 Burst 384 3,4 11 3,2 Burst 384 4,1 12 3,4 Burst 384 4,2 13 4,3 Burst 384 4,3 14 4,1 Burst 384 4,4 15 4,2
Simulation Period The amount of time that the simulations should last was set to 10 000 time-units (clock cycles). Tests up to 200 000 time- units showed that 10 000 time-units were enough to produce reliable data. Since the time to deal with the data increased rapidly, 10 000 time-units were chosen.
31
4.5.1.1 Simulations with 1 Continuous and 15 Bursty Resources This simulation is based on the resource behaviour that you can see in Figure 4-5. When evaluating what happens to the sent packets we have divided the packets into three different categories, namely Transferred, Cancelled and Dropped. Transferred packets are those that have been sent on time, that is when the resource wants to put it to the network, and have reached their destination. Cancelled packets are those that havent been sent on time because the buffers are full. These packets will be dropped by the transmitting resource. Dropped packets are those that are sent on time but later got dropped by a micro-router because the required buffers are full.
When looking at the results it can be seen that a network that works with 0,5 time units per clock cycle gets no dropped packets and only a few cancelled packets. The network is so fast that the buffer configuration has no, or very little influence on the performance. When the speed of the network is lowered to 0,6 time units per clock cycle, we start to se the effects of the buffers. The configurations with only one output buffer begin to drop packets and also to cancel a lot of the packets. When more than one output buffer is used the network works quite satisfactorily, and has no drops and only a few cancelled packets. The highest throughput is reached by [in-buffer:out-buffer] 02:02, because this result in the largest amount of transferred packets. The difference in performance between 01:02 and 02:02 is not very large so it will probably not be worth the extra cost of resources. The reason that it works better with more than one output buffer is that more packets can be routed into an optimal route, and from that reason causes a lower number of routings along its path. From what we have seen so far, 02:02 is the best configuration. The reason for this is that the advantages from both input- and output-buffers are combined. The input buffers gives the router a possibility to have a high throughput because there are always messages in the buffers that are ready to be routed, the output buffers makes it possible to route the packets into the correct direction. If we look at the results from using clk = 0,7 and clk = 0,8 we can once again see that the results from simulations with more than one output buffer are superior to those with only one output buffer. At clk = 0,8 the buffer configurations with 01:03 and 02:02 gets almost the same results, this can be connected to the higher spreading, see Figure 4-8, 02:02 causes when the load is high.
Clk=0,5 100% 50% 0% Dropped Cancelled 01:01 01:02 01:03 02:01 02:02 03:01 0 66 0 19 0 13 0 17 0 2 0 3
100% 50% 0% Dropped Cancelled Clk=0,6
01:01 01:02 01:03 02:01 02:02 03:01 953 2671 0 140 0 118 2903 4167 0 35 2411 3078
32
Clk=0,7 100% 50% 0% Dropped Cancelled Transferred 100% 50% 0% Dropped Cancelled Transferred
Clk=0,8
01:01 01:02 01:03 02:01 02:02 03:01 3170 7381 0 517 0 442 3763 6788 0 194 4008 6246
01:01 01:02 01:03 02:01 02:02 03:01 4593 0 0 1084 4195 8929 0 1096 5027 9127
11050 1300
Figure 4-6. Transfer statistics for 1 continuous and 15 bursty resources The mean time to transfer packets with different configurations on both network speed and buffer configuration once again proves the configurations with several output buffers to be superior. The configuration with 02:02 is slightly faster than 01:02 and 01:03 on networks working on 0,5-0,6 time units. When the period time is increased further configurations with one input buffer doesnt increase their mean transfer time as much as 02:02 does. The reason for this can be that the probability that the wanted output buffer is full is less since it takes longer time between two routings from the same input buffer. Once a packet is routed a new packet has to be transferred in to the input buffer and during this time its possible to make space in the desired output buffer. This result in less spreading, see Figure 4-8, of the packets and lower number of switchings for each packet, which in the end gives a lower transfer time for the packets.
Mean transfertime 250 Time units 200 150 100 50 0 Clk=0,5 Clk=0,6 Clk=0,7 Clk=0,8 01:01 18 41 81 230 01:02 14 19 26 01:03 14 19 26 02:01 18 58 97 02:02 14 19 27 03:01 17 59 113 236
Figure 4-7. Transfer mean time for 1 continuous 15 bursty resources Figure 4-8 is targeted to show how the packets travel within the NOC. This measurement we call the spreading factor. It is defined as the percentage of packets in their optimal path compared to packets out of their optimal path. For example a 60 % spreading shows that out of one hundred packet switchings, switches that reside out of the packets optimal path switch 60 of them. Switches that lay in the packets optimal path make 40 switchings. If a packet doesnt travel the shortest possible path the number of switchings will increase rapidly, and may cause drops if there is heavy traffic.
33
As you can see from the figure below the buffer configuration affects the percentage of switchings caused by misdirected packets, which proves the importance of evaluating what buffer configuration to use. Especially the importance of more than one output buffer can be seen clearly. The buffer configuration 01:02 has a slightly higher spreading than 02:02 and 01:03 at a high frequency of the network clk=(0,5;0,7), this can be explained by the reason that this configuration only has three buffers compared to four in the other two. The probability that all buffers are full is higher when there are fewer buffers, and the router then has to find an optional route. When the network-clock is set to 0,8 the configuration 02:02 suddenly begins to spread the packets more than 01:02, which seems rather odd because more buffers should give better result in most situations. The reason for this is that the router is now able to route the second packet in the input queue before there is any space in the desired output buffer, and this causes the router to send the packet into a non-optimal route.
Spreading
60% 40% 20% 0% Clk=0,8 Clk=0,7 Clk=0,5 01:01 48% 34% 11% 01:02 8% 5% 2% 01:03 5% 3% 1% 02:01 38% 33% 9% 02:02 10% 3% 1% 03:01 41% 31% 9%
4.5.1.2 Simulation with 16 bursty resources versus 14 bursty and 2 inactive resources Simulation Set-up: These simulations are made upon the set-up table in Figure 4-9 but with some differences between the two models. In the simulations model All Bursty the set-up of the resources are exactly as in the table. The simulation model All Burst (-2) is also based on this table but the sending part of the resources 1,2 and 3,3 are turned off.
34
1,1 1,2 1,3 1,4 2,1 2,2 2,3 2,4 3,1 3,2 3,3 3,4 4,1 4,2 4,3 4,4
3,1 1,1 2,4 1,3 3,1 4,1 2,1 2,1 4,3 4,3 3,2 3,2 3,4 4,3 4,1 4,2
Burst Burst Burst Burst Burst Burst Burst Burst Burst Burst Burst Burst Burst Burst Burst Burst
384 384 384 384 384 384 384 384 384 384 384 384 384 384 384 384
100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100
59,2 59,2 59,2 59,2 59,2 59,2 59,2 59,2 59,2 59,2 59,2 59,2 59,2 59,2 59,2 59,2
75% 75% 75% 75% 75% 75% 75% 75% 75% 75% 75% 75% 75% 75% 75% 75%
2,60E-09 2,60E-09 2,60E-09 2,60E-09 2,60E-09 2,60E-09 2,60E-09 2,60E-09 2,60E-09 2,60E-09 2,60E-09 2,60E-09 2,60E-09 2,60E-09 2,60E-09 2,60E-09
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
7,396 7,396 7,396 7,396 7,396 7,396 7,396 7,396 7,396 7,396 7,396 7,396 7,396 7,396 7,396 7,396
13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13
8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8
Figure 4-9. Simulation set-up for 16 vs. 14 bursty resources When comparing the number of packets transferred we can once again see that the network that runs with a period of 0,5 time units is so fast that the buffers have a very small influence on the behaviour of the network. There are no drops and the numbers of cancelled packets are also very small using this configuration. When increasing the period time to 0,7 time we can see that the influence of the buffers starts to increase. The simulations made with more then one output buffer is significantly faster than the ones with only one output buffer. The 02:02 configuration is slightly better then 01:02 and 01:03 at this speed in the All Burst model, but when increasing the period to 0,9 the configuration 02:02 falls back and 01:02 and 01:03 are the best. In the All Burst (2) model the simulation with 02:02 stays the best in all simulations.
0,5 0,7 0,9 18000 16000 14000 12000 10000 8000 6000 4000 2000 0 01:01 01:02 01:03 02:01 02:02 03:01 Bufferconfiguration (In:Out) Number of transferred packets Number of transferred packets 16 Bursty Resources (All Burst) 0,5 0,7 0,9 18000 16000 14000 12000 10000 8000 6000 4000 2000 0 01:01 01:02 01:03 02:01 02:02 03:01 Bufferconfiguration (In:Out) Number of transferred packets 14 Bursty Resources (All Burst-2)
Behaviour
Receiver
Max ns
Sender
Min ns
Period in ns
The number of cancelled messages clearly shows what configurations that are capable of maintaining the dedicated throughput. All buffer configurations are able to route without drops with the period of 0,5 time units. When increasing the period to 0,7 the configurations with multiple output buffers starts to show their advantages. All three configurations manage to distribute the packets with only a few cancelled packets. The configuration with two input, and two output-buffers is the one that causes the smallest number of cancelled packets. The period time is then increased further to 0,9 and the configuration 02:02 in the All Burst simulation is not the best option any more. It seems like the configuration 02:02 is the best when the load is low, and when the load increases the configurations 01:02 and 01:03 gets the best, where the later is slightly best. The reason for this can once again be connected to the higher spreading that 02:02 causes when the load is high.
0,5 0,7 0,9 Number of cancelled packets 10000 9000 8000 7000 6000 5000 4000 3000 2000 1000 0 01:01 01:02 01:03 02:01 02:02 03:01 Bufferconfiguration (In:Out) Number of cancelled packets 16 Bursty Resources (All Burst) 0,5 0,7 0,9 Number of cancelled packets 10000 9000 8000 7000 6000 5000 4000 3000 2000 1000 0 01:01 01:02 01:03 02:01 02:02 03:01 Bufferconfiguration (In:Out) Number of cancelled packets 14 Bursty Resources (All Burst-2)
Figure 4-11. Number of cancelled packets, 16 vs. 14 bursty resources Once again we can see the importance of the output buffers and there are no drops when the configurations 01:02 and 01:03 are used. The configuration 02:02 starts to drop heavily when the load increases and the configurations 01:02 and 01:03 are the ones that dont drop in any of these two simulations.
0,5 0,7 0,9 6000 Number of droppet packets 5000 4000 3000 2000 1000 0 01:01 01:02 01:03 02:01 02:02 03:01 Bufferconfiguration (In:Out) Number of dropped packets 16 Bursty Resources (All Burst) 0,5 0,7 0,9 Number of dropped packets 6000 5000 4000 3000 2000 1000 0 01:01 01:02 01:03 02:01 02:02 03:01 Bufferconfiguration (In:Out) Number of dropped packets 14 Bursty Resources (All Burst-2)
When looking at the mean transfer time it is observed that the results from all the set- ups with one output buffer increase their transfer time much faster than the other set-ups. The set- ups with more than one output buffer have almost the same performance as long as the load is not too high. When the load gets higher the 02:02 set- up falls back and the set-ups 01:02 and 01:03 are by no doubts the fastest. The reason that 02:02 suddenly changes from the fastest of the three set- ups with multiple output buffers to the slowest, as the load increases seems a little strange. One theory about that this happens is that when there is a low load on the network the two input buffers helps the router to keep a high throughput. When the load increases there is not enough time to make space in the output buffer, and the packets are sent through a non-optimal path. This is the reason why the transfer time for 02:02 suddenly becomes longer than the time for 01:02 and 01:03.
Mean Transfer Time for Packets 16 Bursty Resources (All Burst) 700 600 Transfertime 500 400 300 200 100 0 0,5 0,7 0,9 Network Speed, Period Time 01:01
Transfertime 700 600 500 400 300 200 100 0 0,5 0,7 0,9 1,1 1,3 Network Speed, Period Time 01:01 01:02 01:03 02:01 02:02 03:01 Mean Transfer Time for Packets 14 Bursty Resources (All Burst-2)
4.5.1.3 Simulations with different burst length To investigate the importance of buffers when sending messages in bursts, we decided to examine what happens when the mean burst- length is increased. The mean number of packets per burst was now changed from 8 to 16. The expected result was to see that a longer mean burst- length would benefit from a larger amount of buffers. From the figure below it is possible to see that the results we got did not prove that at all. The number of transferred packets are very similar in the two cases, but when the network speed is set to 0,7 the simulations with a mean burst length of 16 are much better in two cases. Its when only one output buffer is used that the numbers of cancelled packets are significantly lower. When a longer burst- length is used this also gives longer time between the bursts because the total amount of packets sent shall not increase. The probability that some other resource is sending packets that degrades the performance in a transmission is from this reason lower when sending longer bursts.
37
0,5 0,7 0,9 20000 18000 16000 14000 12000 10000 8000 6000 4000 2000 0
0,5 0,7 0,9 20000 18000 16000 14000 12000 10000 8000 6000 4000 2000 0
01:01
03:01
0,5 0,7 0,9 14000 12000 10000 8000 6000 4000 2000 0
0,5 0,7 0,9 14000 12000 10000 8000 6000 4000 2000 0
0,5 0,7 0,9 Number of dropped packages 6000 5000 4000 3000 2000 1000 0
5000 4000 3000 2000 1000 0 01:01 01:02 01:03 02:01 02:02 03:01
Bufferconfiguration (In:Out)
38
4.5.2 Simulations with unequal delay of Switch and Buffer The following simulations show the results of the simulations when the switch and buffer delays are set to the delays that were obtained in VHDL. The main difference in the delay configurations is that the switch is a little slower than the buffers. The minimum period of the micro-router is 5 clock cycles. For example, if the network clock period is 0,5 ns, this result in a maximum bandwidth of 1/(5*0,5*10-9 ) =400 Mpackets/s. The set-ups are identical with the previous simulations. 4.5.2.1. Simulations with 1 Continuous 15 and Bursty Resources The results show that a network clock with 0,3 time units gets no dropped packets and very few cancelled packets. We can see that the configurations with multiple in-buffers seem to transfer most packets. At a network speed of 0,4 time units per clock cycle, we can see small effects of the outbuffers. The configurations with several input-buffers begin to drop packets an also to cancel a lot of the packets. It can be seen that the out-buffers does not have the big effect that were seen in the previous simulation. The switch cannot take advantage of the larger out-buffers since it is slower and thus having it difficult filling them up. It is perhaps surprising that the in-buffers are so bad, but the routing algorithm can explain it. At high traffic they make it possible to route packets in a nonoptimal direction, thus creating a lot of extra traffic.
Clk=0,3 100% 50% 50% 0% Dropped Cancelled Transferred 01:01 01:02 01:03 02:01 02:02 03:01 0 82 0 60 0 52 0 23 0 16 0 6 Dropped Cancelled 100%
Clk = 0,4
0%
01:01 01:02 01:03 02:01 02:02 03:01 0 1302 0 905 0 710 3065 2020 3237 4152 3243 3593
Figure 4-15. Transfer statistics for 1 continuous and 15 bursty resources The mean time to transfer packets with different configurations on both network speed and buffer configuration shows that adding more buffers will only increase the time. The configuration with 01:01 is slightly faster than 01:02 and 01:03. Here the configurations with multiple input-buffers perform worse at every clock setting. 39
Figure 4-16. Transfer mean time for 1 continuous and 15 bursty resources
4.5.2.2 Simulation with 16 bursty resources versus 14 bursty and 2 inactive resources When comparing the number of packets transferred we can once again see that the network that runs with a period of 0,3- 0,4 time units is so fast that the buffers have a very small influence on the behaviour of the network. There are no drops and the numbers of cancelled packets are also very small using this configuration. When increasing the period time to 0,5 we can see that the influence of the buffers starts to increase. The simulations made with more then one output-buffer is slightly faster than the ones with only one output-buffer. In the 14 bursty model simulations with multiple out-buffers, the effects of the buffers appear more distinctively.
0,4 Number of transferred packets 0,5 (14 Bursty Resources, 8 pkts per Burst) 0,6 20000 15000 10000 5000 0 01:01 01:02 01:03 02:01 02:02 03:01 Bufferconfiguration (In:Out) 0,3 0,4 0,5 18000 16000 14000 12000 10000 8000 6000 4000 2000 0 Number of transferred packets (16 Bursty Resources, 8 pkts per burst)
Figure 4-17. Number of transferred packets, 16 vs. 14 bursty resources All buffer configurations are able to route without drops with the period of 0,3 time units. When increasing the period to 0,5 the configurations with multiple output-buffers starts to show their advantages, especially in the case with 14 resources. The period time is then increased further to 0,6 and the configurations 01:02 and 01:03 causes the fewest number of cancelled packets, where the later is slightly better.
40
0,4
0,5 0,6 7000 Number of dropped packets 6000 5000 4000 3000 2000 1000
0,3 0,4 0,5 Number of dropped packets Number of cancelled packets (16 Bursty Resources, 8 pkts per burst) 7000 6000 5000 4000 3000 2000 1000 0 01:01 01:02 01:03 02:01 02:02 03:01 Bufferconfiguration (In:Out)
Figure 4-18. Number of cancelled packets, 14 vs. 16 bursty resources We can see the importance of the output-buffers and there are no drops when the configurations are 01:02 and 01:03 up to a clock of 0,5. The configuration 03:01 starts to drop heavily when the load increases.
0,4 0,5 0,6 Number or dropped packets 5000 4000 3000 2000 1000 0 01:01 01:02 01:03 02:01 02:02 Bufferconfiguration (In:Out) 03:01
0,3 0,4 0,5 Number of dropped packets 5000 4000 3000 2000 1000 0
Figure 4-19. Number of dropped packets, 14 vs. 16 bursty resources In the mean transfer time diagram it is observed that the results from all the set-ups with several inbuffers increase their transfer time faster than the other set-ups. The set-ups with more than one output buffer have almost the same performance as long as the load is not too high. The single in-, out-buffer seems to have the fastest transfer time in all clock settings.
Transfer time (14 Bursty Resources, 8 pkts per Burst) 400 350 300 250 200 150 100 50 0 0,4 0,5 0,6 Network Speed, Period Time 01:01 01:02 01:03 02:01 02:02 03:01 400 350 300 250 200 150 100 50 0 0,3 0,4 0,5
Network Speed, Period Time
Transfer time All Burst (16 Bursty Resources, 8 pkts per burst) 01:01 01:02 01:03 02:01 02:02 03:01
Transfertime
Transfertime
4.5.2.3 Simulations with different burst length The number of transferred packets seems to be influenced more when the bursts are 8 packets. The explanation to this is probably the same as in the simulation with equal times.
0,3 0,4 0,5
Number of transferred packets (16 Bursty Resources, 8 pkts per burst) 18000 16000 14000 12000 10000 8000 6000 4000 2000 0 01:01 01:02 01:03 02:01 02:02 03:01 Bufferconfiguration (In:Out)
0,3 0,4 0,5 18000 16000 14000 12000 10000 8000 6000 4000 2000 0
01:01
03:01
0,3 0,4 0,5 10000 9000 8000 7000 6000 5000 4000 3000 2000 1000 0
01:01
03:01
0,3 Number of cancelled packets 0,4 (16 Bursty Resources, 16 pkts per Burst) 0,5 10000 9000 8000 7000 6000 5000 4000 3000 2000 1000 0 01:01 01:02 01:03 02:01 02:02 03:01 Bufferconfiguration (In:Out)
5000 4000 3000 2000 1000 0 01:01 01:02 01:03 02:01 02:02 Bufferconfiguration (In:Out) 03:01
01:01
03:01
42
One must not forget the fact that using SDL, the total design time for a project may increase. For example it may have been faster to directly make a VHDL design on a higher abstraction level instead of using SDL. The benefits of SDL are however that the simulations can be performed using one tool. Using VHDL it may be necessary to introduce other programs, such as Matlab or user written functions in C for generating simulation stimulus. From the different simulations that have been made, it is possible to see that one result is constant throughout all simulations. This is the advantage that comes from using at least two output buffers. The throughput of the system increases, especially in the case of equal delays of buffers and switches. When the output buffer is increased further to three buffer spaces there is only a small extra benefit in performance, but in most cases it will not be worth the extra cost that is connected to extending the buffer size. It also seems that adding more input-buffers only makes the network performance worse. One surprising result, showed in both delay configurations was the one where longer burst-length did not benefit more from larger buffers compared to the shorter burst- length. The configuration that we would like to recommend for a general-purpose system is one with one input buffer and two output buffers. If a NoC with a special purpose is designed there should be simulations made with different configurations to see what configuration that is the most suited.
43
It was decided to make the design in VHDL. It was not intended to make several designs in VHDL and thus the design is described in Register Transfer Level (RTL) code. This makes it possible to make an automatic transfer to net- list using a synthesis tool. The programming language tool supports graphical description of synchronous Finite State Machines (FSM) and this have been used to specify the behaviour and code generation.
44
Clk Reset
Clk Reset
Indata_East_Enable Indata_East_RTR
NoC NOC_Node_Transceiver0_1 I1
Indata_West Indata_West_WR Indata_West_Enable Indata_West_RTR RouterID Send_pid Outdata_South_RTR Indata_South_RTR Dest_PID Outdata_South_Enable Indata_South_Enable Outdata_South_Wr Indata_South_Wr Outdata_South Indata_South
rx
rx
Dest_PID_1
Outdata_South_1 Indata_South_1 Outdata_South_Wr_1 Indata_South_Wr_1 Outdata_South_Enable_1 Indata_South_Enable_1 Outdata_South_RTR_1 Indata_South_RTR_1 Indata_North Outdata_North Indata_North_Wr Outdata_North_WR Indata_North_Enable Outdata_North_Enable Indata_North_RTR Outdata_North_RTR
Clk Reset
Indata_South_RTR Outdata_South Indata_South Indata_North Outdata_North Indata_North_Wr Outdata_North_WR Clk Indata_North_Enable Outdata_North_Enable Reset Indata_North_RTR Outdata_North_RTR Indata_East Indata_East_Wr Indata_East_Enable Indata_East_1 Indata_East_Wr_1
Clk Reset
Clk Reset
Indata_East_Enable_1 Indata_East_RTR_1
NoC NOC_Node_Transceiver1_1 I3
Tx Dest_PID_3 Dest_PID
5.2.1 Micro-Router The Micro_Router in Figure 5-2 is of the same component structure as the blocks in SDL. Switch and Control unit The objective of the Switch is to check the input buffers if there are any messages waiting to be transmitted and if there are, switch them to an appropriate output buffer in a safe way. Handshaking is used for packet transactions. When ready to handle a packet the Switch sets the RTR signal to 1. After reset all these are thus set to 1. If an input buffer has a packet it will set the WR signal to 1. The Switch will then try to route this packet according to the same algorithm as in the SDL design. If there is a free output buffer this has its RTR signal 1. The Switch will then assert its WR signal 1. At the same time the Switch will set the RTR of the input buffer to 0 to indicate that the packet could be taken. After this it will wait for the WR from the in buffer to be 0 and the RTR from the out buffer to be 0 before it sets the WR to 0 to the out buffer and after that is ready to handle a new packet.
45
A counter is rolled forward so it will not start to look for a packet in the same buffer where it just picked up a packet from. This will give a degree of fairness to the behaviour. This was not needed to do in the SDL design since the signals there, are in the form of events that are queued in the order they arrive to a process. In VHDL we cannot determine which of the signal that came first if they arrive within the same clock period.
RouterID Reset Clk North_Out North_In North_Out_RTR North_In_Wr Reset RouterID Reset clk North_Out North_In North_Out_WR North_In_Wr North_Out_RTR North_In_RTR East_Out East_Out_WR East_Out East_Out_WR East_Out_RTR East_In East_In_Wr East_In_RTR Clk Reset Clk
NoC NOC_Router_Buffer I2
Datalink_Out_RTR Datalink_Out_Enable
West_In West_In_Wr
Switch_In Switch_In_WR Datalink_Out_WR Switch_In_RTR Datalink_Out Switch_Out Datalink_In_RTR Switch_Out_WR Datalink_In_Enable Switch_Out_RTR Datalink_In_WR Datalink_In
NoC Switch I5
RNI_Out_RTR South_In_RTR South_Out_RTR RNI_Out_WR South_In_Wr South_Out_WR RNI_Out South_In South_Out South_Out South_In South_In_Wr Switch_Out Switch_Out_WR Switch_Out_RTR Switch_In South_Out_WR South_Out_RTR
South_In_RTR RNI_Out RNI_Out_WR RNI_Out_RTR Switch_Out Switch_In Switch_Out_WR Switch_In_WR Switch_Out_RTR Switch_In_RTR
NOC_Router_Buffer
Switch_In_WR Switch_In_RTR
Figure 5-2. VHDL model of NoC at network layer Buffers The buffers in the North, South, West and East direction are of the same type. A special type is used for the buffer towards RNI for separate configuration possibility. If the buffer is ready to receive a packet the In_RTR is set to 1. When a packet is ready to enter the buffer the In_WR signal sets to 1. When WR is 1 the In_RTR is put low to indicate that the buffer is handling the packet. The buffer then checks if the Out_RTR is set and if so it will set the Out_WR to 1 for immediate transfer of the packet. In the case that Out_RTR is low the packet is saved in the buffer. If the buffer is not full and the In_WR signal has gone low the buffer sets the In_RTR high again.
46
5.2.2 RNI For simplicity the service and buffer capability of the RNI is built together in one FSM. The difference from a pure buffer is that, for instance if it is an out-buffer, the FSM looks up a table and adds the network layer information as the packet from the resource is in the buffer.
D_Indata_RNI_Wr
D_Indata_RNI Clk Reset rx Clk Reset rx Data_Out Data_Out_WR Data_Out_RTR Data_Out Data_Out_WR Data_Out_RTR Data_In Data_In_WR Data_In_RTR Clk Reset Rout_Out_RTR Rout_Out_Enable Rout_Out_WR Reset Rout_Out Clk Res_In Res_In_Wr Res_In_RTR Res_Out Res_Out_Wr Res_Out_RTR Rout_In_Enable Rout_In_RTR
NoC NOC_RNI I1
D_Outdata_RNI Rout_In Rout_In_Wr
NoC UART_Rec I0
D_Outdata_RNI_W
D_Outdata_RNI_En
D_Outdata_RNI_RT
5.2.3 Resource In the design there are three types of resources. These are designed for the requirements of the implementation. There is an UART receiver, which can receive messages at 9600 kbit/s and transmit them towards the RNI as a simple transport layer message. Another is the send ID resource, which receives a number 0-9, forwards it and then sends its own ID that number of times to a specific node in the network. The last type is the UART transmitter, which receives packets from the RNI and sends them at 9600 kbit/s onto a serial channel.
47
48
Pkt2 3 4 5
SDL Timing
8 VHDL Simulation
< Outbuf_WR < Inbuf_RTR to Outbuf < Inbuf_WR to Switch < Switch_RTR to Inbuf < Switch_WR to next Outbuf < Next Outbuf_RTR to Switch
49
5.4.2 Simulated and Implemented values The fast design used in the above chapter would not work in the FPGA. It was decided to make more of the used signals clocked thus making the design slower, but safer in the behaviour. Below the values that work in the FPGA are shown. Buffers: Forwarding a one packet: 2 clk Turn around time to receive new packet: 3 clk Switch: Load packet for address evaluation and send: 4 clk Be ready to load new packet: 4 clk It is shown that the clock cycles to perform the packet transportation is increasing resulting in that both max BW and minimum delay are getting worse. Ex. F=1 GHz Max BW Buffers = 1/(5*1*10-9 )= 200 MPackets/s Max BW Switch = 1/(10*1*10-9 )= 100 MPackets/s Max BW overall = 100 MPackets/s The minimum delay for one packet is calculated as follows. In-buffer delay = 2 ns Switch delay = 4 ns Out-buffer delay = 2 ns Total min delay = 8 ns
5.5
Chapter Discussion
Although it was not possible to make a working implementation in FPGA with the faster values it can be concluded that these are the most realistic in an ASIC implementation. The reason for this is that the possibilities for the VLSI designer vastly exceed the ones existing in this project. Of course one can manually make adjustments in the technology mapping but this is too time consuming and is therefore left out.
50
51
PC
RNI RS-232 Resource 0 Rx RS-232 Interface
Micro Router (1:0) RNI RS-232 Resource 2 Tx RS-232 Interface RNI Resource 3 Store, modify, forward
Message Sequence Number Packet Sequence Number Destination Process Identification Source Process Identification Hop Counter Resource Identification Check bit
20 32
41 105
24 41 58
45
312 574
824 1659
4 5
5 10
370 751
6245 9 15509 5
Figure 6-3. Implementation Result The micro-router is synthesized with 2 output-buffers and one input-buffer in all of the 5 directions. If we manually sum up these components it will be in slices: 32*5+20*5+105=365 In flip-flops: 24*5+41*5+45=370 In gates: 312*5+574*5+1659=6089 The implementation is done with only one buffer in each direction because of the limited space. Here there are only 3 in/out-buffers since it is only a 2*2 network and not connected buffers at the boundary are not added in the design.
53
Results
The objective for this thesis has been divided into three parts. The first and most important objective was to evaluate the SDL- language for modelling and simulation of a NoC. The results of this part are described in part 7.1. The second objective was to make a model of NoC using VHDL, this model should be based on the description that was made in SDL. The results are described in chapters 7.2 and 7.3. The third and last objective was to implement a small NoC into an FPGA, this is described in chapter 7.4.
54
55
Conclusions
In this project we have developed a generic model of a NoC architecture using SDL. We have also designed a small size NoC in VHDL and prototyped it in a 100k gate FPGA. Our project demonstrate the feasibility of the NoC concept. To develop something like a NoC is very interesting since it is a relatively unexplored area. A complete design flow has tried to be done, which has forced us to make things simple. As we look at every step in the design we see that there are possibilities to improve the models at all levels of abstraction. Though the most effort has been spent on the SDL model and the simulations there still can be lots of things to do. For example, since we see that it can be used for making fast simulations it may be worthwhile testing other routing algorithms. When we were planning for NoC prototype we first thought that it would be possible to implement a larger network, but all the buffers took a large amount of space. The network size was from this reason set to two by two, its a little bit too small to draw any real conclusions from but we thought that it would be fun to get a working network-prototype into an FPGA anyhow. There were also very little space left for the resources, and we had to make them very simple. It would be very interesting to build a lager network with some more advanced resources like processors and RAM to see how a NoC could be used. One thing that seems important is to include in the network is designing mechanisms to get hard real-time properties for messages between resources. Some researchers have proposed that a circuit switched network could be the solution for this [16]. A circuit switched network makes it possible to transfer data very fast as soon as the data path has been locked. Another solution for increased real-time properties is to give packets in the NoC different priorities. Packets that are given high priority can pass the packets with lower priority in for example the buffer. Its also possible for the router to give packets with high priority advantages. If there is a packet with lower priority in the output buffer it can be dropped to make space. In this way it is possible to implement some sort of hard real-time properties into a packet-switched network. Similar behaviour can be seen in all simulations made on routers with only one output buffer, and this is the reason why we recommend at least two output buffers. By increasing the number of buffers to three output buffers or two inputs and two outputs we can only get a very small extra benefit. In some simulations it is even possible to see that the number of transferred messages is lower, and the transmission times are longer. Because of this and because of the limited resources in a chip we propose the configuration with one input buffer and two output buffers for the NoC.
56
References
[1] Shashi Kumar, Axel Jantsch, Juha-Pekka Soininen, Martti Forsell, Mikael Millberg, Johnny berg, Kari Tiensyrj, and Ahmed Hemani, A network on chip architecture and design methodology, In Proceedings of IEEE Computer Society Annual Symposium on VLSI, April 2002. Edwin Rijpkema, Kees Goossens, and Paul Wielage, A Router Architecture for Networks on Silicon, Proceedings of Rogress, 2nd Workshop on Embedded Systems, 2001 Luca Benini, Giovanni De Micheli, Networks on Chips: A New SoC Paradigm, IEEE Computer Society, 2002 Christer Bohm, Circuit Switching for High Performance Integrated Services Networks, Royal Institute of Technology, Department of Teleinformatics, June 1996 James F. Kurose, Computer Networking, Addison Wesley Longman 2001, ISBN 0-201-47711-4 Dhiman Deb Chowdhury, High Speed Lan Technology Handbook , Springer 2000, ISBN 9-783540-665977 Gary N. Higginbottom, Performance Evaluation of Communication Networks, Artech House 1998, ISBN 0-89006-870-4 Roland Airiau et al., Circuit Synthesis with VHDL, Kluwer Academic Publishers 1999, ISBN 0-7923-9429-1 Stefan Sjholm m.fl., VHDL fr konstruktion, Studentlitteratur 1999, ISBN 91-44-01250-0 Douglas L. Perry, VHDL, McGraw-Hill Inc. 1994, ISBN 0-07-049434-7 A. Olsen et al., System Engineering Using SDL -92, North-Holland 1997, ISBN 0-444-89872-7 Jan Ellsberger et al., SDL Formal Object-oriented Language for Communicating Systems, Prentice Hall 1997, ISBN 0-13-621384-7 57
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13}
Paul Wielage and Kees Goosens, Networks on Silicon: Blessing or Nightmare?, Philips Research Laboratories, Eindhoven, The Nederlands Yi-Ran Sun, Simulation and Performance Evaluation for Networks on Chip, M Sc Thesis, KTH, Sweden Dake Liu et al., SoCBUS: The solution of high communication bandwidth on chip and short TTM, Proc of the Real- Time and Embedded Computing Conference, Gothenburg, Sweden, Sep 2002
[15]
[16]
58
9
Burst
Vocabulary
With bursty behaviour it means that a certain number of packets is put out after each other with maximum rate, called burst, and after that there is a delay called burst gap before the process repeats itself. The number of packets between the delays is, in this project randomly selected according to the Poisson distribution. Continuous behaviour is, in this project, when it is possible to set the delay between two packets transmitted, which gives the output frequency. Field Programmable Gate Array. A programmable logic device that uses static ram to store the configuration. It has a high logic capacity but the device have to be reprogrammed after a power shutdown. Network on Chip. NoC is one proposed solution for communication in future SoC design. The main idea is that resources on the chip are supposed to communicate with each other through a network. Resource Network Interface. RNI is an interface that adapts the interface of the resource to the network. In a switched network there is a need to find through several switches. Therefore the switches cross-points in the network also implement a function. They are called routers because functionality. a route that are routing of this
Continuous FPGA
NoC
RNI Router
SoC VHDL
System on Chip. Multiple stand-alone VLSI-designs are stitched together on a chip to provide one functional system. Very high speed integrated circuit Hardware Description Language. Initially developed for the US Department of Defence in order to describe digital circuits. Now also widely used for design and synthesis of these circuits. Very Large Scale Integration. Term for the process used when manufacturing chips containing several hundred thousand up to a million transistors.
VLSI
59