Chap 1abstract

Abstract Multiprocessor architectures and platforms have been introduced to extend the applicability of Moores law.
They depend on concurrency and synchronization in both software and hardware to enhance the design productivity and system performance. These platforms will also have to incorporate highly scalable, reusable, predictable, costand energy-efficient architectures. With the rapidly approaching billion transistors era, some of the main problems in deep sub-micron technologies which are characterized by gate lengths in the range of 60-90 nm, will arise from non-scalable wire delays, errors in signal integrity and unsynchronized communications. These problems may be overcome by the use of Network on Chip (NOC) architecture. Recent remarkable advances in nanoscale silicon photonic integrated circuitry specifically compatible with CMOS fabrication have generated new opportunities for leveraging the unique capabilities of optical technologies in the on-chip communications infrastructure. Based on these nanophotonic building blocks, we consider a photonic Network-on-chip architecture designed to exploit the enormous transmission bandwidths, low latencies, and low power dissipation enabled by data exchange in the optical domain. The novel architectural approach employs a broadband photonic circuit-switched network driven in a Distributed fashion by an electronic overlay control network which is also used for independent exchange of short messages.
Chapter 1 Basic of NOC Introduction Chip design has four distinct aspects: computation, memory, communication, and I/O. As processing power has increased and data intensive applications have emerged, the challenge of the communication aspect in single-chip systems, Systems-on-Chip (SoC), has attracted increasing attention. This survey treats a prominent concept for communication in SoC known as Network-on-Chip (NoC). As will become clear in the following, NoC does not constitute an explicit new alternative for intrachip communication but is rather a concept which presents a unification of on-chip communication solutions. Multi-core processor has become a trend today. According to the International Technology Roadmap for Semiconductors (ITRS), hundreds to thousands of cores will be integrated on a die in the next decade [1]. This trend leads to the rapidly increasing energy consumption for processors. Then an upper limitation of the total processor chip energy consumption is expected. As the communication consumes a large part of the total energy [2], energy efficiency is becoming the most critical design facet for the scaling of Network on-Chip (NoC). Different with electronic network, photonic Network-on-Chip offers natural solutions to improve energy efficiency. In contrast with traditional electronic components, photonic devices dont consume energy per bit or per distance. At the chip scale, the energy consumption on a photonic link is completely independent of the transmission distance. Moreover, the bit-rate transparency property facilitates the high data-rate transmission. The natural way of photonic Network-on-Chip to replace electronic components with photonic components is to use active photonic switches. Although without buffer or computing, active photonic switch can perform message exchange by changing PSE on/off states. Several on-chip networks with active photonic switches have been proposed that leverage silicon photonic for future multi-core processors. Shacham et al. [3] present the first hybrid design, which employs a circuit-switched photonic network driven by an electronic overlay control network. Before transmitting messages over the optical network, short control messages are routed in the electronic network to reserve optical path. An optical network is used to transmit bulk messages, while an electronic network, with the same topology, is for distributed control or short message exchange. Because it is still difficult to realize high-radix non-blocking active photonic switch, these networks are comprised of 4-or 5- port photonic switches. For large scale networks, there will be many hops for control messages to setup an optical link between two communication nodes in worst cases. Every on-path electronic router not only stores control messages, but also analyzes and forwards them. Consequently, path setup procedure always takes a great amount of dynamic and static energy. This leads to high power consumption for data packet transmission. 1.1NOC Vs Traditional Buss:-
Traditional on-chip communication structures have already encountered many limitations in todays VLSI designs. Many of these limitations will become even more problematic as the semiconductor technology advances into newer generations. These limitations are either associated with the scaling-down of the device feature size, or they are inevitable with the scaling-up of design complexity. Particularly, the following issues will become the bottleneck in the future communication-centric SoC design scheme: Throughput Limitation Traditional on-chip communication structures (i.e., the buses) cannot scale up as the number of components increases. When multiple dataflows are transmitted concurrently, they will compete for the same communication resources. Energy Consumption As the VLSI device features are continuously shrinking down, interconnect wires have been one of the major contributors of the system energy consumption. The buses used in many of todays SoC designs are notoriously not energy-efficient, because every bit transmitted is propagated throughout the bus to every terminal. Signal Integrity Energy considerations will impose small logic swings and power supplies, most likely below 1 Volt. Smaller device feature sizes will also produce denser wires (i.e., 7 layers or more of routing wires) connecting highly compacted transistors. Therefore, future VLSI systems will become more vulnerable to various forms of electrical noise, such as cross-talk, electro-magnetic interference (EMI) and radiation-induced charge injection (soft errors). An additional source of errors is contention in shared-medium networks. Contention resolution is fundamentally a nondeterministic process, because it requires synchronization of a distributed system, and for this reason it can be seen as an additional noise source. Because of these effects, the mere transmission of digital values on wires will be inherently unreliable. Signal Latency The propagation delay on wires will gradually dominate the signal latency as the wire feature size shrinks. In fact, wire delay has already become a big challenge in todays VLSI systems, because the delay is determined by the physical distribution of the components, which is hard to predict in the early stages of the design flow. A more predictable communication scheme is of great importance in the future SoC designs. Global Synchronization Propagation delay on global wires - spanning a significant fraction of the chip size - will pose another challenge on future SoCs. As the wire size continues to shrink, the signal propagation delay will eventually exceed the clock period. Thus signals on global wires will be pipelined. Hence the need for latency insensitive design is critical. The most likely synchronization paradigm for future chips is globally-asynchronous locally-synchronous (GALS), with many different clocks.
1.2 Router architecture For embedded systems such as handheld devices, cost is a major driving force for the success of the product and therefore the underlying architecture as well. Along with being cost effective, handheld systems are required to be of small size and to consume significantly less power, relative to desktop systems. Under such considerations,there is a clear tradeoff in the design of a routing protocol. A complex routing protocol would
further complicate the design of the router. This will consume more power and area without being cost effective. A simpler routing protocol will outperform in terms of cost and power consumption, but will be less effective in routing traffic across the system. NOC architectures are based on packet-switched networks. This has led to new and efficient principles for design of routers for NOC [10]. Assume that a router for mesh topology has four inputs and four outputs from/to other routers, and another input and output from/to the Network Interface (NI). Routers can implement various functionalities- form simple switching to intelligent routering. Since embedded systems are constrained in the area and power consumption, but still need high data rates, router must be designed with hardware usage in mind. For circuit switched networks, routers may be designed with no buffering. For packet switched networks, some amount of buffering is needed, to support large data transfers. Various design and implementations of router architectures based on different routing strategies have been proposed in the literature. circuit switched router architecture for NOC[12]. While another packet switches router architecture proposed in [13]. 1.2.1 Routing algorithm There are various way for routing in NoC. Classification depends on number of destinations, type of controller, implementation, and adaptability. In unicast routing, the packets have a single destinations. For multicast routing, packets have multiple destinations. Unicast routing can further classified into four classes: centralized routing, source routing, distributed routing and multiphase routing. Routing algorithms can also be defined based on their implementation: lookup table and Finite State Machine (FSM). Lookup table routing algorithms are more popular in implementation. They are implemented in software, where a lookup table is stored in every node. We can change the routing algorithm by replacing the entries of the lookup table. FSM based routing algorithms may be implemented either in software or in hardware. These routing algorithms may further be classified based on their adaptability. Deterministic routing always follows a deterministic path on the network. Examples of such routing algorithms are XY routing, North first, South first,East first, and West first. Adaptive routing algorithms need more information about the network to avoid congested paths in the network. These routing algorithms are obviously more complex to implement, thus, are more expensive in area, cost and power consumption. Therefore, we must consider a right QoS (Quality-of-Service) metric before employing these algorithms. Routing algorithms can be fault-tolerant algorithms such as backtracking. In case of progressive algorithms, a channel is reserved before a flit is forwarded. Some routing algorithms send packets/flits only in the direction that is nearer to the destination. These routing algorithms are referred as profitable algorithms. A misrouting algorithm may forward a packet/flit away from the destination as well. Based on the number of available routing paths, routing algorithms can be finally classified as complete and partial routing algorithms.
1.3 Switching techniques
It can be broadly classified two types. Circuit switching and Packet switching. Circuit switched networks reserve a physical path before transmitting the data packets where as packet switching transmitting the data packets. Packets are transferred to their destination through multiple routers along the routing path in a hop-by-hop manner. Each router keeps forwarding incoming packets to the next router until the packet reached its final destination. Switching techniques decides when the router forwards the incoming packet to the adjacent router. 1. Store and forward switching: When receiving buffer is capable to hold the entire packet then only flits are send. It forward the packet only when there is enough space available. The main drawback of SAF switching is the large needed channel buffer size. Thus it is not suited for NOC. 2. Wormhole switching: In wormhole switching technique, packets are divided into flits and it forward flit by flit. It reduces hop latency because header flit is processed before the next flit is arrive. Drawback is performance degradation due to chain of packet block. 3. Virtual cut through switching: A packet is forwarded to the next router, if there is enough space to hold the packet. Here packet is divided into flits as well as phits. It has a same buffer requirement as SAF.
1.4. Network Topologies:Because of different performance requirements and cost metrics, many different multiprocessor network topologies are designed for specific applications. MPSoC networks can be categorized as direct networks and indirect networks. In direct network MPSoCs, node processors are connected directly with each other by the network. Each node performs dataflow routing as well as arbitration. In indirect network MPSoCs, node processors are connected by one (or more) intermediate node switches. The switching nodes perform the routing and arbitration functions. Therefore, indirect networks are also often referred to as multistage interconnect networks (MIN). 1.4.1. Direct Network Topologies:Orthogonal Topology:Nodes in orthogonal networks are connected in k-ary n-dimensional mesh (k-ary n- mesh) or k-ary n-dimensional torus (k-ary n-cube) formations, as shown in Fig. 1.3. Because of the simple connection and easy routing provided by adjacency, mesh and torus networks are widely used in parallel computing platforms. Orthogonal networks are highly regular. Therefore, the interconnect length between nodes is expected to be uniform to ensure the performance uniformity of the node processors.
Figure 1.3: Mesh and Torus Networks Cube-Connected-Cycles Topology:The cube-connected-cycles (CCC) topology is proposed as alternative to orthogonal topologies to reduce the degree of each node, as shown in Fig. 1.4. Each node has 3 degrees of connectivity as compared to 2n degrees in mesh and torus networks. CCC networks have a hierarchical structure: the three nodes at each corner of the cube form a local ring.
Figure 1.4: Cube-connected-cycles Networks Octagon Topology:The Octagon network was proposed by as an on-chip communication architecture for network processors. In this architecture, eight processors are connected by an octagonal ring and three diameters. The delays between any two node processors are no
more than two stages (through one intermediate node) within the local ring. The Octagon network is scalable. If one node processor is used as the bridge node, more Octagons can be cascaded together, as shown in Fig. 1.5.
Figure 1.5: The Octagon Networks 1.4.2. Indirect Network Topologies:Crossbar Switch Fabrics :- An N N crossbar network connects N input ports with N output ports. Any of the N input ports can be connected to any of the N output ports by a node switch on the corresponding cross point. Fully-Connected Network:-An N N fully-connected network uses MUXes to aggregate every input to the output (Fig. 1.7). Each MUX is controlled by the arbiter that determines which input should be directed to the output. Similar to the crossbar network (fully connected switch network is also often referred as crossbar), in fully connected switch network, each source destination connection has its dedicated data path. Fat-tree Topology:-Unlike the Butterfly network, a fat-tree network provides multiple data paths from source node to destination node. As shown in Fig. 1.11, the fat-tree network can be regarded as an expanded n-ary tree network with multiple root nodes. The network delays are dependent on the depth of the tree. SPIN network is one design example that uses 4-ary fat-tree topology for the MPSoC on-chip communication. 1.5. Problems on Routing:Problems on oblivious routing typically arise when the network starts to block traffic. The only solution to these problems is to wait for traffic amount to reduce and try again. Deadlock, lovelock and starvation are potential problems on both oblivious and adaptive routing.
1.5.1. Deadlock.
Routing is in deadlock when two packets are waiting each other to be routed forward. Both of the packets reserve some resources and both are waiting each other to release the resources. Routers do not release the resources before they get the new resources and so the routing is locked.
1.5.2. Livelock. Livelock occurs when a packet keeps spinning around its destination without ever reaching it. This problem exists in non-minimal routing algorithms. Livelock should be cut out to guarantee packets throughput. There are a couple of resorts to avoid the livelock. Time to live (TTL) counter counts how long a packet has traveled in the network. When the counter reaches some predetermined value, the packet will be removed from the network. Another resort is to give packets a priority which is based on packets age. The oldest packet always finally gets the highest priority and will be routed forward. 1.5.3. Starvation. Using different priorities can cause a situation where some packets with lower priorities never reach their destinations. This occurs when the packets with higher priorities reserve the resources all the time. Starvation can be avoided by using a fair routing algorithm or reserving some bandwidth only for low-priority packets. 1.6 Flow technique Flow control also known as channel buffer management it is broadly classified as two type: buffered and bufferless flow control. The buffered flow control can be future classified into credit based flow control, ack/nAck flow control, stall/go flow controller, T error flow control and handshaking signal based flow control. 1. In credit based flow control, an upstream node keeps count of data transfers, thus the available free slots are termed as credits. Once the transmitted data packet is either consumed or further transmitted, a credit is sent back. 2. In handshaking signal based flow control, a valid signal is sent whenever a sender transmits any flit. The receiver acknowledges by asserting a valid signal after consuming the data flit. 3. In the Ack/Nack protocol, a copy of data flit is kept in abuffer until an ACK signal is received. On assertion of ACK , the flit is deleted from the buffer; instead if a NACK signal is asserted then the flit is scheduled for retransmission. 4. In the STALL/GO schemes, two wires are used for flow control between each pair of sender and receiver. When there is an empty buffer space, a GO signal is activated professor 1.7 The flow control scheme 1.7.1 Statistical traffic modeling On chip processor traffic: on chip processor communications are mostly issued by catches( instruction catch, data catch or both), and not by the processor itself. The behaviour of this component imply that processor traffic is the aggregation of several
types of communication describe hereafter. The catch top the NOC can be segregated into three categories: 1. Reads : read transaction have the size of a cache line. The time between two reads corresponds to a time during which the processor computers only on cached data. Two flows must be distinguished, the instruction flow( binary from instruction memory) and data flow(operands from data memory). 2. Writes: write transaction can have various sizes depending on the catch writing policy: write through (one word at a time) and write back (one line at a time). If write buffer is present then the size is variable, as buffer is periodically emptied. 3. Other request: request to the non catch memory parts have a size of one word, as for atomic reads/writes. If a cache coherency algorithm is implemented then additional messages are also sent among processors.
1.7.2 On chip traffic formalism The traffic produce by a processor is modeled as a sequence of transactions composed of flits (flow transfer units) corresponding to one bus word. The kth transaction is a 5-uple T(K)=(a(K), b(K),c(K),d(K),e(K) meaning targeting address, command(read or write), size of transaction, delay and inter request time, respectively). Also define the latency of the kth transaction L(k) as the number of cycles between the start of a kth request and arrival of associated response. Statistical traffic modeling 1. Replay: first record the complete transaction sequence t(K) and simply replay as it is.it provide very acuurate simlulation.but it has disadvantage that size of recorded simulaqtion trace might be very large thus hard to load and store. 2. Independent random vector: The elements of the vector as sample paths of independent stochastic processes. The statistical behaviour of each element is described 3. Random vector: In this case, the vector is modeled by doing a statically analysis of each element as well as correlations between each pair of elements. 4. Hybrid approach: here introduce some constraints on top of the stochastic modeling.
Chapter 2 NOC Architectures 2. 1 NoC architectures: 2.1.1 Ethereal NOC Generally NOC has two components, 1st is router and 2nd is network interfaces.[1] This is achieved with connection, for reserving source connections are open and for releasing resources connection are closed. In Ethereal NoC contention free routing is used, which offer guaranteed services-throughput and latency. For guaranties communication in NOC resource reservation is require. Thus Ethereal NOC requires configuration and programming. Ethereal Noc Router Ethereal NOC router uses a slot table, for certain advantages: it avoid contention on link, switch data to correct output and divide up bandwidth per link between connections. Every slot table T has S time slot (i.e. rows) and N router outputs (column).all router has fixed duration slots. Wormhole routing is used. Link-level flow control between the routers is used to avoid queue overflow. Ethereal programming models It introduce two programming models: distributed and centralized. Distributed programming model:-It works as a asynchronous transfer mode (ATM). Three BES packets are used for configuration: setup, teardown and AckSetup. The setup packet creates connection between source and destination, it contain the data of sources, path to destination and slot numbers. When a Setup packet successful arrives at the destination, it returns AckSetup. When connection is failed, the Setup packet is dropped and TearDown packet is used to remove partial connection and upstream tear down packets return to the source.( assume that every path must be reversible). Source is successful to create a connection when it receives an AckSetup; otherwise it receives a TearDown. This model use slot tables to avoid contention in each router. Centralized programming model:- In distributed programming model, network interface send SetUp packets to determine which slot they use for connection. Here no slot table uses for router, however the network interfaces still require slot tables to determine when data enter the router network. However a central party (called root process) could directly program the network interfaces with the correct slots.
QNOC QNOC is asynchronous router provide different levels of service [2]. The architecture is based on regular mesh topology. It uses wormhole packet routing. Packets are forwarded using the static X-Y co ordinate based routing. It does not provide any
support for error correction logic and all links and data Transfers are assumed to be reliable. Packets are forwarded based on the number of credits remaining in the next router. QNOC has identified four different service levels (SL) based on the on-chip communication requirements. These SLs include Signaling, Real-Time, Read/Write (RD/WR) and Block Transfer, Signaling being the top priority and Block transfer being the least in the order as listed. The Priority Based Round-Robin scheduling criterion is employed for transmission of flits. The cost functions for the QNOC implementation were calculated based on the estimation of area occupied by its components. Were also provided other performance parameters such as clock rate, end-to-end delay (latency of packet), and power consumption under different traffic loads. Spin NOC The scalable programmable integrated network on chip is based on a fat tree topology. It is addresses design decisions such as the nature of links, the packet structure and the network protocol. It is based on two kinds of components are traffic generators, which send requests to the target components. SPIN is a packet-switching on-chip micronetwork, which uses Wormhole switching, adaptive routing and credit-based flow control. In a full 4-ary fat-tree topology, there are as many fathers as children on all nodes (routers). Links are bidirectional and full-duplex, with two unidirectional channels. In SPIN, packets are defined as sequences of 32 bits data words, with the header fitting in the first word. An 8-bit field in the header is used to identify the destination terminal. The network to scale up to 256 terminals. Routing in SPIN is adaptive and distributed. The basic building block of the SPIN network is the RSPIN router. It includes eight ports, each port with a pair of input and output channels compliant with the SPIN link. XPIPES NOC The researchers of XPIPES [27], [28], [29] have generated a framework, XPIPES Compiler, which automatically instantiates customized NOC macros (switches, network interfaces and links) from the developed parametrizable building blocks implemented in SystemC. A static routing protocol called street sign routing along with wormhole switching are employed for on-chip communication. XPIPES uses pipelined links, similar to a shift register in operation, achieved by partitioning the wires into segments for the actual flit transfer. Each output module is deeply pipelined. The CRC (Cyclic Redundancy Check) decoders for error detection work in parallel with the switch operation. The first pipeline stage checks the headers of incoming packets on the different input ports to determine the correctness of the packet paths, the second pipeline stage resolves contention based on a round-robin policy. Arbitration is carried out when the tail flit of the preceding packet is received. A negative acknowledgement (NACK) for flits of non-selected packets is generated. The following arbitration stage keeps the status of the virtual channel registers and determines whether flits can be stored into the registers or not. The fifth stage is the actual buffering stage, and the ACK/NACK response at this stage indicates whether a flit has been successfully stored or not. The following stage takes care of forward flow control. Finally, a last arbitration stage multiplexes the virtual channels on the physical output link on a flit-by-flit basis. The XPIPES network interface uses the standardized OCP interface to network cores. Static routing information is accessed by the header builder. It is then passed to the
flit builder circuit in the form of a number of hops (NumSB) and an actual direction bit (LutWord) along with the datastream, if BusyBuilder is not asserted high. This flit is then passed to the NOC through the output buffer stage. The response path includes receiving information through Synchro, which reads only useful information and passes it to the core through Receive Response. XPIPES implements an error control logic based on the retransmission of data packets upon the negative acknowledgement. 2.2 NOC Issues and Challenges To enhance system productivity, it is very important that an architect be able to abstract, represent and address most of the design issues and concerns at a high level of abstraction. System-level design affords one the opportunity to review several different software-hardware architectures that meet the functional specifications equally well, to quickly trade-off among different QoS metrics such as latency, power, cost, size and ease of integration. Similarly, there are several issues related to NOC, such as the nature of the NOC link, link length, serial vs parallel links, bus vs packet-based switching, and leakage currents. In this section, we discuss these issues. Serial vs Parallel Link The transportation of data packets among various cores in a NOC can be performed by the use of either a serial or a parallel link. Parallel links make use of a buffer-based architecture and can be operated at a relatively lower clock rate in order to reduce power dissipation. However, these parallel links will incur high silicon cost due to inter-wire spacing, shielding and repeaters. This can be minimized up to a certain limit by imploying multiple metal layers. On the other hand, serial links allow savings in wire area, reduction in signal interference and noise, and further eliminate the need for having buffers. However, serial links would need serializer and de-serializer circuits to convert the data into the right format to be transported over the link and back to the cores. Serial links offer the advantages of a simpler layout and simpler timing verification. Serial links sometimes suffer from ISI (Inter-symbol Interference) between successive signals while operating at high clock rates. Nevertheless, such drawbacks can be addressed by encoding and with asynchronous communication protocols. Interconnect Optimization Communication in a NOC is based on modules connected via a network of routers with links between the routers that comprise of long interconnects. Thus it is very important to optimize interconnects in order to achieve the required system performance. Timing optimization of global wires is typically performed by repeater insertion.Repeaters result in a significant increase in cost, area, and power consumption. Recent studies indicate that in the near future, inverters operating as repeaters [58] will use a large portion of chip resources. Thus, there is a need for optimizing power on the NOC. Techniques for reducing dynamic power consumption include approaches discussed in [59], [60], [61]. Encoding is another effective way of reducing dynamic power consumption [62]. In order to make NOC architectures more effective, innovative ways will have to be introduced to minimize the power consumed by the on-chip repeaters. Leakage Power Consumption
The leakage current, which was negligible relative to the dynamic switching current at larger transistor sizes (of 1 micron or more), is expected to dominate the current drain at sub-100 nm technologies. In a NOC, the link utilization rates vary and in many cases are very low, reaching a few percentage points. Networks are designed to operate at low link utilization in order to meet worst case scenario requirements, and thus having a higher link capacity helps reduce packet collisions. However, even when NOC links are idle they still will consume power in repeaters, due to the dominance of this leakage current at small feature sizes. Thus, new techniques will have to evolve which will help reduce the leakage power consumption to make the NOC architecture more effective.
Chapter 3 Basic of Photonic NOC 3.1 Introduction Current researches and studies shows that the insertion of photonics in the on-chip global interconnects structures for CMPs can potentially leverage the unique advantages of optical communication and capitalize on the capacity, bit-rate transparency, and fundamentally low energy consumption that have made photonics ubiquitous in long-haul transmission systems.
The possibility of a nanophotonic Network-on-Chip (NoC) would dramatically disrupt the current trend of power scaling in multiprocessor architectures by offering significant power savings in both global on-chip and off-chip communications over comparable electronic networks. Photonic NoCs can deliver a dramatic reduction in power expended on intrachip global communications. Photonic NoCs essentially change the power scaling rules: as a result of the low loss in optical waveguides, once a photonic path is established, the data are transmitted end to-end without the need for repeating, regeneration or buffering. In electronic NoCs, on the otherhand, a message is buffered, regenerated and then transmitted on the inter-router links multiple times en route to its destination. Furthermore, the switching and regenerating elements in CMOS consume dynamic power that grows with the data rate. The power consumption of optical switching elements, conversely, is independent of the bit rate, so high bandwidth messages do not consume additional dynamic power. To capitalize on these advantages, optical interconnection networks design must address two architectural challenges: the lack of efficient optical buffering technologies and the limited processing capabilities of optics. Electronic interconnection networks rely heavily on memory elements to perform essential contention resolution functions and to store data while control information is processed. Header and address processing are also easily performed in the electronic domain. The photonic opportunity can be realized only after overcoming the inherent restrictions of optical technologies, i.e. limited buffering and signal processing capabilities.
3.2 Optical NoC: Design Considerations Design to exploit optical advantages: Bit rate transparency: transmission/switching power independent of bandwidth Low loss: power independent of distance Bandwidth: exploit WDM for maximum effective bandwidths across network (Over) provision maximized bandwidth per port Maximize effective communications bandwidth
Seamless optical I/O to external memory with same BW
Design must address optical challenges: No optical buffering No optical signal processing Network routing and flow control managed in electronics Distributed vs. Central Electronic control path provisioning latency
3.3 Hybrid Model: Generally PNOC architecture employs a hybrid design, combining optical network for bulk message transmission and an electronic network with the same topology ,for distributed control and short message exchange. While photonic technology offer unique advantages in term of energy and bandwidth. Hybrid approach deals problems with employing two layers: 1. A photonic interconnection network, comprised of silicon broadband photonic switches interconnected by waveguides, is used to transmit high bandwidth messages. 2. An electronic control network, topologically identical to the photonic network, is used to control the photonic network and for the exchange of short control messages. Every photonic message transmitted is preceded by an electronic control packet(a path-setup packet) which is routed in the electronic network, acquiring and setting-up a photonic path for the message. Buffering of messages is impossible in the photonic network, as there are no photonic equivalents for storage elements (eg- flip flop, registers, RAM). Hence buffering, if necessary, only takes place for electronic packets during the path set-up phase. The photonic messages are transmitted without buffering once the path has been acquired. This approach has many similarities with optical circuit switching, a technique used to establish long lasting connections between nodes in the optical internet core. The main advantage of using photonic paths relies on the properties of the photonic medium, known as bit-rate transparency. Photonic switches switch on and off once per messages, their energy dissipation does not depend on the bit rate. This property facilitates the transmission of very high bandwidth messages while avoiding the power cost that is typically associated with them in traditional electronic networks. Another attractive feature of optical communications results from the low loss in the optical waveguides: at the chip scale, the power dissipation on a photonic link is completely independent of transmission distance. Energy dissipation remains essentially the same whether a message travels between two cores that are 2 mm or 2 cm apart. The photonic network is comprised of broadband 2*2 photonic switching elements which are capable of switching wavelength parallel messages(i.e. each message
is simultaneously encoded on several wavelengths) as a single unit, with a sub-ns switching time. The switches are arranged as a two dimensional matrix and organized in groups of four. Each group is controlled by an electronic circuit termed as electronic router to construct a 4*4 switch. This structure lends itself conveniently to the construction of planar 2D topologies such as a mesh or a torus. Torus networks, which offer a lower network diameter, compared to meshes at the expense of having longer links.[7]. Since the photonic switching elements have small area and power consumption, many of them can be used to provision the network with additional paths on which circuits can be created, thus reducing the contention manifested as path-setup latency. Electronic /Optical and Optical/Electronic conversions are necessary for the exchange of photonic messages on the network. Each node therefore includes a network gateway serving as a photonic network interface.1 Network gateways should also include some circuitry for clock synchronization and recovery and serialization/de-serialization. when traditional approaches are used, this circuitry can be expensive both in terms of power and latency. 3.4 Packet life time on the photonic NOC Here describe the typical chain of events in the transmission of a message between two terminals. A write operation take place from a processor in the node A to a memory address located at node B. (Here both arbitrary nodes connected through the photonic NOC) Before the message is ready, a path-setup packet is sent on the electronic control network. This packet include the information on the destination address of node B, perhaps additional information such as priority, flow id, or other. The control packet is routed in the electronic network, reserving the photonics switches along the path for the photonic message which will follow it. When the path-setup packet reaches the destination node B, the photonic path is reserved and ready to route the message. Since photonic path is completely bidirectional a fast light pulse can then be transmitted onto the wave guide, in the opposite direction, signaling to the source that the path is open. The photonic message transmission then begins and the message follows the path from switch to switch until the message reaches its destination. Since special hardware and additional complexity are required to transmit and exact the counter-directional light pulses, an alternative approach can be used : This is based on transmitting the message when the path is assumed to be ready according to the maximum expected path reservation latency. After the message transmission is completed a path teardown packet is finally sent to release the path for usage by other messages. Once the photonic message has been received and checked for errors, a small acknowledgement packet may be sent on the electronic control network, to support guaranteed-delivery protocols.
In that case, when a path-setup packet is dropped in the router due to congestion, a path blocked packet is sent in the reverse direction, backtracking the path traveled by the path-setup packet. The path-blocked packet released the reserved switches and notifies the node attempting transmission that its request was not served. Topology Here use a folded torus topology as a base and augment it with access points for the gate-ways. The access points for the gateways are designed with two goals in mind : (1) to facilitate injection and ejection without interference with the through traffic on the torus and (2) to avoid blocking between injected and ejected traffic which may be caused by the switches internal blocking. Injection-ejection blocking can be detriment to the performance and may also cause deadlocks. The access points are designed such that gateways are directly connected to the switch. To avoid internal blocking a set of injection-ejection rules must be followed: injected message make a turn at the gateway switch, according to their destination and then enter the torus network through an injection switch. Messages are ejected from the torus network when they arrive to the ejection switch associated with their final destination. The ejection switches are located on the network, at the same row as the gateway switch, and this is the place where the ejecting message turn. Finally , ejected messages pass through the gateway switch without making turns.
Dealing with Deadlock When dimension order routing is used, no channel-dependency cycles are formed between dimensions, so deadlock involving messages traveling in different dimensions cannot occurs. Virtual channel flow control has been shown to be successful in eliminate intra-dimensions deadlocks and make dimension order routing deadlock free. Message that make narrow turns and message that pass straight through do not block other messages and cannot be blocked, and U-turns are forbidden. The injectionejection include the separation of injection and ejection to different switches so that turns that may block other messages cannot occur in the same switch.
3.5 SWITCH: Gateway switching: injection messages are required to make a turn towards the injection switches. Ejected messages arrive from the ejection message and pass straight through. Therefore, blocking cannot happen. Injection switch: message already travelling on the torus network do not to the injection paths, so no blocking interactions exits between them and the injection messages
Ejection switch: messages may arrive only from the torus network and they either turn for ejection turn for ejection or continue straight through. Since no message arrive from the gate switch, none of blocking interactions may happen.
In PNOC, there is a problem called deadlock problem, to solve the intradimensional deadlock problem using path-setup timeouts. When a path-setup packet is sent, the gateway set a timer to a pre-defined time. When the timer expires, a terminateon-timeout packet is sent following the path-setup packet. The timeout packet follows the path acquired by the path-setup packet until it reaches the router where it is blocked. At that router, the path-setup packet is removed from the queue and a path-blocked packet is sent on the reverse path, notifying the routers that the packet was terminated and the pathshould be freed. If a deadlock has occurred, the system recovers from it at that point. While this method suffers from some inefficiency because paths and gateway injection ports are blocked for some time until they are terminated without transmitting, it guarantees deadlock-recovery. Another possible scenario, the path-setup packet is not deadlocked but merely delayed and it reaches its destination while the timeout packet is in en-route. In these case if timeout packets are reaches the destination gateway where it is ignored and discarded, and the path is acquired as if the timeout had not expired. Message size To maintain network efficiency as well as its flexibility and link utilization the message duration should be handled carefully. If too large size messages are used, then link utilization and latency is compromised, when messages are queued in the gateway for a long time while other long messages are transmitted. On other hand, if message are too small ,then the relative overhead of the path-setup latency becomes too large and efficiency is degraded. In this study the optimal size with respect to the overhead of the path-setup process under the assumption that it is constant across all messages. We define the overhead ratio as: Where Tpath-reservation is the time between the transmission of the path-setup packet and the transmission of the path-teardown packet, and Tmessage-duration ,is the time during which actual transmission takes place ,corresponding to the size of the message. The smaller value of p, the higher the network efficiency. The optimal message size is the smallest size which does not exceed a certain overhead ratio. Increasing path diversity One of the advantages of packet-switching network lies in the statistical multiplexing of packets across channels and its extensive usage of buffers. These allow for distribution of loads across space and time. In photonic circuit switched network, there is no statistical multiplexing and buffering is impractical. Additional paths, however, can be provisioned, over which the load can be distributed using either random load-balancing techniques, or adaptive algorithm that use current information on the network load.
The topology chosen for the proposed network, a torus, can be easily augmented with additional parallel paths that provide path-diversity and facilitate this distribution of the load. The performance metric used to evaluate the improvement gained by adding the paths is again the path-setup overhead ratio, which is derived directly from the path-setup latency. 3.6 RECONFIGURABLE PHOTONIC NOC It is known that memory references exhibit locality in space and time. As such, the numerous packets flowing through the NoC will seemingly organize in intensive traffic bursts between communicating pairs. Detailed simulations have shown that those burst patterns exist in a wide range of time scales, and can be up to several milliseconds in length [6]. From this observation, the idea originated of a photonic NoC where the optical paths serve as shortcuts to boost the performance of an underlying base network [7]. Those direct, reconfigurable connections will improve the performance of the accompanying electrical NoC in two ways. First, they decrease the congestion by providing temporary highthroughput data channels where needed, and second, they provide low-latency direct links between the most intensively communicating partners. These proposed photonic links could rely on the same technology as the photonic NoC proposed by Petracca et al. [3] which is based on an array of non-blocking 44 microring switches. By changing the state of the switches, the topology of the interconnect can be altered. However, in contrast to [3], our approach does not set up a dedicated channel for each packet, but slowly reconfigures the topology in accordance to emerging hot-spots. For the allocation of the photonic shortcuts, a heuristic is used that tries to provide a direct link for most of the network traffic that is to be expected during the span of a reconfiguration interval (Treconf). After each interval, a new optimum topology is computed using the traffic pattern measured in the previous interval. The length of Treconf must be chosen as short as possible to be able to follow the dynamics of the evolving traffic patterns but long enough to amortize the cost of calculating the optimized topologies and of link downtime during reconfiguration. In our case, a Treconf of 1 s turned out to be a good compromise. Reconfigurable photonic NoC can provide a good trade-off between network performance and power consumption, while being compatible with the short-message character of sharedmemory CMPs.
Chapter 4 State the Art of the PNOC Photonic NOC for DMA Communication in Chip Multiprocessors: In this paper , internally blocking switches by designing a non-blocking photonic switch, and estimated the optical loss budget and area requirements of a practical NOC implementation based on the new switches. Here also deal with one of the key challenge of photonic that the latency associated with setting-up photonic paths. By reducing the buffering depth ,the path setup latency is significantly reduced and throughput is improved. 4.1 UC-PHOTON : A hybrid Photonic NOC proposed to cope with emerging multiple use-case applications and maximize performance-per-watt. UC-PHOTON is comprised of one or more photonic ring paths coupled to a traditional 2D electrical mesh NoC architecture. The photonic paths offload global communication from the electrical network, improving packet latency and reducing communication power dissipation. UC-PHOTON supports dynamic reconfiguration of the electrical and photonic networks. This enables runtime adaptation to changing traffic patterns, which allows network resources to be optimized for even lower power dissipation. This paper makes two novel contributions. First, we extend our previously proposed low-cost photonic ring topology [39] and explore novel multi-ring topologies to improve performance scalability for emerging CMPs with hundreds of cores. Second, we explore runtime reconfiguration of both the electrical and photonic networks to significantly improve communication performance and reduce power dissipation for multiple use-case applications. There are two types of routers used in UC-PHOTON: (i) regular electrical mesh routers that have 5 I/O ports (N, S, E, W, local core) with the exception of the boundary routers that have fewer ports, and (ii) gateway interface routers that have six I/O ports (N, S, E, W, local core, photonic link) and are responsible for sending/receiving flits to/from photonic interconnects in the photonic layer. If multiple requests contend for access to the photonic waveguide at a gateway interface, then the request with the furthest distance to the destination is given preference i) DVS/DFS: Dynamic supply voltage and clock frequency scaling (DVS/DFS) is one of the most widely used runtime optimization techniques to reduce power dissipation. In this approach, NoC link and router frequencies are dynamically adapted to meet performance requirements while consuming the minimum power. ii) Clock gating: Clock gating is the most effective solution for optimizing the dynamic power, and is supported by most commercial synthesis and optimization tools. ii) Adaptive TDMA slot allocation: The TDMA slot allocation in a router for different traffic flows controls the bandwidth and also the average latency of packets for the flows in a NoC. Since different use-cases have differentbandwidth and latency
requirements, changing the TDMA slot allocation during use-case transition is a way to adapt to the new use-case. 4.2Corona: Corona is a nanophotonically connected 3D many-core NUMA system that meets the future bandwidth demands of data-intensive applications at acceptable power levels. Corona comprises 256 general purpose cores, organized in 64 fourcore clusters, and is interconnected by an all-optical, high bandwidth DWDM crossbar. The crossbar enables a cache coherent design with near uniform on-stack and memory communication latencies. This paper considers the implications for many-core processors. A complete nanophotonic network requires waveguides to carry signals, light sources that provide the optical carrier, modulators that encode the data onto the carrier, photodiodes to detect the data, and injection switches that route signals through the network. Cluster Architecture Each core has a private L1 instruction and data cache,and all four cores share a unified L2 cache. A hub routes message traffic between the L2, directory, memory controller, network interface, optical bus, and optical crossbar. 4.3 Aurora Aurora, a thermally resilient photonic NoC architecture design that supports reliable and low bit error rate (BER) on-chip communications in the presence of large temperature variations. The proposed architecture leverages solutions at both device and architecture layers that synergistically provide significant improvements. To compensate for small temperature variations, our design varies the bias current through ring resonators. For larger temperature variations, we propose architecture-level techniques to re-route messages away from hot regions, and through cooler regions, to their destinations, thereby lowering BER. Architecture : Here photonic NoC is implemented as a layer of optical devices on the top of a silicon chip. A 2D folded torus hybrid NoC topology is used as it is compatible with tiled CMP chip, allows the use of low-radix switches, and minimizes cross talk. In 3D packaging, the photonic network is usually implemented on top of the core layer. It experiences larger non-uniform temperature variations, depending on the temperature of the cores below. Since the photonic layer consists of thousands of ring resonators, the operation of the photonic network will be drastically compromised BER increases with variation in Temperature. Average BER used as an indicator to provide a measure of how temperature variations affect the operation of our simulated photonic network. 4.4 BLOCON: Here propose the BLOCON (Bufferless Photonic Clos Network) to exploit silicon photonics. We propose a scheduling algorithm named Sustained and Informed Dual
Round-Robin Matching (SIDRRM) to solve the output contention problem, and a path allocation scheme named Distributed and Informed Path Allocation (DIPA) to solve the Clos network routing problem. BLOCON is a bufferless Clos network, which applies wormhole routing, and could be viewed as an input-queued switch without reassembly queues or virtual channels (VCs) in the output ports. In BLOCON, the buffers only exist in the PEs and the first stage of the switch modules (SMs). The absence of multi-stage buffered SMs makes BLOCON enjoy high throughput with a proper scheduling algorithm. The zero-load latency of BLOCON is low, compared to other NoC architectures, because a packet only needs to travel a buffered SM, a waveguide, a crossbar, and two electrical links between any source and destination PE. Since the Clos network has a high-bisection bandwidth and large number of routes between any source and destination, BLOCON has a very stable delay and power performance under different kinds of traffic patterns. BLOCON can be viewed as an input-queued switch with the input buffers residing in the PEs and IMs. When a packet leaves the input port of an IM, the packet will not experience any queuing delay. We designed BLOCON based on the virtual output queue (VOQ) buffer structure. Fig. 4 shows the buffer structure of a 4 _ 4 BLOCON, in which there are three VOQs in each PE, and a FIFO in each input port of the IMs. In BLOCON, there are no reassembly queues or VCs on the output port side. A packet has to be transferred continually without interruption between an IM FIFO and the destination PE. To send a packet, a PE has to solve the output contention, making sure that the packet has the exclusive right to enter its destination PE. After winning the output contention, the input/output ports are locked by this packet. The packet then leaves the PE and enters the input port FIFO in an IM. The IM will allocate a MUX waveguide for the packet before the packet leaves the FIFO. After the packet leaves the FIFO, the input/output ports are unlocked. 4.5 ESPN Energy-Star Photonic Network (ESPN) architecture that optimizes energy utilization via a two-pronged approach: (1) by enabling dynamic resource provisioning, ESPN adapts photonic network resources based on runtime traffic characteristics and (2) by utilizing all-optical adaptive routing, ESPN improves energy efficiency by intelligently exploiting existing network resources without introducing high latency and power hungry auxiliary routing mechanisms. ESPN consists of one multi-processor chip and two laser source chips connected by off-chip optical fibers and electrical wires on the PCB. The multi-processor chip consists of three vertically stacked dies using 3D packaging technology [10]. The processor and caches die contains processor cores, private L1/L2 caches and electrical routers. The control die, which operates as the interface between the processor and caches die and the optical die, integrates driving circuits, sense amplifiers, and control circuits for the optical components (e.g. the ON/OFF switch of turn resonators and modulators/photo-detectors).
4.6 Olmpic : An all-optical NoC architecture using a hierarchical topology made up of replicated and cascaded simple photonic building blocks (rings). Local rings connect tiles within clusters directly and a global ring glues together local ones and enables interscluster communications. The all-optical approach allows to achieve a low-energy solution, very important for future embedded CMPs. Olympic is a tiled architecture in which every tile has private L1 caches and a slice of shared L2 cache and directory. It employs a hierarchical clustered network topology in which clusters are interconnected through a global photonic ring and tiles inside each cluster are connected through a local photonic ring. Tiles in the same cluster can communicate using only their local ring. One tile wishing to communicate to another in a different cluster needs to send the messages to its local hub, which receives and converts it into electronics and then sends it again in optics, through the global ring, to the hub of the cluster containing the destination tile. The destination hub converts the message again (O/E/O) and sends it to the final tile through its local ring. Both local ring and global ring use WDM to exploit bit parallelism and their being separated allows up to M + 1 simultaneous and independent communications when having M local and one global rings. 4.7OPAL: In this paper propose a multi-layer hybrid photonic NoC fabric (OPAL) for 3D ICs. Our proposed hybrid photonic 3D NoC combines low cost photonic rings on multiple photonic layers with a 3D mesh NoC in active layers to significantly reduce onchip communication power dissipation and packet latency. OPAL also supports dynamic reconfiguration to adapt to changing runtime traffic requirements, and uncover further opportunities for reduction in power dissipation. OPAL employs multiple active layers and multiple photonic layers with photonic ring paths in a stack. The active layers consist of cores interconnected to each other using a 3D electrical mesh NoC. The photonic layers consist of ring shaped waveguides. Gateway interface routers provided the connectivity between the electrical layer and the modulators and photodetectors in the photonic layer. This E2P3 OPAL configuration has two active electrical (E) layers and three photonic (P) layers. Each E layer has a dedicated P layer with photonic rings for intra layer global transfers between cores in the same layer. For every two E layers, a dedicated P layer exists that facilitates inter-layer (e.g. E1 to E2) global transfers. Vertical TSVs are used for transfers between E1 and E2 in the electrical 3D mesh NoC, as well as to transfer data between photonic layers and active layers. Higher complexity OPAL configurations can be created by reusing this basic E2P3 configuration. For instance, for a four active layer 3D IC, an E4P7 OPAL configuration is created by stacking two E2P3 stacks and adding a dedicated P layer for inter E2P3 photonic communication.
CHAPTER 5: IMPLEMENTATION OF DIGITAL ADAPTIVE CDMA TECHNIQUES TO NOCs 5.1.CDMA history:CDMA is based around a form of transmission known as Direct Sequence Spread Spectrum. The CDMA history can be directly linked back to the 1940s when this form of transmission was first envisaged. As electronics technology improved, it started to be used for covert military transmissions in view of the facts that the transmissions look like noise, it is difficult to decipher without the knowledge of the right codes, and furthermore it is difficult to jam.With the revolution in cellular telecommunications that occurred in the 1980s a then little know company named Qualcomm working on DSSS transmissions started to look at this as the basis for a cellular telecommunications multiple access scheme - CDMA - code division multiple access. The concept of CDMA had to prove in the field and accordingly Qualcomm was joined by US network operators Nynex and Ameritech to develop the first experimental CDMA system. Later the team was expanded as Motorola and AT&T (now Lucent) joined to bring their resources to speed development.As a result this it was possible to start writing a specification for CDMA in 1990. With the support of the Cellular Telecommunications Industry Association (CTIA) and the Telecommunications Industry Association (TIA) a standards group was set up. This group then published the standard for the first CDMA system in the form of IS-95, resulting in the formal publication of IS95-A in 1995.The first CDMA system was launched in September 1995 by Hutchison Telephone Co. Ltd. in Hong Kong and SK Telecom in Korea soon followed along with networks in the USA.This was only one cellular telecommunications system, although it was the first. Its development leads onto the CDMA2000 series of standards. The use of CDMA did not stop with CDMA2000 as it became necessary to evolve the GSM standard so that it could carry data and provide significant improvements in terms of spectrum use efficiency. Accordingly CDMA, in the form of Wideband CDMA (WCDMA) was adopted for this standard. 5.2.KEY ELEMENTS OF CDMA:CDMA is a form of spread spectrum transmission technology. It has a number of distinguishing features that are key to spread spectrum transmission technologies: Use of wide bandwidth: CDMA, like other spread spectrum technologies uses a wider bandwidth than would otherwise be needed for the transmission of the data. This results in a number of advantages including an increased immunity to interference or jamming, and multiple user access. Spreading codes used: In order to achieve the increased bandwidth, the data is spread by use of a code which is independent of the data.
Level of security: In order to receive the data, the receiver must have knowledge of the spreading code, without this it is not possible to decipher the transmitted data, and this gives a measure of security. Multiple access: The use of the spreading codes which are independent for each user along with synchronous reception allow multiple users to access the same channel simultaneously. 5.3.CDMA TECHNOLOGY ADVANTAGES:The use of CDMA offers several advantages and it is for this reason that CDMA technology has been adopted for many 3G cellular telecommunications systems. Improvement in capacity: One of the chief claims for CDMA is that it gives significant improvements in network capacity. Original expectations for some of the proponents of CDMA technology were for some very significant improvements: 18 fold increase in capacity when compared to AMPS (1G technology used in USA) 6 fold increase in capacity when compared to US TDMA (2G technology used in USA) - similar increases were also claimed over GSM.
In reality the original expectations were not fulfilled although increases of a factor of about two were seen when compared to US TDMA and GSM. This in itself was a significant improvement. Improvement in handover / handoff: Using CDMA it is possible for a terminal to communicate with two base stations at once. As a result, the old link only needs to be broken when the new one is firmly established. This provides significant improvements in terms of the reliability of handover / handoff from one base station to another. 5.4. WHY CDMA IN NOCs:NOC was proposed to resolve the on chip communication problem. Because as the integration density becomes higher and higher, communication between several components becomes a complicated issue. Data transfer among different on-chip component can be roughly divided in two categoriesThrough, 1. Circuit Switched Network. 2. Packet Switched Network. SoCBUS architecture, a mesh on-chip network, is an example of circuit switched network whereas Ethereal NoC and Proteo NoC are examples of the packet switched category. Ethereal NoC applies the combined guaranteed service and best-effort routers to transfer data packets in the network. In Proteo NoC, the components in the system are connected through network nodes and hubs. The network topology and data links in Proteo NoC can be customized and optimized for a specific application.
Circuit switched networks has two big disadvantages in scalability and parallelism. The packet switched network can overcome the shortcomings of the circuit switched network by dividing data streams into packets and routing packets to their destinations node by node. But packed switched network also has a problem of different data transfer latency when packets with same source and destination will move through different routes in the network. In order to eliminate variance of data transfer latency and complexity incurred by routing issues in a PTP connected NoC, an on-chip network which applies a code-division multiple access (CDMA) technique is introduced. It has great bandwidth efficiency and multiple access capability. In CDMA each different node is sepereted by a unique orthogonal code, which when used to transmit and receive data several node can transmit and receive data at the same time through the same common channel. 5.5.BASIC DEFINITIONS IN CDMA:Spread spectrum is a communication technique that spreads a narrowband communication signal over a wide range of frequencies for transmission then de-spreads it into the original data bandwidth at the receiver. 5.5.1.Narrow Band vs Spread Spectrum Narrow Band 1. Uses only enough frequency spectrum to carry the signal. 2. High peak power. 3. Easily jammed. Spread Spectrum. 1. The bandwidth is much wider than required to send to the signal. 2. Low peak power. 3. Hard to detect. 4. Hard to intercept. 5. Difficult to jam.
In CDMA a locally generated code runs at a much higher rate than the data to be transmitted. Data for transmission is combined via bitwise XOR (exclusive OR) with the faster code. The figure shows how a spread spectrum signal is generated. The data signal with pulse duration of is XORed with the code signal with pulse duration of . Therefore, the bandwidth of the data signal is and the bandwidth of the spread spectrum signal is . Since is much smaller than , the bandwidth of the spread spectrum signal is much larger than the bandwidth of the original signal. The ratio is called the spreading factor or processing gain and determines to a certain extent the upper limit of the total number of users supported simultaneously by a base station.
5.5.2. Orthogonality:Mathematically it can be described asIt is common to use the following inner product for two functions f and g:
Here we introduce a nonnegative weight function product.
in the definition of this inner
We say that those functions are orthogonal if that inner product is zero:
The dot product of two vectors a = [a1, a2, ... , an] and b = [b1, b2, ... , bn] is defined as:
Where denotes summation notation and n is the dimension of the vector space. In dimension 2, the dot product of vectors [a,b] and [c,d] is ac + bd. Similarly, in a dimension 3, the dot product of vectors [a,b,c] and [d,e,f] is ad + be + cf. For example, the dot product of two three-dimensional vectors [1, 3, 5] and [4, 2, 1] is-
If vectors a andb are orthogonal, then
and:
Each user is associated with a different code, say v. A 1 bit is represented by transmitting a positive code, v, and a 0 bit is represented by a negative code, v. For example, if v = (1, 1) and the data that the user wishes to transmit is (1, 0, 1, 1), then the
transmitted symbols would be (v, v, v, v) = (v0, v1, v0, v1, v0, v1, v0, v1) = (1, 1, 1, 1, 1, 1, 1, 1). For the purposes of this article, we call this constructed vector the transmitted vector.
5.5.3.CDMA code types:There are several types of codes that can be used within a CDMA system for providing the spreading function: PN codes: Pseudo-random number codes (pseudo-noise or PN code) can be generated very easily. These codes will sum to zero over a period of time. Although the sequence is deterministic because of the limited length of the linear shift register used to generate the sequence, they provide a PN code that can be used within a CDMA system to provide the spreading code required. They are used within many systems as there is a very large number that can be used. A feature of PN codes is that if the same versions of the PN code are time shifted, then they become almost orthogonal, and can be used as virtually orthogonal codes within a CDMA system. Truly orthogonal codes: Two codes are said to be orthogonal if when they are multiplied together the result is added over a period of time they sum to zero. For example a codes 1 -1 -1 1 and 1 -1 1 -1 when multiplied together give 1 1 -1 -1 which gives the sum zero. An example of an orthogonal code set is the Walsh codes used within the IS95 / CDMA2000 system. 5.5.3.1. Walsh Table:To generate chip sequences, we use a Walsh Table, which is a two-dimensional table (i.e. Matrix) with an equal number of rows and columns, the entries of which are +1 or 1, and the property that the dot product of any two distinct rows (or columns) is zero. The Walsh matrix was proposed by Joseph Leonard Walsh in 1923. Each row of a Walsh matrix corresponds to a Walsh function. In the Walsh table, each row is a sequence of chips. H(21) for a one-chip sequence has one row and one column. We can choose -1 or +1 for the chip for for this trivial table. According to Walsh, if we know the table for k-1 sequences H(2k-1) , we can create the table for k sequences H(2k) as shown below.
and in general
for 2 k N, where
denotes the Kronecker product.
5.6.DIGITAL CDMA TO NOC:5.6.1.BASIC MECHANISM:At the transmitter side each data bit from different sender are xored with their corresponding unique orthogonal and balance spreading code, and then this encoded data from each sender are added and send through a single line. At the receiver end this sum data is received through a Demultiplexer which uses the corresponding spreading code of the sender as its select signal to receive that particular data. Depending on that spreading code the sum data accumulated to positive part accumulator (if spreading chip is 0) or negative part accumulator(if the spreading chip is 1). Then comparing this two accumulators content the corresponding decoded data is generated.
Fig. 5.1: CDMA technique principle.
5.6.2.DIGITAL CDMA ENCODER:-
Fig. 5.2: Digital CDMA encoding scheme.
Each bit of the S-bit encoded data generated by XOR operations is called a data chip. Then, the data chips which come from different senders are added together arithmetically according to their bit positions in the S-bit sequences. Namely, all the first data chips from different senders are added together and all the second data chips from different senders are added together, and so on. Therefore, after the add operations, we will get S sum values of S-bit encoded data. Finally, binary equivalents of the S sum values are transferred to the receiving end. An example of encoding two data bits from two senders is illustrated in Fig. 4.3 in order to illustrate the proposed encoding scheme in more detail. Fig. 5.3(a) illustrates two original data bits from different senders and two 8-bit spreading codes. The top two figures in Fig. 5.3(b) illustrate the results after data encoding (XOR operations) for the original data bits. The bottom figures in Fig. 5.3(b) presents the eight sum values after add operations. Then the binary equivalents of each sum value will be transferred to the receiving end. In this case, two binary bits are enough to represent the three possible different decimal sum values, 0, 1, and 2. For example, if a decimal sum value 2 needs to be transferred, we need to transfer two binary digits 10.
Fig. 5.3: Data encoding example.
5.6.3. DIGITAL CDMA DECODER:-
Fig. 5.4: Digital CDMA decoding scheme. The digital decoding scheme applied in the CDMA NoC is depicted in Fig5.4. The decoding scheme accumulates the received sum values into two separate parts, a positive part and a negative part, according to the bit value of the spreading code used for decoding. For instance, as illustrated in Fig. 5.4, the received first sum value will be put into the positive accumulator if the first bit of the spreading code for decoding is 0, otherwise, it will be put into the negative accumulator. The same selection and accumulation operations are also performed on the other received sum values. The principle of this decoding scheme can be explained as follows. If the original data bit to
be transferred is 1, after the XOR operations in the encoding scheme illustrated in Fig. 4.2, it can only contribute nonzero value to the sums of data chips when a bit of spreading code is 0. Similarly, the 0-value original data bit can only contribute nonzero value to the sums of data chips when a bit of spreading code is 1. Therefore, after accumulating the sum values according to the bit values of the spreading code, either the positive part or negative part is larger than the other if the spreading codes are orthogonal and balance. Hence, the original data bit can be decoded by comparing the values between the two accumulators. Namely, if the value of the positive accumulator is larger than the value in the negative accumulator, the original data bit is 1; otherwise, the original data bit is 0. 5.6.4. SPREADING CODE SELECTION:Walsh code has the required orthogonal and balance property. An S-bit (S=2N where integer N >1) Walsh code family has S-1 different sequences that have both orthogonal and balance property.
5.6.5. SPREADING CODE PROTOCOL SELECTION:There are several spreading code protocols- 1) Common Code Protocol (C protocol): All users in the network use the same spreading code to encode their data packets to be transferred. 2) Receiver-Based Protocol (R protocol): Each user in the network is assigned a unique spreading code used by the other users who want to send data to that user. 3) Transmitter-Based Protocol (T protocol): The unique spreading code allocated to each user is used by the user himself to transfer data to others. 4) Common-Transmitter-Based Protocol (C-T protocol): The destination address portion of a data packet is encoded using C protocol, whereas, the data portion of a packet is encoded using T protocol. 5) Receiver-Transmitter-Based Protocol (R-T protocol): It is the same as the C-T protocol except that the destination address portion of a data packet is encoded using R protocol. 6) Transmitter-Receiver-Based Protocol (T-R protocol): Two unique spreading codes are assigned to each user in the network, and then a user will generate a new spreading code from the assigned two unique codes for its data encoding. But among them T and T-R protocol are conflict free. Between them T-R protocol has complicated decoding scheme and a large amount of spreading codes are required (two spreading codes for each user) compared to T protocol so T protocol is preferred. But in this case a Network Arbiter is required to tell the receiver who is the sender to select the proper spreading code for decoding. So this is called A-T protocol.
5.6.6.CDMA NOC STRUCTURE:-
Fig. 5.5: CDMA NoC structure. The proposed CDMA NOC is a packet switched network that consists of Network Node, CDMA Transmitter, and Network Arbiter blocks. The functional IP blocks (functional hosts) are connected to the CDMA NoC through individual Network Node blocks. Because the different functional hosts may work at different clock frequencies coordinating the data transfers among different clock domains would be a problem. A globally-asynchronous locally-synchronous (GALS) scheme has been proposed as a solution for this problem. Applying the GALS scheme to the CDMA NoC means that the communications between each functional host and its network node use local clock frequency, while the communications between network nodes through the CDMA network are asynchronous.
5.6.6.1.NETWORK NODE:The functional sublocks of Network Node are1) 2) 3) 4) Node IF. Tx/Rx Packet Buffer. Packet Sender. Packet Receiver.
Fig. 5.6: Block diagram of the network node in CDMA NoC. 1) Node IF: This block is used to receive data from the Network IF block of a functional host through the applied VCI or OCP standard. Then it will assemble the received data into packet format and send the packet to Tx Packet Buffer, or disassemble the received packet from Rx Packet Buffer and send the extracted data to the functional host. 2) Tx/Rx Packet Buffer: These two blocks are buffers that consist of asynchronous firstinputfirst-output (FIFO). Tx Packet Buffer is used to store the data packets from Node IF block, and then deliver the packets to Packet Sender block. The Rx Packet Buffer stores and delivers the received packets from Packet Receiver to Node IF. 3) Packet Sender: If Tx Packet Buffer is not empty, Packet Sender will fetch a data packet from the buffer by an asynchronous handshake protocol. Then it will extract the destination information from the fetched packet and send the destination address to Network Arbiter. After Packet Sender gets the grant signal from the arbiter, it will start to send the data packet to CDMA Transmitter. 4) Packet Receiver: After system reset, this block will wait for the sender information from Network Arbiter to select the proper spreading code for decoding. After the spreading code for decoding is ready, the receiver will send an acknowledge signal back to Network Arbiter and wait to receive and decode the data from CDMA Transmitter, and then send the decoded data to Rx Packet Buffer in packet format. 5.6.6.2.NETWORK ARBITER:Network Arbiter block is the core component to implement the A-T spreading code protocol. By applying A-T spreading code protocol, every sender node cannot start to send data packets to CDMA Transmitter until it gets the grant signal from Network Arbiter. Network Arbiter takes charge of informing the requested receiver node to prepare the proper spreading code for decoding and sending a grant signal back to the
sender node. In the case that there are more than one sender nodes requesting to send data to the same receiver node simultaneously or at different times, the arbiter will apply a round-robin arbitration scheme or the first-come first-served principle, respectively, to guarantee that there is only one sender sending data to one specific receiver at a time. However, if different sender nodes request to send data to different receiver nodes, these requests would not block each other and will be handled in parallel in the Network Arbiter. The Network Arbiter in the CDMA NoC is different from the arbiter used in a conventional bus. The reason is that the Network Arbiter here is only used to set up spreading codes for receiving and it handles the requests in parallel in the time domain. However, a conventional bus arbiter is used to allocate the usage of the common communication media among the users in the time-division manner. 5.6.6.3. CDMA TRANSMITTER:-
Fig. 5.7: Bit-synchronous transfer scheme. CDMA Transmitter block takes care of receiving data packets from network nodes and encoding the data to be transferred with the corresponding unique spreading code of the sender node. Although this block is realized using asynchronous circuits, it applies a bit-synchronous transfer scheme. It means that the data from different nodes will be encoded and transmitted synchronously in terms of data bits rather than any clock signals. IN Fig 4.7 the principle of the referred bit-synchronous transfer scheme is illustrated by a situation that network nodes A and B send data packets to CDMA Transmitter simultaneously and node C sends a data packet later than A and B. In this situation, the data packet from node A will be encoded and transmitted together with the data packet from node B synchronously in terms of each data bit. When the data packet from node C arrives at a later time point, the transmitter will handle the data bit of Packet C together with the data bits of packet A and B at the next start point of the time slot for bit encoding and transmitting processes. The dot-line frame at the head of the Packet C in Fig. 4.7 is used to illustrate the waiting duration if the
Packet C arrived in the middle of the time slot for handling the previous data bit. The time slot for handling a data bit is formed by a four-phase handshake process. The bitsynchronous transfer scheme can avoid the interferences caused by the phase offsets among the orthogonal spreading codes if the data bits from different nodes are encoded and transmitted asynchronously with each other. Because the nodes in the network can request data transfer randomly and independently of each other, CDMA Transmitter applies the first come, first served mechanism to ensure that the data encoding and transmission are performed as soon as there is data transfer request. 5.7. PROPOSED TOPOLOGY COMBINING PHOTONIC AND CDMA NOC TECHNOLOGY: 5.7.1. Architecture of the CDMA Transceiver Both the transmitter and a single receiver block are shown in Figure 5.8 and Figure 5.9 respectively. At the transmitter side data bits from different senders are exclusive-or-ed with their corresponding unique orthogonal and balance spreading code. Then this encoded data from each sender is added and sent through a single line. At the receiver end this sum data is received through a demultiplexer which uses the corresponding spreading code of the sender as its select signal to receive that particular data. Depending on that spreading code the sum data gets accumulated to positive part accumulator (if spreading chip is 0) or negative part accumulator (if the spreading chip is 1). Then comparing the contents of these two accumulators the corresponding decoded data bit is generated. We have chosen Walsh code for the CDMA PN code generation. Walsh code has the required orthogonal and balance property. An S-bit (S = 2N where integer N > 1) Walsh code family has S- 1 different sequences that have both orthogonal and balance property. Here we use A-T (arbitrated transmission) protocol.
Fig. 5.8: Schematic of CDMA transmitter
Fig. 5.9: Schematic of the CDMA receiver
5.7.2. The Photonic Network Topology The proposed topology is shown in Figure 5.10. A LASER source injects light into a waveguide by a grating. The Tx rings act as the modulators of that light while the Rx rings act as the receivers and demodulators of the rings. The waveguide traverses the routing plane in a snake like fashion and at the very end it bends 180 degrees and again parallely traverses its previous path backwards. For each cluster, we allocate N-1 wavelengths for transmission, where, N is the number of clusters. Thus there are totally N(N-1) wavelengths in the waveguide and the total number of rings is 2N(N-1). For example, in the example shown above, there are 6 clusters, so there are 6 5 = 30 wavelength channels and 60 rings. If 16 bit PN codes are used for the CDMA transceivers, then there can be a maximum of 15 cores per cluster. So if the number of clusters is 6 then 6 15 = 90 cores can simultaneously communicate at a time by the photonic interconnect. All the previous schemes, to our knowledge, need a lot of more photonic components and clever design of topology and other parameters to achieve the above connectivity.
Fig. 5.10: The proposed photonic topology
5.7.2.1 Photonic Components Utilized: In our design the principal photonic components used are: Micro ring resonators used as transmitters and receivers Fiber to waveguide coupler grating Along with the above mentioned components, waveguide bends are also used in order to complete the snake like structure of our proposed topology. i. Waveguide to fiber coupler: As silicon photonics enters mainstream technology, we need methods to seamlessly transfer light between the optical fibers of global scale telecommunications networks and the onchip waveguides used for signal routing and processing in local computing networks. Connecting these components directly results in high loss from their unequal sizes. Therefore, we need a coupler, which acts as an intermediary device to reduce loss through mode and index matching, and provide alignment tolerance. Other factors that come into play include the couplers polarization dependence, the complexity of its design and fabrication process, and whether or not that process can be integrated with that of other optical components. Therefore, while a good coupler design should not only be able to perform in the first three areas, but also be feasible to fabricate [31]. Microring resonator: The simplest configuration of a micro ring resonator device consists of one straight waveguide and one ring resonator. This device can be used as an all pass filter for applications such as phase equalizer, dispersion compensator or optical delay line. Micro ring resonators as optical filters have attracted much attention due to their high wavelength selectivity in combination with small size. They have also low onchip insertion loss [29,30].
ii.
5.7.2.2 The Routing Plane: A schematic of the routing plane is shown in Figure 5.11. The routers in the lower plane send/receive data to the upper plane by means of TSVs (through silicon vias). The data of the lower logical plane must be stored in FIFOs before they are dispatched to their destinations. The CDMA Tx sends data simultaneously by all wavelengths. There is a facility to enable or disable each wavelengths driver and send data to corresponding modulator.
Fig. 5.11: The Routing Plane containing the photonic devices.
As we are using A-T protocol, transmitters of different clusters are not being distinguished by using different PN Codes. They are being distinguished by using different wavelength for transmission. So for receiving from cores of 3 different clusters we need to have one micro ring and CDMA receiver pair.
7. Future Work 8 Conclusion Our approach has been mainly to reduce component count while at the same time maintain the high performance of the NoC. Significant improvements in loss reduction. Due to use of Adaptive CDMA Architecture the design gets a quite sense of reconfigurability. Due to reuse of code, flexible allocation of code bit length along with using the topology, on demand code (so also resources) allocation increases the design reconfigurability in a great deal. At last we can not ignore that the difficulty with parallelism is not in only Hardware, it is that too few important application programs have been rewritten to complete tasks sooner on multiprocessors. It is difficult to write software that uses multiple processors to complete one task faster, and the problem gets worse as the number of processor increases. So for utilization of parallel programming more challenges are included in scheduling, load balancing, keeping time for synchronization, and overhead of communication between the parties.
Reference

Chap 1abstract

Caricato da

Informazioni sul documento

Descrizione originale:

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Chap 1abstract

Caricato da

Copyright:

Formati disponibili

Abstract Multiprocessor architectures and platforms have been introduced to extend the applicability of Moores law.

1.3 Switching techniques

Seamless optical I/O to external memory with same BW

Here we introduce a nonnegative weight function product.

in the definition of this inner

If vectors a andb are orthogonal, then

denotes the Kronecker product.

Fig. 5.1: CDMA technique principle.

5.6.2.DIGITAL CDMA ENCODER:-

Fig. 5.2: Digital CDMA encoding scheme.

Fig. 5.3: Data encoding example.

5.6.3. DIGITAL CDMA DECODER:-

5.6.6.CDMA NOC STRUCTURE:-

Fig. 5.8: Schematic of CDMA transmitter

Fig. 5.9: Schematic of the CDMA receiver

Fig. 5.10: The proposed photonic topology

Fig. 5.11: The Routing Plane containing the photonic devices.

Potrebbero piacerti anche