Sei sulla pagina 1di 20

Dr.Y.Narasimha Murthy Ph.D yayavaram@yahoo.



INTRODUCTION: The Xilinx Programmable Gate Array, known as a Logic Cell Array (LCA), is a high-density CMOS IC that combines user programmability with the flexibility of a gate array architecture and the economy and testability of standard products. Xilinx reprogrammable architectures are used because of their flexibility, low prices for small quantities, testability and short development time. Most design changes can be implemented by reprogramming the LCAs. Thus, use of the LCAs , allows the design to go directly from schematic capture to a production board. The programmable logic blocks in the Xilinx family of FPGAs are called Configurable Logic Blocks (CLBs).The Xilinx architecture uses, CLBs, I/O blocks, switch matrix and an external memory chip to realize a logic function. It uses external memory to store the interconnection information. Therefore, the device can be reprogrammed by simply changing the configuration data stored in the memory. XILINX Logic Cell Array : This is the novel architectural feature introduced by XILINX in the year 1985 for their FPGA devices. It is almost like a proprietary or trade mark property of XILINX implemented for FPGA devices. The XILINX LCA architecture consists of three major Components. They are (i).Configurable Logic Blocks (CLBs) (iii). Programmable Interconnect. In addition, configuration memory is used to hold the configuration program bits which control the configuration of CLRM, IOBs and interconnect. This LCA architecture consists of an interior matrix of logic blocks and a surrounding ring of I/O interface blocks. Interconnect resources occupy the channels between the rows and columns of logic blocks and between the logic blocks and I/O blocks. Like a microprocessor the LCA is a program driven logic device. The functions of the LCAs configurable logic blocks and I/O blocks and their interconnection are controlled by a configuration program stored in an on-chip memory. The configuration program is loaded automatically from an external memory on powerup or on command, or is programmed by a microprocessor as part of system initialization. (ii).Input/Output Blocks (lOBs) and

Dr.Y.Narasimha Murthy Ph.D

As shown below diagram the configuration memory consists of a distributed array of static memory cells .During configuration the cell is written through the data line and is read through the data line during read back operation.

During normal operation the pass transistor is off and continuous

configuration control is

provided. There are five methods for loading configuration program data into configuration memory. Among them two methods load the data serially and three methods load the data in a byte wide parallel manner. The LCA performance is determined by the speed of logic , storage elements and programmable interconnect.LCA performance is specified by the maximum toggle rate for a logic block storage element configured as a toggle flip-flop. For typical application system clock rates are one third to one-half the maximum flip-flop toggle rate. The core of the LCA is a matrix of identical Configurable Blocks (CLBs).Each CLB contains programmable combinational logic and storage registers. The combinational logic section of of the block is capable of implementing any Boolean function of its input variables.The registers can be loaded from the combinational logic or directly from a CLB input the register outputs can be inputs to the combinational logic via an internal feedback path.

Dr.Y.Narasimha Murthy Ph.D

The periphery of the Logic Cell Array is made up of user programmable input/output blocks (IOBs).Each block can be programmed independently to be an input ,an output or bi-directional pin with three state control. Inputs can be programmed to recognize either TTL or CMOS thresholds. Each IOB also includes flip-flops that can be used to buffer inputs and outputs. The flexibility of the LCA is due to resources that permit program control of the interconnection of any two points on the chip. The LCA interconnection resources include a two-layer metal network of lines that run horizontally and vertically in the rows and columns between the CLBs. Programmable switches connect the inputs and outputs of IOBs and CLBs to the nearest metal lines. Cross point switches and interchanges at the interconnections of rows and columns allow signals to be switched from one path to another. Long lines run the entire length or breadth of the chip ,by passing interchanges to provide distribution of critical signals with minimum delay or skew. Configurable Block(CLB) : The core of the FPGA is a matrix of identical Configurable Blocks(CLBs) .Each CLB contains a combinational logic array, program controlled data

Dr.Y.Narasimha Murthy Ph.D

multiplexers, and flip-flops. The CLB also contains RAM memory cells and can be programmed to realize any function of five variables or any two functions of four variables. The functions are stored in the truth table form, so the number of gates required to realize the functions is not important. In the fig below each trapezoidal block represents a multiplexer, which can be programmed to select one of its inputs. The block diagram of the CLB is shown below

The array of CLBs provides the functional elements from which the users logic is constructed. The logic blocks are arranged in a matrix within the perimeter of IOBs. Forexample, the XC3020A has 64 such blocks arranged in 8rows and 8 columns. The development system is used tocompile the configuration data which is to be loaded intothe internal configuration memory to define the operationand interconnection of each block. User definition of CLBsand their interconnecting networks may be done by automatic translation from a schematic-capture logic diagram oroptionally by installing library or user macros. Each CLB has a combinatorial logic section, two flip-flops,and an internal control section. There are : five logic inputs (A, B, C, D and E); a common clock input (K); an asynchronous direct RESET input (RD); and an enable clock (EC). All may be driven from the interconnect resources adjacent to the blocks. Each CLB also has two outputs (X and Y) which may drive interconnect networks. Data input for either

Dr.Y.Narasimha Murthy Ph.D

flip-flop within a CLB is supplied from the function F or G outputs of the combinatorial logic, or the block input, DI. Both flip-flops in each CLB share the asynchronous RD which, when enabled and High , is dominant over clocked inputs. All flip-flops are reset by the active-Low chip input, RESET, or during the configuration process. The flip-flops share the enable clock (EC) which, when Low, re circulates the flip-flops present states and inhibits response to the data-in or combinatorial function inputs on a CLB. The user may enable these control inputsand select their sources. The user may also select theclock net input (K), as well as its active sense within each CLB. This programmable inversion eliminates the need toroute both phases of a clock signal throughout the device. The combinatorial-logic portion of the CLB uses a 32 by 1 look-up table to implement Boolean functions. Variables selected from the five logic inputs and two internal block flip-flops are used as table address inputs. The combinatorial propagation delay through the network is independent of the logic function generated and is spike free for singleinput variable changes. The partial functions of six or seven variables are implemented using the input variable (E) to dynamically select between two functions of four different variables. For thetwo functions of four variables each, the independent results (F and G) may be used as data inputs to either flip-flop or either logic block output. For the single function of five variables and merged functions of six or seven variables, the F and G outputs are identical. Symmetry of the F and G functions and the flip-flops allows the interchange of CLB outputs to optimize routing efficiencies of the networks interconnecting the CLBs and IOBs Input/Output Blocks ( I/O Block): The periphery of the Logic Cell Array is made up of user programmable input/output blocks (IOBs) .Each block can be programmed independently to be an input ,an output or bi-directional pin with three state control. So, each user-configurable IOB , provides an interface between the external package pin of the device and the internal user logic. This IOB includes both registered and direct input paths. Also each IOB provides a programmable3-state output buffer, which may be driven by a registered or direct output signal. Configuration options allow the IOB an

inversion, a controlled slew rate and a high impedance pull-up. Each input circuit also provides input clamping diodes to provide electrostatic protection, and circuits to inhibit latch-up produced by input currents

Dr.Y.Narasimha Murthy Ph.D


IOB also

includes input and output storage elements and I/O options selected by

configuration memory cells. A choice of two clocks is available on each die edge. The polarity of each clock line (not each flip-flop or latch) is programmable. A clock line that triggers the flipflop on the rising edge is an active Low Latch Enable (Latch transparent) signal and vice versa. Passive pull-up can only be enabled on inputs, not on outputs. All user inputs are programmed for TTL or CMOS thresholds. The input-buffer portion of each IOB provides threshold detection to translate external signals applied to the package pin to internal logic levels. The global input-buffer threshold of the IOBs can be programmed to be compatible with either TTL or CMOS levels. The buffered input signal drives the data input of a storage element, which may be configured as either a flip-flop or a latch. The clocking polarity (rising/falling edge-triggered flip-flop, High/Low transparent latch) is programmable for each of the two clock lines on each of the four die edges. Note that a clock line driving a rising edge-triggered flip-flop makes any latch driven by the same line on the same edge Low-level transparent and vice versa (falling edge, High transparent). All Xilinx primitives in the supported schematic-entry packages, however, are positive edge-triggered flip-flops or High transparent latches. When one clock line must drive flip-flops as well as latches, it is necessary to compensate for the difference in clocking polarities with an additional inverter

Dr.Y.Narasimha Murthy Ph.D

either in the flip-flop clock input or the latch-enable input. I/O storage elements are reset during configuration or by the active-Low chip RESET input. Both direct input (from IOB pin I) and registered input (from IOB pin Q) signals are available for interconnect. Programmable-interconnection resources in the Field Programmable Gate Array provide routing paths to connect inputs and outputs of the IOBs and CLBs into logic networks .Interconnections between blocks are composed of a two-layer grid of metal segments. Specially designed pass transistors, each controlled by a configuration bit, form programmable interconnect points (PIPs) and switching matrices used to implement the necessary connections between selected metal segments and block pins. Figure below is an example of a routed net. The development system provides automatic

routing of these interconnections. Interactive routing is also available for design optimization. The inputs of the CLBs or IOBs are multiplexers which can be programmed to select an input network from the adjacent interconnect segments. Since the switch connections to block inputs are unidirectional, as are block outputs, they are usable only for block input connection and not for routing. Figure below illustrates routing access to logic block input variables, control inputs and block outputs.

Dr.Y.Narasimha Murthy Ph.D

Three types of metal resources are provided to fulfill various requirements. General Purpose Interconnect Direct Connection Long lines (multiplexed busses and wide AND gates) General Purpose Interconnect

network interconnect

It consists of a grid of five horizontal and five vertical metal segments located between the rows and columns of logic and IOBs. Each segment is the height or width of a logic block. Switching matrices join the ends of these segments and allow programmed interconnections between the metal grid segments of adjoining rows and columns. The switches of an un-programmed device are all non-conducting. The connections through the switch matrix may be established by the automatic routing or by selecting the desired pairs of matrix pins to be connected or disconnected.

Special buffers within the general interconnect areas provide periodic signal isolation and restoration for improved performance of lengthy nets. The interconnect buffers are available to propagate signals in either direction on a given general interconnect segment. These bidirectional (bidi) buffers are found adjacent to the switching matrices, above and to the right. The other PIPs

Dr.Y.Narasimha Murthy Ph.D

adjacent to the matrices are accessed to or from Long lines. The development system automatically defines the buffer direction based on the location of the interconnection network source. The delay calculator of the development system automatically calculates and displays the block, interconnect and buffer delays for any paths selected. Generation of the simulation net list with a worst-case delay model is provided. Direct Interconnect Direct interconnect provides the most efficient implementation of networks between adjacent

CLBs or I/O Blocks. Signals routed from block to block using the direct interconnect exhibit minimum interconnect propagation and use no general interconnect resources.

For each CLB, the X output may be connected directly to the B input of the CLB immediately to its right and to the C input of the CLB to its left. The Y output can use direct interconnect to drive the D input of the block immediately above and the A input of the block below.Direct interconnect should be used to maximize the speed of high-performance portions of logic. Where logic blocks are adjacent to IOBs, direct connect is provided alternately to the IOB inputs (I) and

Dr.Y.Narasimha Murthy Ph.D

outputs (O) on all four edges of the die. The right edge provides additional direct connects from CLB outputs to adjacent IOBs. Long lines The Long lines bypass the switch matrices and are intended primarily for signals that must travel a long distance, or must have minimum skew among multiple destinations. Long lines, run vertically and horizontally the height or width of the interconnect area. Each interconnection column has three vertical Long lines, and each interconnection row has two horizontal Long lines.

Vertical and Horizontal Long Lines

Dr.Y.Narasimha Murthy Ph.D

Two additional Long lines are located adjacent to the outer sets of switching matrices. Long lines can be driven by a logic block or IOB output on a column-by-column basis. This capability provides a common low skew control or clock line within each column of logic blocks. Isolation buffers are provided at each input to a Long line and are enabled automatically by the development system when a connection is made. Technology Mapping for FPGA : An FPGA consists of a regular array of logic blocks that implement combinational and sequential logic functions and a user programmable routing network that provides connections between the logic blocks . In conventional ASIC implementation technologies such as Mask Programmed Gate Arrays (MPGAs) and Standard Cells the connections between logic blocks are implemented by metallization at a fabrication facility. In an FPGA the connections are implemented in the field using the user programmable routing network. This reduces manufacturing costs. But the limitations are , density and performance penalties associated with user programmable routing. The programmable connections which consist of metal wire segments connected by programmable switches occupy greater area and incur greater delay than simple metal wires. To reduce the density penalty FPGA architectures employ highly functional logic blocks such as lookup tables that reduce the total number of logic blocks and hence the number of programmable connections needed to implement a given application. These complex logic blocks also reduce the performance penalty by reducing the number of logic blocks and programmable conections on the critical paths in the circuit. The high functionality of FPGA logic blocks presents new challenges for logic synthesis. So,the technology mapping provides a solution for FPGAs that use lookup tables to implement combinational logic. i.e Technology mapping is a process of transforming a technology independent Boolean network into a technology dependent network. For example a K input lookup table (LUT) is a digital memory that can implement any Boolean function of K variables. The K inputs are used to address a 2K by 1 bit memory that stores the truth table of the Boolean function. It is a proven fact that lookup tables are an area efficient method of implementing combinational functions and that the delays of LUT based FPGAs are minimum when compared turn-around times drastically from weeks to minutes and reduces prototype

Dr.Y.Narasimha Murthy Ph.D

to the delays of FPGAs using other types of logic blocks .The goal of the technology mapping is to reduce area, delay or a combination of both. Technology mapping is the logic synthesiss task that is directly concerned with selecting the circuit elements used to implement the optimized circuit. Previous approaches to technology mapping have focused on using circuit elements from a limited set of simple gates. However such approaches are inappropriate for complex logic blocks where each logic block can implement a large number of functions . A K input lookup table can implement 2K different functions. For values of K greater than 3 the number of different functions becomes too large for conventional technology mapping Therefore new approaches to technology mapping are required for LUT based FPGAs. Library-Based Technology Mapping : In library based mapping, gates or components are selected from a technology library to implement a circuit. Hence it is also referred to as library binding. So, this method generates a technology mapping for a given Boolean network using a characterized cell library with the objective of cost optimization or delay optimization. Standard Cells and Mask Programmed Gate Arrays implement combinational functions using a limited

set of simple gates. For such ASIC technologies library-based technology mapping is very useful. In this methodology the set of available circuit elements is represented as a library of functions and the construction of the optimized circuit is divided into three sub problems (i). Decomposition, (ii). Matching and (iii) Covering. The original network is first decomposed into a canonical representation that uses limited fan in NAND nodes. This decomposition guarantees that there will be no nodes in the network that are too large to be implemented by any library element provided the library includes NAND gates that reach the fan in limit. After decomposition the network is partitioned into a forest of trees The optimal sub circuit

covering each tree is constructed and finally the circuit covering the entire network is assembled from these sub circuits. To form the forest of trees, the decomposed network is partitioned at fan out nodes into a set of single output sub networks. Each of these sub networks is either a tree or a leaf DAG (Directed Acyclic Graph). A leaf DAG is a multi input single output DAG where only the input nodes have fan out greater than one. Each leaf DAG is converted into a tree by creating a unique instance of every input node for

Dr.Y.Narasimha Murthy Ph.D

each of its multiple fan out edges The optimal circuit implementing each tree is constructed using a dynamic programming traversal that proceeds from the leaf nodes to the root node. For every node in the tree an optimal circuit implementing the sub tree extending from the node to the leaf nodes is constructed. This circuit consists of a library element that matches a sub function rooted at the node and previously constructed circuits implementing its inputs. The cost of the circuit is calculated from the cost of the matched library element and the cost of the circuits implementing its inputs. To find the lowest cost circuit, the DAGON , first finds all library elements that match sub functions rooted at the node. The cost of the circuit using each of these candidate library elements is then calculated and the lowest cost circuit is retained . The set of library elements is found by searching through the library and using tree matching to determine if each library element matches a sub function rooted at the node. As an example let us consider the library shown in the figure(a) below and the circuit shown in figure(b). The circuit elements are standard cells and their costs are given in terms of the area of the cells. The cost of the INV , NAND-2 and AOI-21 cells are2,3 and 4 respectively. In Figure (b) the only library element matching at node E is the NAND-2 and the cost of the optimal circuit implementing node E is therefore 3. At node C the only matching library element is also the NAND2. The cost of the NAND-2 is 3 and the cost of the optimal circuits implementing its input E is also 3.Therefore , the cumulative cost of the optimal circuit implementing node C is 6.

Dr.Y.Narasimha Murthy Ph.D

Finally the algorithm will reach node A_ For node A there are two matching library elements the INV as used in figure(b) and the AOI-21 as used in figure (c).The circuit constructed using the INV matching A includes a NAND-2 implementing node B, a NAND-2 implementing node C, an INV implementing node D and a NAND-2 implementing node E. The cumulative cost of this circuit is 13. The circuit constructed using the AOI-21 matching A includes a NAND-2

implementing node E. The cumulative cost of this circuit is 7. The circuit using the AOI-21 is therefore the optimal circuit implementing node A. The major obstacle to applying library-based technology mapping to LUT circuits is the large number of different functions that a K-input LUT can implement. The function implemented by a K-input LUT is determined by the values stored in its 2K memory bits. Since each bit can independently be either 0 or 1, there are 22K different Boolean functions of K- variables. For values of K greater than 3 the library required to represent a K-input LUT becomes very large. The size of the library can be reduced by noting that some patterns are equivalent after a.

Dr.Y.Narasimha Murthy Ph.D

permutation of inputs . The inversion of outputs or inputs, which is trivially accomplished with a LUT, can also produce equivalent patterns. Another alternative is to use a partial library tuned to take advantage of the network structure likely to be produced by technology independent logic optimization. The limitation of this approach is that it precludes some opportunities for optimization of the final circuit. LUT-based Technology Mapping: The major obstacle to applying library-based technology mapping to LUT circuits is the large number of different functions that a K-input LUT can implement. The function implemented by a K-input LUT is determined by the values stored in its 2K memory bits. Since independently be either 0 or 1, there are 22K each bit can

different Boolean functions of K- variables.For

values of K greater than 3 the library required to represent a K-input LUT becomes very large. The limitations of earlier technology mapping approaches paved the way for the development of technology mapping that deals specially with LUT circuits. The first LUT based technology mappers appeared in 90s. and later improved for optimized delay performance of LUT circuits by minimizing the number of levels of LUT in the final circuit. In LUT based FPGAs (example XILINX FPGAs) the building blocks are LUTs and Flip-Flops. In an LUT based FPGA chip the basic programmable logic block is a K-input Look Up Table.(K-LUT) which can implement any Boolean function of up to K- variables.The technology mapping in LUT based FPGA designs is to cover a general Boolean Network using K-LUTs to obtain functionally equivalent K-LUT network. The main objectives in LUT mapping are (i).Cost optimal mapping i.e Minimizing the number of LUTs and Minimizing the number of CLBs (ii) Delay optimal mapping i.e delays (including routing delays) (iii).Maximizing the routability of the mapping schemes. The LUT based technology can be implemented using two types of algorithms .They are (a). The Area Algorithm and (b) The delay algorithm The Area Algorithm : A circuit can be implemented by a given FPGA only if the number of logic blocks in the circuit does not exceed the available number of logic blocks and the required connections between the logic blocks do not exceed the capacity of the routing network. The area algorithm minimizes Minimizing the number of LUT levels and Minimizing the

Dr.Y.Narasimha Murthy Ph.D

the total number of K -input LUTs in the circuit implementing a given network . Minimizing the number of LUTs in the circuit allows larger networks to be implemented by the fixed number of logic blocks available in a given LUT based FPGA. In implementing the area algorithm ,the original network is first partitioned into a forest of trees and then each tree is separately mapped into a circuit of K-input LUTs. The final circuit is then assembled from the circuits implementing the trees. The main principle of the area algorithm is that it simultaneously addresses the decomposition and matching problems using a bin packing approximation algorithm. The correct decomposition of network nodes can reduce the number of LUTs required to implement the network. For example let us consider the circuit of 5 input LUTs shown in Figure (a) below.The shaded OR node is not decomposed and four 5 input LUTs are required to implement the network However if the OR node is decomposed into the two nodes as shown in figure (b) then only two LUTs are required .But the main problem is to find the decomposition of every node in the network that minimizes the number of LUTs in the final circuit.

The delay algorithm : Unlike the area algorithm which decomposes nodes to reduce the total number of LUTs the delay algorithm decomposes nodes to minimize the number of levels in the final circuit. For example consider the circuit of 5-input LUTs shown in figure (a). In this figure the number in the lower right hand corner of a LUT indicates its depth which is the

maximum number of LUTs along any path from a primary input to the output of the LUT. The LUTs preceding the AND nodes are not shown in this figure but they are assumed to

Dr.Y.Narasimha Murthy Ph.D

In figure(a) the shaded OR node is not decomposed and 5 levels of LUTs are required to implement the network. However if the OR node is decomposed into the two nodes shown in figure (b) then only 4 levels of LUTs are required. The delay algorithm like the area algorithm firstt partitions the original net workin to a forest of trees , maps each tree separately into a circuit of K-input LUTs and then assembles the circuit implementing the entire network from the circuits implementing the trees. The trees are mapped in a breadth first order proceeding from the primary inputs toward the primary outputs. This ensures that when each tree is mapped that the trees implementing its leaf nodes have already been mapped. The overall strategy employed by the delay algorithm is to minimize the number of levels of LUTs by minimizing the depth of every path in the final circuit. This can result in a circuit that contains a large number of LUTs. MULTIPLEXER BASED TECHNOLOGY MAPPING: This Multiplexer based technology mapping is used in ACTEL FPGAs and in recent Xilinx VIRTEX 6 FPGA devices .Because their logic block architectures are MUX based.In Actel based FPGAs ,the size of the Multiplexers is small and suitable to achieve the objective of area optimization and minimum delays.

Circuits usually contain a large number of multiplexers (MUXes). This is mainly true for circuits that are automatically synthesized from high-level descriptions. MUXes exist in the data-paths of

Dr.Y.Narasimha Murthy Ph.D

circuits, where they are used to route operands to operators. Also, the control logic is frequently specified as a CASE statement in HDL descriptions. MUXs arise as a result of a direct translation of CASE statements in HDLs into a logic-level description. Cell libraries too contain various choices of MUXes. Cell implementations make use of the fact that a pass gate implementation of a MUX is both, faster and smaller. In the case of MUX-based FPGAs like Actel, there is a natural presence of MUX in the virtual library. Thus, a method for mapping MUX in the unmapped network to those in the library is desirable. The significance of Multiplexer synthesis is mainly due to the fact that Multiplexer tree circuits give new FPGA's like the ACT. FPGA family from Actel , where the basic building block consists of multiplexers .Each basic building block of the ACT family allows the

implementation of a multiplexer (a) and, in the case of the ACT l family, implementation of three hierarchical multiplexers (b), which is denoted by act0. The ACT 2 family allows only a restricted realization of three hierarchical multiplexers, as can be seen in Fig. (b).

Basic building block of the ACT' family : (a) ACT1 family;

(b) .ACT2 family.

The main objective behind this Mux based technology mapping is ,describing a combinational circuit in terms of Boolean equations and realize it using minimum number of basic blocks of the target Mux based architecture and minimizing the delay on the critical path. In this algorithm an appropriate base function ,a library of cells and a set of pattern graphs are selected .As an example let us select a 2 to 1 multiplexer as a base function.

Dr.Y.Narasimha Murthy Ph.D

The above figure shows two Mux structures STRUCT and STRUCT1.Four pattern graphs are constructed for STRUCT1 as shon in figure below.If the function is realizable by one STRUCT1 block ,it either uses all the multiplexers or two or just one.These pattern graphs are in one to one correspondnce with these possibilities.So, a very small set of patterns to capture all possible functions realizable by one STRUCT1 block is needed.From the figure it is clear that the pattern graph uses all the multiplexers.

The introduction of the OR gate at the select input of MUX3 increases the number of function realized by the block.from an algorithmic point of view it creates some problems .But the a modification of the algorithm is considered for the concurrence of OR structure. The advantages of MUX based technology mapping are it generates optimal mappings, which are often much better than those produced by conventional heuristic techniques. Moderately large circuits can be mapped optimally in a small amount of time. Very large circuits can be mapped near-optimally by partitioning the circuits and mapping each partition individually.

Dr.Y.Narasimha Murthy Ph.D


References: (i).Technology Mapping for Lookup Table Based Field Programmable Gate Arrays, Robert J Francis (ii).Technology Mapping for Field-Programmable Gate Arrays Using Integer Programming, Amit Chowdhary and John P. Hayes. (iii) .Experiences with XILINX Programmable Gate arrays,J.Molendijk & U.Wehrle