603 Paper

FPGA Placement Methodologies: A Survey
Xiaoyu Shi Department of Computing Science, University of Alberta, Edmonton, Canada xshi@cs.ualberta.ca
Abstract Field Programmable Gate Array (FPGA), a programmable integrated circuit, has gained great popularity in the circuit design since its rst introduction in 1984. Placement in FPGA decides the physical locations and inter connections of each logic block in the circuit design, which now becomes the bottleneck of the circuit performance. In this survey paper, we shall review the classic placement methods that have been used for the past two decades along with some modern placement techniques in the last 5-10 years. In particular, we shall focus on four different categories of placement methods as following: simulated annealing, min-cut, quadratic and parallel approaches. The methodology of each algorithm will be presented, with an emphasis on the comparison of performances and evaluation of advantages and disadvantages.
Introduction
The popularity of using Field Programmable Gate Arrays (FPGAs) to implement integrated circuits has been dramatically increased in recent years. The prime advantages provided by FPGAs are their fast manufacturing turnaround time, low start-up costs and ease of design that involves less nancial risks [8]. However, new challenges have emerged since the size of FPGAs has reached million gates level. The design and development using FPGAs suffer from the the large placement time as turnaround time is crucial [4]. The placement problem has become bottleneck of the circuit performance in FPGA. For the next generation of Computer Aided Design (CAD) tools for FPGAs, fast and quality placement methods are critical. In this survey, we use the island style FPGA model [6]. The generic structure of island style FPGA consists of four main parts: Congurable Logic Blocks (CLB), which are the basic logic blocks, implement the logic functions of the circuit. Input/Output Blocks (IOB) are the connections of FPGA and external devices. The connection block is used to connect a CLB to the routing channels while the switch block is used to connect the routing channels [6]. In the placement step, the netlist of logic blocks is placed into FPGA circuit [6]. The optimization goal of placement is to place the blocks in a proper location so that the objective function is minimized. There are three common optimization criteria for placement, time-driven, wire-lengthdriven and path-driven. Time-driven placement attempts to minimize the delay in the circuit while wire-length-driven placement targets to minimize the total wire used. Path1
driven placement focuses on trying to put the logic blocks on the critical path of the circuit so that both timing and wire can be optimized. In the following sections, the paper reviews four different categories of FPGA placement methods, compares their experimental results and analyzes the performances.
Simulated Annealing Placement
Simulated annealing placement mimics the annealing process used to gradually cool molten metal to high quality metal objects. An initial placement is created by randomly placing the logic blocks in the circuit. A large number of swapping blocks is made to gradually reduce the cost. In this section, the well-know Versatile Place and Route (VPR) tool using simulated annealing is reviewed [9].
2.1. Overview of Simulated Annealing

An annealing process is to allow molecules cool down in a controlled manner by temperature in order to nd their best t in the system. The simulated annealing algorithm is based on random movement of logic blocks, which is called move [6]. The cost function is dened to evaluate the quality of the placement and a linear congestion cost function as following provides the best results in a reasonable computation time [9]. Cost =
Nnets n=1
q(n)[
bbx (n) bby (n) + ] Cav,x (n) Cav,y (n)
The cost is the summation over the bounding box of all nets in the circuit. For each block (xi , yi ) in one
Raccept Raccept > 0.96 0.8 < Raccept 0.96 0.15 < Raccept 0.8 Raccept 0.15
0.5 0.9 0.95 0.8
Table 1. Temperature Update Schedule
Figure 1. Bounding Box of One Net. net, the coordinates of the bounding box is dened as (xmin , xmax , ymin , ymax ). A bounding box of one net is illustrated in Figure 1. Note that the dashed line is the bounding box of a net consists of four blocks. Cav,x (n) and Cav,y (n) are the average channel capacities in the x and y directions over the bounding box of net n [6]. Also, a compensation factor, q(n) is used for wiring length under estimation introduced by the bounding box method. Simulated annealing starts with a random placement of each logic block in the circuit. After the initial placement, a certain number of moves are performed to see whether the cost is reduced or not according at certain temperature. If the cost decreases, then the move is always accepted. However, if the cost increases, there is still probability for the move to be accepted. The probability is given by eC/T , where C is the change in the cost function the move courses, and T is the temperature. This hill-climbing ability allows simulated annealing method not to converge to local minima and thus to reach global optimization [9]. A good annealing schedule is essential to the nal results. With the motivation of increasing the amount of time spent at temperatures where a signicant of moves are being accepted, the following temperature update schedule is used in VPR. Tnew = Told
old where is dened as shown in Table 1. Note that Raccepte is the percentage of the move that has been accepted at the old temperature. Even with a good annealing schedule, millions of block swaps are evaluated at each temperature. The most time consuming and computationally intensive part is calculating the cost causes by the swap. It is crucial to make this part as fast as possible. VPR also uses some heuristics to speed up this process, such as using incremental net bounding box update and a changing range of distant limit [6].
There are several advantages of the simulated annealing placer. Fist of all, it outperforms the other placers as long as direct comparisons can be made [9]. The FPGA CAD tool VPR, which uses the simulated annealing method, has become the state of the art tool in this eld. Second, simulated annealing placer has an open cost function which can be dened as either wire-length-driven, time-driven or pathdriven. The cost function can also be the linear combination of the above types though it is hard to decide the weights. Third, simulated annealing can reach global minimum because of the hill-climbing ability. However, simulated annealing is very slow because of the computationally expensive and time consuming evaluation of each move. Besides, due to the inherent sequential nature of simulated annealing, it is very hard to be paralleled using multi-core CPUs or clusters.
Quadratic Placement
Quadratic placement method uses the squared wire length as the objective function. It tries to minimize the cost by solving the linear equations [12]. Although quadratic placement only considers the squared wire length, it can efciently nish the placement process with almost no quality lost. As a result, it is widely used in the VLSI placement [12].
3.1. Overview of Quadratic Placement

The input le for a quadratic placer is a hyper-graph netlist and the process tries to minimize the total squared distances between every two nodes. The cost will be computed according to the formula: (x, y) =
m 1 Wi,j [(xi xj )2 + (yi yj )2 ] 2 i=1,j=1
2.2. Pros and Cons

2
The coordinates of the logic block in the netlist are x and y. Wi,j is the weight between node (xi , yi ) and (xj , yj ). Since the input is a hyper-graph, and two nodes can be connected by more than more net, there are two models to convert the hyper-graph into a graph [7]. The objective function can be rewritten into a matrix no-
tation: (x, y) = 1 T 1 x Qx + dT x + y T Qy + dT y + const x y 2 2
where Q is an n n symmetric matrix and dx , dy are ndimensional vectors. Because of the symmetric property, the objective function can be separated into x dimension and y dimension respectively. Then the function looks like as following with only one dimension considered: (x, y) = 1 T x Qx + dT x + const x 2
Circuit (ASIC) placement and have also been applied to FPGAs. One of the recent partitioning-based placement method, named min-cut placement, recursively applies bipartitioning to map the netlist of a circuit into the FPGA layout region. It minimizes the number of cuts of the nets while in the mean time, leaves the highly connected logic blocks in one partition [3].
4.1. Overview of Min-Cut Placement

Delay optimization is very important in circuit design. Effective delay minimization on large circuits is possible only by accounting for performance as early as possible in the design ow. Min-cut placement targets delay minimization on the placement stage, which is an early step in the design process. The min-cut placer employs the fundamental divide-andconquer method. A circuit is recursively bi-partitioned in a breadth rst manner as shown in Figure 2.
In order to nd the minimum value, let (x) = 0 which results in the following matrix equation: Qx + dx = 0 This is the quadratic equation that minimizes the total squared wire length and can be solved by using nonstationary iterative methods. The algorithm proposed in [12] can be divided into three stages. In the rst stage, by repeatedly building up, modifying and solving linear equations, a good initial placement can be obtained. This stage is performed until no signicant improvement can be achieved. In stage 2, instead of building and solving linear equations, nodes can be directly moved to reduce the total wire length since stage 1 has already given a reasonably good initial placement. The process in stage 2 is much faster than stage 1 so more iterations can be performed to get a better renement. Finally, simulated annealing can be used to further rene the placement with low temperature.
Figure 2. Bi-partitioning Process of Min-Cut. The cut direction (horizontal or vertical) is decided based on the criticality of the nets crossing the four borders so that the total cut numbers are minimized [3]. This recursive process is repeated until each partition contains only a few blocks to group the highly connected blocks together in order to decrease placement cost. The goal of min-cut is to nd a proper partition that cuts fewest wires in the net. All the edges in the net are weighted with timing criticality, as well as terminal alignment of critical nets [3]. The algorithm can be divided into three stages. In the rst stage, min-cut uses the state of the art multilevel partitioner hMetis [11] as its partitioning engine. During the partitioning process, a tight connection between the circuit graph and placement is maintained, which represents coordinates of all blocks on the FPGA fabric. Recursive partitioning is done until each leaf partition has only a few blocks while in some cases, some leaf nodes might contain more nodes than it can accommodate, so overlaps must be removed. In stage two, overlaps are removed by using a greedy technique, which moves blocks to the closest best aligned partition. Finally, the placement is rened by using a low temperature simulated annealing method to further minimize 3
3.2. Pros and Cons

The main advantage of the quadratic placement technique is that it signicantly improves the run time with almost no quality lost compared to VPR. According to the results shown in [12], across the 20 MCNC benchmark circuits [5], QPF runs 5.8 times faster than VPR on average while the wire length obtained by QPF is only 1.9% more than VPR. By using better algebra method to solve the linear equation, the run time might be further reduced. However, since the squared wire length is the only factor considered in the objective function, the timing part of the placement can not be shown in the quadratic placement.
Min-Cut Placement
Partitioning-based placement algorithms have been fast and hence scalable for large Application Specic Integrated
the delay.
4.2. Pros and Cons

The advantage of the min-cut placement technique is that it minimizes the delay in the placement stage, which lays the foundation for designing a better performance circuit. Besides, the run time reported in [3] shows that an average 3-4x speed up is gained compared to VPR on 20 MCNC benchmarks with a slight degradation in the quality. However, the results of min-cut is relied on how well the partition is performed. Current research is focused on nding some heuristics to better partition the circuit. Also, min-cut placer may not be able to reach the global minimum because of some of the greedy strategies it uses.
Figure 3. Parallel Moves Approach. Generally, there are two ways to deal with the move collision and net cost collision. Ignoring the errors in the cost function is the easiest way to deal with collisions. But it has negative effects on the accuracy of the cost which interferes with the acceptance of moves. This adversely affects the results. Find the disjoint moves that not only move different blocks, but also belong to different nets. The over restricted moves results in a smaller swap space and the synchronization overheads tend to overwhelm the gain in parallelism.
Parallel Placement
As the scale of modern FPGAs has reached millions of logic blocks, more efcient and scalable FPGA placement algorithms are needed. Parallelization is an appealing solution for providing fast placements due to the rapid development of multi-core CPUs in recent years. The parallel approaches that we are going to review are based on simulated annealing since it outperforms the others while the main drawback is its time consuming move. We divide modern simulated annealing based parallel FPGA placers into three categories: parallel move approach, area based approach and deterministic parallel approach.
5.1. Parallel Move Approach

Since there are quite a large number of moves at each temperature, the motivation of the parallel move approach is trying to accelerate the simulated annealing process by performing several moves at the same time. There are three possible cases after each move is done. (i) two blocks are swapped (ii) a block is moved to an empty location (iii) the move is rejected. Moves can be done in parallel only if they do not move the same block or move to the same location. Figure 3 shows a simple example of the parallel move. Note that Move 1 and Move 2 can be done in parallel since they are totally independent while Move 2 and Move 3 can not because they are trying to move block 3 to different locations at the same time. However, ensuring the above can only guarantee there are no move collisions while net cost collision might still happen. As shown in Figure 4, block 1 and block 3 belong to the same net. While move 1 and move 2 are done in parallel, the resulting bounding box of move 1 is the bounding box of block 2 and 3 while the resulting bounding box of move 2 is the bounding box of block 1 and 4. Two moves that move blocks of the same net may evaluate the bounding box incorrectly as each one of the moves can not take into account the fact that the other move is changing the bounding box. 4
Figure 4. Net Cost Collision. Both of these two methods show negative speedups [1]. The reason is due to the overhead of synchronization outweighs the advantages of parallelization. But the thought of trying to parallelize the moves inspires many other parallel FPGA placement methods.
5.2. Area Based Approach

The area based approach is motivated by solving the collision illustrated in the parallel move approach. It partitions the area of FPGA and assigns the partitioned areas to different processors. As shown in Figure 5, the whole circuit is
into two stages: processing and nalization. As shown in Figure 6, during the processing stage, each processor proposals a move and evaluates it. This takes the vast majority of time and thus occurs in parallel. In order to avoid collisions and maintain the deterministic property, the calculated moves are put in a queue and a dependency check is needed to ensure there is no collision and re-propose moves that have collided. Note that the nalization part can be done by any of the idle processor. In our example, C0 is idle when all the moves in queue have been checked, thus C0 does the nalization job. Figure 5. Collision in Area Partitioning. partitioned into four parts, and each processor is in charge of one partition. The moves evaluated are much less restricted than the move parallel approach. However, collisions could still happen because multiple processors may move blocks belonging to the same net across the partition as presented in Figure 5. For example, the bounding box of block 1, 2 and 3 can not be computed since they belong to different partitions. These errors can be tolerated because we do not expect the net which spans over two or more partitions happen very often. Moreover, with cooling temperature, the swaps are tend to happen between nearby blocks. Since each processor can only move blocks within its own partitioned area, to allow the placement to reach global minimum, the partition must be carefully performed so that each block has the freedom to move to any arbitrary locations in FPGA. The area based approach uses both horizontal and vertical partition to ensure global minimum could be reached [1]. The experimental results show a non linear speed up has been gained compared to the sequential placer and the cost does not degrade with the increasing processors. This is due to the less synchronization requirements.
Figure 6. Deterministic Parallel Approach. There are several advantages of the deterministic parallel approach. Firstly, speed up can be linear given the assumption that the nalization time is negligible. Secondly, a move is now processed entirely by one processor, which improves the memory locality. Thirdly, the results are deterministic and serial equivalent.
Future Work
5.3. Deterministic Parallel Approach

One of the constrains of parallelism is the nondeterminism of the results. This constrain is seldom studied in the past work (an exception is [2]), but is vital in a commercial context for the following two reasons [10]. When user uses a commercial FPGA placement tool, he must be able to reproduce the problem when a bug is reported. Non-determinism makes this extremely difcult because the results are different for each run. In the release testing stage of building a placer, it would be terribly difcult to look into failing tests since the results changed randomly. The algorithm proposed in [10] parallelizes the placement while at the mean time, keeps the results deterministic. The deterministic parallel approach partitions a move 5
Algorithms for FPGA placement play a vital role in modern integrated circuit design. As the comparisons of results show in this paper, placement is still bottleneck even though tradeoff can be made between quality and efciency. The potential of improving both run time and quality still exists by using parallel methods[10]. Ideally, we want to systematically compare the results of each placement algorithm. However, direct comparisons of these algorithms are difcult, partly because of the limited access to the algorithms, and partly due to their different assumptions. Comparisons based on wire-length placement have been attempted in [5]. More work is needed to build a common framework to directly compare the performance of different FPGA placers.
Conclusion
In this paper, a number of different FPGA placement algorithms are reviewed and a summary of comparison is
Simulated Annealing
Quadratic Placement
Min-Cut Placement
Parallel Placement
Placement Quality Overall, it gives the best results particularly when using wire length as the cost function. The nal placement result can reach global optimization. On average, it requires 1.9% more wire length compared to simulated annealing method. It only considers wire length factor in the cost function while timing part can not be shown. The nal result varies dramatically based on how the partition is made. To the best of our knowledge, no direct comparison of results to simulated annealing has been made. In general, using the deterministic parallel placement, the results are as good as the simulated annealing.
Placement Efciency It is very slow because of the computationally expensive evaluation of each move.
Compared to simulated annealing, it can be 5.8 times faster on average.
An average of 3-4 times speedup can be gained compared to simulated annealing.
The best speedup can be linear with the number of processors if the nalization time is negligible.
Figure 7. Overall Comparison of the Four Placement Methods shown in Figure 7. Simulated annealing placement in general, outperforms the other placers with regard to the nal results. Besides, it can overcome local minimum and has open cost function. However, it is too time consuming. Quadratic placement gives fast run time while the results can not reach global optimization and timing factor can not be shown in the cost function. Min-cut method can also give speed up on run time while the quality of the placement still can not be guaranteed. Parallel algorithms can give good speedups and with almost no quality lost. But the scalability is restricted by the overhead due to memory. [5] S.Yildiz M.Markov I. Villarrubia, P.Parakh and Madden. Benchmarking for large scale placement and beyond. Proc. of the International Symposium on Physical Design, pages 95103, 2003. [6] V.Betz J.Rose and A.Marquardt. Architecture and CAD for Deep-Submicron FPGAs. Kluwer Academic, 1999. [7] N.Viswanathan and C.Chu. Fastplace: Efcient analytical placement using cell shifting, iterative local renement and a hybrid net mode. Proc. of ISPD, 2004. [8] S.Brown and J.Rose. Fpga and cpld architectures: A tutorial. IEEE Design and Test of Computers, 12:42 57, 1996. [9] V.Betz and J.Rose. Vpr: A new packing, placement and routing tool for fpga research. International Workshop on Field Programmable Logic and Applications, pages 213222, 1997. [10] A.Ludwin V.Betz and K.Padalia. High-quality, deterministic parallel placement for fpgas on commodity hardware. ACM/Sigda Int. Symp. on FPGAs, pages 1423, 2008. [11] R.Aggarwal V.Kumar, G.Karypis and S.Shekhar. Multilevel hypergraph partitioning: Application in vlsi domain. Proc. ACM/IEEE DAC, 1997. [12] Y.Xu and M.A.S. Khalid. Qfd: Efcient quadratic placement for fpgas. International Conference on Field Programmable Logic and Application, pages 555558, 2005.
References
[1] A.Nayak A.Choudhary, M.Haldar and P.Banerjee. Parallel algorithms for fpga placement. Proc. of the 10th Great Lakes Symposium on VLSI, pages 8694, 2000. [2] J.Chandy S.Kim B.Rankumar, S.Parkers and P.Banerjee. An evaluation of parallel simulated annealing strategies with application to standard cell placement. TCAD, 16:398410, 1997. [3] P.Maidee C.Ababei and K.Bazargan. Time-driven partitioning-based placement for island style fpgas. IEEE Transactions on Computer Aided Design of Integrated Circuits and Systems, 24:395406, 2005. [4] J.Cong D.Chen and P.Pan. Fpga design automation: A survey. Foundations and Trends in Electronic Design Automation, 1, 2006. 6

603 Paper

Caricato da

Informazioni sul documento

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

603 Paper

Caricato da

Copyright:

Formati disponibili

FPGA Placement Methodologies: A Survey

Simulated Annealing Placement

2.1. Overview of Simulated Annealing

bbx (n) bby (n) + ] Cav,x (n) Cav,y (n)

0.5 0.9 0.95 0.8

Table 1. Temperature Update Schedule

3.1. Overview of Quadratic Placement

2.2. Pros and Cons

tation: (x, y) = 1 T 1 x Qx + dT x + y T Qy + dT y + const x y 2 2

4.1. Overview of Min-Cut Placement

3.2. Pros and Cons

4.2. Pros and Cons

5.1. Parallel Move Approach

5.2. Area Based Approach

5.3. Deterministic Parallel Approach

Compared to simulated annealing, it can be 5.8 times faster on average.

An average of 3-4 times speedup can be gained compared to simulated annealing.

Potrebbero piacerti anche