Sei sulla pagina 1di 7

FPGA implementation of Radix-22 Pipelined FFT Processor

Ahmed Saeed*, M. Elbably, G, Abdelfadeel and M. I. Eladawy


Faculty of Engineering & Technology Future University in Egypt, Helwan, Egypt Tel: +2-010-2052382 E-mail: asaeed@fue.edu.eg Faculty of Engineering Helwan University, Helwan, Egypt E-mail: saeeed3@ieee.org
*

Abstract The Fast Fourierstopping processing and presentexpose that our design achievesThe pentagons between BFI and Transform (FFT) is verysmaller latency with low powerhigher operating frequency andBFII represent the trivial important algorithm in signalconsumption [5] which makesappropriate execution speed. multiplication by j. After these processing, software-definedthem suitable for most The paper is structured astwo butterflies, full twiddle radio, and wireless follows. The Radix-22 FFTfactor multipliers (TFM) are communication. This paperapplication. to compute the explains the realization of radix- The pipelined architecturesalgorithm is illustrated inrequired 22 single-path delay feedbackcan be classified into two types:Section II. In Section III, themultiplication by the twiddle 2 pipelined FFT processor. Thissingle-path architectures andimplementation of Radix-2 factor Wn3K1+2k2. architecture has the samemulti-path architectures. SeveralAlgorithm by FPGA will be Fig.3 shows the block multiplicative complexity assingle-path architectures havedebated. The synthesis resultsdiagram for Radix-22 N-point radix-4 algorithm, but retains thebeen proposed: Radix-2 single-and consumed resources areFFT processor. The N-point simple butterfly structure ofpath delay feedback, Radix-4revealed in Section IV. At last,FFT processor has i=log4N radix-2 algorithm. The single-path delay feedback,the concise statements remarkstages. A typical stage consists implementation was made on a 2 single-path delaythis paper. of BFI, BFII, delay-feedback, Field Programmable Gate ArrayRadix-2 feedback, Radix-24 single-path ROM, and TFM. A log2N (FPGA) because it can achieve RADIX-22 FFT ALGORITHM counter is used to control the higher computing speed thandelay feedback, Split-Radix II. processor. digital signal processors (DSPs),single-path delay feedback, and Discrete Fourier The last stage is different and also can achieve costRadix-4 single-path delay The effectively ASIC-likecommutator. The multi-pathTransform (DFT) of N-pointaccording to the size of FFT; if performance with lowerarchitectures: Radix-2 multi-input x(n) is defined as N is power of 2, the last stage is development time, and risks. Thepath delay commutator, Radix-4 composed of BFI only. But if N processor has been developed multi-path delay commutator, Xk=n=0N-1xnWNnk, is power of 4, the last stage using hardware description Split-Radix multi-path delay0k<N (1) composed of BFI and BFII. language VHDL and simulated commutator, and Mixed-Radix up to 465MHz and exhibited Where WN=e-j2N III. IMPLEMENTATION OF execution time of 0.135S on anmulti-path delay commutator. According to the Xilinx xc5vsx35t for The observation made on the RADIX-22 BY FPGA transformation length 256-point. listed architectures reveals thatdecomposition method of [6]

Concerning the FPGA the delay feedback architecturethat done by substituting with implementation, the selection of I. INTRODUCTION is more efficient than the target FPGA should consider corresponding delay n=N2n1+N4n2+n3N The Fast Fourier Transform the required resources for the commutator in terms of memory k=k1+2k2+4k3N (FFT) one of the most widely pipelined architecture proposed 2 utilization and Radix-2 has used and important signal by [6] for N=256: complex This yield simpler butterfly and higher processing functions. It has multipliers, complex Xk1+2k2+4k3=n3=0N4multiplier utilization [2], [3], been widely applied in the adders/subtractors for BFI and 2 1Hk1,k2, [6]. This makes Radix-2 singleanalysis and implementation of path delay feedback ann3WNn3(k1+2k2)WN4n3k3 BFII, registers and memory for digital communication systems delay feedback and pipelining, attractive architecture for(2) and television terrestrial ROM for storing the twiddle With implementation. broadcasting systems, such as factor, and the control unit. Classical implementation of the xDSL (de)modulator, phase n3=xn3+-A. BFI Structure the FFT algorithm, with digitalHk1,k2, correlation system, mobile signal processors (DSPs),1k1xn3+N2+The detailed structure receiver, etc.[1], [2]. requires a sequential algorithm.jk1+2k2xn3+N4+When considering the of BFI is shown in Fig.2. 1k1xn3+3N4 (3) This slows down the execution alternate implementations, the The A input comes from time. On the other hand, the FFT algorithm should be chosen modern programmable circuits, After this simplification wethe previous component, to consider the execution speed, like an FPGA, utilizes a tens ofhave a set of four DFTs ofTFM. The B output fed to hardware complexity, and the next component, thousands of lists and triggerslength N/4. flexibility and precision. Each term in equation (3)normally BFII. In first during operation, resulting of Nevertheless, for real time parallel processing system,represents a Radix-2 butterfly systems the execution speed is putting the FPGA computing(BFI), and the whole equation the main concern [3], [4]. represents Radix-2 speed at a significant advantagealso Several architectures have been butterfly (BFII) with trivial over a DSP chips [7], [8], [9]. proposed over the last 3 decades This paper presents themultiplication by -j. like: single-memory implementation of Radix-22 Fig.1 is an example of N=162 architecture, dual-memory single-path delay feedbackponits Radix-2 decimation in architecture, cached-memory pipelined FFT processor on anfrequency (DIF) FFT algorithm. architecture, array architecture, FPGA. It is compared to otherNotice the order of input and and pipelined architecture [4]. implementations based on theoutput; the inputs are in normal Pipelined architectures maximum operating frequencyorder whereas the outputs are in characterized by real-time, nonand execution speed. The resultspermuted (digit-reversed) order.

Fig. 1.Butterfly I structure

Fig. 1.Flow graph of Radix22 DIF for N =16 FFT algorithm

N/2i+1cycles, multiplexors direct the input data to the feedback registers until they are filled (position 0). On next N/2i+1cycles, the multiplexors select the output of the adders/subtractors (position 1), the butterfly computes a 2point DFT with incoming data and the data stored in the feedback registers. A. BFII Structure The detailed structure of BFII is shown in Fig. 4. The B input comes from the previous component, BFI. The Z output fed to the next component, normally TFM.

imaginary swapping is handled by the multiplexors MUXim efficiently and the sign inversion is handled by switching the addingsubtracting operations by mean of MUXsg. When there is a need for multiplication by j, all multiplexors switches to position 1, the realimaginary data are swapped and the addingsubtracting operations
Fig. 3. Block diagram for Radix-22 N-points FFT processor

Fig. 2.

In first N/2i+2cycles, multiplexors direct the input data to the feedback registers until they are filled (position 0). In next N/2i+2cycles, the multiplexors select the output of the adders/subtractors (position 1), the butterfly computes a 2point DFT with incoming data and the data stored in the feedback registers. The multiplication by j involves real-imaginary swapping and sign inversion. The real-

are switched. Fig. 5 shows the sign inversion structure.

Fig. 4.

Butterfly II structure

(4)

converted to fixed point, andavailable in Virtex-4 and then stored in ROM. Virtex-5 FPGAs. It has been used to implement the Adders, subtractors, control unit, and complex multipliers. DSP48E slices provide reduced overall power consumption, increased maximum frequency, and reduced set-up plus clock-to-out time. It can implement 25x18bit multipliers, with add, subtract, accumulate, and bitwise logic features [12]. These slices have optional connections Fig. 5. Sign inversion structure the multiplier and th between The twiddle factor at the i Fig. 7. Structure of fully-pipelined adder/subtractor units inside it; TFM stage, with i=0, 1, , The adders and this saves the routing resources, (log4N)-2, is given by Wi=ux; subtractors in BFI and and increases performance, BFII are fully-pipelinedx=0, 1, , N22i with ux=e-because all connections are in j2zN I. RESULTS (Fig. 6) and followed by the DSP slice. In order to save divide-by-2 scaling and the registers in the FPGA, the The FFT processor was z=0, &0 x< a22i+1.x-a, rounding. The divide-by-2&ax<2a22i.x-2a, &2ax<3apipelining of multipliers anddescribed with hardware Adder/subtractor units wasdescription language VHDL as scaling is used in order to322i.x-3a, &3ax<4a (5) made by mean of DSP48E. not lose any precision fixed-point arithmetic and We chose twos complementsynthesized with XST tool in where the word-length With a=N22+2i [3]. format to represent the signalsXilinx ISE version 10.1 on imply successive growth in digital domain due to itsxc5vsx35t FPGA and simulated as the data goes through ability to handle negativeusing ModelSim Xilinx Edition add/subtract and multiply B. Delay-feedback Structure numbers inherently without an(MXE III). The top-level design operations. Rounding of has been also applied to In order to reuse the existingextra sign bit. Positive andshown in Fig. 8, the Xn is input negative numbers can bedata (16-bit real and 16-bit reduce the scaling errorshardware, the delay feedback is used. The delay feedbackdistinguished by the mostimaginary), Xk output data (16[4]. architectures reorder the inputsignificant bit of the givenbit real and 16-bit imaginary), A. TFM Structure by first accepting part of thenumber. synchronous reset; rst, clock, A six clock cycle fully-data stream into the butterfly The word length is chosen tochip enable, start, busy, and pipelined TFM has beenelements, but instead ofbe 16-bits for the data and 11-finish. implemented to multiply thecomputing on the block, it isbits for the twiddle factor in The synthesis tool has twiddle factor by the output ofredirected to a feedback delayorder to guarantee a Signal-to-allocated the following BFII. The structure of TFMline. By the time, the dataNoise Ratio (SNR) of 45dBresources (Fig. 9): 838 slice appears again at the input of the[10], it also matches theregisters (3%), 1033 lookup shown in Fig. 7. of DSP48Etable (4%), 717 lookup tablebutterfly. First-in First-out shiftcharacteristics slices. register is used to implement the flip flop pairs (62%), and 44 Br+jBi.c+js=Br.cdelay-feedback. DSP48E (22%). Bi.s+jBi.c+Br.s The feedback delay at the ith The results of the synthesis stage is given by: tool and the timing analysis using the ModelSim simulator There exist many popular indicate a maximum operating generation techniques for =N22(i+1) (6) Fig. 6. Pipelined adder with divide- frequency of 465 MHz; this twiddle factor. They include by-2 and round provides an execution time of a CORDIC algorithms,C. Control Unit 256 complex data points polynomial-based approach, 2 transform in 0.135S. ROM-based scheme, and the Radix-2 control unit is very Table 1 shows a comparison recursive function generators.simple. A log2N counter is used with some of FFT processors. For small lengths such as 64 upto switch the butterflies between Our Radix22 processor achieves to 512, ROM-based is a bettermodes. It also used as address to ROMs in order to pick the highest operating frequency of choice [11]. all the FFT implemented with The twiddle factors aretwiddle factors. FPGA of Table 1. Notice the generated using Matlab data concerning Xilinx core is according to equation (5), The DSP48E are a DSP slices introduced by Xilinx and extracted from [4].

TRANSACTIONS ON Simplify the Design of CIRCUITS AND SYSTEMS IFFT/FFT Cores for OFDM This paper describes the II: EXPRESS BRIEFS, Systems," IEEE Transactions design and implementation of VOL. 53, NO. 7, pp. 585-589, on Consumer Electronics, No. JULY 2006. 1, FEBRUARY 2006, pp. 26Radix22 single-path delay 32, 2006. feedback pipelined FFT[3] A. Corts, I. Vlez, I. Zalbide, A. Irizar, and J. F. Sevillano,[11] Chi, J.C., and S. Chen, "An processor with N=256 points on An FFT Core for DVBefficient FFT twiddle factor an FPGA. The description was T/DVB-H Receivers, VLSI generator," in Proc. European made by VHDL in Xilinx ISE Design, vol. 2008, Article ID Signal Process. Conf., pp. on Virtex-5 family. The 610420, 9-pages , 2008. 15331536, 2004. multipliers, Adder/subtractor[4] Jess Garca1, Juan A.[12] User Guide, Virtex-5 FPGA units, and control unit were Michell, Gustavo Ruiz, and XtremeDSP Design implemented by efficient Angel M. Burn, "FPGA Considerations, Version 3.3, realization of a Split Radix Xilinx Inc., Jan. 2009. inferring the DSP48E Blocks in FFT processor," Proc. of SPIE.[13] Datasheet, Xilinx LogiCore order to obtain a faster design. Microtechnologies for the New Millennium, vol. 6590, pp. 65900P-1 to 65900P-11, 2007. [5] T. Sansaloni, A. PerezPascual, V. Torres, and J. Valls, Efficient pipeline FFT processors for WLAN MIMOOFDM systems, Electronics Letters, vol. 41, no. 19, pp. 10431044, 2005. [6] S. He and M. Torkelson, A

II.

CONCLUSION

new approach to pipeline


FFT processor, in Proc. 10th Int. Parallel Processing Symp., pp. 766770, 1996.
TABLE 1 COMPARISON WITH OTHER FFT PROCE SSORS (N=256, DATA-BITS=16) Fig. 8. Top-level design Fig. 9. Design summary

The data and twiddle factor bits chosen to achieve an[7] acceptable SNR and also to match the feature of DSP48E slices. The synthesis and simulation of the processor indicates the execution of 256 data points R22SDF in 0.135S [8] with maximum operating frequency of 465MHz. Comparison with other FFT processors reveals the power of our processor.

Fast Fourier Transform, Version 4.1, Xilinx Inc., Sep. 2008.

Zhijian Sun, Xuemei Liu, and Zhongxing Ji, "The Design of Radix-4 FFT by FPGA," International Symposium on Intelligent Information Technology Application Workshops, pp.765-768, 2008. [8] J. Mar, Y. Lin, T. Lung, and T. Wei, "Realization of OFDM Modulator and Demodulator for DSRC Vehicular Communication System Using FPGA Chip," International Symposium on Intelligent Signal Processing REFERENCES and Communications (ISPACS'06), pp. 477-480, [1] C. Lin, Y. Yu, and L. Van, "A 2006. low-power 64-point FFT/IFFT design for IEEE 802.11a[9] M. Nilsson, FFT, Realization and Implementation in FPGA, WLAN application" in Proc. Master thesis, Ericsson International Symposium on Microwave Systems AB / circuit and systems, pp. 4523Griffith University, 2000 4526, 2006. 2001 [2] L. Yang, K. Zhang, H. Liu, J. Huang, and S. Huang, "An[10] Ainhoa Corts, Igone Vlez, Juan F. Sevillano, and Andoni Efficient Locally Pipelined Irizar, "An Approach to FFT Processor," IEEE

Xilinx Core [4] Our design [13] Execution 0.59 0.123 0.135 time [S] Frequency 350 465 432 [MHz] FPGA Virtex 5 tratixVirtex 5 S II Family Device5VSX35T 5VSX35T EP2S15F67 2C3

Potrebbero piacerti anche