Sei sulla pagina 1di 10

Application Note

07 April 2004 www.picochip.com

An FFT implementation on the picoArray


Summary The Fast Fourier Transform (FFT) is a basic building block in signal processing applications. As such, the FFT is often used for benchmarking on a range of platforms (e.g. ASIC, FGPA, DSP) with a multitude of implementations available to optimize speed, area and power. This application note describes in detail a software implementation of a pipeline FFT [1] on the picoArrayTM suitable for highspeed applications such as OFDM based wireless LAN (802.11) and MAN (802.16).

Table of Contents

1 2 3 4

INTRODUCTION.........................................................................................................................2 FFT OVERVIEW .........................................................................................................................2 A PIPELINE FFT.........................................................................................................................3 THE PC102.................................................................................................................................4 4.1 4.2 ARRAY ELEMENTS .................................................................................................................4 PICOBUS SWITCHING FABRIC ..................................................................................................5

PICOARRAYTM IMPLEMENTATION OF FFT..............................................................................6 5.1 5.2 5.3 5.4 SYNCHRONOUS VS ASYNCHRONOUS OPERATION ......................................................................7 A 2048-POINT FFT................................................................................................................8 A SMALLER FFT....................................................................................................................8 A FASTER FFT......................................................................................................................9

6 7

REFERENCES ...........................................................................................................................9 APPENDIX: SOURCE CODE ................................................................................................... 10

Table of Figures FIGURE 1 FFT WITH (A) DECIMATION IN FREQUENCY, (B) DECIMATION IN TIME ..........................................3 2 FIGURE 2 - N=256 R2 SDF FFT............................................................................................................4 TM FIGURE 3 - PICOARRAY FFT ARCHITECTURE (N=256) ON PC102 ...........................................................6

picoChip Designs Ltd.

Page 1 of 10

An FFT implementation on the picoArray

Application Note

Introduction

The picoArray is a multi-processor IC which integrates hundreds of processing elements into a single array. The individual elements have been optimized for signal processing and wireless algorithm computation and control. The result is a general purpose wireless communications processor, capable of executing all contemporary wireless standards, which combines the computational density of a dedicated ASIC with the programmability of a traditional high-end Digital Signal Processor (DSP). The Fast Fourier Transform (FFT) is an optimized implementation of the Discrete Fourier Transform (DFT). The FFT is used in a wide range of applications, some examples being: Frequency domain analysis of a signal FIR filtering Modulation/Demodulation for orthogonal frequency division multiplexing (OFDM), as used in 802.11a/g and 802.16a/Revd/e. A quick summary of the FFT is given in section 2 although the reader is assumed to be familiar with the general theory involved. The focus of this application note is the implementation of a specific TM pipeline FFT algorithm in software on the picoArray . Many algorithms and architectures exist for implementing the FFT. The differences in implementation take account of the target platform (hardware or software) and mode of operation (sequential real-time TM or batch). Despite the picoArray being a software platform, a hardware-orientated implementation is TM used here for the FFT. This apparent dichotomy is actually one of the strengths of the picoArray combining the time-to-market and abstraction benefits of a software development environment with the performance benefits gained by exploiting parallelisms within an algorithm. An overview of the FFT architecture is given in section 3. The reader is directed to [1] for a detailed description of the particular FFT algorithm itself. In order to describe the FFT implementation in detail, an understanding of some key features of the picoArrayTM is required. An overview of the PC102 processors or array elements (AE) together with the switching fabric is given in section 4. Section 5 ties everything together and describes the implementation of the pipeline FFT on the PC102 TM picoArray .

FFT Overview

The FFT is simply an efficient implementation of the DFT. For an N-point DFT, whereas a direct 2 implementation requires of the order of N complex multiply and add operations, the FFT only requires of the order of Nlog2N operations. Figure 1 presents the (unscaled) definition of the DFT which acts as the starting point for the FFT algorithm.

FFT N (k , f ) = f ( n)e j 2kn / N


n =0

N 1

picoChip Designs Ltd.

Page 2 of 10

An FFT implementation on the picoArray


N / 2 1 n=0 N 1

Application Note

(a)

FFT N ( k , f ) = FFT N (k , f ) =
N / 2 1 n '= 0

f ( n )e

j 2kn / N

n= N / 2 N / 2 1 n '= 0

f ( n )e

j 2kn / N

(b)

f ( 2 n ' )e j 2k ( 2 n ') / N +

f (2n'+1)e

j 2 k ( 2 n ' +1) / N

Figure 1 FFT with (a) Decimation in Frequency, (b) Decimation in Time

Two approaches exist for reducing the DFT into a series of simpler calculations; (a) decimation in frequency and (b) decimation in time. Both approaches require the same number of complex multiplications and additions. The key difference between the two is that: Decimation in time takes bit reversed inputs and generates normal order outputs, Decimation in frequency takes normal order input and generates bit reversed outputs. Each butterfly stage involves multiplying an input by a complex twiddle factor, e
-j2n/N

A Pipeline FFT

This section summarizes the pipeline FFT processor proposed in [1]. A pipeline FFT is characterized by real-time, non-stop processing of a sequential input stream. A hardware-orientated approach is taken with the aim of minimizing the area of a VLSI design by minimizing the number of complex multipliers used. Such an approach takes advantage of the parallel processing capabilities of a hardware based solution. The proposed FFT uses decimation in frequency in order to avoid reordering the inputs. The outputs are in bit reversed order. The FFT/IFFT algorithm involves the temporal separation of data, i.e. the inputs to each radix-2 butterfly in the first stage of an N-point FFT are x(n) and x(n+N/2). Pipeline FFTs therefore require a method of buffering and reordering data. Various architectures exist for this. The proposed FFT processor is based on the Radix-2 Single-path Delay Feedback (R2SDF) architecture. The delay between each butterfly stage in a pipeline FFT is dictated by the amount of buffering required for the inputs. The largest delay is in the first stage where N/2 samples have to be buffered before outputs can be generated. The smallest delay is in the last stage where only 1 sample needs to be buffered. An optimization of the FFT algorithm is made with respect to the twiddle factors which halves the number of multipliers required. The proposed FFT retains the radix-2 butterfly structure but has the multiplicative complexity of a radix-4 algorithm. This results in two radix-2 butterflies for each complex 2 multiplier as shown in Figure 2. The resulting architecture is termed R2 SDF.

picoChip Designs Ltd.

Page 3 of 10

An FFT implementation on the picoArray

Application Note

Figure 2 - N=256 R2 SDF FFT

Figure 2 shows two (radix-2) butterfly stages between each complex multiplier. requirements for each butterfly stage are also shown.

The memory

The PC102
TM

An overview of the salient features of the PC102 picoArray is given here in order to appreciate the FFT implementation described in section 5. Reference is made to the source code given in section 7 for the first butterfly stage for illustrative purposes. The PC102 consists of an array of processors or array elements (AEs) interconnected by a highspeed switching fabric called the picoBus.

4.1

Array Elements

The PC102 contains four different types of array elements (AEs) which are detailed in Table 1. Minor differences exist between the three programmable AE types (STAN2, MEM2 and CTRL2). These differences include the size of instruction/data memory, additional processing unit and instructions supported (e.g. multiply-accumulate, multiply). A long instruction word (LIW) of upto 64bits allows upto 3 execution units to be targeted in a single cycle (160MHz). Each AE has a number of ports for communicating with other AEs within the array. In addition to the STAN2, MEM2 and CTRL2 AE types specified in Table 1, software for the PC102 can also be targeted at the ANY2 AE type implying that: the function does not use any AE-specific instructions, and the code and data memory requirements can be met by all AE types.

Software, written in C or ASM, is targeted at an AE type depending on the processing units used and memory required. Section 7 shows the ASM source between the code (line 32) and endcode (line 57) tags. The C or ASM code for each AE is contained within a picoVHDL wrapper which defines the ports and the type of AE used amongst other things. In section 7, line 27 (begin MEM) tells us that this code is targeted at a Memory AE type. NOTE: The MAC, STAN, MEM, CTRL and ANY AE types are also supported on the PC102 for backwards compatibility with the PC101. Where an AE code body does not use any of the additional features which are specific to the PC102, using these AE types allows the s/w to run on both the PC101 and PC102.

picoChip Designs Ltd.

Page 4 of 10

An FFT implementation on the picoArray

Application Note

Description Type STAN2 Standard A standard AE type includes multiply-accumulate peripheral as well as special instructions optimized for CDMA spread & de-spread. Memory is divided between 512 bytes code and 256 bytes data. Memory An AE having multiply unit and additional memory. Memory division between code and data is configurable. Function Accelerator Unit A co-processor optimised for specific signal processing tasks (FEC, preamble detect, FHT, etc). Includes dedicated hardware for trellis operations. Control An AE type with a multiply unit and larger amounts of data and instruction memory optimized for the implementation of base station control functionality. Memory division between code and data is configurable. Totals per PC102 device:

Number 240

Memory (Bytes) 768

MEM2

64

8,704

FAU

14

n/a

CTRL2

65,536

322

1,003,520

Table 1: PC102 processor variants and memory distribution

4.2

picoBus Switching Fabric

The picoBus is the name given to the switching fabric running vertically and horizontally between the processing elements in the array. AEs are assigned 32-bit slots on the picoBus at compile time thereby removing the need for arbitration and making performance completely deterministic. Each AE communicates over the picoBus via its ports. These are defined using picoVHDL. Each AE has a number of ports which can be configured to be read (incoming) or write (outgoing). Lines 20-22 in section 7 provide an example. Data sent between AEs is: written to a write port FIFO (by the sending AE), sent over the picoBus on the next available slot and read from the read port FIFO (by the receiving AE).

By default, communication between AEs is data blocking. On attempting to read data from the picoBus, an AE will block until data becomes available in the read port FIFO. Similarly, when attempting to write data to the picoBus, the sending AE will block if its write port FIFO is full. A full write port FIFO infers that the receiving AEs read port is not taking data (i.e. is full itself). Bandwidth on the picoBus between communicating AEs is assigned via @-rates. A signal is assigned an @-rate which is a power of 2, e.g. @8, @16. The @-rate is defined in the port declarations in both the sending and receiving AEs (see lines 21 & 22 in section 7). This @-rate is relative to the system clock (160MHz) and indicates how often data may be sent. For example, @8 means that a 32-bit quantity can be sent every 8 cycles (of the 160MHz bus). The receiving AE(s) must therefore issue a read (against the associated port) once every 8 cycles in order to prevent the sending AE from blocking.

picoChip Designs Ltd.

Page 5 of 10

An FFT implementation on the picoArray

Application Note

picoArray

TM

Implementation of FFT

This section describes the implementation of the FFT processor proposed in section 3 on the PC102 picoArrayTM. The hardware-approach adopted in the design of the proposed FFT enables it to be TM readily mapped to the parallel processing capabilities of the picoArray . A performance summary for a 256-point FFT on the PC102 is given in Table 2.

Input/Output Throughput Latency 1 AE Resources


1

12+j12 / 16+j16 20 Msps 13.6 s 2 MEM, 3 STAN2, 11 ANY

The MEM and ANY AE types are used as no PC102 specific features are required. Table 2- 256-point FFT on PC102

Each of the butterfly and multiplier stages shown in Figure 2 is mapped to a separate array element (AE) as shown in Figure 3. This is in order to maximize the overall throughput. The use of the various AE types will be explained shortly.

BF1
MEM

BF2
ANY

C.Mult
STAN2

Round
ANY

BF1
ANY

BF2
ANY

C.Mult
STAN2

Round
ANY

MEM

ANY

Twiddles

Twiddles

BF1
ANY

BF2
ANY

C.Mult
STAN2

Round
ANY

BF1
ANY

BF2
ANY

Figure 3 - picoArray

TM

FFT architecture (N=256) on PC102

Each AE takes its input from the picoBus, processes it and then provides an output to the next AE in the pipeline. Since the overall throughput is limited by the slowest AE, each loop on each AE should ideally take the same number of cycles for optimum performance.

Example:

If each AE processes each sample in 8 cycles then the maximum throughput is 160/8 or 20Msps (assuming inputs 32bits).

picoChip Designs Ltd.

Page 6 of 10

An FFT implementation on the picoArray

Application Note

The PC102 implementation of the FFT takes 12+j12 inputs and provides 16+j16 outputs. Bit growth occurs in both the butterfly (1 bit) and the multiplier stages (13 bits). The rounding scheme for the 256-point FFT shown in Figure 3 is as follows: st st 1. The 1 Rounding stage encountered only removes the bit growth (13 bits) due to the 1 complex multiplier. The bit growth due to the first 2 butterflies (2 bits in total) is not removed. The input to rd the 3 butterfly is therefore 14+j14 (given a 12+j12 input). nd nd 2. The 2 Rounding stage removes the bit growth (13 bits) due to the 2 complex multiplier and also th the bit growth (4 bits) due to the first 4 butterfly stages. The input to the 5 butterfly is 12+j12. rd rd 3. The 3 Rounding stage only removes the bit growth (13 bits) to the 3 complex multiplier. The bit growth from the 5th, 6th, 7th and 8th butterflies results in a 16+j16 output. The 256-point FFT shown in Figure 3 shows 8 [log2(256)] butterfly stages and 3 [log4(256)-1] multipliers. The very first butterfly stage (BF1) uses a Memory (MEM) AE type whereas every other butterfly stage uses the ANY AE type. The reason for this can be traced back to Figure 2 which shows the need to buffer 128 samples in the first butterfly stage. With 32-bit inputs, this equates to a memory requirement of 512 bytes which exceeds the 256 byte data memory of the Standard (STAN2) AE type. Therefore, the first butterfly stage must use a MEM AE. The memory requirements for the second butterfly stage are half the first and can therefore be accommodated within any of the AE types. The twiddle factors are held in memory in a separate AE from the multiplier (with the exception of the last multiplier). The amount of memory required for the twiddle store dictates the type of AE used. The first multiplier requires 256 (32-bit) twiddles which therefore requires a MEM. The second multiplier requires 64 (32-bit) twiddles which can be accommodated within any AE type. The third multiplier has the twiddle store together with the multiplier in the one AE. Being a fixed point processor, the twiddle factors are stored as 14+j14 values.

5.1

Synchronous vs Asynchronous Operation

The source code for the 1st radix-2 butterfly as shown in section 7 is designed for continuous synchronous operation. The output values for one set of FFT results are effectively clocked out by the subsequent set of inputs. The rate at which the outputs are provided is identical to that at which the inputs are presented to the FFT. The latency of such an implementation is therefore dependant on the input sample rate. A closer look at the source code in section 7 explains this behavior. The bottom half of the radix-2 butterfly (loop2, lines 47-53) takes a complex (IQ) input and generates 2 complex results, one of which is sent immediately, the other is stored in memory for sending in the top half of the butterfly (loop1, lines 37-43). Each half of the butterfly consists of a loop with a get and put ASM instruction. Each get instruction will block until an input sample is received for processing. Therefore, in order to send all the outputs held in memory (calculated previously in loop2), the top half of the butterfly (loop1) must run to completion which requires input samples to be received for the next FFT set. The tstport ASM instruction enables the outputs for the previous FFT set to be decoupled from the inputs for the following FFT set. In effect, the outputs are being presented asynchronously relative to subsequent inputs to the FFT. The tstport instruction enables the butterfly AE to test for a received sample without blocking. In the top half of the butterfly (loop1), the outputs have already been calculated (previously in loop2) and are held in memory. Therefore, by using the tstport instruction in the top half of the butterfly the AE can avoid blocking if no input is available, enabling the AE to load

picoChip Designs Ltd.

Page 7 of 10

An FFT implementation on the picoArray

Application Note

and output the results held in memory. If an input is available, it is processed as usual. Once all results have been sent, the AE can revert to using the blocking get instruction in order to receive the required number of samples prior to progressing to the bottom half of butterfly (loop2). Several points should be noted about this asynchronous mode of operation: The overall maximum throughput of the FFT is unaffected. All outputs are presented at the maximum throughput rate as soon as the last input sample in the st FFT set has been received (by the 1 butterfly). For an input rate slower than the maximum throughput, this results in a burst of outputs followed by a gap before the next burst of outputs. The latency for the FFT is the same (i.e. the minimum possible latency) for all input sample rates.

5.2

A 2048-point FFT

The size of the FFT can be increased by adding AEs without impacting the throughput. For example, increasing the 256-point FFT in Figure 3 to a 2048-point FFT would involve adding: an additional radix-4 stage (taking it to 1024-point), and a radix-2 stage (taking it to 2048-point). The additional radix-4 stage would consist of: 2 radix-2 butterflies (MEM) a complex multiplier (STAN2) a twiddle store (MEM) a rounding stage (ANY).

The existing entities in the FFT library can be used to create the radix-2 stage, which would consist of: a radix-2 butterfly (MEM) a complex multiplier (STAN2) a twiddle store (MEM) a rounding stage (ANY). NOTE: The twiddles in the twiddle store for this stage are standard radix-2 twiddles (i.e. twiddle decomposition as used in the radix-4 stages is not used). The performance for a 2048-point FFT on the PC102 is summarized in Table 3. The increase in latency is mainly due to the buffering in the additional butterfly stages.

Input/Output Throughput Latency 2 AE Resources


2

12+j12 / 16+j16 20 Msps 104 s 7 MEM, 5 STAN2, 13 ANY

The MEM and ANY AE types are used as no PC102 specific features are required. Table 3- 2048-point FFT on PC102

5.3

A Smaller FFT

For slower sample rates the overall FFT can be made smaller by combining several functions into the one AE. In its current form, the FFT uses the multiply-accumulate instruction in the STAN2 for the

picoChip Designs Ltd.

Page 8 of 10

An FFT implementation on the picoArray

Application Note

multiplier stages. Therefore, the butterfly stages can only be combined with the multiplier stage if their memory requirements can be met by a STAN2.

5.4

A Faster FFT

To achieve an increase in throughput (above 20Msps) two possibilities exist: 1. Further parallelism could be exploited within each butterfly stage, multiplier and rounding stage with each function being spread across multiple AEs. 2. An alternate (and possibly simpler) approach would be to instantiate two FFTs and use an AE to multiplex between them.

References

[1] Shousheng He & Mats Torkelson, A New Approach to Pipeline FFT Processor, Dept of Applied Electronics, Lund University, Sweden. [2] Software Library Datasheet, FFT, picoChip.

picoChip Designs Ltd.

Page 9 of 10

An FFT implementation on the picoArray

Application Note

7 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58

Appendix: Source Code

------------------------------------------------------------------------------ Bfly1Mem.vhd ------------------------------------------------------------------------------- Copyright (c) 2002 picoChip Designs Ltd. -- Proprietary and Confidential Information. -- Not to be copied or distributed. -------------------------------------------------------------------------------- Description: Type 1 butterfly as per fig 3 He & Torkelson. -Buffering requires use of MEM AE. -----------------------------------------------------------------------------entity Bfly1Mem is generic( NP1 : integer16; MEM_BUFFER : integer16); port( inputDataBfly1 : in complex16@4; outBfly1 : out complex16@4); end entity Bfly1Mem; architecture ASM of Bfly1Mem is begin MEM initialize regs:= (0,0,0,0,0,0,0,0,0,0,0,0,0,0,0); initialize memory 0: array (0 to 256) of integer16 :=(others => 0); code top: copy.0 MEM_BUFFER, AP \ copy.1 0,r11 loop1: get inputDataBfly1, r[7:6] ldl (AP), r[9:8] stl r[7:6], (AP) \ add.0 r11, 1, r11 copy.0 r9, r1 \ copy.1 r8, r0 sub.0 r11, NP1, r15 bne loop1 \ sub.0 AP, 4, AP =-> put r[1:0], outBfly1 copy.0 MEM_BUFFER, AP \ copy.1 0,r11 loop2: get inputDataBfly1, r[7:6] ldl (AP), r[9:8] \ add.0 r11, 1, r11 add.0 r8, r6, r0 \ add.1 r9, r7, r1 sub.0 r8, r6, r4 \ sub.1 r9, r7, r5 stl r[5:4], (AP) \ sub.0 r11, NP1, r15 bne loop2 \ sub.0 AP, 4, AP =-> put r[1:0], outBfly1 bra top endcode; end Bfly1Mem;

picoChip Designs Ltd.

Page 10 of 10

Potrebbero piacerti anche