Sei sulla pagina 1di 13

Parallel Systolic FFT Architectures for High-Speed, High Throughput Frequency-Domain Filtering

Oscar E. Agazzi October 12, 2012


ClariPhy Confidential

Overview
Introduction Systolic FFT architecture (radix 2) Parallel systolic architectures Storage requirements Other considerations Conclusions

ClariPhy Confidential

Introduction (1)
In this presentation we investigate high-speed, high throughput architectures for FFTs The main problem that it is desired to address is how to simplify the complex interconnection pattern resulting from butterflies in FFT implementations derived (directly or indirectly) from FFT flow diagrams Systolic architectures greatly simplify the interconnections, at the expense of increasing the storage requirements Systolic architectures per se may not be sufficient to achieve the throughput and speed required by the BCD filter in the CL10010
Systolic architectures may need to be combined with parallel processing and some degree of traditional, butterfly-based architectures

ClariPhy Confidential

Introduction (2)
The work presented here is largely based on the systolic FFT architecture described in reference [1], however no good references have been found on how to combine systolic implementations with parallel processing
The approach presented here may be similar to the one described in [2], but that reference is not explicit enough to replicate its work

For simplicity, in this presentation we consider only radix 2 FFTs, however additional savings may be achieved by using higher radix FFTs
OLeary [1] reports that savings may be achieved by using radix 4 transforms

ClariPhy Confidential

Systolic FFT Architecture (radix 2)


Example for N=8
Input 1 Top Output

Delay 4

+ X Delay 2

Delay 2

+ W0, W2 X Delay 1

Delay 1

+
Bottom Output

Input 2

W0, W1, W2, W3

Complexity vs. FFT Size N


FFT Size
BLOCK D POS B NEG B 3N/2 2N POS C NEG C 5N/2

BLOCK A BLOCK B

BLOCK C

Memory (Complex Words)


~3N/2 12 24 48 96 192

Complex Multipliers
log2(N)-1 2 3 4 5 6

Complex Adders
2log2(N) 6 8 10 12 14

POS A NEG A N/2 N

N 8 16 32 64 128

I/O Timing

ClariPhy Confidential

Discussion
The systolic processor has an extremely simple interconnection pattern Although memory size grows linearly with N, it is quite manageable for N=64 or even N=128, which are the likely sizes for a parallel/systolic FFT processor for the CL10010 BCD filter Notice that the processor shown in the previous slide can process two independent FFTs at the same time The inputs must be skewed in time by N/2 (this requires additional buffering) The outputs come sequentially (aligning the outputs also requires additional buffering) The outputs come in bit reverse order

ClariPhy Confidential

FFT Parallelization
In the following discussion we use a numerical example to make the discussion more concrete We assume that the FFT size is N=8192 and the desired throughput is 64Gs/s We also assume that the input comes in blocks of consecutive samples of size D=128 Therefore a complete FFT block of 8192 samples can be thought as a matrix of samples of 64 rows and 128 columns
The FFT processor must accept blocks of 128 samples (where each block is a row of the matrix) at a rate of 500MHz

The discussion can be easily generalized to other FFT sizes N and decimation factors D

ClariPhy Confidential

FFT Parallelization (cont.)


The parallelization of the FFT is based on the following factorization:
nk X (k ) = x(n)W8192 n =0 8191

2j WN = exp N
127

This can be expressed as:

X (k ) = W
r =0

127

rk 8192

m =0

x(128m + r )W

63

mk 64

rk = W8192 X r (k ) r =0

Writing k = 64 p + q with p=0,,128 and q=0,,63, and observing that Xr(k) is periodic in k with period 64, we can write:

X (64 p + q ) = W r =0 Finally:

127

r ( 64 p + q ) 8192

rp rq X r (q ) = W128W8192 X r (q )

127

rq X (64 p + q ) = FFT128 W8192 X r (q)

r =0

Where the FFT is taken with respect to index r The implementation of this factorization is shown in the following slide
ClariPhy Confidential 8

Parallel/Systolic Processor
fs=64GHz fD=500MHz FFT Leaf 0
FFT Output: 64 blocks of 128 samples each

FFT Leaf 1

Serial to Parallel Converter

128 Point FFT

Scalers

Input fs=64GHz

FFT Leaf 63

ClariPhy Confidential

Discussion
The only complex interconnections in this processor occur in the 128-point output FFT However, this FFT is relatively small so that its interconnections should not be a problem By comparison, consider that the BCD filter in the CL4010 uses an FFT size of 512 The FFT required by the processor proposed here is 4 times smaller, and the technology is more advanced than in the CL4010 The processor described here lends itself to an extremely regular and simple layout The output comes in the form of a matrix of complex numbers with 64 rows and 128 columns with both columns and rows in bit reverse order It is not necessary to reorder them because the IFFT can automatically reverse the order of both rows and columns Frequency domain filtering can be implemented in bit reverse order
10

ClariPhy Confidential

Hardware Requirements
Hardware Component Memory (Complex Words) Memory (Bits) (assumes average word length is 24 bits) Complex Multipliers Complex Adders Number of Units
10240 491520 896 1216

Assumptions
Numbers are per polarization and per FFT block Assuming 2 polarizations and IFFT similar to FFT, numbers in table should be quadrupled Pipeline registers not included Output FFT requires (N/2)log2(N) complex multipliers and equal number of complex adders Scaler requires 128 complex multipliers

ClariPhy Confidential

11

Conclusions
A systolic architecture can considerably simplify the routing of large block size, high throughput, high speed FFTs In deep submicron CMOS technologies, interconnections have a large impact on the power dissipation, therefore it is important to use regular architectures that lead to an efficient layout and to minimize interconnections In this presentation we have proposed an architecture that has the potential to meet the requirements of the CL10010 However, significant work still needs to be done to explore alternative values of parameters, such as DSP clock speed, parallelization factor, size of the frontend FFTs (FFT Leaves) versus size of the back-end FFT, radices different from 2, etc. It is believed that this work can lead to a very efficient implementation of the BCD filter in the CL10010

ClariPhy Confidential

12

References
[1] G.C.OLeary, Nonrecursive Digital Filtering Using Cascad Fast Fourier Transformers, IEEE Transactions on Audio and Electroacoustics, Vol. AU-18, No.2, June 1970, pp.177-183 [2] P.Jackson et al, A Systolic FFT Architecture for Real Time FPGA Systems, MIT Lincoln Laboratory publication, September 29, 2004 [3] T.Woodward, private communication [4] A.V.Oppenheim, Applications of Digital Signal Processing, Prentice Hall, 1978, Chapter 5 (Applications of Digital Signal Processing to Radar)

ClariPhy Confidential

13

Potrebbero piacerti anche