Sei sulla pagina 1di 4

IMPLEMENTATION OF CHANNEL DEMODULATOR

FOR DAB SYSTEM


Chien-Ming Wu', Ming-Der Shieh', Hsin-Fu Lo', and Min-Hsiung HuZ
I Graduate School of Engineering Science & Technology, National Yunlin University of Science & Technology, Taiwan
2
Department of Electronic Engineering, National Yunlin University of Science & Technology, Taiwan
Division of Design Service, Nation Science Council C h i p Implementation C e n t e r (CIC), Taiwan

ABSTRACT memory accesses at the expense of increasing the arithmetic


complexity, i.e., the hardware requirement of a high-radix butterfly,
This paper describes the VLSI implementation of Fast Fourier
unit. (2) Partition the memory into several banks in order to allow
Transform (FIT) for the. Eureka-147 Digital Audio Broadcasting
concurrent accesses of multiple data with a more complicated
(DAB) system. We emphasize how 'to m i n i i e the hardware
addressing scheme, which might correspond to a higher routing area.
requirement and efficiently manage the memory to meet the DAB
In this paper, we describe the design and implementation of the
requirement. Implementation results demonstrate the applicability of
FIT for the DAB channel demodulator. We show our experiences
our work with the characteristics of modular design, consuming less
on how to use the conflict-free memory addressing arrangement in
silicon area, and facilitating the extension for high transmission rate
191 to minimize the hardware requirement and to match the DAB
applications. The core size of the resulting chip implementation is
requirement. Implementation results demonstrate the applicability of
2086x1806 pmz based o n the TSMC 0.35 p 1P4M CMOS
our work to the targeted channel demodulator and the advantages
process. Performance evaluation reveals that our design for the over previous solutions [ 5 , 71 in terms of hardware requirement.
targeted channel demodulator outperform previous solutions.
The rest of this paper is organized as follows: Section 2 reviews the
1. INTRODUCTION background and our previous work [9] related to this paper. Section
3 describes the resulting architecture and design of FFT processor.
The Digital Audio Broadcasting (DAB) system, described in the Then, the corresponding chip implementation and performance
European Eureka-I47 standard [I], offers high-quality audio evaluation are shown in Section 4. Finally, Section 5 concludes this
services, supports multimedia data to mobile reception and might work.
replace the traditional radio system. Basically, two strategies are
conYolulional d i n g
employed to implement the DAB receiver: the DSP-based + OFDM transminer
architecture [Z,31 and the ASIC-based implementation [4, 51. The ding inrerIEaving
former has the characteristics of maximum flexibility, ease of use Chaskd
and sImple programming, but it can only provide limited processing N o m and Retlcclion
c a p a b t y . 0n.the contrary, the ASIC-based implementation has the
potentials of: supporting real-time symbol decoding and low-cost
Implementation.
Figure 1: shows an overview of the DAB system, in which the Figure 1. An overview of the DAB system [SI.
ISONPEG coding is adopted for source coding and COFDM
(Coded Orthogonal Frequency Division Multiplexing) for channel 2. PRELIMINARY RESULTS
coding and' modulation [I]. After convolutional coding, the
generated codewords are interleaved in frequency for the fast The N-point Discrete Fourier Transform (DFT) of a sequence
information channel and in both time and.frequency for the main x(k) is defmed as
service channel, and then the OFDM modulation is performed. In
this paper, we focus on the design and implementation of the
channel demodulator, which essentially perform a Fast Fourier
Transform (FFT). In general, two basic types of F'FT architectures where n = 0, 1, ..., N-l and W = e-J2"". From Eq. ( I ) , we know
can be found in the literature: the pipelined orchirecture with each that N2 multiplications and N(N-1) additions are needed to directly
stage consisting of a butterfly unit 16, 71 and the single burrerfly perform the required computations. By applying the FIT, the
architecture 1.5, 81 that employs just one radix-r butterfly unit. The computational complexity can be down to a number in O(N log M.
main concern is the trade-off between hardware overhead and speed If the number of sampled points is a power of the radix r, then it
requirement. is easy to compute the D F I by using a radix-r FF'I algorithm In
Although the pipelined architecture can provide a higher such a case, the N-point DFT can be decomposed into a set of
throughput rate than the single butterfly implementation, we are still recursively related r-point transforms. The decimation in time (DIT)
interested in the single butterfly architecture because of the and decimation in frequency (DIF) are two basic classes of FIT
specifcations of the channel demodulator as well as the hardware algorithm [lo]. Specifically, the DIT FF'I algorithm is based on
considerations on the implementation of DAB receivers. For the decomposing the input sequence x(k) into successively smaller and
single butterfly Implementation, a basic problem that arises is how smaller subsequences. The DIF F'FT algorithm is to decompose the
to eEciently mange memory readwrite accesses for the purposes output sequence X(n) into smaller subsequences in the same way.
of increasing its throughput rate. The common solutions include: (1) Figure 2 shows a DIT 8-point FIT algorithm, in which the data in
Use the high-radix implementation to reduce the total number of each stage can be processed based on the so-called butterfly units.

02003 IEEE
0-7803-7761-31031117.00
E137
datapath widths are either 8 or 16 bits. The details of the VLSI

eh
realization are described in the following subsections.

Conlml Unit

Cacff. Unit
Butlsrtly

ROM
Figure 3. Block diagram of the FFT processor,

3.1 Memory Arrangement


For memory arrangement, first we have to.decide whether the
Figure 2. The data flow graph of DIT E-point FFI computation ping-pong mode or in-place mode is to be applied to store the
intermediate values when implementing the FFT RAM. The main
In general, an N-point FFI computation requires (N/r)xlog,N disadvantage of the former is that twice as many memory spaces are
radix-r butterfly computations and either the pipelined architecture required in comparison with the in-place operation, but the control
or the single butterfly architecture can be selected for a dedicated circuit is easy, For in-place scheduling, exactly one memory space is
application. For the single butterfly implementation, it implies needed for storing the intermediate values and the old computed
2Nxlog,N memory accesses, which are the main bottlenecks for fast values are immediately overwritten by the newly computed values.
FFT computation. Therefore, we need an efficient memory This is an important feature for the realization of long FFTs due to
management strategy to overcome this problem, i.e., to reduce the the fact that area for storing the large amount of intermediate results
number of memory accesses or to increase the memory bandwidth. will occupy a significant fraction of the avatlable chip area. For this
In our previous work [91, we have presented a set of simple but reason, we consider only in-place schemes in this work. Basically,
efficient equations to partition the memory into a number of the memory addresses of the in-place schedule can be generated
memory banks such that the equivalent memory bandwidth can be with little hardware overhead based on cyclically rotational property
increased with simple interconnection networks. [Ill.
As known, let m be the number of stages for the FFT As known, the.lower hardware cost of the single butterfly
computation, then the value m can be computed by architecture is achieved at the price of degrading the throughput
m=llogr 1 (2) rate of the pipelined version. According to the operational mode I
defined in the Eureka-147 standard. we know that a ZME-point
Following the notation of conventional number system, it is FFT operation should be completed within 1.25 m. Under such a
assumed that the original memory address (4, is expressed in circumstance, it will be not possible to complete the desired FFr
unsigned radix-r representation defined as operation based on the radix-2 solution without memory partition
(4.( & . I . c L . 2 . .. ..da, d,. do), (3) given the chosen operational frequency of 24.576 MHz. In order to
where di is an integer and 0 5 di 5 r-I. In consequence, a feasible make the single buttemy architecture meet the DAB requirement,
solution to partition the memory into r banks can be easily obtained memory partitioning becomes a cost-effective solution. In our
as shown in Eq.(4), which implies that the original address (4, will implementation, the single-port FFT RAM is divided into r = 2
be distributed into the bank number B(d, r). The correctness of Eq. banks to meet the timing requirement and the in-place scheduling
(4) is assured by observing that for a given butterfly index, the scheme is applied for saving memory spaces.
equation contains the distinguishable variable at each stage. The address-generate unit shown in Figure 4 is designed to
B(d, r) = (d,,., + d,,,. + ... + dz + d l + 4 )mod r generate addresses for two memory banks and the coefficient ROM.
(4)
The butterfly counter is used to sequentially generate the required
Finally, we consider the mapping of (4, into one of the address
buttemy indices at stage one. The two barrel shifters first
locations of the selected bank B(d, r). To simp@ the hardware
concatenate their indices, respectively, with the current butterfly
implementation, the assigned address BA(d, r ) in the bank B(d, r) is
index and then emulate the right rotational property of addresses at
obtained by discarding the least significant digit of the original
the present stages specified by the stage counter. Finally, the MUX
address. Equation ( 5 ) causes no conflict due to the fact that for two
is to distribute the addresses based on Eqs. (2)-(5) such that the
original addresses that differ in only the least signifcant digit, they
output of each barrel shifter can be directed into the correct
are distributed into different banks based on Eq. (4) because of 0 S memory bank. For the radix-? implementation. the control signal
d0Sr-l. Bank-index is derived by performing bit-wise XOR operation on
BA(d. r) = (dn,.t.d,,,.2. ..., 4.4, (5) the original addresses according to Eq. (4).
In addition, the contents of the coefficient ROM and the
3. FFT DESIGN AND IMPLEMENTATION corresponding addressing rules can be easily decided by following
Figure 3 depicts the block diagram of the single butterfly the data flow graph of DIT FFT computation. Note that we only
architecture for our FIT processor. It operates on a 24.576 MHz need to store half the twiddle coefficients due to their symmetric
clock and consists of a simple radix-2 DIT butterfly unit, a single- property. Let the radix-2 twiddle coefficient W p = e - j l n x P i Nbe
port FFT RAM, a coefficient ROM, a control unit, and an address- stored in the pth ROM address. Then, the ROM contents can be
generate unit (AGU). AU variables are complex and the intemal accessed based on the current butterfly index BI and the present

U-138
itage number r according to following equations. k t the binary memory write operation. To.reduce the critical path delay, we divide
representation of the current butterfly index be given by the whole operations of the buttertly unit into (s+?) different steps
.........b2.bl.bO)2
Bl = (bn,.2,bn,.3 (6) (the fust step for memory read operation, the following s steps for
arithmetic operation. and the last step for memory write operation)
where m = l o g , N is the number of stages for the rad&-? as indicated in Figure 6. Due to the in-place computation. we have
implementation. From the data flow graph. the elements hi's of BI to schedule the tasks assigned to the pipelined butterfly unit such
can be used as variables in conjunction with the value t to generate that no control hazard occurs during memory accesses. A control
proper ROM addresses. Specifically, we first generate a vector from hazard (see Figure 7(a)) results from the conflict when the butterfly
the ,present r value based on Eq. (7) and then the desired ROM unit intends to access more than two data in the same memory bank.
address p(B1, r ) can be computed by using the vector as a mask to Figure 7(b) shows the schedule to eliminate the control hazard
filter out unwanted b,'s according to Eq. (8). providing that only the single-port memory h available in the
2r-, implementation. The arrangement of Figure 7(b) results in only 50%
1 , qr O
-1 = [ q , , , ~ 2 . q n , ~ J . . . ~ . qfor = 11,22,. .... m (7) hardware utilization of the pipelined butterfly unit. On the contrary,
100% hardware utilization can be achieved if the dual-port memory
is employed in the design. Note that the area occupied by the
memory module is not only proportional to the number of stored
Equation (7) can be easily implemented by resetting a s M register data, but it is also proponional to the number of ports. Obviously,
and then shifting in a "one" from the least significant bit when the the chip area of a dud-port memory is much higher than that of a
stage advances once. And. Eq. (8) represents the masked output of single-port memory.
the bit reversal of the current butterfly index. In both cases, their Since we use a 24.576 MHz clock in our FFT processor, the
implementation cost is almost negligible. arithmetic operation can be fnished within one clock cycle (s =I).
Each buttertly operation. thus, only takes three clock cycles, each
Bank-index
C"", 1 for memory read operation, arithmetic operation. and memory write
operation. In addition, only 50% hardware utilization is achieved
because the single-port memory is employed in our design to reduce
the hardware cost.

iz 02

w
m- m.
I Read I Computation IWntc
Figure 4. The block diagram of the address-generate unit Figure 6. Radix-2 DIT pipelined butterfly unit
T . T , r . - - , ~ , . , T . . T . - - T T . T
3.2 Buttemy Unit
The butterfly unit is the core of F l T processors to determine
the desired clock speed and the resulting throughput. In this work,
the butterfly unit was designed with the simple rad&-2 DIT-FFT
algorithm. As shown in Figure 5 , the arithmetic operations consist os. I I R I C~ I c. I4
of calculating a pair of complex values, A'=A+BW and B'=A-BW,
from a pair of complex inputs, A and B, and the twiddle coefficient
W.

L - - - ~ _ _ _ _ - - - _ - _ _Mulipliar
_____- ~

Figure 5 . The arithmetic of radix-2 DIT-FFT algorithm


For a butterfly unit without employing pipelining, the critical (b)
path is the summation of the memory read operation. arithmetic Figure 7. (a) The control hazard. (b) The reconcile for control
operation (multiplication and addition of complex numbers), and hazard.

11-139
4. CHIP REALIZATION AND COMPARISON Results show that our implementation has the potentials of
consuming less silicon area and facilitating the extension for high
AU the modules in our design have been successfully transmission rate requirement.
implemented based on the TSMC (Taiwan Semiconductor
Manufacturing Company) 0.35 jnn lP4M CMOS process and REFERENCES
simulated using Synopsys and Cadence tool. Based on the [I] ETS 300 401, "Radio broadcasting system: Digital audio broadcasting
speciiicatians of DAB channel demodulator, the resulting FFT (DAB)to mobile. portable and fixed receivers", ETSI, 2'edition.. May
processor is capable of completing the four operational modes 1997.
(mode I: 2048 points, mode II: 512 points, mode I11 256 points, 121 J. A. Husiken. F. V. Lax. A. Delaruelle, and N. J. L. Philips.
and mode I V 1024 points) with a clock frequency of 24.576 MHz. "Specification. partitioning and design of a DAB channel decoder." in
Proc.VLSI Signal Processing Workhap, pp. 21-29. 1993.
The corresponding physical layout is shown in Figure 8, in which it
131 M. B o k . D. Clawin, K. Gieske. F. Hofmnn. T. Mlasko, M. J. Ruf. and
includes 2 x 1 0 2 4 ~ 1 6SRAMs (two banks, each containing 1024x16 G. Spreitz "The receiver engine chipset for digital audio broadcasting,"
bits) and 2 x 1 0 2 4 ~ 8ROMs (one for the real part and another for the in hoc. URSI Int. Symp. Signals. System. and Electronics. pp. 338-342,
imaginary part). In terms of the 2-input NAND gate, the total 1998.
number of gate counts is 4351, excluding the used memories. The 141 A. Delamelk, J. Huisken. 1. V. Loan. and F. Welten. "A chip set for a
resulting core s u e of the chip implementation is about 2086x1806 digital audio broadcasting channel decoder." in hoc. IEEE Custom
pn2 and the overall chip size including U 0 pads is 2856x2594 pn'. Integrated Circuit Coni.. pp. 13.4.1-13.4.4. 1995.
151 A. Delaruelle. J. Huisken. 1. van Laan, and F. Welten. "A channel
demodulator IC for digital audio broadcasting,'' in hoc. IEEE Custom
Integrated Circuits Conf. 1994. pp. 47-50. 1994.
161 S. He. and M. Torkelson. "Design and implementation of a 1024-point
pipeline F l T processor." in Proc. IEEE Custom Integrated Circuits Coni,,
pp. 131-134,1998.
171 E. Bidet, D. Castelain. C. Jaanblanq. and P. Senn. "A fast single-chip
implementation of 8192 complex paint FTT." IEEE I. Solid-State
Circuits, vol. 30. no. 3. pp. 300-305, March 1995.
181 E. Cedn. Richard C. S . Morling and I. Kale. "An extensible complex fast
Fourier transform processor chip for real-time specmm analysis and
m~suremenf."IEEE Trans. Instrumentation and Measuremnt. vol. 47.
no. 1. pp.95-99, Feb. 1998.
191 H. F. Lo, M. D. Shieh. and C. M. Wu, "Design of an efficient FFI
processor far DAB system" in Proc. IEEE Inl. Symp. Circuits and
System. 654-657.2001
[IO1 E. 0.Brigham The Fnsf Fourier Tonsform and ifs Applications.
Prentice-Hall Inc.. 1988.
Figure 8. The layout of the developed FFT processor, [Ill M. Biver, H. Kaeslin, and C. TormMsini. "In-place updating of path
metiics in Viterbi decaders," IEEE J. Solid-State Circuits. vol. 24.pp.
We compare the performance of our implementation with the 1158-1159,Aug.1989.
following FFT implementations: the pipelined architecture I71 and
the single butterfly architecture IS]. The circuit complexities of Table 1. Comparisons of different implementations
these designs are compiled in Table I.The pipelined architecture in E. Bidet A. Delaruelle
171 might be the preferred choice for high-speed applications, but it Proposed
171 151
is not suitable for the application of DAB system. The memory
bandwidth problem of [5] is solved by introducing more
No. of butterfly
unit
logy, radix-r 1 I , radix-4 I I . radix-2
complicated structure (the radix-4 butterfly unit) and utilizing more
memory resources. Note that the operation frequency of [5] is 3*( logy -1) CM"'
CM 1 CM
Arithmetic
12.288MHz. By taking advantages of efficient memory partition and 4 *log: Adder'" 4 Adder Sub
1 Adder
components
employing the pipelined butterfly unit, our design can reduce the 4 Sub

I I
4*log: Sub"' 4 Registe
required area complexity and it still fits in the DAB specifications.
For DAB applications, it is clear that our design outperforms Gate counts of 8160*( log: -1)
Delaruelle's work. arithmetic 9156 2954
components +896* log:
5. CONCLUSION Memory size 2048 (dual- ort) 2x2048
No. of clock
Up to date, lots of efforts have been devoted to the 4xA,")
cycles
development of low-cost DAB products. Of the key techniques to
N = 2048 2458 1 I264 22528
build a DAB receiver. the FFT is one of the key components, which
is very suitable for ASIC implementation. This paper explores Note: (1) C M %bit complex-number multiplier, (2) A d d 16-bits
efficient solutions for hardware implementations of the FFT N
processor such that they can fit in the specification of the Eureka- adder, (3) S u b 16-bit subtractor. (4) A , = --log:, and ( 5 ) A2 =
4
147 standard under limited hardware resources. AU the functional
blocks are designed, simulated, and verified using the Synopsys and
Cadence software and the fmd layout is ready for VLSI fabrication
based on the 0.35 p n TSMC process and Compass cell library.

11-140

Potrebbero piacerti anche