Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
and 3 involve slightly more EPS than the Design 1 but offer nearly twice and thrice the MUF
at a cost of 55.0% and 60.6% more area, respectively.
3. Efficient Integer DCT Architectures for HEVC
In this paper, we present area- and power-efficient architectures for the implementation of
integer discrete cosine transform (DCT) of different lengths to be used in High Efficiency
Video Coding (HEVC). We show that an efficient constant matrix-multiplication scheme can
be used to derive parallel architectures for 1-D integer DCT of different lengths. We also
show that the proposed structure could be reusable for DCT of lengths 4, 8, 16, and 32 with a
throughput of 32 DCT coefficients per cycle irrespective of the transform size. Moreover, the
proposed architecture could be pruned to reduce the complexity of implementation
substantially with only a marginal affect on the coding performance. We propose powerefficient structures for folded and full-parallel implementations of 2-D DCT. From the
synthesis result, it is found that the proposed architecture involves nearly 14% less area-delay
product (ADP) and 19% less energy per sample (EPS) compared to the direct implementation
of the reference algorithm, on average, for integer DCT of lengths 4, 8, 16, and 32. Also, an
additional 19% saving in ADP and 20% saving in EPS can be achieved by the proposed
pruning algorithm with nearly the same throughput rate. The proposed architecture is found
to support ultrahigh definition 7680 4320 at 60 frames/s video, which is one of the
applications of HEVC.
4. An Optimized Modified Booth Recoder for Efficient Design of the Add-Multiply
Operator
Complex arithmetic operations are widely used in Digital Signal Processing (DSP)
applications. In this work, we focus on optimizing the design of the fused Add-Multiply
(FAM) operator for increasing performance. We investigate techniques to implement the
direct recoding of the sum of two numbers in its Modified Booth (MB) form. We introduce a
structured and efficient recoding technique and explore three different schemes by
incorporating them in FAM designs. Comparing them with the FAM designs which use
existing recoding schemes, the proposed technique yields considerable reductions in terms of
critical delay, hardware complexity and power consumption of the FAM unit.
5. Improved design of high-frequency sequential decimal multipliers
Hardware implementation of decimal arithmetic operations has become a hot topic for
research during the last decade. Among various operations, decimal multiplication is
considered as one of the most complicated dyadic operations, which requires high-cost
hardware implementation. Therefore, the processor industry has opted to use the sequential
decimal multipliers to reduce the high cost of parallel architectures. However, the main
drawback of iterative multipliers is their high latency. In this reported work, the focus has
been on reducing the latency of decimal sequential multipliers while maintaining a low cost
of area. Consequently, a high-frequency sequential decimal multiplier is proposed whose
cycle time is reduced to the latency of a binary half-adder plus that of a decimal multiply-bytwo operation, which overall is less than that of a decimal carry-save adder. The synthesis
results reveal that the proposed sequential multiplier works with a higher clock frequency
than the fastest previous decimal multiplier which in turn leads to overall latency advantage.
6. On-Chip Codeword Generation to Cope With Crosstalk
Capacitive and inductive coupling between bus lines results in crosstalk induced delays.
Many bus encoding techniques have been proposed to improve the performance. Existing
implementation techniques and mapping algorithms in the literature only apply the specific
encoding. This paper presents the first generalized framework for a stall-free on-chip
codeword generation strategy that is scalable and easy to automate. It is applicable to the
coupling aware encoding techniques that allow recursive codeword generation. The proposed
implementation strategy iteratively generates codewords without explicitly enumerating
them. Codeword mapping relies on graph-based representation that is unique to the given
encoding technique. The codewords are calculated on-chip using basic function blocks, such
as adders and multiplexers. Three encoding techniques were implemented using the proposed
strategy. Experimental results show significant reduction in the area overhead and power
dissipation over the existing method that uses random logic to implement the codec.
7. Effects of Random Delay Errors in Continuous-Time Semi-Digital Transversal
Filters
The implementation of transversal filters requires basic circuit elements such as adders,
multipliers and (unit) delay elements. The filters designed under infinite precision of these
elements may behave differently when implemented with components with limited accuracy.
In fact, the effects of the coefficient inaccuracies in analog and digital transversal filters have
been investigated extensively in the literature [1], [2]. On the other hand, the effects of the
unit delays with limited precision have not received similar attention. In this paper, we find
that such effects especially in very high frequency continuous-time semi-digital transversal
filters may not be ignored. As an example, we analyze the impact of delay errors in the
of conduction losses and cost. Finally, the prototype circuit with 40-V input voltage, 380-V
output, and 1000-W output power is operated to verify its performance. The highest
efficiency is 97.1%.
13. Low-Cost Low-Power ASIC Solution for Both DAB+ and DAB Audio Decoding
DAB+ is the upgraded version of digital audio broadcasting (DAB). DAB and DAB+ coexist
in many countries, so receivers are required to be compatible with both standards. In this
paper, a solution integrating an MPEG1-LayerII (MP2) decoder and an advanced audio
coding (AAC) low-complexity (AAC LC) decoder is proposed to provide basic audio
decoding for both DAB and DAB+. It also utilizes simple methods to improve high
frequencies and stereo quality instead of complicated spectrum band replication and
parametric stereo. A highly integrated low-power audio decoder design compatible with
DAB/DAB+ and using a purely ASIC approach is presented. As a result of the system
structure optimization and hardware sharing, the audio decoder is fabricated in 1P4M 0.18m CMOS technology using only 3.2 mm2 silicon area (including 147 456 bits RAM and 170
496 bits ROM). The powerconsumption of the audio decoder is 10.4 mW for DAB audio
decoding and 8.5 mW for DAB+ audio decoding. Laboratory and field tests show that the
function is correct and the audio quality is good for receiving both DAB and DAB+. The
audio decoder is thus proven to be a low-cost low-power solution for the two existing DAB
standards.
14. Low-Power Digital Signal Processor Architecture for Wireless Sensor Nodes
Radio communication exhibits the highest energy consumption in wireless sensor nodes.
Given their limited energy supply from batteries or scavenging, these nodes must trade data
communication for on-the-node computation. Currently, they are designed around off-theshelf low-power microcontrollers. But by employing a more appropriate processing element,
the energy consumption can be significantly reduced. This paper describes the design and
implementation of the newly proposed folded-tree architecture for on-the-node data
processing in wireless sensor networks, using parallel prefix operations and data locality in
hardware. Measurements of the silicon implementation show an improvement of 10-20 in
terms of energy as compared to traditional modern micro-controllers found in sensor nodes.
among the existing SQRT-CSLA designs, on average, for different bit-widths. The
application-specified integrated circuit (ASIC) synthesis result shows that the BEC-based
SQRT-CSLA design involves 48% more ADP and consumes 50% more energy than the
proposed SQRT-CSLA, on average, for different bit-widths.
16. An Optimized Modified Booth Recoder for Efficient Design of the Add-Multiply
Operator
Complex arithmetic operations are widely used in Digital Signal Processing (DSP)
applications. In this work, we focus on optimizing the design of the fused Add-Multiply
(FAM) operator for increasing performance. We investigate techniques to implement the
direct recoding of the sum of two numbers in its Modified Booth (MB) form. We introduce a
structured and efficient recoding technique and explore three different schemes by
incorporating them in FAM designs. Comparing them with the FAM designs which use
existing recoding schemes, the proposed technique yields considerable reductions in terms of
critical delay, hardware complexity and power consumption of the FAM unit.
19. Improved matrix multiplier design for high-speed digital signal processing
applications
A transistor level implementation of an improved matrix multiplier for high-speed digital
signal processing applications based on matrix element transformation and multiplication is
reported in this study. The improvement in speed was achieved by rearranging the matrix
element into a two-dimensional array of processing elements interconnected as a mesh. The
edges of each row and column were interconnected in torus structure, facilitating
simultaneous implementation of several multiplications. The functionality of the circuitry
was verified and the performance parameters for example, propagation delay and dynamic
switching power consumptions were calculated using spice spectre using 90 nm CMOS
technology. The proposed methodology ensures substantial reduction in propagation delay
compared with the conventional algorithm, systolic array and pseudo number theoretic
transformation (PNTT)-based implementation, which are the most commonly used
techniques, for matrix multiplication. The propagation delay of the implemented 4 4
matrix multiplierwas only ~2 s, whereas the power consumption of the implemented 4 4
matrix multiplier was ~3.12 mW only. Improvement in speed compared with earlier reported
matrix multipliers, for example, conventional algorithm, systolic array and PNTT-based
implementation was found to be ~67, ~56 and ~65%, respectively.
20. A Novel Distortion Model and Lagrangian Multiplier for Depth Maps Coding
In three-dimensional videos (3-DV) coding systems, depth maps are not used for viewing but
for rendering virtual views. Therefore, the traditional rate distortion criterion (including
distortion criterion, and Lagrangian multiplier) is not suitable for depth map coding. In order
to design an effective rate distortion criterion for depth maps, the relationship between the
distortion of synthesized virtual view and the coding error of depth maps is analyzed in detail.
Through the analysis, a polynomial model revealing the relationship between the coding error
of depth maps and the distortion of synthesized virtual view is derived. Model parameters are
estimated by utilizing camera parameters and features of the texture video corresponding to
the depth map. Based on the model, a virtual view-based Lagrangian multiplierfor depth map
coding is also proposed. Experimental results demonstrated the accuracy of the model. The
squared correlation coefficients between the actual distortion of virtual view and the
estimated distortion are all larger than 0.98 for all tested sequences. When incorporating the
proposed model and Lagrangian multiplier into the mode decision procedure of joint model
version 18.5 (JM18.5) of H.264/AVC, a maximum 0.470 dB BD PSNR and an average 0.251
dB BD PSNR can be achieved.
to/from residue representation, along with the proposed residue Montgomery multiplication
algorithm, reveals common multiply-accumulate data paths both between the converters and
between the two residue representations. A versatile architecture is derived that supports all
operations of Montgomery multiplication in GF(p) and GF(2n), input/output conversions,
Mixed Radix Conversion (MRC) for integers and polynomials, dual-field modular
exponentiation and inversion in the same hardware. Detailed comparisons with state-of-theart implementations prove the potential of residue arithmetic exploitation in dual-field
modular multiplication.