Low Power Vlsi Papers

LOW POWER VLSI
1. Area-Delay-Power Efficient Fixed-Point LMS Adaptive With Low AdaptationDelay

In this paper, we present an efficient architecture for the implementation of a delayed least
mean square adaptive filter. For achieving lower adaptation-delay and area-delay-power
efficient implementation, we use a novel partial product generator and propose a strategy for
optimized balanced pipelining across the time-consuming combinational blocks of the
structure. From synthesis results, we find that the proposed design offers nearly 17% less
area-delay product (ADP) and nearly 14% less energy-delay product (EDP) than the best of
the existing systolic structures, on average, for filter lengths N=8, 16, and 32. We propose an
efficient fixed-point implementation scheme of the proposed architecture, and derive the
expression for steady-state error. We show that the steady-state mean squared error obtained
from the analytical result matches with the simulation result. Moreover, we have proposed a
bit-level pruning of the proposed architecture, which provides nearly 20% saving in ADP and
9% saving in EDP over the proposed structure before pruning without noticeable degradation
of steady-state-error performance.
2. Critical-Path Analysis and Low-Complexity Implementation of the LMS
Adaptive Algorithm
This paper presents a precise analysis of the critical path of the least-mean-square (LMS)
adaptive filter for deriving its architectures for high-speed and low-complexity
implementation. It is shown that the direct-form LMS adaptive filter has nearly the same
critical path as its transpose-form counterpart, but provides much faster convergence and
lower register complexity. From the critical-path evaluation, it is further shown that no
pipelining is required for implementing a direct-form LMS adaptive filter for most practical
cases, and can be realized with a very small adaptation delay in cases where a very high
sampling rate is required. Based on these findings, this paper proposes three structures of the
LMS adaptive filter: (i) Design 1 having no adaptation delays, (ii) Design 2 with only one
adaptation delay, and (iii) Design 3 with two adaptation delays. Design 1 involves the
minimum area and the minimum energy per sample (EPS). The best of existing direct-form
structures requires 80.4% more area and 41.9% more EPS compared to Design 1. Designs 2
and 3 involve slightly more EPS than the Design 1 but offer nearly twice and thrice the MUF
at a cost of 55.0% and 60.6% more area, respectively.
3. Efficient Integer DCT Architectures for HEVC
In this paper, we present area- and power-efficient architectures for the implementation of
integer discrete cosine transform (DCT) of different lengths to be used in High Efficiency
Video Coding (HEVC). We show that an efficient constant matrix-multiplication scheme can
be used to derive parallel architectures for 1-D integer DCT of different lengths. We also
show that the proposed structure could be reusable for DCT of lengths 4, 8, 16, and 32 with a
throughput of 32 DCT coefficients per cycle irrespective of the transform size. Moreover, the
proposed architecture could be pruned to reduce the complexity of implementation
substantially with only a marginal affect on the coding performance. We propose powerefficient structures for folded and full-parallel implementations of 2-D DCT. From the
synthesis result, it is found that the proposed architecture involves nearly 14% less area-delay
product (ADP) and 19% less energy per sample (EPS) compared to the direct implementation
of the reference algorithm, on average, for integer DCT of lengths 4, 8, 16, and 32. Also, an
additional 19% saving in ADP and 20% saving in EPS can be achieved by the proposed
pruning algorithm with nearly the same throughput rate. The proposed architecture is found
to support ultrahigh definition 7680 4320 at 60 frames/s video, which is one of the
applications of HEVC.
4. An Optimized Modified Booth Recoder for Efficient Design of the Add-Multiply
Operator
Complex arithmetic operations are widely used in Digital Signal Processing (DSP)
applications. In this work, we focus on optimizing the design of the fused Add-Multiply
(FAM) operator for increasing performance. We investigate techniques to implement the
direct recoding of the sum of two numbers in its Modified Booth (MB) form. We introduce a
structured and efficient recoding technique and explore three different schemes by
incorporating them in FAM designs. Comparing them with the FAM designs which use
existing recoding schemes, the proposed technique yields considerable reductions in terms of
critical delay, hardware complexity and power consumption of the FAM unit.
5. Improved design of high-frequency sequential decimal multipliers
Hardware implementation of decimal arithmetic operations has become a hot topic for
research during the last decade. Among various operations, decimal multiplication is
considered as one of the most complicated dyadic operations, which requires high-cost
hardware implementation. Therefore, the processor industry has opted to use the sequential
decimal multipliers to reduce the high cost of parallel architectures. However, the main
drawback of iterative multipliers is their high latency. In this reported work, the focus has
been on reducing the latency of decimal sequential multipliers while maintaining a low cost
of area. Consequently, a high-frequency sequential decimal multiplier is proposed whose
cycle time is reduced to the latency of a binary half-adder plus that of a decimal multiply-bytwo operation, which overall is less than that of a decimal carry-save adder. The synthesis
results reveal that the proposed sequential multiplier works with a higher clock frequency
than the fastest previous decimal multiplier which in turn leads to overall latency advantage.
6. On-Chip Codeword Generation to Cope With Crosstalk
Capacitive and inductive coupling between bus lines results in crosstalk induced delays.
Many bus encoding techniques have been proposed to improve the performance. Existing
implementation techniques and mapping algorithms in the literature only apply the specific
encoding. This paper presents the first generalized framework for a stall-free on-chip
codeword generation strategy that is scalable and easy to automate. It is applicable to the
coupling aware encoding techniques that allow recursive codeword generation. The proposed
implementation strategy iteratively generates codewords without explicitly enumerating
them. Codeword mapping relies on graph-based representation that is unique to the given
encoding technique. The codewords are calculated on-chip using basic function blocks, such
as adders and multiplexers. Three encoding techniques were implemented using the proposed
strategy. Experimental results show significant reduction in the area overhead and power
dissipation over the existing method that uses random logic to implement the codec.
7. Effects of Random Delay Errors in Continuous-Time Semi-Digital Transversal
Filters
The implementation of transversal filters requires basic circuit elements such as adders,
multipliers and (unit) delay elements. The filters designed under infinite precision of these
elements may behave differently when implemented with components with limited accuracy.
In fact, the effects of the coefficient inaccuracies in analog and digital transversal filters have
been investigated extensively in the literature [1], [2]. On the other hand, the effects of the
unit delays with limited precision have not received similar attention. In this paper, we find
that such effects especially in very high frequency continuous-time semi-digital transversal
filters may not be ignored. As an example, we analyze the impact of delay errors in the
implementation of the direct modulation transmitter. Specifically, we provide the analytical

statistical performance bounds and confirm the results with simulations.
8. Digitally Synthesized Stochastic Flash ADC Using Only Standard Digital Cells
It is demonstrated in this paper that it is possible to synthesize a stochastic flash ADC entirely
from Verilog code and a standard digital library. An analog comparator is introduced that is
constructed from two cross-coupled 3-input digital NAND gates, and can be described in
Verilog. The synthesized comparators have random, Gaussian offsets that are used as virtual
voltage references to make a flash ADC. A piecewise-linear inverse Gaussian CDF function
is used to correct the nonlinearity introduced by the Gaussian offset distribution. The
prototype IC is fabricated in 90 nm CMOS and implements a 2047-comparator version of the
proposed architecture. All components including the comparators, the ones adder, and the
peicewise inverse Gaussian function are all implemented in Verilog. Conventional digital
synthesis and place-and-route is then used to generate the physical layout, making this the
first fully synthesized ADC. SNDR of 35.9 dB (without calibration) is achieved at 210 MSPS
from the Verilog synthesized design.
9. Memory Footprint Reduction for Power-Efficient Realization of 2-D Finite
Impulse Response Filters
We have analyzed memory footprint and combinational complexity to arrive at a systematic
design strategy to derive area-delay-power-efficient architectures for two-dimensional (2-D)
finite impulse response (FIR) filter. We have presented novel block-based structures for
separable and non-separable filters with less memory footprint by memory sharing and
memory-reuse along with appropriate scheduling of computations and design of storage
architecture. The proposed structures involve L times less storage per output (SPO), and
nearly L times less energy consumption per output (EPO) compared with the existing
structures, where L is the input block-size. They involve L times more arithmetic resources
than the best of the corresponding existing structures, and produce L times more throughput
with less memory band-width (MBW) than others. We have also proposed separate generic
structures for separable and non-separable filter-banks, and a unified structure of filter-bank
constituting symmetric and general filters. The proposed unified structure for 6 parallel filters
involves nearly 3.6L times more multipliers, 3L times more adders, (N2-N+2) less registers
than similar existing unified structure, and computes 6L times more filter outputs per cycle
with 6L times less MBW than the existing design, where N is FIR filter size in each
dimension. ASIC synthesis result shows that for filter size (4 4), input-block size L=4, and
image-size (512 512), proposed block-based non-separable and generic non-separable

structures, respectively, involve 5.95 times and 11.25 times less area-delay-product (ADP),
and 5.81 times and 15.63 times less EPO than the corresponding existing structures. The
proposed unified structure involves 4.64 times less ADP and 9.78 times less EPO than the
corresponding existing structure.
10. Improved matrix multiplier design for high-speed 5
A transistor level implementation of an improved matrix multiplier for high-speed digital
signal processing applications based on matrix element transformation and multiplication is
reported in this study. The improvement in speed was achieved by rearranging the matrix
element into a two-dimensional array of processing elements interconnected as a mesh. The
edges of each row and column were interconnected in torus structure, facilitating
simultaneous implementation of several multiplications. The functionality of the circuitry
was verified and the performance parameters for example, propagation delay and dynamic
switching power consumptions were calculated using spice spectre using 90 nm CMOS
technology. The proposed methodology ensures substantial reduction in propagation delay
compared with the conventional algorithm, systolic array and pseudo number theoretic
transformation (PNTT)-based implementation, which are the most commonly used
techniques, for matrix multiplication. The propagation delay of the implemented 4 4
matrix multiplierwas only ~2 s, whereas the power consumption of the implemented 4 4
matrix multiplier was ~3.12 mW only. Improvement in speed compared with earlier reported
matrix multipliers, for example, conventional algorithm, systolic array and PNTT-based
implementation was found to be ~67, ~56 and ~65%, respectively.
11. High Step-Up High-Efficiency Interleaved Converter With Voltage Multiplier
Module for Renewable Energy System
A novel high step-up converter, which is suitable for renewable energy system, is proposed in
this paper. Through a voltage multiplier module composed of switched capacitors and
coupled inductors, a conventional interleaved boost converter obtains high step-up gain
without operating at extreme duty ratio. The configuration of the proposed converter not only
reduces the current stress but also constrains the input current ripple, which decreases the
conduction losses and lengthens the lifetime of the input source. In addition, due to the
lossless passive clamp performance, leakage energy is recycled to the output terminal. Hence,
large voltage spikes across the main switches are alleviated, and the efficiency is improved.
Even the low voltage stress makes the low-voltage-rated MOSFETs be adopted for reductions
of conduction losses and cost. Finally, the prototype circuit with 40-V input voltage, 380-V
output, and 1000-W output power is operated to verify its performance. The highest
efficiency is 97.1%.
12. Ultra-High Throughput Low-Power Packet Classification

Packet classification is used by networking equipment to sort packets into flows by
comparing their headers to a list of rules, with packets placed in the flow determined by the
matched rule. A flow is used to decide a packet's priority and the manner in which it is
processed. Packet classification is a difficult task due to the fact that all packets must be
processed at wire speed and rulesets can contain tens of thousands of rules. The contribution
of this paper is a hardware accelerator that can classify up to 433 million packets per second
when using rulesets containing tens of thousands of rules with a peak powerconsumption of
only 9.03 W when using a Stratix III field-programmable gate array (FPGA). The hardware
accelerator uses a modified version of the HyperCuts packet classification algorithm, with a
new pre-cutting process used to reduce the amount of memory needed to save the search
structure for large rulesets so that it is small enough to fit in the on-chip memory of an FPGA.
The modified algorithm also removes the need for floating point division to be performed
when classifying a packet, allowing higher clock speeds and thus obtaining higher
throughputs.
13. Low-Cost Low-Power ASIC Solution for Both DAB+ and DAB Audio Decoding
DAB+ is the upgraded version of digital audio broadcasting (DAB). DAB and DAB+ coexist
in many countries, so receivers are required to be compatible with both standards. In this
paper, a solution integrating an MPEG1-LayerII (MP2) decoder and an advanced audio
coding (AAC) low-complexity (AAC LC) decoder is proposed to provide basic audio
decoding for both DAB and DAB+. It also utilizes simple methods to improve high
frequencies and stereo quality instead of complicated spectrum band replication and
parametric stereo. A highly integrated low-power audio decoder design compatible with
DAB/DAB+ and using a purely ASIC approach is presented. As a result of the system
structure optimization and hardware sharing, the audio decoder is fabricated in 1P4M 0.18m CMOS technology using only 3.2 mm2 silicon area (including 147 456 bits RAM and 170
496 bits ROM). The powerconsumption of the audio decoder is 10.4 mW for DAB audio
decoding and 8.5 mW for DAB+ audio decoding. Laboratory and field tests show that the
function is correct and the audio quality is good for receiving both DAB and DAB+. The
audio decoder is thus proven to be a low-cost low-power solution for the two existing DAB
standards.
14. Low-Power Digital Signal Processor Architecture for Wireless Sensor Nodes
Radio communication exhibits the highest energy consumption in wireless sensor nodes.
Given their limited energy supply from batteries or scavenging, these nodes must trade data
communication for on-the-node computation. Currently, they are designed around off-theshelf low-power microcontrollers. But by employing a more appropriate processing element,
the energy consumption can be significantly reduced. This paper describes the design and
implementation of the newly proposed folded-tree architecture for on-the-node data
processing in wireless sensor networks, using parallel prefix operations and data locality in
hardware. Measurements of the silicon implementation show an improvement of 10-20 in
terms of energy as compared to traditional modern micro-controllers found in sensor nodes.
15. AreaDelayPower Efficient Carry-Select Adder

In this brief, the logic operations involved in conventional carry select adder (CSLA) and
binary to excess-1 converter (BEC)-based CSLA are analyzed to study the data dependence
and to identify redundant logic operations. We have eliminated all the redundant logic
operations present in the conventional CSLA and proposed a new logic formulation for
CSLA. In the proposed scheme, the carry select (CS) operation is scheduled before the
calculation of final-sum, which is different from the conventional approach. Bit patterns of
two anticipating carry words (corresponding to $c_{rm in} = 0 hbox{and} 1$) and
fixed $c_{rm in}$ bits are used for logic optimization of CS and generation units. An
efficient CSLA design is obtained using optimized logic units. The proposed CSLA design
involves significantly less area and delay than the recently proposed BEC-based CSLA. Due
to the small carry-output delay, the proposed CSLA design is a good candidate for squareroot (SQRT) CSLA. A theoretical estimate shows that the proposed SQRT-CSLA involves
nearly 35% less areadelayproduct (ADP) than the BEC-based SQRT-CSLA, which is best
among the existing SQRT-CSLA designs, on average, for different bit-widths. The
application-specified integrated circuit (ASIC) synthesis result shows that the BEC-based
SQRT-CSLA design involves 48% more ADP and consumes 50% more energy than the
proposed SQRT-CSLA, on average, for different bit-widths.
16. An Optimized Modified Booth Recoder for Efficient Design of the Add-Multiply
Operator
Complex arithmetic operations are widely used in Digital Signal Processing (DSP)
applications. In this work, we focus on optimizing the design of the fused Add-Multiply
(FAM) operator for increasing performance. We investigate techniques to implement the
direct recoding of the sum of two numbers in its Modified Booth (MB) form. We introduce a
structured and efficient recoding technique and explore three different schemes by
incorporating them in FAM designs. Comparing them with the FAM designs which use
existing recoding schemes, the proposed technique yields considerable reductions in terms of
critical delay, hardware complexity and power consumption of the FAM unit.
17. Improved design of high-frequency sequential decimal multipliers

Hardware implementation of decimal arithmetic operations has become a hot topic for
research during the last decade. Among various operations, decimal multiplication is
considered as one of the most complicated dyadic operations, which requires high-cost
hardware implementation. Therefore, the processor industry has opted to use the sequential
decimal multipliers to reduce the high cost of parallel architectures. However, the main
drawback of iterative multipliers is their high latency. In this reported work, the focus has
been on reducing the latency of decimal sequential multipliers while maintaining a low cost
of area. Consequently, a high-frequency sequential decimal multiplier is proposed whose
cycle time is reduced to the latency of a binary half-adder plus that of a decimal multiply-bytwo operation, which overall is less than that of a decimal carry-save adder. The synthesis
results reveal that the proposed sequential multiplier works with a higher clock frequency
than the fastest previous decimal multiplier which in turn leads to overall latency advantage.
18. Bit-Level Optimization of Adder-Trees for Multiple Constant Multiplications for

Efficient FIR Filter Implementation
Multiple constant multiplication (MCM) scheme is widely used for implementing transposed
direct-form FIR filters. While the research focus of MCM has been on more effective
common subexpression elimination, the optimization of adder-trees, which sum up the

computed sub-expressions for each coefficient, is largely omitted. In this paper, we have
identified the resource minimization problem in the scheduling of adder-tree operations for
the MCM block, and presented a mixed integer programming (MIP) based algorithm for
more efficient MCM-based implementation of FIR filters. Experimental result shows that up
to 15% reduction of area and 11.6% reduction of power (with an average of 8.46% and
5.96% respectively) can be achieved on the top of already optimized adder/subtractor
network of the MCM block.
19. Improved matrix multiplier design for high-speed digital signal processing
applications
A transistor level implementation of an improved matrix multiplier for high-speed digital
signal processing applications based on matrix element transformation and multiplication is
reported in this study. The improvement in speed was achieved by rearranging the matrix
element into a two-dimensional array of processing elements interconnected as a mesh. The
edges of each row and column were interconnected in torus structure, facilitating
simultaneous implementation of several multiplications. The functionality of the circuitry
was verified and the performance parameters for example, propagation delay and dynamic
switching power consumptions were calculated using spice spectre using 90 nm CMOS
technology. The proposed methodology ensures substantial reduction in propagation delay
compared with the conventional algorithm, systolic array and pseudo number theoretic
transformation (PNTT)-based implementation, which are the most commonly used
techniques, for matrix multiplication. The propagation delay of the implemented 4 4
matrix multiplierwas only ~2 s, whereas the power consumption of the implemented 4 4
matrix multiplier was ~3.12 mW only. Improvement in speed compared with earlier reported
matrix multipliers, for example, conventional algorithm, systolic array and PNTT-based
implementation was found to be ~67, ~56 and ~65%, respectively.
20. A Novel Distortion Model and Lagrangian Multiplier for Depth Maps Coding
In three-dimensional videos (3-DV) coding systems, depth maps are not used for viewing but
for rendering virtual views. Therefore, the traditional rate distortion criterion (including
distortion criterion, and Lagrangian multiplier) is not suitable for depth map coding. In order
to design an effective rate distortion criterion for depth maps, the relationship between the
distortion of synthesized virtual view and the coding error of depth maps is analyzed in detail.
Through the analysis, a polynomial model revealing the relationship between the coding error
of depth maps and the distortion of synthesized virtual view is derived. Model parameters are
estimated by utilizing camera parameters and features of the texture video corresponding to
the depth map. Based on the model, a virtual view-based Lagrangian multiplierfor depth map
coding is also proposed. Experimental results demonstrated the accuracy of the model. The
squared correlation coefficients between the actual distortion of virtual view and the
estimated distortion are all larger than 0.98 for all tested sequences. When incorporating the
proposed model and Lagrangian multiplier into the mode decision procedure of joint model
version 18.5 (JM18.5) of H.264/AVC, a maximum 0.470 dB BD PSNR and an average 0.251
dB BD PSNR can be achieved.
21. Dual-Basis Superserial Multipliers for Secure Applications and Lightweight

Cryptographic Architectures
Cryptographic algorithms utilize finite-field arithmetic operations in their computations. Due
to the constraints of the nodes which benefit from the security and privacy advantages of
these algorithms in sensitive applications, these algorithms need to be lightweight. One of the
well-known bases used in sensitive computations is dual basis (DB). In this brief, we present
low-complexity superserial architectures for the DB multiplication over GF(2m). To the best
of our knowledge, this is the first time that such a multiplier is proposed in the open
literature. We have performed complexity analysis for the proposed lightweight architectures,
and the results show that the hardware complexity of the proposed superserial multiplier is
reduced compared with that of regular serial multipliers. This has been also confirmed
through our application-specific integrated circuit hardware- and time-equivalent estimations.
The proposed superserial architecture is a step forward toward efficient and lightweight
cryptographic algorithms and is suitable for constrained implementations of cryptographic
primitives in applications such as smart cards, handheld devices, life-critical wearable and
implantable medical devices, and constrained nodes in the blooming notion of Internet of
nano-Things.
22. Multifunction Residue Architectures for Cryptography

A design methodology for incorporating Residue Number System (RNS) and Polynomial
Residue Number System (PRNS) in Montgomery modular multiplication in GF(p) or GF(2 n)
respectively, as well as a VLSI architecture of a dual-field residue arithmetic
Montgomery multiplier are presented in this paper. An analysis of input/output conversions
to/from residue representation, along with the proposed residue Montgomery multiplication
algorithm, reveals common multiply-accumulate data paths both between the converters and
between the two residue representations. A versatile architecture is derived that supports all
operations of Montgomery multiplication in GF(p) and GF(2n), input/output conversions,
Mixed Radix Conversion (MRC) for integers and polynomials, dual-field modular
exponentiation and inversion in the same hardware. Detailed comparisons with state-of-theart implementations prove the potential of residue arithmetic exploitation in dual-field
modular multiplication.

Low Power Vlsi Papers

Caricato da

Informazioni sul documento

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Low Power Vlsi Papers

Caricato da

Copyright:

Formati disponibili

LOW POWER VLSI

1. Area-Delay-Power Efficient Fixed-Point LMS Adaptive With Low AdaptationDelay

implementation of the direct modulation transmitter. Specifically, we provide the analytical

image-size (512 512), proposed block-based non-separable and generic non-separable

12. Ultra-High Throughput Low-Power Packet Classification

15. AreaDelayPower Efficient Carry-Select Adder

17. Improved design of high-frequency sequential decimal multipliers

18. Bit-Level Optimization of Adder-Trees for Multiple Constant Multiplications for

common subexpression elimination, the optimization of adder-trees, which sum up the

21. Dual-Basis Superserial Multipliers for Secure Applications and Lightweight

22. Multifunction Residue Architectures for Cryptography

Potrebbero piacerti anche