Final Year IEEE Project 2013-2014 - VLSI Project Title and Abstract

Elysium Technologies Private Limited
Singapore | Madurai | Chennai | Trichy | Coimbatore | Cochin | Ramnad | Pondicherry | Trivandrum | Salem | Erode | Tirunelveli http://www.elysiumtechnologies.com, info@elysiumtechnologies.com
13 Years of Experience Automated Services 24/7 Help Desk Support Experience & Expertise Developers Advanced Technologies & Tools Legitimate Member of all Journals Having 1,50,000 Successive records in all Languages More than 12 Branches in Tamilnadu, Kerala & Karnataka. Ticketing & Appointment Systems. Individual Care for every Student. Around 250 Developers & 20 Researchers

227-230 Church Road, Anna Nagar, Madurai 625020. 0452-4390702, 4392702, + 91-9944793398. info@elysiumtechnologies.com, elysiumtechnologies@gmail.com
S.P.Towers, No.81 Valluvar Kottam High Road, Nungambakkam, Chennai - 600034. 044-42072702, +91-9600354638, chennai@elysiumtechnologies.com
15, III Floor, SI Towers, Melapudur main Road, Trichy 620001. 0431-4002234, + 91-9790464324. trichy@elysiumtechnologies.com
577/4, DB Road, RS Puram, Opp to KFC, Coimbatore 641002 0422- 4377758, +91-9677751577. coimbatore@elysiumtechnologies.com

1st Floor, A.R.IT Park, Rasi Color Scan Building, Ramanathapuram - 623501. 04567-223225, +919677704922.ramnad@elysiumtechnologies.com
Plot No: 4, C Colony, P&T Extension, Perumal puram, Tirunelveli627007. 0462-2532104, +919677733255, tirunelveli@elysiumtechnologies.com
74, 2nd floor, K.V.K Complex,Upstairs Krishna Sweets, Mettur Road, Opp. Bus stand, Erode-638 011. 0424-4030055, +919677748477 erode@elysiumtechnologies.com
No: 88, First Floor, S.V.Patel Salai, Pondicherry 605 001. 0413 4200640 +91-9677704822 pondy@elysiumtechnologies.com
TNHB A-Block, D.no.10, Opp: Hotel Ganesh Near Busstand. Salem 636007, 0427-4042220, +91-9894444716. salem@elysiumtechnologies.com

ETPL VLSI-001 Pragmatic Integration of an SRAM Row Cache in Heterogeneous 3-D DRAM Architecture Using TSV
Abstract: As scaling DRAM cells becomes more challenging and energy-efficient DRAM chips are in high demand, the DRAM industry has started to undertake an alternative approach to address these looming issues-that is, to vertically stack DRAM dies with through-silicon-vias (TSVs) using 3-D-IC technology. Furthermore, this emerging integration technology also makes heterogeneous die stacking in one DRAM package possible. Such a heterogeneous DRAM chip provides a unique, promising opportunity for computer architects to contemplate a new memory hierarchy for future system design. In this paper, we study how to design such a heterogeneous DRAM chip for improving both performance and energy efficiency. In particular, we found that, if we want to design an SRAM row cache in a DRAM chip, simple stacking alone cannot address the majority of traditional SRAM row cache design issues. In this paper, to address these issues, we propose a novel floorplan and several architectural techniques that fully exploit the benefits of 3-D stacking technology. Our multi-core simulation results with memoryintensive applications suggest that, by tightly integrating a small row cache with its corresponding DRAM array, we can improve performance by 30% while saving dynamic energy by 31%. ETPL VLSI-002 A Low-Complexity Turbo Decoder Architecture for Energy-Efficient Wireless Sensor Networks
Abstract: Turbo codes have recently been considered for energy-constrained wireless communication applications, since they facilitate a low transmission energy consumption. However, in order to reduce the overall energy consumption, lookup table-log-BCJR (LUT-Log-BCJR) architectures having a low processing energy consumption are required. In this paper, we decompose the LUT-Log-BCJR architecture into its most fundamental add compare select (ACS) operations and perform them using a novel low-complexity ACS unit. We demonstrate that our architecture employs an order of magnitude fewer gates than the most recent LUT-Log-BCJR architectures, facilitating a 71% energy consumption reduction. Compared to state-of-the-art maximum logarithmic Bahl-Cocke-Jelinek-Raviv implementations, our approach facilitates a 10% reduction in the overall energy consumption at ranges above 58 m. ETPL VLSI-003 Pipelined Radix- 2k Feedforward FFT Architectures
Abstract: The appearance of radix-22 was a milestone in the design of pipelined FFT hardware architectures. Later, radix-22 was extended to radix-2k . However, radix-2k was only proposed for singlepath delay feedback (SDF) architectures, but not for feedforward ones, also called multi-path delay commutator (MDC). This paper presents the radix-2k feedforward (MDC) FFT architectures. In feedforward architectures radix-2k can be used for any number of parallel samples which is a power of two. Furthermore, both decimation in frequency (DIF) and decimation in time (DIT) decompositions can be used. In addition to this, the designs can achieve very high throughputs, which makes them suitable for the most demanding applications. Indeed, the proposed radix-2k feedforward architectures require fewer hardware resources than parallel feedback ones, also called multi-path delay feedback (MDF), when several samples in parallel must be processed. As a result, the proposed radix-2k feedforward architectures not only offer an attractive solution for current applications, but also open up a new research line on feedforward structures.

ETPL VLSI-004 Algorithm and Architecture Design of Bandwidth-Oriented Motion Estimation for Real-Time Mobile Video Applications
Abstract: This paper proposes a data bandwidth-oriented motion estimation design for resource-limited mobile video applications using an integrated bandwidth rate distortion optimization framework. This framework predicts and allocates the appropriate data bandwidth for motion estimation under a limited bandwidth supply to fit a dynamically changing bandwidth supply. The simulation results show that our proposed algorithm can achieve 66% and 41% memory bandwidth savings while maintaining an equivalent rate-distortion performance and meeting real-time targets, when compared with conventional approaches for low-motion and high-motion D1 (704 × 576)-size video, respectively. The final implementation costs 122 K gate counts with TSMC 0.13- m CMOS technology and consumes 74 mW of power for D1 resolution at 30 frames/s which is 40% of that achieved in previous designs. ETPL VLSI-005 STBC-OFDM Downlink Baseband Receiver for Mobile WMAN
Abstract: This paper proposes a space time block code-orthogonal frequency division multiplexing downlink baseband receiver for mobile wireless metropolitan area network. The proposed baseband receiver applied in the system with two transmit antennas and one receive antenna aims to provide high performance in outdoor mobile environments. It provides a simple and robust synchronizer and an accurate but hardware affordable channel estimator to overcome the challenge of multipath fading channels. The coded bit error rate performance for 16 quadrature amplitude modulation can achieve less than 10-6 under the vehicle speed of 120 km/hr. The proposed baseband receiver designed in 90-nm CMOS technology can support up to 27.32 Mb/s uncoded data transmission under 10 MHz channel bandwidth. It requires a core area of 2.41 2.41 mm2 and dissipates 68.48 mW at 78.4 MHz with 1 V power supply. ETPL VLSI-006 Glitch-Free NAND-Based Digitally Controlled Delay-Lines
Abstract: The recently proposed NAND-based digitally controlled delay-lines (DCDL) present a glitching problem which may limit their employ in many applications. This paper presents a glitch-free NANDbased DCDL which overcame this limitation by opening the employ of NAND-based DCDLs in a wide range of applications. The proposed NAND-based DCDL maintains the same resolution and minimum delay of previously proposed NAND-based DCDL. The theoretical demonstration of the glitch-free operation of proposed DCDL is also derived in the paper. Following this analysis, three driving circuits for the delay control-bits are also proposed. Proposed DCDLs have been designed in a 90-nm CMOS technology and compared, in this technology, to the state-of-the-art. Simulation results show that novel circuits result in the lowest resolution, with a little worsening of the minimum delay with respect to the previously proposed DCDL with the lowest delay. Simulations also confirm the correctness of developed glitching model and sizing strategy. As example application, proposed DCDL is used to realize an Alldigital spread-spectrum clock generator (SSCG). The employ of proposed DCDL in this circuit allows to reduce the peak-to-peak absolute output jitter of more than the 40% with respect to a SSCG using threestate inverter based DCDLs.

ETPL VLSI-007 A High-Efficiency, Wide Workload Range, Digital Off-Time Modulation (DOTM) DCDC Converter With Asynchronous Power Saving Technique
Abstract: Conventionally for wide workload range applications, to keep good stability and high efficiency, a switching converter with multi-mode operation is necessary. With the advanced digital signal processing, this work presents an asynchronous digital controller with dynamic power saving technique to achieve high power efficiency. The regulation is based on the off-time modulation, in which an adaptive resolution adjustment is proposed for the extension toward light-loaded range. The DC-DC converter is fabricated in a 0.18- m CMOS process. The input voltage is from 2.7 to 3.6 V and the regulated output is 1.8 V. The switching frequency is from 44 kHz to 1.65 MHz and the maximum output ripple is 20 mV with a 10-F capacitor and a 2.2-H inductor. The power efficiency is higher than 91% for the workload range from 3 to 400 mA. ETPL VLSI-008 Formal Verification of Architectural Power Intent
Abstract: This paper presents a verification framework that attempts to bridge the disconnect between high-level properties capturing the architectural power management strategy and the implementation of the power management control logic using low-level per-domain control signals. The novelty of the proposed framework is in demonstrating that the architectural power intent properties developed using high-level artifacts can be automatically translated into properties over low-level control sequences gleaned from UPF specifications of power domains, and that the resulting properties can be used to formally verify the global on-chip power management logic. The proposed translation uses a considerable amount of domain knowledge and is also not purely syntactic, because it requires formal extraction of timing information for the low-level control sequences. We present a tool, called POWER-TRUCTOR which enables the proposed framework, and several test cases of significant complexity to demonstrate the feasibility of the proposed framework. ETPL VLSI-009 Statistical SRAM Read Access Yield Improvement Using Negative Capacitance Circuits
Abstract: SRAM has become the dominant block in modern ICs and constitutes more than 50% of the die area. The increase of process variations with continued CMOS technology scaling is considered one of the major challenges for SRAM designers. This process variations increase causes the SRAM cells to functionally fail and reduces the chip functional yield considering the static noise margin stability failures (i.e., cell flips when accessed), write failures (i.e., cell is not written within the write window), and read access failures (i.e., incorrect read operation). In this paper, novel negative capacitance circuits are developed, for the first time, to statistically improve the SRAM read access yield under process variations by reducing the bitlines parasitic capacitance. Post layout simulation results, referring to an industrial hardware-calibrated TSMC 65-nm CMOS technology, show that the adoption of the negative capacitance circuit to a 512 SRAM cells column is capable of improving the read access yield from 61.9% to 100%. ETPL VLSI-010 An Energy-Efficient L2 Cache Architecture Using Way Tag Information Under WriteThrough Policy
Abstract: Many high-performance microprocessors employ cache write-through policy for performance improvement and at the same time achieving good tolerance to soft errors in on-chip caches. However,

write-through policy also incurs large energy overhead due to the increased accesses to caches at the lower level (e.g., L2 caches) during write operations. In this paper, we propose a new cache architecture referred to as way-tagged cache to improve the energy efficiency of write-through caches. By maintaining the way tags of L2 cache in the L1 cache during read operations, the proposed technique enables L2 cache to work in an equivalent direct-mapping manner during write hits, which account for the majority of L2 cache accesses. This leads to significant energy reduction without performance degradation. Simulation results on the SPEC CPU2000 benchmarks demonstrate that the proposed technique achieves 65.4% energy savings in L2 caches on average with only 0.02% area overhead and no performance degradation. Similar results are also obtained under different L1 and L2 cache configurations. Furthermore, the idea of way tagging can be applied to existing low-power cache design techniques to further improve energy efficiency. ETPL VLSI-011 An Analytical Latency Model for Networks-on-Chip
Abstract: We propose an analytical model based on queueing theory for delay analysis in a wormholeswitched network-on-chip (NoC). The proposed model takes as input an application communication graph, a topology graph, a mapping vector, and a routing matrix, and estimates average packet latency and router blocking time. It works for arbitrary network topology with deterministic routing under arbitrary traffic patterns. This model can estimate per-flow average latency accurately and quickly, thus enabling fast design space exploration of various design parameters in NoC designs. Experimental results show that the proposed analytical model can predict the average packet latency more than four orders of magnitude faster than an accurate simulation, while the computation error is less than 10% in nonsaturated networks for different system-on-chip platforms. ETPL VLSI-012 Built-In Generation of Functional Broadside Tests Using a Fixed Hardware Structure
Abstract: Functional broadside tests are two-pattern scan-based tests that avoid overtesting by ensuring that a circuit traverses only reachable states during the functional clock cycles of a test. In addition, the power dissipation during the fast functional clock cycles of functional broadside tests does not exceed that possible during functional operation. On-chip test generation has the added advantage that it reduces test data volume and facilitates at-speed test application. This paper shows that on-chip generation of functional broadside tests can be done using a simple and fixed hardware structure, with a small number of parameters that need to be tailored to a given circuit, and can achieve high transition fault coverage for testable circuits. With the proposed on-chip test generation method, the circuit is used for generating reachable states during test application. This alleviates the need to compute reachable states offline. ETPL VLSI-013 Checkpointing for Virtual Platforms and SystemC-TLM
Abstract: Integrating simulation models created using different simulation systems is a common problem when constructing virtual platforms. Different companies and different departments can create models, and virtual platforms for different purposes using different tools. There are also existing models that need to be integrated into new tools, or the other way around. The simulators can be quite different in details, even in the case of transaction-level models. We present work in integrating SystemC transaction-level models into two typical full-system simulation environments, QEMU and Simics. We present issues in

reconciling the semantics of the different platforms, and our proposed solutions. In the Simics integration, we additionally enable checkpointing in the models, based on the Simics checkpoint mechanism. ETPL VLSI-014 Design of a Practical Nanometer-Scale Redundant Via-Aware Standard Cell Library for Improved Redundant Via1 Insertion Rate
Abstract: Despite the rapid advances in process technology, via failure is still problematic in nanometerscale semiconductor manufacturing. Adding redundant vias is a typical approach for improving yield and reliability. Cell-based design methodologies are widely adopted in the industry for application-specific integrated circuits. Standard cells are effective for increasing the insertion rate of redundant via1s in cellbased designs. This study proposes an efficient library check and staggered pin arrangement approach that compares redundant via1 insertion rate in different configurations such as double-via and rectangle-via. To compare the variability in standard cell (SC) libraries, accurate characterization results are provided. Moreover, the proposed SC library is easily implemented in all currently available routers. The experimental results reveal that the proposed library improves total inserted redundant vias, total inserted redundant via1s, and total run time by 20.2%, 51.9%, and 42.3%, respectively. In double-via pattern, the proposed approach improves average via1 insertion rate by 14.6%. In rectangle-via pattern, the proposed approach achieves a 100% via1 insertion rate. ETPL VLSI-015 Scaling Energy Per Operation via an Asynchronous Pipeline
Abstract: Statistical analysis of computations per unit energy in processors over the last 30 years is given that illustrates a sharp reduction in the rate of energy efficiency improvements over the last several years resulting in the formation of an asymptotic wall with our dataset; we use the measure of giga multiply accumulates per Joule. We have developed an energy model which takes into account the realities of scaling, specifically for asynchronous systems. Studies of an energy efficient asynchronous pipeline show fabricated results of 17 Giga Operations per Joule in 0.6 m at subthreshold when fully pipelined, and simulations at a more modern 65 nm process show a further order of magnitude improvement on that. ETPL VLSI-016 A High Speed Low Power CAM With a Parity Bit and Power-Gated ML Sensing
Abstract: Content addressable memory (CAM) offers high-speed search function in a single clock cycle. Due to its parallel match-line (ML) comparison, CAM is power-hungry. Thus, robust, high-speed and low-power ML sense amplifiers are highly sought-after in CAM designs. In this paper, we introduce a parity bit that leads to 39% sensing delay reduction at a cost of less than 1% area and power overhead. Furthermore, we propose an effective gated-power technique to reduce the peak and average power consumption and enhance the robustness of the design against process variations. A feedback loop is employed to auto-turn off the power supply to the comparison elements and hence reduce the average power consumption by 64%. The proposed design can work at a supply voltage down to 0.5 V. ETPL VLSI-017 Error Detection in Majority Logic Decoding of Euclidean Geometry Low Density Parity Check (EG-LDPC) Codes
Abstract: In a recent paper, a method was proposed to accelerate the majority logic decoding of difference set low density parity check codes. This is useful as majority logic decoding can be implemented serially

with simple hardware but requires a large decoding time. For memory applications, this increases the memory access time. The method detects whether a word has errors in the first iterations of majority logic decoding, and when there are no errors the decoding ends without completing the rest of the iterations. Since most words in a memory will be error-free, the average decoding time is greatly reduced. In this brief, we study the application of a similar technique to a class of Euclidean geometry low density parity check (EG-LDPC) codes that are one step majority logic decodable. The results obtained show that the method is also effective for EG-LDPC codes. Extensive simulation results are given to accurately estimate the probability of error detection for different code sizes and numbers of errors. ETPL VLSI-018 Techniques for Compensating Memory Errors in JPEG2000
Abstract: This paper presents novel techniques to mitigate the effects of SRAM memory failures caused by low voltage operation in JPEG2000 implementations. We investigate error control coding schemes, specifically single error correction double error detection code based schemes, and propose an unequal error protection scheme tailored for JPEG2000 that reduces memory overhead with minimal effect in performance. Furthermore, we propose algorithm-specific techniques that exploit the characteristics of the discrete wavelet transform coefficients to identify and remove SRAM errors. These techniques do not require any additional memory, have low circuit overhead, and more importantly, reduce the memory power consumption significantly with only a small reduction in image quality. ETPL VLSI-019 Spatial Distribution Measurement of Dynamic Voltage Drop Caused by Pulse and Periodic Injection of Spot Noise
Abstract: This paper presents measured results of dynamic voltage drop caused by pulse and periodic injection of spot noise. The test structure being fabricated by a 45 nm low-power process has 1024 delay probes to measure spatial distributions in response to the spot-noise generation. The test structure is the advanced version of our predecessor being fabricated by a 65-nm node, and can trace changes in the spatial distributions with time after the noise injection. The measured results are compared with SPICE simulations, in which package/socket LCR as well as power-line RC within the die is modeled. It is found that the simple model agrees well with the measured results. ETPL VLSI-020 Low-Complexity Multiplier for GF(2^{m}) Based on All-One Polynomials
Abstract: This paper presents an area-time-efficient systolic structure for multiplication over GF(2m) based on irreducible all-one polynomial (AOP). We have used a novel cut-set retiming to reduce the duration of the critical-path to one XOR gate delay. It is further shown that the systolic structure can be decomposed into two or more parallel systolic branches, where the pair of parallel systolic branches has the same input operand, and they can share the same input operand registers. From the applicationspecific integrated circuit and field-programmable gate array synthesis results we find that the proposed design provides significantly less area-delay and power-delay complexities over the best of the existing designs. ETPL VLSI-021 Design and Implementation of an On-Chip Permutation Network for Multiprocessor System-On-Chip

Abstract: This paper presents the silicon-proven design of a novel on-chip network to support guaranteed traffic permutation in multiprocessor system-on-chip applications. The proposed network employs a pipelined circuit-switching approach combined with a dynamic path-setup scheme under a multistage network topology. The dynamic path-setup scheme enables runtime path arrangement for arbitrary traffic permutations. The circuit-switching approach offers a guarantee of permuted data and its compact overhead enables the benefit of stacking multiple networks. A 0.13- m CMOS test-chip validates the feasibility and efficiency of the proposed design. Experimental results show that the proposed on-chip network achieves 1.9 to 8.2 reduction of silicon overhead compared to other design approaches. ETPL VLSI-022 An On-Chip Network Fabric Supporting Coarse-Grained Processor Array
Abstract: Coarse grained arrays (CGAs) with run-time reconfigurability play an important role in accelerating reconfigurable computing applications. It is challenging to design on-chip communication networks (OCNs) for such CGAs with dynamic run-time reconfigurability whilst satisfying the tight budgets of power and area for an embedded system. This paper presents a silicon-proven design of a 64PE circuit-switched OCN fabric with a dynamic path-setup scheme capable of supporting an embedded coarse-grained processor array. A proof-of-concept test chip fabricated in a 0.13 m CMOS process occupies a silicon area of 23 mm2 and consumes a peak power of 200 mW @ 128 MHz and 1.2 Vcc, at room temperature. The OCN overhead consumes 9.4% of the area and 18% of the power of the total chip. Experimental results and analysis show that the proposed OCN fabric with its dynamic path-setup is suitable for use in an embedded CGA supporting fast run-time reconfigurability. ETPL VLSI-023 A Very Linear Low-Pass Filter with Automatic Frequency Tuning
Abstract: A Gm-C third-order Chebyshev low-pass filter with a novel switched capacitor frequency tuning technique for a zero-IF Bluetooth receiver has been designed. The frequency tuning scheme is simpler and has more relaxed specifications than conventional ones. Furthermore, a highly linear pseudodifferential transconductor with a compact feedback loop able to operate with low supply voltage has been used. This control loop holds the input transistors in triode region and provides high output resistance, keeping high linearity in a wide range of transconductance. The filter bandwidth is 0.5 MHz and the overall scheme consumes 1.1 mA from a 1.8-V supply. The measured third-order intermodulation (IM3) distortion of the filter for a 1 Vpp two-tone signal centered at 300 kHz is -65 dB. ETPL VLSI-024 A High-Speed Low-Complexity Modified {\rm Radix}-2^{5} FFT Processor for High Rate WPAN Applications
Abstract: This paper presents a high-speed low-complexity modified radix-25 512-point fast Fourier transform (FFT) processor using an eight data-path pipelined approach for high rate wireless personal area network applications. A novel modified radix-25 FFT algorithm that reduces the hardware complexity is proposed. This method can reduce the number of complex multiplications and the size of the twiddle factor memory. It also uses a complex constant multiplier instead of a complex Booth multiplier. The proposed FFT processor achieves a signal-to-quantization noise ratio of 35 dB at 12 bit internal word length. The proposed processor has been designed and implemented using 90-nm CMOS technology with a supply voltage of 1.2 V. The results demonstrate that the total gate count of the

proposed FFT processor is 290 K. Furthermore, the highest throughput rate is up to 2.5 GS/s at 310 MHz while requiring much less hardware complexity. ETPL VLSI-025 Application Space Exploration of a Heterogeneous Run-Time Configurable Digital Signal Processor
Abstract: This paper describes the application space exploration of a heterogeneous digital signal processor with dynamic reconfiguration capabilities. The device is built around three reconfigurable engines featuring different flavours and computation granularities that make it suitable for a wide range of signal processing application domains such as video coding, image processing, telecommunications, and cryptography. Performance of signal processing applications is evaluated from measurements performed on a CMOS 90 nm prototype. In order to characterize the application space of the processor, performance is compared with state-of-the-art devices, taking programmability, computational capabilities, and energy efficiency as the main metrics. The device exploits performance and energy efficiency significantly more than general purpose processors, while still maintaining a user-friendly programming approach that mainly relies on software-oriented languages. The device is able to achieve 1.2 to 15 GOPS with an energy efficiency from 2 to 50 GOPS/W when running the selected applications ETPL VLSI-026 A Unified Graphics and Vision Processor With a 0.89 \mu W/fps Pose Estimation Engine for Augmented Reality
Abstract: A unified vision and graphics processor with three layers is shown to provide a fast pipeline for augmented reality. In the image-level layer, a 153.6 GOPS massively parallel processing unit with eight SIMD processors, each containing 128 processing elements, performs highly data-parallel operations. In the sub-image layer, a rasterizer and a pixel arranger respectively generate and reduce data-level parallelism. In the descriptor-level layer, a pose estimation engine executes sequential programs. Our processor can provide images for augmented reality at 100 fps, for a power consumption of 413 mW. This is 39% faster than a comparable smartphone implementation. Our chip is fabricated in a 0.18 m CMOS process and contains 0.95 M gates. ETPL VLSI-027 CORDIC Designs for Fixed Angle of Rotation
Abstract: Rotation of vectors through fixed and known angles has wide applications in robotics, digital signal processing, graphics, games, and animation. But, we do not find any optimized coordinate rotation digital computer (CORDIC) design for vector-rotation through specific angles. Therefore, in this paper, we present optimization schemes and CORDIC circuits for fixed and known rotations with different levels of accuracy. For reducing the area- and time-complexities, we have proposed a hardwired preshifting scheme in barrel-shifters of the proposed circuits. Two dedicated CORDIC cells are proposed for the fixed-angle rotations. In one of those cells, micro-rotations and scaling are interleaved, and in the other they are implemented in two separate stages. Pipelined schemes are suggested further for cascading dedicated single-rotation units and bi-rotation CORDIC units for high-throughput and reduced latency implementations. We have obtained the optimized set of micro-rotations for fixed and known angles. The optimized scale-factors are also derived and dedicated shift-add circuits are designed to implement the scaling. The fixed-point mean-squared-error of the proposed CORDIC circuit is analyzed statistically, and strategies for reducing the error are given. We have synthesized the proposed CORDIC cells by Synopsys Design Compiler using TSMC 90-nm library, and shown that the proposed designs offer higher

throughput, less latency and less area-delay product than the reference CORDIC design for fixed and known angles of rotation. We find similar results of synthesis for different Xilinx field-programmable gate-array platforms. ETPL VLSI-028 Application-Driven End-to-End Traffic Predictions for Low Power NoC Design
Abstract: As chip multiprocessors keep increasing the number of cores on the chip, the network-on-chip (NoC) technology is becoming essential for interconnecting the cores. While NoCs result in noticeable performance boost over conventional bus systems, they consume a non-negligible fraction of the system power. One promising solution is to dynamically adjust the working frequencies/voltages of the switches as well as the links between switches in the NoC to match the traffic flows. The question is when to adjust and by how much. Most previous works take a passive approach by reacting to fluctuations in local traffic flows. Unfortunately, this approach may be too slow and too conservative in adjusting the working frequencies/voltages. Since applications often exhibit periodic behaviors, we propose a hardware mechanism to proactively adjust the frequencies/voltages of switches and/or links in NoC by predicting the application runtime traffic. The evaluations show that our design achieves 86% dynamic power savings of the links in the on-chip network, and the resulting overheads from mispredictions are tolerable. ETPL VLSI-029 Thermal-Constrained Task Allocation for Interconnect Energy Reduction in 3-D Homogeneous MPSoCs
Abstract: 3-D technology that stacks silicon dies with through silicon vias (TSVs) is a promising solution to overcome the interconnect scaling problem in giga-scale integrated circuits (ICs). Thermal dissipation is a major challenge for 3-D integration and prior thermal-balanced task scheduling methods for 3-D multiprocessor system-on-chips (MPSoCs) typically balance power gradient across vertical stacks based on the assumption of strong thermal correlation among processing cores within a stack. On the other hand, 3-D MPSoCs typically employ network-on-chip (NoC) as the communication infrastructure which consumes a large portion of the energy budget. As TSVs consume much less energy than horizontal links in 3-D MPSoCs when transmitting the same amount data due to the reduced interconnect distance between vertical adjacent cores, it motivates to allocate heavily communicating tasks within the same vertical stack as much as possible, and thus traffic is restricted in the third dimension to reduce interconnect energy. However, aggregating active tasks within the same stack probably exacerbates the power density and result in hot spots. In this paper, we explore the tradeoff between thermal and interconnect energy when allocating tasks in 3-D Homogeneous MPSoCs, and propose an efficient heuristic. Experimental results show that the proposed technique can reduce interconnect energy by more than 25% on average with almost the same peak temperature when compared with prior thermal-balanced solutions. ETPL VLSI-030 A Wide-Range PLL Using Self-Healing Prescaler/VCO in 65-nm CMOS
Abstract: The variability and leakage current in nanoscale CMOS technology may degrade the circuit performances significantly. To accommodate the above issues in a wide-range phase-locked loop (PLL), a self-healing prescaler, a self-healing voltage-controlled oscillator (VCO), and a calibrated charge pump (CP) are presented. This PLL is fabricated in a 65-nm CMOS technology and its active area is 0.0182 mm2 . For the self-healing VCO, its measured frequency range is from 60 to 1489 MHz. When this PLL

operates at 855 MHz, the measured rms and peak-to-peak jitters are 8.03 and 55.6 ps, respectively. The measured reference spur is -52.89 dBc. This PLL consumes 4.3 mW from 1.2 V supply without buffers. ETPL VLSI-031 A Clock Control Strategy for Peak Power and RMS Current Reduction Using Path Clustering
Abstract: Peak power reduction has been a critical challenge in the design of integrated circuits impacting the chip's performance and reliability. The reduction of peak power also reduces the power density of integrated circuits. Due to large IR-voltage drops in circuits, transistor switching slows down giving rise to timing violations and logic failures. In this paper, we present a new clock control strategy for peakpower reduction in VLSI circuits. In the proposed method, the simultaneous switching of combinational paths is minimized by taking advantage of the delay slacks among the paths and clustering the paths with similar slack values. Once the paths are identified based on the path delays and their slack values, the clustering algorithm determines the ideal number of clusters for the given circuit and for each cluster the maximum possible phase shift that can be applied to the clock. The paths are assigned to clusters in a load balanced manner based on the slack values and each cluster will have a phase shift possible on its clock depending on the slack. Thus, the proposed register-transfer level (RTL) method takes advantage of the logic-path timing slack to re-schedule circuit activities at optimal intervals within the unaltered clock period. When switching activities are redistributed more evenly across the clock period, the IC supplycurrent consumption is also spread across a wider range of time within the clock period. This has the beneficial effect of reducing peak-current draw in addition to reducing RMS power draw without having to change the operating frequency and without utilizing additional power supply voltages as in dual or multi VT approaches. The proposed method is implemented and tested through simulations using an experimental setup with Synopsys Tools Suite and Cadence Tools on the ISCAS'85 benchmark circuits, OpenCore circuits and LEON processor multiplier circuit. Experimental results indicate that peak power can be reduced significantly to at- least 72% depending on the number of clusters and the phase-shifted clock identified as suitable for the given circuit by the proposed algorithms. Although the proposed method incurs some power overhead compared to the traditional clocking method, the overhead can be made negligible compared to the peak-power reduction as seen in the experimental results presented. ETPL VLSI-032 A Fast-Locking All-Digital Deskew Buffer With Duty-Cycle Correction
Abstract: In this paper, a fast-locking all-digital deskew buffer with duty cycle correction is proposed and implemented. A cyclic time-to-digital converter is introduced to decrease the locking time in conventional register-controlled delay-locked loop to only two input clock cycles in coarse tuning. With the aid of the three half delay lines technique, the mismatch between half delay lines causing the duty cycle distortion can be alleviated by interpolation. A balanced edge combiner to achieve a precise 50% output clock is also presented. A test chip is fabricated in 0.18-m technology to demonstrate the feasibility of the proposed architecture. The circuit can accept the input clock rates from 250 to 625 MHz with the duty cycle variation within 30% and 70% to generate 50% output clocks. It preserves the capability of closedloop control with a small area and power consumption. ETPL VLSI-033 A Built-In Repair Analyzer With Optimal Repair Rate for Word-Oriented Memories

Abstract: This paper presents a built-in self repair analyzer with the optimal repair rate for memory arrays with redundancy. The proposed method requires only a single test, even in the worst case. By performing the must-repair analysis on the fly during the test, it selectively stores fault addresses, and the final analysis to find a solution is performed on the stored fault addresses. To enumerate all possible solutions, existing techniques use depth first search using a stack and a finite-state machine. Instead, we propose a new algorithm and its combinational circuit implementation. Since our formulation for the circuit allows us to use the parallel prefix algorithm, it can be configured in various ways to meet area and test time requirements. The total area of our infrastructure is dominated by the number of content addressable memory entries to store the fault addresses, and it only grows quadratically with respect to the number of repair elements. The infrastructure is also extended to support various types of word-oriented memories. ETPL VLSI-034 System-Level Modeling and Analysis of Thermal Effects in Optical Networks-on-Chip
Abstract: The performance of multiprocessor systems, such as chip multiprocessors (CMPs), is determined not only by individual processor performance, but also by how efficiently the processors collaborate with one another. It is the communication architecture that determines the collaboration efficiency on the hardware side. Optical networks-on-chip (ONoCs) are emerging communication architectures that can potentially offer ultra-high communication bandwidth and low latency to multiprocessor systems. Thermal sensitivity is an intrinsic characteristic of photonic devices used by ONoCs as well as a potential issue. This paper systematically modeled and quantitatively analyzed the thermal effects in ONoCs. We used an 8 8 mesh-based ONoC as a case study and evaluated the impacts of thermal effects in the average power efficiency for real MPSoC applications. We revealed three important factors regarding ONoC power efficiency under temperature variations, and proposed several techniques to reduce the temperature sensitivity of ONoCs. These techniques include the optimal initial setting of microresonator resonant wavelength, increasing the 3-dB bandwidth of optical switching elements by parallel coupling multiple microresonators, and the use of passive-routing optical router Crux to minimize the number of switching stages in mesh-based ONoCs. We gave a mathematical analysis of periodically parallel coupling of multiple microresonators and show that the 3-dB bandwidth of optical switching elements can be widened nearly linearly with the ring number. Evaluation results for different real MPSoC applications show that, on the basis of thermal tuning, the optimal device setting improves the average power efficiency by 54% to 1.2 pJ/bit when chip temperature reaches 85 C. The findings in this paper can help support the further development of this emerging technology. ETPL VLSI-035 A Study of Tapered 3-D TSVs for Power and Thermal Integrity
Abstract: 3-D integration presents a path to higher performance, greater density, increased functionality and heterogeneous technology implementation. However, 3-D integration introduces many challenges for power and thermal integrity due to large switching currents, longer power delivery paths, and increased parasitics compared to 2-D integration. In this work, we provide an in-depth study of power and thermal issues while incorporating the physical design characteristics unique to 3-D integration. We provide a qualitative perspective of the power and thermal dissipation issues in 3-D and study the impact of Through Silicon Vias (TSVs) size for their mitigation. We investigate and discuss the design implications of power and thermal issues in the presence of decoupling capacitors, TSV/on-die/package parasitics, various resonance effects and power gating. Our study is based on a ten-tier system utilizing existing 3-D

technology specifications. Based on detailed power distribution and heat dissipation models, we present a comprehensive analysis of TSV tapering for alleviating power and thermal integrity issues in 3-D ICs. ETPL VLSI-036 Improved Trace Buffer Observation via Selective Data Capture Using 2-D Compaction for Post-Silicon Debug
Abstract: This paper presents a novel technique for extending the capacity of trace buffers when capturing debug data during post-silicon debug. It exploits the fact that is it not necessary to capture error-free data in the trace buffer since that information can be obtained from simulation. A selective data capture method is proposed in this paper that only captures debug data during clock cycles in which errors are present. The proposed debug method requires only three debug sessions. The first session estimates a rough error rate, the second session identifies a set of suspect clock cycles where errors may be present, and the third session captures the suspect clock cycles in the trace buffer. The suspect clock cycles are determined through a 2-D compaction technique using multiple-input signature register signatures and cycling register signatures. Intersecting both signatures generates a small number of suspect clock cycles for which the trace buffer needs to capture. The effective observation window of the trace buffer can be expanded significantly, by up to orders of magnitude. Experimental results indicate very significant increases in the effective observation window for a trace buffer can be obtained. ETPL VLSI-037 AC-Plus Scan Methodology for Small Delay Testing and Characterization
Abstract: Small delay defects escaping traditional delay testing could cause a device to malfunction in the field and thus detecting these defects is often necessary. To address this issue, we propose three test modes in a new methodology called AC-plus scan, in which versatile test clocks can be generated on the chip by embedding an all-digital phase-locked loop (ADPLL) into the circuit under test (CUT). AC-plus scan can be executed on an in-house wireless test platform called HOY system. The first test mode of our AC-plus scan provides a more efficient way to measure the longest path delay associated with each test pattern. Experimental result shows that our method could greatly reduce the test time by 81.8%. The second test mode is designed for volume production test. It could effectively detect small delay defects and provide fast characterization on those defective chips for further processing. This mode could be used to help predict which chips are more likely to fall victim to operational failure in the field. The third test mode is to extract the waveform of each flip-flop's output in a real chip. This is made possible by taking advantage of the almost unlimited test memory our HOY test platform provides, so that we could easily store a great volume of data and reconstruct the waveform for post-silicon debugging. We have successfully fabricated a Viterbi decoder chip with such an AC-plus scan methodology inside to demonstrate its capability. ETPL VLSI-038 A Variation Tolerant Current-Mode Signaling Scheme for On-Chip Interconnects
Abstract: Current-mode signaling (CMS) with dynamic overdriving is one of the most promising scheme for high-speed low-power communication over long on-chip interconnects. However, they are sensitive to parameter variations due to reduced voltage swings on the line. In this paper, we propose a variation tolerant dynamic overdriving CMS scheme. The proposed CMS scheme and a competing CMS scheme (CMS-Fb) are fabricated in 180-nm CMOS technology. Measurement results show that the proposed scheme offers 34% reduction in energy/bit and 42% reduction in energy-delay-product over CMS-Fb

scheme for a 10 mm line operating at 0.64 Gbps of data rate. Simulations indicate that the proposed CMS scheme consumes 0.297 pJ/bit for data transfer over the 10 mm line at 2.63 Gb/s. Measurements indicate that the delay of CMS-Fb becomes 2.5 times its nominal value in the presence of intra-die variations whereas the delay of the proposed scheme changes by only 5% for the same amount of intra-die variations. Measurement and simulation results show that both the schemes are robust against inter-die variations. Experiments and simulations also indicate that the proposed CMS scheme is more robust against practical variations in supply and temperature as compared to CMS-Fb scheme. ETPL VLSI-039 Modeling and Analysis of Power Distribution Networks in 3-D ICs
Abstract: This paper addresses the modeling and analysis problems for power distribution networks (PDNs) in 3-D ICs. An on-chip distributed model is proposed for 3-D power grids, in which the details of metal layers are considered. The distributed model is demonstrated to be essential to identifying the unique noise behavior of 3-D PDNs. A lumped model is proposed based on the distributed model. The lumped model features the connection impedance between tiers and is proven to be useful for designers to understand the global effects of 3-D PDNs. Based on the models, an analysis flow is designed for 3-D PDNs in both frequency domain and time domain. With the analysis flow, the electrical characteristics of 3-D PDNs are studied systematically for the first time. The frequency-domain analysis identifies the global and local resonance phenomena in 3-D PDNs that are distinct from those in 2-D PDNs. The physical mechanisms behind the resonance phenomena are investigated. The time-domain analysis predicts the worst-case supply noise based on distributed current constraints. The Rogue Wave concept is introduced to explain the spatial and temporal relations of the worst-case on-chip noise responses in 3D PDNs. ETPL VLSI-040 A Low-Cost, Systematic Methodology for Soft Error Robustness of Logic Circuits
Abstract: Due to current technology scaling trends such as shrinking feature sizes and decreasing supply voltages, circuit reliability is becoming more susceptible to radiation-induced transient faults (soft errors). Soft errors, which have been a great concern in memories, are now a main factor in reliability degradation of logic circuits as well. In this paper, we present a systematic and integrated methodology for circuit robustness to soft errors. The proposed soft error rate (SER) reduction framework, based on redundancy addition and removal (RAR), aims at eliminating those gates with large contribution to the overall SER. Several metrics and constraints are introduced to guide the RAR-based approach toward SER reduction. Furthermore, we integrate a resizing strategy into our framework, as post-RAR additive SER optimization. The strategy can identify most critical gates to be upsized and thereby, minimize area and power overheads while maintaining a high level of soft error robustness. Experimental results show that the proposed RAR-based framework can achieve up to 70% reduction in output failure probability. On average, about 23% SER reduction is obtained with less than 4% area overhead. ETPL Low Complexity Out-of-Order Issue Logic Using Static Circuits VLSI-041 Abstract: In this paper a single-cycle issue queue circuit architecture that simplifies the wakeup and selection logic is proposed. The micro-architecture and fully static CMOS circuits are presented for a 32entry queue that issues four instructions per cycle. The instruction-ready signals are divided into groups and processed in parallel to issue the four oldest ready instructions. The complete issue queue and prioritization logic requires 20 inversions, allowing simulated circuit operation at over 4 GHz in a foundry

45 nm SOI fabrication process. ETPL Low Latency Systolic Montgomery Multiplier for Finite Field GF(2^{m}) Based on VLSI-042 Pentanomials Abstract: In this paper, we present a low latency systolic Montgomery multiplier over GF(2m) based on irreducible pentanomials. An efficient algorithm is presented to decompose the multiplication into a number of independent units to facilitate parallel processing. Besides, a novel so-called pre-computed addition technique is introduced to further reduce the latency. The proposed design involves significantly less area-delay and power-delay complexities compared with the best of the existing designs. It has the same or shorter critical-path and involves nearly one-fourth of the latency of the other in case of the National Institute of Standards and Technology recommended irreducible pentanomials. ETPL Power-Up Sequence Control for MTCMOS Designs VLSI-043 Abstract: Power gating is effective for reducing standby leakage power as multi-threshold CMOS (MTCMOS) designs have become popular in the industry. However, a large inrush current and dynamic IR drop may occur when a circuit domain is powered up with MTCMOS switches. This could in turn lead to improper circuit operation. We propose a novel framework for generating a proper power-up sequence of the switches to control the inrush current of a power-gated domain while minimizing the power-up time and reducing the dynamic IR drop of the active domains. We also propose a configurable dominodelay circuit for implementing the sequence. Experimental results based on state-of-the-art industrial designs demonstrate the effectiveness of the proposed framework in limiting the inrush current, minimizing the power-up time, and reducing the dynamic IR drop. Results further confirm the efficiency of the framework in handling large-scale designs with more than 80 K power switches and 100 M transistors. ETPL Architecture and Design Flow for a Highly Efficient Structured ASIC VLSI-044 Abstract: As fabrication process technology continues to advance, mask set costs have become prohibitively expensive. Structured application specific integrated circuits (sASICs) offer a middle ground in price and performance between ASICs and field-programmable gate arrays (FPGAs) by sharing masks across different designs. In this paper, two sASIC architectures are proposed, the first being based on three-input lookup-tables, and the second on AOI22 gates. The sASICs are programmed using a standardcell compatible design flow. They are customized using a minimum of three masks, i.e., two metals and one via. The area and delay of the sASIC are compared with ASICs and FPGAs. Results over a set of benchmark circuits show that our AOI22-based sASIC had an average of 1.76x/1.41x increase in area/delay compared to ASICs, a considerable improvement compared with the 26.56x/5.09x increase for FPGAs. This is, to the best of our knowledge, the best performance reported in the literature for a practical sASIC. A prototype using the sASIC was fabricated using a universal machine control 0.13-m mixed-mode/RF process. It was fully verified using scan and functional tests, and used in a demonstration system. ETPL Secure Dual-Core Cryptoprocessor for Pairings Over Barreto-Naehrig Curves on VLSI-045 FPGA Platform, Abstract: This paper is devoted to the design and the physical security of a parallel dual-core flexible cryptoprocessor for computing pairings over Barreto-Naehrig (BN) curves. The proposed design is specifically optimized for field-programmable gate-array (FPGA) platforms. The design explores the inbuilt features of an FPGA device for achieving an efficient cryptoprocessor for computing 128-bit secure

pairings. The work further pinpoints the vulnerability of those pairing computations against side-channel attacks and demonstrates experimentally that power consumptions of such devices can be used to attack these ciphers. Finally, we suggest a suitable countermeasure to overcome the respective weaknesses. The proposed secure cryptoprocessor needs 1 730 000, 1 206 000, and 821 000 cycles for the computation of Tate, ate, and optimal-ate pairings, respectively. The implementation results on a Virtex-6 FPGA device shows that it consumes 23 k Slices and computes the respective pairings in 11.93, 8.32, and 5.66 ms. ETPL In-Situ Method for TSV Delay Testing and Characterization Using Input Sensitivity VLSI-046 Analysis Abstract: In this paper, we propose a method and the required architecture for characterizing the propagation delays of the through Silicon vias (TSVs) in a 3-D IC. First of all, every two TSVs are paired up to form an oscillation ring with some peripheral circuits. Their joint performance can thus be measured roughly by the oscillation period of the ring. Next, we utilize a technique called sensitivity analysis to further derive the propagation delay of each individual TSV participating in an oscillation ring-a distilling process. In this process, we perturb the strength of the two TSV drivers, and then measure their effects in terms of the change of the oscillation ring's period. By some following analysis, the propagation delay of each TSV can be revealed. On top of scheme, we also present an architecture that can activate the performance characterization process of each test unit - that consists of two TSVs - one at a time in a proper sequence. The area overhead is only 18.97 equivalent two-input NAND gate per TSV, by which one can gain the ability to profile the capacitances and the propagation delays of the TSVs on a 3-D IC. ETPL VLSI-047 Low-Resolution DAC-Driven Linearity Testing of Higher Resolution ADCs Using Polynomial Fitting Measurements
Abstract: A low-cost linearity test methodology for high-resolution analog-to-digital converters (ADCs) is presented in this paper. Linearity testing of ADCs requires high-precision digital-to-analog conversion (DAC) capability, commonly 3-bit higher resolution than the ADC under test. Further, a large number of ADC output data samples must be collected making conventional histogram testing impractical for highresolution ADCs with 18-24 bit precision. In the proposed test methodology, two low-precision and lowcost DACs are used to generate a high-resolution ADC test stimulus. Significant reductions in test cost and test time are achieved by using low-cost instrumentation and by making fewer measurements than required for conventional histogram test. A least-squares-based polynomial fitting approach is used to determine the transfer function of the ADC under test. The generated transfer function is used to compute the non-linearity of the ADC accurately. No assumption is made regarding the linearity of the lower precision signal generators (DACs) used in the testing procedure. Software simulations and hardware experiments are performed to validate the proposed test methodology ETPL VLSI-048 Low-Cost Error Tolerance Scheme for 3-D CMOS Imagers
Abstract: This paper presents an error tolerance scheme for 3-D CMOS imagers that are constructed by stacking a pixel array of imager sensors, an analog-to-digital converter (ADC) array, and an image signal processor (ISP) array using microbumps (bumps) a nd through silicon vias (TSVs). To deliver highquality images in the presence of single or multiple bump, ADC, or TSV failures, we propose to interleave the connections from pixels to ADCs and recover the corrupted data in the ISPs. Key design parameters, such as the interleaving stride and the grouping ratio are determined by analyzing the employed error correction algorithm. Architectural simulation results demonstrate that the error tolerance

scheme enhances the effective yield of an exemplar 3-D imager from 44% to 97%. ETPL VLSI-049
Computing Two-Pattern Test Cubes for Transition Path Delay Faults
Abstract: Considering full-scan circuits, incompletely-specified tests, or test cubes, are used for test data compression. When considering path delay faults, certain specified input values in a test cube are needed only for determining the lengths of the paths associated with detected faults. Path delay faults, and therefore, small delay defects, would still be detected if such values are unspecified. The goal of this paper is to explore the possibility of increasing the number of unspecified input values in a test set for path delay faults by unspecifying such values in order to make the test set more amenable to test data compression. Experimental results indicate that significant numbers of such values exist. The proposed procedure unspecifies them gradually to obtain a series of test sets with increasing numbers of unspecified values and decreasing path lengths. Experimental results also indicate that filling the unspecified values randomly (as with some test data compression methods) recovers some or all of the path lengths associated with detected path delay faults. The procedure uses a matching of the sets of detected faults for the comparison of path lengths ETPL VLSI-050 Integrated Energy-Harvesting Photodiodes With Diffractive Storage Capacitance
Abstract: Integrating energy-harvesting photodiodes with logic and exploiting on-die interconnect capacitance for energy storage can enable new, ultraminiaturized wireless systems. Unlike CMOS imager pixels, the proposed photodiode designs utilize p-diffusion fingers and are implemented in a conventional logic process. Also unlike specialized solar cell processes, the designs utilize the on-chip metal interconnect to form a diffraction grating above the p-diffusion fingers which also provides capacitive energy storage. To explore the tradeoffs between optical efficiency and energy storage for integrated photodiodes, an array of photovoltaics with various diffractive storage capacitors was designed in a 90nm CMOS logic process. The diffractive effects can be exploited to increase the photodiodes' response to off-axis illumination. Transient effects from interfacing the photodiodes with switched-capacitor DC-DC converters were examined, with measurements indicating a 50% reduction in the output voltage ripple due to the diffractive storage capacitance. A quantitative comparison between 90-nm and 0.35-m CMOS logic processes for energy-harvesting capabilities was carried out. Measurements show an increase in power generation for the newer CMOS technology, however at the cost of reduced output voltage. One potential application for the integrated photodiodes is harvesting energy for a subdermal biomedical device. ETPL VLSI-051 Fast Fixed-Outline 3-D IC Floorplanning With TSV Co-Placement
Abstract: Through-silicon vias (TSVs) are used to connect inter-die signals in a 3-D IC. Unlike conventional vias, TSVs occupy device area and are very large compared to logic gates. However, most previous 3-D floorplanners only view TSVs as points. As a result, whitespace redistribution is necessary for TSV insertion after the initial floorplan is computed, which leads to suboptimal layouts. In this paper, we propose a very efficient 3-D floorplanner to simultaneously floorplan the functional modules and place the TSVs and to optimize the total wirelength under fixed-outline constraint. Compared to the state-

of-the-art 3-D floorplanner with TSV planning, our design consistently produces better floorplans with 15% shorter wirelength and 31% fewer TSVs on average. Our algorithm is extremely fast and only takes a few seconds to floorplan benchmarks with hundreds of modules compared to hours as required by the previous state-of-the-art floorplanner. ETPL VLSI-052 Reactivation Noise Suppression With Sleep Signal Slew Rate Modulation in MTCMOS Circuits
Abstract: Multi-threshold CMOS (MTCMOS) is commonly used for suppressing leakage currents in idle integrated circuits. Power and ground distribution network noise produced during SLEEP to ACTIVE mode transitions is an important reliability concern in MTCMOS circuits. Sleep signal slew rate modulation techniques for suppressing mode-transition noise are explored in this paper. A triple-phase sleep signal slew rate modulation (TPS) technique with a novel digital sleep signal generator is proposed. Reactivation time, mode-transition energy consumption, leakage power consumption, and layout area of different MTCMOS circuits are characterized under an equal-noise constraint. Influences of within-die and die-to-die parameter variations on the reactivation noise, time, and energy consumption of sleep signal slew rate modulated MTCMOS circuits are evaluated with a process imperfections aware robustness metric. The proposed triple-phase sleep signal slew rate modulation technique enhances the tolerance to process parameter fluctuations by up to 183.1 as compared to various alternative MTCMOS noise suppression techniques in a UMC 80-nm CMOS technology. ETPL VLSI-053 Sub-mW LC Dual-Input Injection-Locked Oscillator for Autonomous WBSNs
Abstract: This paper presents a sub-mW, current-reused first-harmonic LC injection-locked oscillator (ILO) using in-phase dual-input injection technique. It can be used as a power oscillator in the injectionlocked transmitter of wireless biomedical sensor nodes (WBSNs) integrated into a wireless body area network. A prototype chip, implemented in a standard 0.13-m CMOS process occupying 200 380 m, operates in the medical implantable communications service (MICS) band for medical implants. Measurement results show that the proposed ILO features a wide locking range of 800 MHz (150-950 MHz) at an input power of 0 dBm. More importantly, it has a high input sensitivity of -30 dBm to lock the 3-MHz bandwidth of the MICS band, while consuming only 660 W at 1 -V supply. This ultralow power consumption enables autonomous WBSNs ETPL VLSI-054 Constant Delay Logic Style
Abstract: A constant delay (CD) logic style is proposed in this paper, targeting at full-custom high-speed applications. The CD characteristic of this logic style regardless of the logic type makes it suitable in implementing complicated logic expressions such as addition. CD logic exhibits a unique characteristic where the output is pre-evaluated before the inputs from the preceding stage is ready. This feature offers performance advantage over static and dynamic domino logic styles in a single-cycle multistage circuit block. Several design considerations including timing window width adjustment and clock distribution are discussed. Using 65-nm general-purpose CMOS technology, the proposed logic demonstrates an average speedup of 94% and 56% over static and dynamic domino logic, respectively, in five different logic gates. Simulation results of 8-bit ripple carry adders show that CD logic is 39% and 23% faster than the static and dynamic-based adders, respectively. CD logic also demonstrates 39% speedup and 64%

(22%) energy-delay product (EDP) reduction from static logic at 100% (10%) data activity in 32-bit carry lookahead adders. For 8-bit Wallace tree multiplier, CD logic achieves a similar speedup with at least 50% EDP reduction across all data activities. ETPL VLSI-055 A Compact Clock Generator for Heterogeneous GALS MPSoCs in 65-nm CMOS Technology
Abstract: This paper presents an all-digital phase-locked loop (ADPLL) clock generator for globally asynchronous locally synchronous (GALS) multiprocessor systems-on-chip (MPSoCs). With its low power consumption of 2.7 mW and ultra small chip area of 0.0078 mm2 it can be instantiated per core for fine-grained power management like DVFS. It is based on an ADPLL providing a multiphase clock signal from which core frequencies from 83 to 666 MHz with 50% duty cycle are generated by phase rotation and frequency division. The clock meets the specification for DDR2/DDR3 memory interfaces. Additionally, it provides a dedicated high-speed clock up to 4 GHz for serial network-on-chip data links. Core frequencies can be changed arbitrarily within one clock cycle for fast dynamic frequency scaling applications. The performance including statistical analysis of mismatch has been verified by a prototype in 65-nm CMOS technology. ETPL VLSI-056 A Colpitts CMOS Quadrature VCO Using Direct Connection of Substrates for Coupling
Abstract: A new low-phase noise low-power quadrature voltage-controlled oscillator (QVCO) using differential Colpitts oscillator is presented. The proposed QVCO is composed of two identical currentswitching differential Colpitts VCOs in which the first core VCO is coupled to the second in an in-phase manner, and the second core VCO is coupled to the first in an anti-phase manner. To couple the two core VCOs, the substrates of the cross-connected transistors as well as the substrates of MOS varactors are used; alleviating the need for any extra elements for coupling, which could add noise and increase power dissipation. A linear (sinusoidal) analysis is presented that confirms that the proposed circuit generates quadrature waveforms. The proposed coupling technique can be generalized to N differential Colpitts VCOs for multiphase signals generation ETPL VLSI-057 A Self-Calibrated DLL-Based Clock Generator for an Energy-Aware EISC Processor
Abstract: This paper describes a low-jitter delay-locked loop (DLL)-based clock generator for dynamic frequency scaling in the extendable instruction set computing (EISC) processor. The DLL-based clock generator provides the system clock with frequencies of 0.5 to 8 of the reference clock, according to the workload of the EISC processor. The proposed analog self-calibration method and a phase detector with an auxiliary charge pump can effectively reduce the delay mismatch between delay cells in the voltage-controlled delay line and the static phase offset due to the current mismatch in the charge pump, respectively. The self-calibrated output waveform exhibits 9.7 ps of RMS jitter and 73.7 ps of peak-topeak jitter at 120 MHz. The prototype clock generator implemented in a 0.18-m CMOS process occupies an active area of 0.27 mm2 and consumes 15.56 mA ETPL VLSI-058 Clamping Virtual Supply Voltage of Power-Gated Circuits for Active Leakage Reduction and Gate-Oxide Reliability

Abstract: In an integrated circuit (IC) adopting a power-gating (PG) technique, the virtual supply voltage (VVDD) is susceptible to: 1) negative-bias temperature instability (NBTI) degradation that weakens the PG device over time and 2) temporal temperature variation that affects active leakage current (thus total current) of the IC. The PG device is sized to guarantee a minimum VVDD level over the chip lifetime. Thus, the NBTI degradation and the worst-case total current at high-temperature must be considered for sizing the PG device. This leads to higher VVDD (thus active leakage power) than necessary in early chip lifetime and/or at low temperature, negatively impacting the gate-oxide reliability of transistors. To reduce active leakage power increase and improve the gate-oxide reliability due to these effects, we propose two techniques that adjust the strength of a PG device based on its usage and IC's temperature at runtime. We demonstrate the efficacy of these techniques with an experimental setup using a 32-nm technology model in the presence of within-die spatial process and temperature variations. On an average of 100 die samples, they can reduce dynamic and active leakage power by up to 3.7% and 10% in early chip lifetime. Finally, these techniques also reduce the oxide failure rate by up to 5% across process corners over a period of 7 years. ETPL VLSI-059 10-bit 30-MS/s SAR ADC Using a Switchback Switching Method
Abstract: This brief presents a 10-bit 30-MS/s successive-approximation-register analog-to-digital converter (ADC) that uses a power efficient switchback switching method. With respect to the monotonic switching method, the input common-mode voltage variation reduces which improves the dynamic offset and the parasitic capacitance variation of the comparator. The proposed switchback switching method does not consume any power at the first digital-to-analog converter switching, which can reduce the power consumption and design effort of the reference buffer. The prototype was fabricated in a 90-nm 1P9M CMOS technology. At 1-V supply and 30 MS/s, the ADC achieves an sequenced neighbor double reservation of 56.89 dB and consumes 0.98 mW, resulting in a figure-of-merit (FOM) of 57 fJ/conversion-step. The ADC core occupies an active area of only 190 525 m2. ETPL VLSI-060 Spur-Reduction Frequency Synthesizer Exploiting Randomly Selected PFD
Abstract: This brief presents a low-spur phase-locked loop (PLL) system for wireless applications. The low-spur frequency synthesizer randomizes the periodic ripples on the control voltage of the voltagecontrolled oscillator to reduce the reference spur at the output of the PLL. A novel random clock generator is presented to perform the random selection of the phase frequency detector control for the charge pump in locked state. The proposed frequency synthesizer was fabricated in a TSMC 0.18-m CMOS process. The proposed PLL achieved phase noise of -93 dBc/Hz with a 600-kHz offset frequency and reference spurs below -72 dBc. ETPL VLSI-061 Gain-Enhanced Monolithic Charge Pump With Simultaneous Dynamic Gate and Substrate Control
Abstract: This brief presents a gain-enhanced complimentary metal-oxide-semiconductor (CMOS) charge pump (CP) circuit via dynamically controlling the gate and substrate terminals of each pMOS pass transistor. The proposed control strategy enables the CP circuit free of the threshold-voltage drops, the body effect, and the floating substrate terminals of pass devices. The on-resistance of each pass device is

also reduced to improve the gain and the power efficiency of the CP circuit. Implemented in a 0.35-m single n-well CMOS process, the proposed four-stage monolithic CP circuit can operate with a supply voltage down to 0.9 V and deliver a maximum output current of about 100 A. The proposed CP circuit also achieves a high voltage gain of 4 with two complementary-phase nonoverlapping clock signals. ETPL VLSI-062 Embedding Repeaters in Silicon IPs for Cross-IP Interconnections
Abstract: During systems-on-a-chip (SoC) integration, silicon intellectual properties (IPs) are generally regarded as blockages to long interconnections that connect different IPs. With this constraint, conventional designs are forced to place those repeaters that drive long interconnections outside the IP. These designs either lead to a longer interconnection distance requiring more repeaters or result in a longer signal delay, since the interconnection wire is not appropriately segmented by the repeaters. To solve these problems, we designed the IPs such that designers can embed the repeaters in the IP for the SoC integration. In other words, it allows the cross-IP interconnections to be routed over the IP using repeaters inserted in the IP. The design concept, physical implementation, and application examples of the embedded repeaters are described in this brief ETPL VLSI-063 RATS: Restoration-Aware Trace Signal Selection for Post-Silicon Validation
Abstract: Post-silicon validation is one of the most important and expensive tasks in modern integrated circuit design methodology. The primary problem governing post-silicon validation is the limited observability due to storage of a small number of signals in a trace buffer. The signals to be traced should be carefully selected in order to maximize restoration of the remaining signals. Existing approaches have two major drawbacks. They depend on partial restorability computations that are not effective in restoring maximum signal states. They also require long signal selection time due to inefficient computation as well as operating on gate-level netlist. We have proposed a signal selection approach based on total restorability at gate-level, which is computationally more efficient (10 times faster) and can restore up to three times more signals compared to existing methods. We have also developed a register transfer level signal selection approach, which reduces both memory requirements and signal selection time by several orders-of-magnitude. ETPL VLSI-064 Test Patterns of Multiple SIC Vectors: Theory and Application in BIST Schemes
Abstract: This paper proposes a novel test pattern generator (TPG) for built-in self-test. Our method generates multiple single-input change (MSIC) vectors in a pattern, i.e., each vector applied to a scan chain is an SIC vector. A reconfigurable Johnson counter and a scalable SIC counter are developed to generate a class of minimum transition sequences. The proposed TPG is flexible to both the test-per-clock and the test-per-scan schemes. A theory is also developed to represent and analyze the sequences and to extract a class of MSIC sequences. Analysis results show that the produced MSIC sequences have the favorable features of uniform distribution and low input transition density. The performances of the designed TPGs and the circuits under test with 45 nm are evaluated. Simulation results with ISCAS benchmarks demonstrate that MSIC can save test power and impose no more than 7.5% overhead for a scan design. It also achieves the target fault coverage without increasing the test length.

ETPL Effective and Efficient Approach for Power Reduction by Using Multi-Bit Flip-Flops VLSI-065 Abstract: Power has become a burning issue in modern VLSI design. In modern integrated circuits, the power consumed by clocking gradually takes a dominant part. Given a design, we can reduce its power consumption by replacing some flip-flops with fewer multi-bit flip-flops. However, this procedure may affect the performance of the original circuit. Hence, the flip-flop replacement without timing and placement capacity constraints violation becomes a quite complex problem. To deal with the difficulty efficiently, we have proposed several techniques. First, we perform a co-ordinate transformation to identify those flip-flops that can be merged and their legal regions. Besides, we show how to build a combination table to enumerate possible combinations of flip-flops provided by a library. Finally, we use a hierarchical way to merge flip-flops. Besides power reduction, the objective of minimizing the total wirelength is also considered. The time complexity of our algorithm is $Theta({rm n}^{1.12})$ less than the empirical complexity of $Theta({rm n}^{2})$. According to the experimental results, our algorithm significantly reduces clock power by 2030% and the running time is very short. In the largest test case, which contains 1 700 000 flip-flops, our algorithm only takes about 5 min to replace flip-flops and the power reduction can achieve 21%. ETPL VLSI-066 Reconfigurable Accelerator for the Word-Matching Stage of BLASTN
Abstract: BLAST is one of the most popular sequence analysis tools used by molecular biologists. It is designed to efficiently find similar regions between two sequences that have biological significance. However, because the size of genomic databases is growing rapidly, the computation time of BLAST, when performing a complete genomic database search, is continuously increasing. Thus, there is a clear need to accelerate this process. In this paper, we present a new approach for genomic sequence database scanning utilizing reconfigurable field programmable gate array (FPGA)-based hardware. In order to derive an efficient structure for BLASTN, we propose a reconfigurable architecture to accelerate the computation of the word-matching stage. The experimental results show that the FPGA implementation achieves a speedup around one order of magnitude compared to the NCBI BLASTN software running on a general purpose computer. ETPL VLSI-067 Architecturally Homogeneous Power-Performance Heterogeneous Multicore Systems
Abstract: Dynamic voltage and frequency scaling (DVFS), a widely adopted technique to ensure safe thermal characteristics while delivering superior energy efficiency, is rapidly becoming inefficient with technology scaling due to two critical factors: 1) inability to scale the supply voltage due to reliability concerns and 2) dynamic adaptations through DVFS cannot alter underlying power hungry circuit characteristics, designed for the nominal frequency. In this paper, we show that DVFS scaled circuits substantially lag in energy efficiency, by 22%86%, compared to ground up designs for target frequency levels. We propose architecturally homogeneous power-performance heterogeneous multicore systems, a fundamentally alternate means to design energy efficient multicore systems. Using a system level computer-aided design (CAD) approach, we seamlessly integrate architecturally identical cores, designed for different voltage-frequency domains. We use a combination of standard cell library based CAD flow and full system architectural simulation to demonstrate 11%22% improvement in energy efficiency using our design paradigm.

ETPL VLSI-068 Active Filter-Based Hybrid On-Chip DCDC Converter for Point-of-Load Voltage Regulation
Abstract: An active filter-based on-chip DCDC voltage converter for application to distributed on-chip power supplies in multivoltage systems is described in this paper. No inductor or output capacitor is required in the proposed converter. The area of the voltage converter is therefore significantly less than that of a conventional low-dropout (LDO) regulator. Hence, the proposed circuit is appropriate for pointof-load voltage regulation for noise sensitive portions of an integrated circuit. The performance of the circuit has been verified with Cadence Spectre simulations and fabricated with a commercial 110 nm complimentary metal oxide semiconductor (CMOS) technology. The area of the voltage regulator is 0.015 ${rm mm}^{2}$ and delivers up to 80 mA of output current. The transient response with no output capacitor ranges from 72 to 192 ns. The parameter sensitivity of the active filter is also described. The advantages and disadvantages of the active filter-based, conventional switching, linear, and switched capacitor voltage converters are compared. The proposed circuit is an alternative to classical LDO voltage regulators, providing a means for distributing multiple local power supplies across an integrated circuit while maintaining high current efficiency and fast response time within a small area. ETPL VLSI-069 CusNoC: Fast Full-Chip Custom NoC Generation
Abstract: We propose a full-chip synthesis methodology to construct custom network-on-chips (CusNoCs) for NoC-based systems. The proposed scheme generates irregular network topologies for application-specific designs with known communication demands. In this method, processors and the communication architecture can be synthesized simultaneously in the floorplanning process, and thus it is called CusNoC. CusNoC synthesizes CusNoC in two steps. The target network topology is first generated based on communication analysis. Processing elements are partitioned into groups such that the utility of routers will be maximized if a router is assigned to each group. In this way, the number of routers passed by a packet, or hops, is minimized, and so is the power consumption in the network. The final network topology is formed by properly connecting these groups. A wirelength-aware floor planning is then carried out to optimize circuit size as well as wirelength. Experimental results show that CusNoC produces custom NoCs with better performance than previous methods while the computation time is significantly shorter. This method is also more scalable, which makes it ideal for complicated systems. ETPL VLSI-070 Cooperating Virtual Memory and Write Buffer Management for Flash-Based Storage Systems
Abstract: Flash memory is becoming the preferred choice of secondary storage in mobile devices and embedded systems. The performance of Flash memory is dictated by asymmetric speeds of read and write, limited number of erase times, and the absence of in-place updates. To improve the performance of Flash-based storage systems, the write buffer has been provided in Flash memories recently. At the same time, new virtual memory management strategies have been proposed in recent studies that consider the characteristics of Flash memory. Currently, approaches on these two memory layers are considered separately, which fail to explore the full potential of these two layers. In this paper, we propose cooperative management schemes for virtual memory and write buffer to maximize the performance of Flash-memory-based systems. Management on virtual memory is designed to exploit write buffer status via reordering of the write sequences. The proposed write buffer management scheme works seamlessly with the proposed virtual memory management scheme. Experimental results show that significant

improvement in I/O performance and reduction of the number of erase and write operations can be achieved compared to the state-of-art approaches. ETPL VLSI-071 MDC FFT/IFFT Processor With Variable Length for MIMO-OFDM Systems
Abstract: This paper presents an multipath delay commutator (MDC)-based architecture and memory scheduling to implement fast Fourier transform (FFT) processors for multiple input multiple outputorthogonal frequency division multiplexing (MIMO-OFDM) systems with variable length. Based on the MDC architecture, we propose to use radix-$N_{s}$ butterflies at each stage, where $N_{s}$ is the number of data streams, so that there is only one butterfly needed in each stage. Consequently, a 100% utilization rate in computational elements is achieved. Moreover, thanks to the simple control mechanism of the MDC, we propose simple memory scheduling methods for input data and output bit/set-reversing, which again results in a full utilization rate in memory usage. Since the memory requirements usually dominate the die area of FFT/inverse fast Fourier transform (IFFT) processors, the proposed scheme can effectively reduce the memory size and thus the die area as well. Furthermore, to apply the proposed scheme in practical applications, we let $N_{s}=4$ and implement a 4-stream FFT/IFFT processor with variable length including 2048, 1024, 512, and 128 for MIMO-OFDM systems. This processor can be used in IEEE 802.16 WiMAX and 3GPP long term evolution applications. The processor was implemented with an UMC 90-nm CMOS technology with a core area of 3.1 ${rm mm}^{2}$. The power consumption at 40 MHz was 63.72/62.92/57.51/51.69 mW for 2048/1024/512/128-FFT, respectively in the post-layout simulation. Finally, we analyze the complexity and performance of the implemented processor and compare it with other processors. The results show advantages of the proposed scheme in terms of area and power consumption. ETPL VLSI-072 Current-Reused 2.4-GHz Direct-Modulation Transmitter With On-Chip Automatic Tuning
Abstract: This paper presents the design, analysis, and experimental verification of a self-calibrating current-reused 2.4-GHz direct-modulation transmitter for short-range wireless applications. The key contributions are the design/analysis of a stacked power amplifier (PA)/voltage-controlled oscillator (VCO) architecture, the nonlinear frequency-dependent analysis of a Gilbert-cell-based root-mean-square detector, and an on-chip $LC$-tank calibration circuit that needs no analog-to-digital convertor (ADC)/digital signal processor. The stacked architecture reduces the number of required regulators, utilizes supply headroom effectively, and allows for an ADC -less calibration loop that can dynamically tune the PA center frequency by sensing the transmitted signal. The very nature of direct-modulation architecture obviates additional high-purity signal generators, reducing complexity and allowing online calibration. The system was implemented in TSMC 0.18 $mu{rm m}$ CMOS, occupies 0.7 ${rm mm}^{2}~({rm TX})+0.1~{rm mm}^{2}$ (self-tuning), and was measured in a QFN48 package on an FR4 PCB. Automatically correcting PA/VCO tank misalignment in this case yielded ${>}{rm 4}~{rm dB}$ increase in output power. With the automatic tuning active, the transmitter delivers a measured output power ${>}{rm 0}~{rm dBm}$ to a 100-$Omega$ differential load, and the system consumes 22.9 mA from a 1.8-V core-circuit supply.

ETPL VLSI-073 Reconfigurable Adaptive Singular Value Decomposition Engine Design for HighThroughput MIMO-OFDM Systems
Abstract: Singular value decomposition (SVD) is an optimal method to obtain spatial multiplexing gain in multi-input multi-output (MIMO) channels. However, the high cost of implementation and high decomposing latency of the SVD restricts its usage in current wireless communication applications. In this paper, we present a complete adaptive SVD algorithm and a reconfigurable architecture for highthroughput MIMO-orthogonal frequency division multiplexing systems. There are several proposed architectural design techniques: reconfigurable scheme, division-free adaptive step size scheme, early termination scheme, and data interleaving scheme. The reconfigurable scheme can support all antenna configurations in a MIMO system. The division-free adaptive step size and early termination schemes are used to effectively reduce the decomposing latency and improve hardware utilization. The data interleaving scheme helps to deal with several channel matrices concurrently. Besides, we propose an orthogonal reconstruction scheme to obtain more accurate SVD outputs, and then the system performance will be greatly enhanced. We apply our SVD design to the IEEE 802.11 n applications. This design is implemented and fabricated in UMC 90 nm 1P9M CMOS technology. The maximum operating frequency is measured to be at 101.2 MHz, and the corresponding power dissipation is at 125 mW. The core size is 2.17 ${rm mm}^{2}$ and the die size occupies 4.93 ${rm mm}^{2}$. The chip result shows that the average latency is only 0.33% of the wireless local area network coherence time. Hence, the proposed reconfigurable adaptive SVD engine design is very suitable for high-throughput wireless communication applications. ETPL VLSI-074 The LUT-SR Family of Uniform Random Number Generators for FPGA Architectures
Abstract: Field-programmable gate array (FPGA) optimized random number generators (RNGs) are more resource-efficient than software-optimized RNGs because they can take advantage of bitwise operations and FPGA-specific features. However, it is difficult to concisely describe FPGA-optimized RNGs, so they are not commonly used in real-world designs. This paper describes a type of FPGA RNG called a LUT-SR RNG, which takes advantage of bitwise xor operations and the ability to turn lookup tables (LUTs) into shift registers of varying lengths. This provides a good resourcequality balance compared to previous FPGA-optimized generators, between the previous high-resource high-period LUT-FIFO RNGs and low-resource low-quality LUT-OPT RNGs, with quality comparable to the best software generators. The LUT-SR generators can also be expressed using a simple C++ algorithm contained within this paper, allowing 60 fully-specified LUT-SR RNGs with different characteristics to be embedded in this paper, backed up by an online set of very high speed integrated circuit hardware description language (VHDL) generators and test benches. ETPL VLSI-075 Exploring the Use of Emerging Nonvolatile Memory Technologies in Future FPGAs,
Abstract: As new nonvolatile memory technologies become increasingly mature, there has been a growing interest on investigating their use in future field-programmable gate arrays (FPGAs). Similar to existing FPGAs with embedded Flash memory, future FPGAs can embed these new nonvolatile memories to persistently store configuration data. By comparing with prior work, we first propose the more appropriate design style for new nonvolatile configuration data storage memory. Moreover, this brief studies a dynamic random-access memory (DRAM)-based FPGA design strategy enabled by high-

density embedded nonvolatile memory. Existing FPGAs do not use on-chip DRAM cells for configuration data storage mainly because DRAM self-refresh involves destructive DRAM read. This problem can be solved, if we use embedded nonvolatile memory as primary FPGA configuration data storage and externally refresh on-chip DRAM cells. Analysis and simulations have been carried out to demonstrate the potential advantages of such a design strategy. ETPL VLSI-076 Broadside and Skewed-Load Tests Under Primary Input Constraints
Abstract: Tester limitations may impose certain constraints on the primary input vectors applicable as part of a two-pattern test for delay faults. Under these constraints, the primary input vectors may be held constant, or the second primary input vector of a test may be obtained by a single shift of a scan chain relative to the first. The goal of this brief is to study the differences in achievable transition fault coverage between various primary input constraints that are similar to the commonly used ones of holding or shifting primary input vectors. This brief also studies the possibility of combining the constraints in order to increase the transition fault coverage. The combination requires a fixed and circuit-independent hardware structure similar to the case where shifting of primary input vectors is used. This study is done using test sets that consist of both broadside and skewed-load tests in order to maximize the transition fault coverage. ETPL VLSI-078 Supply Noise Suppression by Triple-Well Structure
Abstract: This brief discusses the impact of twin- and triple-well structures on power supply noise, and a substrate model for simulating the power supply noise. We observed $V_{rm ss}$ noise reduction by the resistive network of the p-substrate and $V_{rm dd}$ noise reduction by the junction capacitance of a triple-well structure on a 90-nm test chip. Measurement results also showed that the total noise reduction of a triple-well structure is superior to that of a twin-well structure. The measurement results correlate well with the results obtained from the power supply noise simulation using a hierarchical resistive mesh model. Our simulation-based verification indicates that in common CMOS design, a triple-well structure can reduce the power supply drop by 10%40% or the decoupling capacitance area by 5%10%. We also verified that supply drop sensitivity to variation of the well junction capacitance is sufficiently small and that supply noise reduction using a triple-well structure is robust to process variation. ETPL VLSI-079 Software-Based Self Test Methodology for On-Line Testing of L1 Caches in Multithreaded Multicore Architectures
Abstract: The flexibility that allows the application of different March tests is a critical requirement for on-line testing of memory arrays. In a previous study, we have introduced a low-cost software-based self test (SBST) program development methodology for on-line periodic testing of L1 caches that utilizes direct cache access (DCA) instructions and exploits the native monitoring hardware available in modern architectures. In this brief, we discuss a multithreaded optimization of this SBST methodology that exploits the thread level parallelism of multithreaded multicore architectures in order to speed up March test execution by elaborating the low level multiple sub-bank cache organization. The effectiveness of the methodology and its multithreaded optimization is demonstrated on the L1 caches of OpenSPARC T1 processor. Our results showed a speedup of more than 1.7 when the multithreaded optimization is applied and an acceptable performance overhead (less than 11%), even in intensive periodic test scenarios.

ETPL VLSI-080 Design of Ternary Logic Combinational Circuits Based on Quantum Dot Gate FETs
Abstract: In this paper, we discuss logic circuit designs using the circuit model of three-state quantum dot gate field effect transistors (QDGFETs). QDGFETs produce one intermediate state between the two normal stable ON and OFF states due to a change in the threshold voltage over this range. We have developed a simplified circuit model that accounts for this intermediate state. Interesting logic can be implemented using QDGFETs. In this paper, we discuss the designs of various two-input three-state QDGFET gates, including NAND- and NOR-like operations and their application in different combinational circuits like decoder, multiplier, adder, and so on. Increased number of states in three-state QDGFETs will increase the number of bit-handling capability of this device and will help us to handle more number of bits at a time with less circuit elements. ETPL VLSI-081 Parametric DFM Solution for Analog Circuits: Electrical-Driven Hotspot Detection, Analysis, and Correction Flow
Abstract: As VLSI technology pushes into advanced nodes, designers and foundries have exposed a hitherto insignificant set of yield problems. To combat yield failures, the semiconductor industry has deployed new tools and methodologies commonly referred to as design for manufacturing (DFM). Most of the early DFM efforts concentrated on catastrophic failures, or physical DFM problems. Recently, there has been an increased emphasis on parametric yield issues, referred to as electrical-DFM (e-DFM). In this paper, we present a complete e-DFM solution that detects, analyzes, and fixes electrical hotspots (e-hotspots) within an analog circuit design that are caused by different process variations. Novel algorithms are proposed to implement the engines used to develop this solution. The solution is examined on a 130-nm parametrically-failing level shifter circuit, and verified with silicon wafer measurements that confirm the existence of parametric yield issues in the design. Additional experiments are applied on a 65-nm industrial operational amplifier and voltage control oscillator (VCO). E-hotspot devices with a 27.7% variation in dc current are identified. After fixing the e-hotspots, the dc current variation in these devices is dramatically reduced to 7%, which meets the designer acceptance criteria, while saving the original VCO specifications. ETPL VLSI-082 Unified Capture Scheme for Small Delay Defect Detection and Aging Prediction
Abstract: Small delay defect (SDD) and aging-induced circuit failure are both prominent reliability concerns for nanoscale integrated circuits. Faster-than-at-speed testing is effective on SDD detection in manufacturing testing, which is always implemented by designing a suite of test signal generation circuits on the chip. Meanwhile, the integration of online aging sensors is becoming attractive in monitoring aging-induced delay degradation in the runtime. These design requirements, if implemented in separate ways, will increase the complexity of a reliable design and consume more die area. In this paper, a unified capture scheme is proposed to generate programmable clock signals for the detection of both SDDs and circuit aging. Our motivation arises from the observations that SDD detection and online aging prediction both need to capture circuit response ahead of the functional clock. The proposed aging-resistant design method enables the offline test circuit to be reused in online operations. Reversed short channel effect is also exploited to make the underlying circuit resilient to process variations. The proposed scheme is validated by intensive HSPICE simulations. Experimental results demonstrate the effectiveness in terms of low area, power, and performance overheads.

ETPL VLSI-083 Novel MIMO Detection Algorithm for High-Order Constellations in the Complex Domain
Abstract: A novel detection algorithm with an efficient VLSI architecture featuring efficient operation over infinite complex lattices is proposed. The proposed design results in the highest throughput, the lowest latency, and the lowest energy compared to the complex-domain VLSI implementations to date. The main innovations are a novel complex-domain means of expanding/visiting the intermediate nodes of the search tree on demand, rather than exhaustively, as well as a new distributed sorting scheme to keep track of the best candidates at each search phase. Its support of unbounded infinite lattice decoding distinguishes the present method from previous K-Best strategies and also allows its complexity to scale sublinearly with the modulation order. Since the expansion and sorting cores are data-driven, the architecture is well suited for a pipelined parallel VLSI implementation. The proposed algorithm is used to fabricate a 44, 64-QAM complex multiple-input-multiple-output detector in a 0.13-m CMOS technology, achieving a clock rate of 417 MHz with the core area of 340 kgates. The chip test results prove that the fabricated design can sustain a throughput of 1 Gb/s with energy efficiency of 110 pJ/bit, the best numbers reported to date. ETPL VLSI-084 High-Throughput 0.13- \mu{\rm m} CMOS Lattice Reduction Core Supporting 880 Mb/s Detection
Abstract: This paper presents the first silicon -proven implementation of a lattice reduction (LR) algorithm, which achieves maximum likelihood diversity . The implementation is based on a novel hardware-optimized due to the Lenstra , Lenstra, and Lovasz (LLL) algorithm, which significantly reduces its complexity by replacing all the computationally intensive LLL operations (multiplication, division, and square root) with low-complexity additions and comparisons. The proposed VLSI design utilizes a pipelined architecture that produces an LR-reduced matrix set every 40 cycles, which is a 60% reduction compared to current state-of-the-art LR field-programmable gate array implementations. The 0.13-m CMOS LR core presented in this paper achieves a clock rate of 352 MHz, and thus is capable of sustaining a throughput of 880 Mb/s for 64-QAM multiple-input-multiple-output detection with superior performance while dissipating 59.4 mW at 1.32 V supply. ETPL VLSI-085 Study of Through-Silicon-Via Impact on the 3-D Stacked IC Layout
Abstract: The technology of through-silicon vias (TSVs) enables fine-grained integration of multiple dies into a single 3-D stack. TSVs occupy significant silicon area due to their sheer size, which has a great effect on the quality of 3-D integrated chips (ICs). Whereas well-managed TSVs alleviate routing congestion and reduce wirelength, excessive or ill-managed TSVs increase the die area and wirelength. In this paper, we investigate the impact of the TSV on the quality of 3-D IC layouts. Two design schemes, namely TSV co-placement (irregular TSV placement) and TSV site (regular TSV placement), and accompanying algorithms to find and optimize locations of gates and TSVs are proposed for the design of 3-D ICs. Two TSV assignment algorithms are also proposed to enable the regular TSV placement. Simulation results show that the wirelength of 3-D ICs is shorter than that of 2-D ICs by up to 25%. ETPL VLSI-086 Design of Hardware Function Evaluators Using Low-Overhead Nonuniform Segmentation With Address Remapping

Abstract: In the piecewise function evaluation with polynomial approximation, nonuniform segmentation can effectively reduce the size of lookup tables for some arithmetic functions compared to uniform segmentation approaches, at the cost of the extra segment address (index) encoder that results in area and delay overhead. Also, it is observed that the nonuniform segmentation reflects a design tradeoff between the ROM size and the area cost of the subsequent arithmetic computation hardware. In this paper, we propose a new nonuniform segmentation method that searches for the optimal segmentation scheme with the goal of minimized ROM, total area, or delay. For some high-variation arithmetic functions, the proposed segmentation method achieves significant area reduction compared to the uniform segmentation method. We also demonstrate the design tradeoff among uniform and nonuniform segmentation, and degree-one and degree-two polynomial approximations, with respect to precision ranging from 12 to 32 bits for the elementary function of reciprocal. ETPL VLSI-087 Statistical Functional Yield Estimation and Enhancement of CNFET-Based VLSI Circuits
Abstract: Carbon nanotube field effect transistors (CNFETs) show great promise as extensions to silicon CMOS. However, imperfections, which are mainly related to carbon nanotubes (CNTs) growth process, result in metallic and nonuniform CNTs leading to significant functional yield reduction. This paper presents a comprehensive technique for statistical functional yield estimation and enhancement of CNFET-based VLSI circuits. Based on experimental data extracted from aligned CNTs, we propose a compact statistical model to estimate the failure probability of a CNFET. Using the proposed failure model, we show that enhancing the CNT synthesis process alone cannot achieve acceptable functional yield for upcoming CNFET-based VLSI circuits. We propose a technique which is based on replacing each transistor by series-parallel transistor structures to reduce the failure probability of CNFETs in the presence of metallic and nonuniform CNTs. The technique is adapted to use single directional independence, which is inherent in aligned CNTs, to enhance the functional yield as validated by theoretical analysis and simulation results. Tradeoffs between failure probability reduction and design overheads such as area and current drive are explored. As demonstrated by extensive simulation results, the proposed technique achieves 80% functional yield in CNFET technology at the cost of 7.5X area and 34% current drive overheads if the CNT density and the fraction of semiconducting CNTs are improved to 200 CNTs per m and 99.99%, respectively. ETPL VLSI-088 Theoretical Modeling of Elliptic Curve Scalar Multiplier on LUT-Based FPGAs for Area and Speed,
Abstract: This paper uses a theoretical model to approximate the delay of different characteristic two primitives used in an elliptic curve scalar multiplier architecture (ECSMA) implemented on k input lookup table (LUT)-based field-programmable gate arrays. Approximations are used to determine the delay of the critical paths in the ECSMA. This is then used to theoretically estimate the optimal number of pipeline stages and the ideal placement of each stage in the ECSMA. This paper illustrates suitable scheduling for performing point addition and doubling in a pipelined data path of the ECSMA. Finally, detailed analyses, supported with experimental results, are provided to design the fastest scalar multiplier over generic curves. Experimental results for GF(2163) show that, when the ECSMA is suitably pipelined, the scalar multiplication can be performed in only 9.5 s on a Xilinx Virtex V. Notably the design has an area which is significantly smaller than other reported high-speed designs, which is due to

the better LUT utilization of the underlying field primitives. ETPL VLSI-089
Architecture for Real-Time Nonparametric Probability Density Function Estimation
Abstract: Adaptive systems are increasing in importance across a range of application domains. They rely on the ability to respond to environmental conditions, and hence real-time monitoring of statistics is a key enabler for such systems. Probability density function (PDF) estimation has been applied in numerous domains; computational limitations, however, have meant that proxies are often used. Parametric estimators attempt to approximate PDFs based on fitting data to an expected underlying distribution, but this is not always ideal. The density function can be estimated by rescaling a histogram of sampled data, but this requires many samples for a smooth curve. Kernel-based density estimation can provide a smoother curve from fewer data samples. We present a general architecture for nonparametric PDF estimation, using both histogram-based and kernel-based methods, which is designed for integration into streaming applications on field-programmable gate array (FPGAs). The architecture employs heterogeneous resources available on modern FPGAs within a highly parallelized and pipelined design, and is able to perform real-time computation on sampled data at speeds of over 250 million samples per second, while extracting a variety of statistical properties. ETPL VLSI-090 Symbolic Moment Computation for Statistical Analysis of Large Interconnect Networks,
Abstract: The shrinking technology feature size and dense large-scale integration make process variation a challenging issue directly confronting the latest design automation tools. Process variation causes severe variation in interconnect networks, including very large-scale integrated interconnect structures, such as clock trees, clock mesh, power-ground networks, and other wiring structures in 3-D integrated circuits. The traditional moment computation techniques are only partly useful for analyzing such variational problems, however, their computational efficiency cannot meet the quickly rising needs, such as statistical analysis. This paper presents a novel symbolic moment calculator (SMC) for variational interconnect analysis. The moment calculator is constructed in a regular data structure that incorporates binary decision diagrams for data storage and computation. Given an interconnect circuit, such a computation diagram has to be constructed only once and can be repeatedly invoked for computation of moments with varying parameter values. Also, the SMC is friendly to interconnect synthesis in that it can be incrementally modified according to the modifications made to the circuit structure. Applications of the SMC for fast moment computation, sensitivity analysis, and statistical timing analysis are addressed. Significant efficiency is demonstrated comparing to other existing methods. ETPL C-Based Complex Event Processing on Reconfigurable Hardware VLSI-091 Abstract: This brief presents an efficient complex event-processing framework, designed to process a large number of sequential events on field-programmable gate arrays (FPGAs). Unlike conventional structured query language based approaches, our approach features logic automation constructed with a new C-based event language that supports regular expressions on the basis of C functions, so that a wide variety of event-processing applications can be efficiently mapped to FPGAs. Evaluations on an FPGAbased network interface card show that we can achieve 12.3 times better event-processing performance than does a CPU software in a financial trading application.

ETPL VLSI-092 Task Allocation on Nonvolatile-Memory-Based Hybrid Main Memory
Abstract: In this paper, we consider the task allocation problem on a hybrid main memory composed of nonvolatile memory (NVM) and dynamic random access memory (DRAM). Compared to the conventional memory technology DRAM, the emerging NVM has excellent energy performance since it consumes orders of magnitude less leakage power. On the other hand, most types of NVMs come with the disadvantages of much shorter write endurance and longer write latency as opposed to DRAM. By leveraging the energy efficiency of NVM and long write endurance of DRAM, this paper explores task allocation techniques on hybrid memory for multiple objectives such as minimizing the energy consumption, extending the lifetime, and minimizing the memory size. The contributions of this paper are twofold. First, we design the integer linear programming (ILP) formulations that can solve different objectives optimally. Then, we propose two sets of heuristic algorithms including three polynomial time offline heuristics and three online heuristics. Experiments show that compared to the optimal solutions generated by the ILP formulations, the offline heuristics can produce near-optimal results. ETPL VLSI-093 BilRC: An Execution Triggered Coarse Grained Reconfigurable Architecture
Abstract: We present Bilkent reconfigurable computer (BilRC), a new coarse-grained reconfigurable architecture (CGRA) employing an execution-triggering mechanism. A control data flow graph language is presented for mapping the applications to BilRC. The flexibility of the architecture and the computation model are validated by mapping several real-world applications. The same language is also used to map applications to a 90-nm field-programmable gate array (FPGA), giving exactly the same cycle count performance. It is found that BilRC reduces the configuration size about 33 times. It is synthesized with 90-nm technology, and typical applications mapped on BilRC run about 2.5 times faster than those on FPGA. It is found that the cycle counts of the applications for a commercial very long instruction word digital signal processor processor are 1.9 to 15 times higher than that of BilRC. It is also found that BilRC can run the inverse discrete cosine transform algorithm almost 3 times faster than the closest CGRA in terms of cycle count. Although the area required for BilRC processing elements is larger than that of existing CGRAs, this is mainly due to the segmented interconnect architecture of BilRC, which is crucial for supporting a broad range of applications. ETPL VLSI-094 Block-Circulant RS-LDPC Code: Code Construction and Efficient Decoder Design
Abstract: This brief presents a method for constructing block-circulant (BC) Reed-Solomon-based lowdensity parity-check (RS-LDPC) codes and an efficient decoder design. The proposed construction method results in a BC form of a parity-check matrix from a random parity-check matrix for RS-LDPC codes. A decoder architecture and switch network for BC-RS-LDPC code are then developed based on the new BC parity-check matrix. Thus, an efficient decoder architecture dedicated to a promising class of high-performance BC-RS-LDPC codes is presented for the first time. Moreover, a (2048, 1723) BC-RSLDPC decoder architecture is designed to demonstrate the efficiency of the presented techniques. Synthesis results show that the proposed decoder requires 1.3-M gates and can operate at 450 MHz to achieve a data throughput of 41 Gb/s with eight iterations.

ETPL VLSI-095 Fault Demotion Using Reconfigurable Slack (FaDReS)
Abstract: We propose an active dynamic redundancy-based fault-handling approach exploiting the partial dynamic reconfiguration capability of static random-access memory-based field-programmable gate arrays. Fault detection is accomplished in a uniplex hardware arrangement while an autonomous fault isolation scheme is employed, which neither requires test vectors nor suspends the computational throughput. The deterministic flow of the fault-handling scheme achieves an improved recovery in a bounded number of reconfigurations. This approach extends existing signal processing properties to accommodate fault handling, and is validated by implementing an H.263 video encoder discrete cosine transform (DCT) block. The peak signal-to-noise ratio measure of the video sequences indicates fault tolerance in the DCT block with only limited quality degradation, during the isolation and recovery phases spanning a few frames. ETPL VLSI-096 MAEPER: Matching Access and Error Patterns With Error-Free Resource for Low Vcc L1 Cache
Abstract: Large SRAMs are the practical bottleneck to achieve a low supply voltage, because they suffer from process variation-induced bit errors at a low supply voltage. In this paper, we present an errorresilient cache architecture that resolves the drawback of previous approaches, i.e., the performance degradation at a low supply voltage which is caused by cache misses in accesses to faulty resources. We utilize cache access locality and error-free resources in a cost-effective manner. First, we classify cache lines into fully and partially accessed groups and apply appropriate methods to each group. For the partially accessed group, we propose a method of matching memory access behavior and error locations with intra-cache line word-level remapping. In order to reduce the area overhead used to store the cache access information history, we present an access pattern-learning line-fill buffer (LFB). For the fully accessed group, we propose the utilization of error-free assist functions in the cache, i.e., a LFB and victim cache with no process variation-induced error at the target minimum supply voltage. We also present an error-aware prefetch method that allows us to utilize the error-free victim cache to achieve a further reduction in cache misses due to faulty resources. Experimental results show that the proposed method gives an average 32.6% reduction in cycles per instruction at an error rate of 0.2% with a small area overhead of 8.2%.

Final Year IEEE Project 2013-2014 - VLSI Project Title and Abstract

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Final Year IEEE Project 2013-2014 - VLSI Project Title and Abstract

Caricato da

Copyright:

Formati disponibili

Elysium Technologies Private Limited

Elysium Technologies Private Limited

Elysium Technologies Private Limited

Elysium Technologies Private Limited

Elysium Technologies Private Limited

Elysium Technologies Private Limited

Elysium Technologies Private Limited

Elysium Technologies Private Limited

Elysium Technologies Private Limited

Elysium Technologies Private Limited

Elysium Technologies Private Limited

Elysium Technologies Private Limited

Elysium Technologies Private Limited

Elysium Technologies Private Limited

Elysium Technologies Private Limited

Elysium Technologies Private Limited

Elysium Technologies Private Limited

Elysium Technologies Private Limited

Elysium Technologies Private Limited

Computing Two-Pattern Test Cubes for Transition Path Delay Faults

Elysium Technologies Private Limited

Elysium Technologies Private Limited

Elysium Technologies Private Limited

Elysium Technologies Private Limited

Elysium Technologies Private Limited

Elysium Technologies Private Limited

Elysium Technologies Private Limited

Elysium Technologies Private Limited

Elysium Technologies Private Limited

Elysium Technologies Private Limited

Elysium Technologies Private Limited

Elysium Technologies Private Limited

Elysium Technologies Private Limited

Architecture for Real-Time Nonparametric Probability Density Function Estimation

Elysium Technologies Private Limited

Elysium Technologies Private Limited

Potrebbero piacerti anche