Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Singapore | Madurai | Chennai | Trichy | Coimbatore | Cochin | Ramnad | Pondicherry | Trivandrum | Salem | Erode | Tirunelveli http://www.elysiumtechnologies.com, info@elysiumtechnologies.com
13 Years of Experience Automated Services 24/7 Help Desk Support Experience & Expertise Developers Advanced Technologies & Tools Legitimate Member of all Journals Having 1,50,000 Successive records in all Languages More than 12 Branches in Tamilnadu, Kerala & Karnataka. Ticketing & Appointment Systems. Individual Care for every Student. Around 250 Developers & 20 Researchers
227-230 Church Road, Anna Nagar, Madurai 625020. 0452-4390702, 4392702, + 91-9944793398. info@elysiumtechnologies.com, elysiumtechnologies@gmail.com
S.P.Towers, No.81 Valluvar Kottam High Road, Nungambakkam, Chennai - 600034. 044-42072702, +91-9600354638, chennai@elysiumtechnologies.com
15, III Floor, SI Towers, Melapudur main Road, Trichy 620001. 0431-4002234, + 91-9790464324. trichy@elysiumtechnologies.com
577/4, DB Road, RS Puram, Opp to KFC, Coimbatore 641002 0422- 4377758, +91-9677751577. coimbatore@elysiumtechnologies.com
1st Floor, A.R.IT Park, Rasi Color Scan Building, Ramanathapuram - 623501. 04567-223225, +919677704922.ramnad@elysiumtechnologies.com
Plot No: 4, C Colony, P&T Extension, Perumal puram, Tirunelveli627007. 0462-2532104, +919677733255, tirunelveli@elysiumtechnologies.com
74, 2nd floor, K.V.K Complex,Upstairs Krishna Sweets, Mettur Road, Opp. Bus stand, Erode-638 011. 0424-4030055, +919677748477 erode@elysiumtechnologies.com
No: 88, First Floor, S.V.Patel Salai, Pondicherry 605 001. 0413 4200640 +91-9677704822 pondy@elysiumtechnologies.com
TNHB A-Block, D.no.10, Opp: Hotel Ganesh Near Busstand. Salem 636007, 0427-4042220, +91-9894444716. salem@elysiumtechnologies.com
Abstract: As scaling DRAM cells becomes more challenging and energy-efficient DRAM chips are in high demand, the DRAM industry has started to undertake an alternative approach to address these looming issues-that is, to vertically stack DRAM dies with through-silicon-vias (TSVs) using 3-D-IC technology. Furthermore, this emerging integration technology also makes heterogeneous die stacking in one DRAM package possible. Such a heterogeneous DRAM chip provides a unique, promising opportunity for computer architects to contemplate a new memory hierarchy for future system design. In this paper, we study how to design such a heterogeneous DRAM chip for improving both performance and energy efficiency. In particular, we found that, if we want to design an SRAM row cache in a DRAM chip, simple stacking alone cannot address the majority of traditional SRAM row cache design issues. In this paper, to address these issues, we propose a novel floorplan and several architectural techniques that fully exploit the benefits of 3-D stacking technology. Our multi-core simulation results with memoryintensive applications suggest that, by tightly integrating a small row cache with its corresponding DRAM array, we can improve performance by 30% while saving dynamic energy by 31%. ETPL VLSI-002 A Low-Complexity Turbo Decoder Architecture for Energy-Efficient Wireless Sensor Networks
Abstract: Turbo codes have recently been considered for energy-constrained wireless communication applications, since they facilitate a low transmission energy consumption. However, in order to reduce the overall energy consumption, lookup table-log-BCJR (LUT-Log-BCJR) architectures having a low processing energy consumption are required. In this paper, we decompose the LUT-Log-BCJR architecture into its most fundamental add compare select (ACS) operations and perform them using a novel low-complexity ACS unit. We demonstrate that our architecture employs an order of magnitude fewer gates than the most recent LUT-Log-BCJR architectures, facilitating a 71% energy consumption reduction. Compared to state-of-the-art maximum logarithmic Bahl-Cocke-Jelinek-Raviv implementations, our approach facilitates a 10% reduction in the overall energy consumption at ranges above 58 m. ETPL VLSI-003 Pipelined Radix- 2k Feedforward FFT Architectures
Abstract: The appearance of radix-22 was a milestone in the design of pipelined FFT hardware architectures. Later, radix-22 was extended to radix-2k . However, radix-2k was only proposed for singlepath delay feedback (SDF) architectures, but not for feedforward ones, also called multi-path delay commutator (MDC). This paper presents the radix-2k feedforward (MDC) FFT architectures. In feedforward architectures radix-2k can be used for any number of parallel samples which is a power of two. Furthermore, both decimation in frequency (DIF) and decimation in time (DIT) decompositions can be used. In addition to this, the designs can achieve very high throughputs, which makes them suitable for the most demanding applications. Indeed, the proposed radix-2k feedforward architectures require fewer hardware resources than parallel feedback ones, also called multi-path delay feedback (MDF), when several samples in parallel must be processed. As a result, the proposed radix-2k feedforward architectures not only offer an attractive solution for current applications, but also open up a new research line on feedforward structures.
Abstract: This paper proposes a data bandwidth-oriented motion estimation design for resource-limited mobile video applications using an integrated bandwidth rate distortion optimization framework. This framework predicts and allocates the appropriate data bandwidth for motion estimation under a limited bandwidth supply to fit a dynamically changing bandwidth supply. The simulation results show that our proposed algorithm can achieve 66% and 41% memory bandwidth savings while maintaining an equivalent rate-distortion performance and meeting real-time targets, when compared with conventional approaches for low-motion and high-motion D1 (704 × 576)-size video, respectively. The final implementation costs 122 K gate counts with TSMC 0.13- m CMOS technology and consumes 74 mW of power for D1 resolution at 30 frames/s which is 40% of that achieved in previous designs. ETPL VLSI-005 STBC-OFDM Downlink Baseband Receiver for Mobile WMAN
Abstract: This paper proposes a space time block code-orthogonal frequency division multiplexing downlink baseband receiver for mobile wireless metropolitan area network. The proposed baseband receiver applied in the system with two transmit antennas and one receive antenna aims to provide high performance in outdoor mobile environments. It provides a simple and robust synchronizer and an accurate but hardware affordable channel estimator to overcome the challenge of multipath fading channels. The coded bit error rate performance for 16 quadrature amplitude modulation can achieve less than 10-6 under the vehicle speed of 120 km/hr. The proposed baseband receiver designed in 90-nm CMOS technology can support up to 27.32 Mb/s uncoded data transmission under 10 MHz channel bandwidth. It requires a core area of 2.41 2.41 mm2 and dissipates 68.48 mW at 78.4 MHz with 1 V power supply. ETPL VLSI-006 Glitch-Free NAND-Based Digitally Controlled Delay-Lines
Abstract: The recently proposed NAND-based digitally controlled delay-lines (DCDL) present a glitching problem which may limit their employ in many applications. This paper presents a glitch-free NANDbased DCDL which overcame this limitation by opening the employ of NAND-based DCDLs in a wide range of applications. The proposed NAND-based DCDL maintains the same resolution and minimum delay of previously proposed NAND-based DCDL. The theoretical demonstration of the glitch-free operation of proposed DCDL is also derived in the paper. Following this analysis, three driving circuits for the delay control-bits are also proposed. Proposed DCDLs have been designed in a 90-nm CMOS technology and compared, in this technology, to the state-of-the-art. Simulation results show that novel circuits result in the lowest resolution, with a little worsening of the minimum delay with respect to the previously proposed DCDL with the lowest delay. Simulations also confirm the correctness of developed glitching model and sizing strategy. As example application, proposed DCDL is used to realize an Alldigital spread-spectrum clock generator (SSCG). The employ of proposed DCDL in this circuit allows to reduce the peak-to-peak absolute output jitter of more than the 40% with respect to a SSCG using threestate inverter based DCDLs.
Abstract: Conventionally for wide workload range applications, to keep good stability and high efficiency, a switching converter with multi-mode operation is necessary. With the advanced digital signal processing, this work presents an asynchronous digital controller with dynamic power saving technique to achieve high power efficiency. The regulation is based on the off-time modulation, in which an adaptive resolution adjustment is proposed for the extension toward light-loaded range. The DC-DC converter is fabricated in a 0.18- m CMOS process. The input voltage is from 2.7 to 3.6 V and the regulated output is 1.8 V. The switching frequency is from 44 kHz to 1.65 MHz and the maximum output ripple is 20 mV with a 10-F capacitor and a 2.2-H inductor. The power efficiency is higher than 91% for the workload range from 3 to 400 mA. ETPL VLSI-008 Formal Verification of Architectural Power Intent
Abstract: This paper presents a verification framework that attempts to bridge the disconnect between high-level properties capturing the architectural power management strategy and the implementation of the power management control logic using low-level per-domain control signals. The novelty of the proposed framework is in demonstrating that the architectural power intent properties developed using high-level artifacts can be automatically translated into properties over low-level control sequences gleaned from UPF specifications of power domains, and that the resulting properties can be used to formally verify the global on-chip power management logic. The proposed translation uses a considerable amount of domain knowledge and is also not purely syntactic, because it requires formal extraction of timing information for the low-level control sequences. We present a tool, called POWER-TRUCTOR which enables the proposed framework, and several test cases of significant complexity to demonstrate the feasibility of the proposed framework. ETPL VLSI-009 Statistical SRAM Read Access Yield Improvement Using Negative Capacitance Circuits
Abstract: SRAM has become the dominant block in modern ICs and constitutes more than 50% of the die area. The increase of process variations with continued CMOS technology scaling is considered one of the major challenges for SRAM designers. This process variations increase causes the SRAM cells to functionally fail and reduces the chip functional yield considering the static noise margin stability failures (i.e., cell flips when accessed), write failures (i.e., cell is not written within the write window), and read access failures (i.e., incorrect read operation). In this paper, novel negative capacitance circuits are developed, for the first time, to statistically improve the SRAM read access yield under process variations by reducing the bitlines parasitic capacitance. Post layout simulation results, referring to an industrial hardware-calibrated TSMC 65-nm CMOS technology, show that the adoption of the negative capacitance circuit to a 512 SRAM cells column is capable of improving the read access yield from 61.9% to 100%. ETPL VLSI-010 An Energy-Efficient L2 Cache Architecture Using Way Tag Information Under WriteThrough Policy
Abstract: Many high-performance microprocessors employ cache write-through policy for performance improvement and at the same time achieving good tolerance to soft errors in on-chip caches. However,
Abstract: We propose an analytical model based on queueing theory for delay analysis in a wormholeswitched network-on-chip (NoC). The proposed model takes as input an application communication graph, a topology graph, a mapping vector, and a routing matrix, and estimates average packet latency and router blocking time. It works for arbitrary network topology with deterministic routing under arbitrary traffic patterns. This model can estimate per-flow average latency accurately and quickly, thus enabling fast design space exploration of various design parameters in NoC designs. Experimental results show that the proposed analytical model can predict the average packet latency more than four orders of magnitude faster than an accurate simulation, while the computation error is less than 10% in nonsaturated networks for different system-on-chip platforms. ETPL VLSI-012 Built-In Generation of Functional Broadside Tests Using a Fixed Hardware Structure
Abstract: Functional broadside tests are two-pattern scan-based tests that avoid overtesting by ensuring that a circuit traverses only reachable states during the functional clock cycles of a test. In addition, the power dissipation during the fast functional clock cycles of functional broadside tests does not exceed that possible during functional operation. On-chip test generation has the added advantage that it reduces test data volume and facilitates at-speed test application. This paper shows that on-chip generation of functional broadside tests can be done using a simple and fixed hardware structure, with a small number of parameters that need to be tailored to a given circuit, and can achieve high transition fault coverage for testable circuits. With the proposed on-chip test generation method, the circuit is used for generating reachable states during test application. This alleviates the need to compute reachable states offline. ETPL VLSI-013 Checkpointing for Virtual Platforms and SystemC-TLM
Abstract: Integrating simulation models created using different simulation systems is a common problem when constructing virtual platforms. Different companies and different departments can create models, and virtual platforms for different purposes using different tools. There are also existing models that need to be integrated into new tools, or the other way around. The simulators can be quite different in details, even in the case of transaction-level models. We present work in integrating SystemC transaction-level models into two typical full-system simulation environments, QEMU and Simics. We present issues in
Abstract: Despite the rapid advances in process technology, via failure is still problematic in nanometerscale semiconductor manufacturing. Adding redundant vias is a typical approach for improving yield and reliability. Cell-based design methodologies are widely adopted in the industry for application-specific integrated circuits. Standard cells are effective for increasing the insertion rate of redundant via1s in cellbased designs. This study proposes an efficient library check and staggered pin arrangement approach that compares redundant via1 insertion rate in different configurations such as double-via and rectangle-via. To compare the variability in standard cell (SC) libraries, accurate characterization results are provided. Moreover, the proposed SC library is easily implemented in all currently available routers. The experimental results reveal that the proposed library improves total inserted redundant vias, total inserted redundant via1s, and total run time by 20.2%, 51.9%, and 42.3%, respectively. In double-via pattern, the proposed approach improves average via1 insertion rate by 14.6%. In rectangle-via pattern, the proposed approach achieves a 100% via1 insertion rate. ETPL VLSI-015 Scaling Energy Per Operation via an Asynchronous Pipeline
Abstract: Statistical analysis of computations per unit energy in processors over the last 30 years is given that illustrates a sharp reduction in the rate of energy efficiency improvements over the last several years resulting in the formation of an asymptotic wall with our dataset; we use the measure of giga multiply accumulates per Joule. We have developed an energy model which takes into account the realities of scaling, specifically for asynchronous systems. Studies of an energy efficient asynchronous pipeline show fabricated results of 17 Giga Operations per Joule in 0.6 m at subthreshold when fully pipelined, and simulations at a more modern 65 nm process show a further order of magnitude improvement on that. ETPL VLSI-016 A High Speed Low Power CAM With a Parity Bit and Power-Gated ML Sensing
Abstract: Content addressable memory (CAM) offers high-speed search function in a single clock cycle. Due to its parallel match-line (ML) comparison, CAM is power-hungry. Thus, robust, high-speed and low-power ML sense amplifiers are highly sought-after in CAM designs. In this paper, we introduce a parity bit that leads to 39% sensing delay reduction at a cost of less than 1% area and power overhead. Furthermore, we propose an effective gated-power technique to reduce the peak and average power consumption and enhance the robustness of the design against process variations. A feedback loop is employed to auto-turn off the power supply to the comparison elements and hence reduce the average power consumption by 64%. The proposed design can work at a supply voltage down to 0.5 V. ETPL VLSI-017 Error Detection in Majority Logic Decoding of Euclidean Geometry Low Density Parity Check (EG-LDPC) Codes
Abstract: In a recent paper, a method was proposed to accelerate the majority logic decoding of difference set low density parity check codes. This is useful as majority logic decoding can be implemented serially
Abstract: This paper presents novel techniques to mitigate the effects of SRAM memory failures caused by low voltage operation in JPEG2000 implementations. We investigate error control coding schemes, specifically single error correction double error detection code based schemes, and propose an unequal error protection scheme tailored for JPEG2000 that reduces memory overhead with minimal effect in performance. Furthermore, we propose algorithm-specific techniques that exploit the characteristics of the discrete wavelet transform coefficients to identify and remove SRAM errors. These techniques do not require any additional memory, have low circuit overhead, and more importantly, reduce the memory power consumption significantly with only a small reduction in image quality. ETPL VLSI-019 Spatial Distribution Measurement of Dynamic Voltage Drop Caused by Pulse and Periodic Injection of Spot Noise
Abstract: This paper presents measured results of dynamic voltage drop caused by pulse and periodic injection of spot noise. The test structure being fabricated by a 45 nm low-power process has 1024 delay probes to measure spatial distributions in response to the spot-noise generation. The test structure is the advanced version of our predecessor being fabricated by a 65-nm node, and can trace changes in the spatial distributions with time after the noise injection. The measured results are compared with SPICE simulations, in which package/socket LCR as well as power-line RC within the die is modeled. It is found that the simple model agrees well with the measured results. ETPL VLSI-020 Low-Complexity Multiplier for GF(2^{m}) Based on All-One Polynomials
Abstract: This paper presents an area-time-efficient systolic structure for multiplication over GF(2m) based on irreducible all-one polynomial (AOP). We have used a novel cut-set retiming to reduce the duration of the critical-path to one XOR gate delay. It is further shown that the systolic structure can be decomposed into two or more parallel systolic branches, where the pair of parallel systolic branches has the same input operand, and they can share the same input operand registers. From the applicationspecific integrated circuit and field-programmable gate array synthesis results we find that the proposed design provides significantly less area-delay and power-delay complexities over the best of the existing designs. ETPL VLSI-021 Design and Implementation of an On-Chip Permutation Network for Multiprocessor System-On-Chip
Abstract: Coarse grained arrays (CGAs) with run-time reconfigurability play an important role in accelerating reconfigurable computing applications. It is challenging to design on-chip communication networks (OCNs) for such CGAs with dynamic run-time reconfigurability whilst satisfying the tight budgets of power and area for an embedded system. This paper presents a silicon-proven design of a 64PE circuit-switched OCN fabric with a dynamic path-setup scheme capable of supporting an embedded coarse-grained processor array. A proof-of-concept test chip fabricated in a 0.13 m CMOS process occupies a silicon area of 23 mm2 and consumes a peak power of 200 mW @ 128 MHz and 1.2 Vcc, at room temperature. The OCN overhead consumes 9.4% of the area and 18% of the power of the total chip. Experimental results and analysis show that the proposed OCN fabric with its dynamic path-setup is suitable for use in an embedded CGA supporting fast run-time reconfigurability. ETPL VLSI-023 A Very Linear Low-Pass Filter with Automatic Frequency Tuning
Abstract: A Gm-C third-order Chebyshev low-pass filter with a novel switched capacitor frequency tuning technique for a zero-IF Bluetooth receiver has been designed. The frequency tuning scheme is simpler and has more relaxed specifications than conventional ones. Furthermore, a highly linear pseudodifferential transconductor with a compact feedback loop able to operate with low supply voltage has been used. This control loop holds the input transistors in triode region and provides high output resistance, keeping high linearity in a wide range of transconductance. The filter bandwidth is 0.5 MHz and the overall scheme consumes 1.1 mA from a 1.8-V supply. The measured third-order intermodulation (IM3) distortion of the filter for a 1 Vpp two-tone signal centered at 300 kHz is -65 dB. ETPL VLSI-024 A High-Speed Low-Complexity Modified {\rm Radix}-2^{5} FFT Processor for High Rate WPAN Applications
Abstract: This paper presents a high-speed low-complexity modified radix-25 512-point fast Fourier transform (FFT) processor using an eight data-path pipelined approach for high rate wireless personal area network applications. A novel modified radix-25 FFT algorithm that reduces the hardware complexity is proposed. This method can reduce the number of complex multiplications and the size of the twiddle factor memory. It also uses a complex constant multiplier instead of a complex Booth multiplier. The proposed FFT processor achieves a signal-to-quantization noise ratio of 35 dB at 12 bit internal word length. The proposed processor has been designed and implemented using 90-nm CMOS technology with a supply voltage of 1.2 V. The results demonstrate that the total gate count of the
Abstract: This paper describes the application space exploration of a heterogeneous digital signal processor with dynamic reconfiguration capabilities. The device is built around three reconfigurable engines featuring different flavours and computation granularities that make it suitable for a wide range of signal processing application domains such as video coding, image processing, telecommunications, and cryptography. Performance of signal processing applications is evaluated from measurements performed on a CMOS 90 nm prototype. In order to characterize the application space of the processor, performance is compared with state-of-the-art devices, taking programmability, computational capabilities, and energy efficiency as the main metrics. The device exploits performance and energy efficiency significantly more than general purpose processors, while still maintaining a user-friendly programming approach that mainly relies on software-oriented languages. The device is able to achieve 1.2 to 15 GOPS with an energy efficiency from 2 to 50 GOPS/W when running the selected applications ETPL VLSI-026 A Unified Graphics and Vision Processor With a 0.89 \mu W/fps Pose Estimation Engine for Augmented Reality
Abstract: A unified vision and graphics processor with three layers is shown to provide a fast pipeline for augmented reality. In the image-level layer, a 153.6 GOPS massively parallel processing unit with eight SIMD processors, each containing 128 processing elements, performs highly data-parallel operations. In the sub-image layer, a rasterizer and a pixel arranger respectively generate and reduce data-level parallelism. In the descriptor-level layer, a pose estimation engine executes sequential programs. Our processor can provide images for augmented reality at 100 fps, for a power consumption of 413 mW. This is 39% faster than a comparable smartphone implementation. Our chip is fabricated in a 0.18 m CMOS process and contains 0.95 M gates. ETPL VLSI-027 CORDIC Designs for Fixed Angle of Rotation
Abstract: Rotation of vectors through fixed and known angles has wide applications in robotics, digital signal processing, graphics, games, and animation. But, we do not find any optimized coordinate rotation digital computer (CORDIC) design for vector-rotation through specific angles. Therefore, in this paper, we present optimization schemes and CORDIC circuits for fixed and known rotations with different levels of accuracy. For reducing the area- and time-complexities, we have proposed a hardwired preshifting scheme in barrel-shifters of the proposed circuits. Two dedicated CORDIC cells are proposed for the fixed-angle rotations. In one of those cells, micro-rotations and scaling are interleaved, and in the other they are implemented in two separate stages. Pipelined schemes are suggested further for cascading dedicated single-rotation units and bi-rotation CORDIC units for high-throughput and reduced latency implementations. We have obtained the optimized set of micro-rotations for fixed and known angles. The optimized scale-factors are also derived and dedicated shift-add circuits are designed to implement the scaling. The fixed-point mean-squared-error of the proposed CORDIC circuit is analyzed statistically, and strategies for reducing the error are given. We have synthesized the proposed CORDIC cells by Synopsys Design Compiler using TSMC 90-nm library, and shown that the proposed designs offer higher
Abstract: As chip multiprocessors keep increasing the number of cores on the chip, the network-on-chip (NoC) technology is becoming essential for interconnecting the cores. While NoCs result in noticeable performance boost over conventional bus systems, they consume a non-negligible fraction of the system power. One promising solution is to dynamically adjust the working frequencies/voltages of the switches as well as the links between switches in the NoC to match the traffic flows. The question is when to adjust and by how much. Most previous works take a passive approach by reacting to fluctuations in local traffic flows. Unfortunately, this approach may be too slow and too conservative in adjusting the working frequencies/voltages. Since applications often exhibit periodic behaviors, we propose a hardware mechanism to proactively adjust the frequencies/voltages of switches and/or links in NoC by predicting the application runtime traffic. The evaluations show that our design achieves 86% dynamic power savings of the links in the on-chip network, and the resulting overheads from mispredictions are tolerable. ETPL VLSI-029 Thermal-Constrained Task Allocation for Interconnect Energy Reduction in 3-D Homogeneous MPSoCs
Abstract: 3-D technology that stacks silicon dies with through silicon vias (TSVs) is a promising solution to overcome the interconnect scaling problem in giga-scale integrated circuits (ICs). Thermal dissipation is a major challenge for 3-D integration and prior thermal-balanced task scheduling methods for 3-D multiprocessor system-on-chips (MPSoCs) typically balance power gradient across vertical stacks based on the assumption of strong thermal correlation among processing cores within a stack. On the other hand, 3-D MPSoCs typically employ network-on-chip (NoC) as the communication infrastructure which consumes a large portion of the energy budget. As TSVs consume much less energy than horizontal links in 3-D MPSoCs when transmitting the same amount data due to the reduced interconnect distance between vertical adjacent cores, it motivates to allocate heavily communicating tasks within the same vertical stack as much as possible, and thus traffic is restricted in the third dimension to reduce interconnect energy. However, aggregating active tasks within the same stack probably exacerbates the power density and result in hot spots. In this paper, we explore the tradeoff between thermal and interconnect energy when allocating tasks in 3-D Homogeneous MPSoCs, and propose an efficient heuristic. Experimental results show that the proposed technique can reduce interconnect energy by more than 25% on average with almost the same peak temperature when compared with prior thermal-balanced solutions. ETPL VLSI-030 A Wide-Range PLL Using Self-Healing Prescaler/VCO in 65-nm CMOS
Abstract: The variability and leakage current in nanoscale CMOS technology may degrade the circuit performances significantly. To accommodate the above issues in a wide-range phase-locked loop (PLL), a self-healing prescaler, a self-healing voltage-controlled oscillator (VCO), and a calibrated charge pump (CP) are presented. This PLL is fabricated in a 65-nm CMOS technology and its active area is 0.0182 mm2 . For the self-healing VCO, its measured frequency range is from 60 to 1489 MHz. When this PLL
Abstract: Peak power reduction has been a critical challenge in the design of integrated circuits impacting the chip's performance and reliability. The reduction of peak power also reduces the power density of integrated circuits. Due to large IR-voltage drops in circuits, transistor switching slows down giving rise to timing violations and logic failures. In this paper, we present a new clock control strategy for peakpower reduction in VLSI circuits. In the proposed method, the simultaneous switching of combinational paths is minimized by taking advantage of the delay slacks among the paths and clustering the paths with similar slack values. Once the paths are identified based on the path delays and their slack values, the clustering algorithm determines the ideal number of clusters for the given circuit and for each cluster the maximum possible phase shift that can be applied to the clock. The paths are assigned to clusters in a load balanced manner based on the slack values and each cluster will have a phase shift possible on its clock depending on the slack. Thus, the proposed register-transfer level (RTL) method takes advantage of the logic-path timing slack to re-schedule circuit activities at optimal intervals within the unaltered clock period. When switching activities are redistributed more evenly across the clock period, the IC supplycurrent consumption is also spread across a wider range of time within the clock period. This has the beneficial effect of reducing peak-current draw in addition to reducing RMS power draw without having to change the operating frequency and without utilizing additional power supply voltages as in dual or multi VT approaches. The proposed method is implemented and tested through simulations using an experimental setup with Synopsys Tools Suite and Cadence Tools on the ISCAS'85 benchmark circuits, OpenCore circuits and LEON processor multiplier circuit. Experimental results indicate that peak power can be reduced significantly to at- least 72% depending on the number of clusters and the phase-shifted clock identified as suitable for the given circuit by the proposed algorithms. Although the proposed method incurs some power overhead compared to the traditional clocking method, the overhead can be made negligible compared to the peak-power reduction as seen in the experimental results presented. ETPL VLSI-032 A Fast-Locking All-Digital Deskew Buffer With Duty-Cycle Correction
Abstract: In this paper, a fast-locking all-digital deskew buffer with duty cycle correction is proposed and implemented. A cyclic time-to-digital converter is introduced to decrease the locking time in conventional register-controlled delay-locked loop to only two input clock cycles in coarse tuning. With the aid of the three half delay lines technique, the mismatch between half delay lines causing the duty cycle distortion can be alleviated by interpolation. A balanced edge combiner to achieve a precise 50% output clock is also presented. A test chip is fabricated in 0.18-m technology to demonstrate the feasibility of the proposed architecture. The circuit can accept the input clock rates from 250 to 625 MHz with the duty cycle variation within 30% and 70% to generate 50% output clocks. It preserves the capability of closedloop control with a small area and power consumption. ETPL VLSI-033 A Built-In Repair Analyzer With Optimal Repair Rate for Word-Oriented Memories
Abstract: The performance of multiprocessor systems, such as chip multiprocessors (CMPs), is determined not only by individual processor performance, but also by how efficiently the processors collaborate with one another. It is the communication architecture that determines the collaboration efficiency on the hardware side. Optical networks-on-chip (ONoCs) are emerging communication architectures that can potentially offer ultra-high communication bandwidth and low latency to multiprocessor systems. Thermal sensitivity is an intrinsic characteristic of photonic devices used by ONoCs as well as a potential issue. This paper systematically modeled and quantitatively analyzed the thermal effects in ONoCs. We used an 8 8 mesh-based ONoC as a case study and evaluated the impacts of thermal effects in the average power efficiency for real MPSoC applications. We revealed three important factors regarding ONoC power efficiency under temperature variations, and proposed several techniques to reduce the temperature sensitivity of ONoCs. These techniques include the optimal initial setting of microresonator resonant wavelength, increasing the 3-dB bandwidth of optical switching elements by parallel coupling multiple microresonators, and the use of passive-routing optical router Crux to minimize the number of switching stages in mesh-based ONoCs. We gave a mathematical analysis of periodically parallel coupling of multiple microresonators and show that the 3-dB bandwidth of optical switching elements can be widened nearly linearly with the ring number. Evaluation results for different real MPSoC applications show that, on the basis of thermal tuning, the optimal device setting improves the average power efficiency by 54% to 1.2 pJ/bit when chip temperature reaches 85 C. The findings in this paper can help support the further development of this emerging technology. ETPL VLSI-035 A Study of Tapered 3-D TSVs for Power and Thermal Integrity
Abstract: 3-D integration presents a path to higher performance, greater density, increased functionality and heterogeneous technology implementation. However, 3-D integration introduces many challenges for power and thermal integrity due to large switching currents, longer power delivery paths, and increased parasitics compared to 2-D integration. In this work, we provide an in-depth study of power and thermal issues while incorporating the physical design characteristics unique to 3-D integration. We provide a qualitative perspective of the power and thermal dissipation issues in 3-D and study the impact of Through Silicon Vias (TSVs) size for their mitigation. We investigate and discuss the design implications of power and thermal issues in the presence of decoupling capacitors, TSV/on-die/package parasitics, various resonance effects and power gating. Our study is based on a ten-tier system utilizing existing 3-D
Abstract: This paper presents a novel technique for extending the capacity of trace buffers when capturing debug data during post-silicon debug. It exploits the fact that is it not necessary to capture error-free data in the trace buffer since that information can be obtained from simulation. A selective data capture method is proposed in this paper that only captures debug data during clock cycles in which errors are present. The proposed debug method requires only three debug sessions. The first session estimates a rough error rate, the second session identifies a set of suspect clock cycles where errors may be present, and the third session captures the suspect clock cycles in the trace buffer. The suspect clock cycles are determined through a 2-D compaction technique using multiple-input signature register signatures and cycling register signatures. Intersecting both signatures generates a small number of suspect clock cycles for which the trace buffer needs to capture. The effective observation window of the trace buffer can be expanded significantly, by up to orders of magnitude. Experimental results indicate very significant increases in the effective observation window for a trace buffer can be obtained. ETPL VLSI-037 AC-Plus Scan Methodology for Small Delay Testing and Characterization
Abstract: Small delay defects escaping traditional delay testing could cause a device to malfunction in the field and thus detecting these defects is often necessary. To address this issue, we propose three test modes in a new methodology called AC-plus scan, in which versatile test clocks can be generated on the chip by embedding an all-digital phase-locked loop (ADPLL) into the circuit under test (CUT). AC-plus scan can be executed on an in-house wireless test platform called HOY system. The first test mode of our AC-plus scan provides a more efficient way to measure the longest path delay associated with each test pattern. Experimental result shows that our method could greatly reduce the test time by 81.8%. The second test mode is designed for volume production test. It could effectively detect small delay defects and provide fast characterization on those defective chips for further processing. This mode could be used to help predict which chips are more likely to fall victim to operational failure in the field. The third test mode is to extract the waveform of each flip-flop's output in a real chip. This is made possible by taking advantage of the almost unlimited test memory our HOY test platform provides, so that we could easily store a great volume of data and reconstruct the waveform for post-silicon debugging. We have successfully fabricated a Viterbi decoder chip with such an AC-plus scan methodology inside to demonstrate its capability. ETPL VLSI-038 A Variation Tolerant Current-Mode Signaling Scheme for On-Chip Interconnects
Abstract: Current-mode signaling (CMS) with dynamic overdriving is one of the most promising scheme for high-speed low-power communication over long on-chip interconnects. However, they are sensitive to parameter variations due to reduced voltage swings on the line. In this paper, we propose a variation tolerant dynamic overdriving CMS scheme. The proposed CMS scheme and a competing CMS scheme (CMS-Fb) are fabricated in 180-nm CMOS technology. Measurement results show that the proposed scheme offers 34% reduction in energy/bit and 42% reduction in energy-delay-product over CMS-Fb
Abstract: This paper addresses the modeling and analysis problems for power distribution networks (PDNs) in 3-D ICs. An on-chip distributed model is proposed for 3-D power grids, in which the details of metal layers are considered. The distributed model is demonstrated to be essential to identifying the unique noise behavior of 3-D PDNs. A lumped model is proposed based on the distributed model. The lumped model features the connection impedance between tiers and is proven to be useful for designers to understand the global effects of 3-D PDNs. Based on the models, an analysis flow is designed for 3-D PDNs in both frequency domain and time domain. With the analysis flow, the electrical characteristics of 3-D PDNs are studied systematically for the first time. The frequency-domain analysis identifies the global and local resonance phenomena in 3-D PDNs that are distinct from those in 2-D PDNs. The physical mechanisms behind the resonance phenomena are investigated. The time-domain analysis predicts the worst-case supply noise based on distributed current constraints. The Rogue Wave concept is introduced to explain the spatial and temporal relations of the worst-case on-chip noise responses in 3D PDNs. ETPL VLSI-040 A Low-Cost, Systematic Methodology for Soft Error Robustness of Logic Circuits
Abstract: Due to current technology scaling trends such as shrinking feature sizes and decreasing supply voltages, circuit reliability is becoming more susceptible to radiation-induced transient faults (soft errors). Soft errors, which have been a great concern in memories, are now a main factor in reliability degradation of logic circuits as well. In this paper, we present a systematic and integrated methodology for circuit robustness to soft errors. The proposed soft error rate (SER) reduction framework, based on redundancy addition and removal (RAR), aims at eliminating those gates with large contribution to the overall SER. Several metrics and constraints are introduced to guide the RAR-based approach toward SER reduction. Furthermore, we integrate a resizing strategy into our framework, as post-RAR additive SER optimization. The strategy can identify most critical gates to be upsized and thereby, minimize area and power overheads while maintaining a high level of soft error robustness. Experimental results show that the proposed RAR-based framework can achieve up to 70% reduction in output failure probability. On average, about 23% SER reduction is obtained with less than 4% area overhead. ETPL Low Complexity Out-of-Order Issue Logic Using Static Circuits VLSI-041 Abstract: In this paper a single-cycle issue queue circuit architecture that simplifies the wakeup and selection logic is proposed. The micro-architecture and fully static CMOS circuits are presented for a 32entry queue that issues four instructions per cycle. The instruction-ready signals are divided into groups and processed in parallel to issue the four oldest ready instructions. The complete issue queue and prioritization logic requires 20 inversions, allowing simulated circuit operation at over 4 GHz in a foundry
Abstract: A low-cost linearity test methodology for high-resolution analog-to-digital converters (ADCs) is presented in this paper. Linearity testing of ADCs requires high-precision digital-to-analog conversion (DAC) capability, commonly 3-bit higher resolution than the ADC under test. Further, a large number of ADC output data samples must be collected making conventional histogram testing impractical for highresolution ADCs with 18-24 bit precision. In the proposed test methodology, two low-precision and lowcost DACs are used to generate a high-resolution ADC test stimulus. Significant reductions in test cost and test time are achieved by using low-cost instrumentation and by making fewer measurements than required for conventional histogram test. A least-squares-based polynomial fitting approach is used to determine the transfer function of the ADC under test. The generated transfer function is used to compute the non-linearity of the ADC accurately. No assumption is made regarding the linearity of the lower precision signal generators (DACs) used in the testing procedure. Software simulations and hardware experiments are performed to validate the proposed test methodology ETPL VLSI-048 Low-Cost Error Tolerance Scheme for 3-D CMOS Imagers
Abstract: This paper presents an error tolerance scheme for 3-D CMOS imagers that are constructed by stacking a pixel array of imager sensors, an analog-to-digital converter (ADC) array, and an image signal processor (ISP) array using microbumps (bumps) a nd through silicon vias (TSVs). To deliver highquality images in the presence of single or multiple bump, ADC, or TSV failures, we propose to interleave the connections from pixels to ADCs and recover the corrupted data in the ISPs. Key design parameters, such as the interleaving stride and the grouping ratio are determined by analyzing the employed error correction algorithm. Architectural simulation results demonstrate that the error tolerance
Abstract: Considering full-scan circuits, incompletely-specified tests, or test cubes, are used for test data compression. When considering path delay faults, certain specified input values in a test cube are needed only for determining the lengths of the paths associated with detected faults. Path delay faults, and therefore, small delay defects, would still be detected if such values are unspecified. The goal of this paper is to explore the possibility of increasing the number of unspecified input values in a test set for path delay faults by unspecifying such values in order to make the test set more amenable to test data compression. Experimental results indicate that significant numbers of such values exist. The proposed procedure unspecifies them gradually to obtain a series of test sets with increasing numbers of unspecified values and decreasing path lengths. Experimental results also indicate that filling the unspecified values randomly (as with some test data compression methods) recovers some or all of the path lengths associated with detected path delay faults. The procedure uses a matching of the sets of detected faults for the comparison of path lengths ETPL VLSI-050 Integrated Energy-Harvesting Photodiodes With Diffractive Storage Capacitance
Abstract: Integrating energy-harvesting photodiodes with logic and exploiting on-die interconnect capacitance for energy storage can enable new, ultraminiaturized wireless systems. Unlike CMOS imager pixels, the proposed photodiode designs utilize p-diffusion fingers and are implemented in a conventional logic process. Also unlike specialized solar cell processes, the designs utilize the on-chip metal interconnect to form a diffraction grating above the p-diffusion fingers which also provides capacitive energy storage. To explore the tradeoffs between optical efficiency and energy storage for integrated photodiodes, an array of photovoltaics with various diffractive storage capacitors was designed in a 90nm CMOS logic process. The diffractive effects can be exploited to increase the photodiodes' response to off-axis illumination. Transient effects from interfacing the photodiodes with switched-capacitor DC-DC converters were examined, with measurements indicating a 50% reduction in the output voltage ripple due to the diffractive storage capacitance. A quantitative comparison between 90-nm and 0.35-m CMOS logic processes for energy-harvesting capabilities was carried out. Measurements show an increase in power generation for the newer CMOS technology, however at the cost of reduced output voltage. One potential application for the integrated photodiodes is harvesting energy for a subdermal biomedical device. ETPL VLSI-051 Fast Fixed-Outline 3-D IC Floorplanning With TSV Co-Placement
Abstract: Through-silicon vias (TSVs) are used to connect inter-die signals in a 3-D IC. Unlike conventional vias, TSVs occupy device area and are very large compared to logic gates. However, most previous 3-D floorplanners only view TSVs as points. As a result, whitespace redistribution is necessary for TSV insertion after the initial floorplan is computed, which leads to suboptimal layouts. In this paper, we propose a very efficient 3-D floorplanner to simultaneously floorplan the functional modules and place the TSVs and to optimize the total wirelength under fixed-outline constraint. Compared to the state-
Abstract: Multi-threshold CMOS (MTCMOS) is commonly used for suppressing leakage currents in idle integrated circuits. Power and ground distribution network noise produced during SLEEP to ACTIVE mode transitions is an important reliability concern in MTCMOS circuits. Sleep signal slew rate modulation techniques for suppressing mode-transition noise are explored in this paper. A triple-phase sleep signal slew rate modulation (TPS) technique with a novel digital sleep signal generator is proposed. Reactivation time, mode-transition energy consumption, leakage power consumption, and layout area of different MTCMOS circuits are characterized under an equal-noise constraint. Influences of within-die and die-to-die parameter variations on the reactivation noise, time, and energy consumption of sleep signal slew rate modulated MTCMOS circuits are evaluated with a process imperfections aware robustness metric. The proposed triple-phase sleep signal slew rate modulation technique enhances the tolerance to process parameter fluctuations by up to 183.1 as compared to various alternative MTCMOS noise suppression techniques in a UMC 80-nm CMOS technology. ETPL VLSI-053 Sub-mW LC Dual-Input Injection-Locked Oscillator for Autonomous WBSNs
Abstract: This paper presents a sub-mW, current-reused first-harmonic LC injection-locked oscillator (ILO) using in-phase dual-input injection technique. It can be used as a power oscillator in the injectionlocked transmitter of wireless biomedical sensor nodes (WBSNs) integrated into a wireless body area network. A prototype chip, implemented in a standard 0.13-m CMOS process occupying 200 380 m, operates in the medical implantable communications service (MICS) band for medical implants. Measurement results show that the proposed ILO features a wide locking range of 800 MHz (150-950 MHz) at an input power of 0 dBm. More importantly, it has a high input sensitivity of -30 dBm to lock the 3-MHz bandwidth of the MICS band, while consuming only 660 W at 1 -V supply. This ultralow power consumption enables autonomous WBSNs ETPL VLSI-054 Constant Delay Logic Style
Abstract: A constant delay (CD) logic style is proposed in this paper, targeting at full-custom high-speed applications. The CD characteristic of this logic style regardless of the logic type makes it suitable in implementing complicated logic expressions such as addition. CD logic exhibits a unique characteristic where the output is pre-evaluated before the inputs from the preceding stage is ready. This feature offers performance advantage over static and dynamic domino logic styles in a single-cycle multistage circuit block. Several design considerations including timing window width adjustment and clock distribution are discussed. Using 65-nm general-purpose CMOS technology, the proposed logic demonstrates an average speedup of 94% and 56% over static and dynamic domino logic, respectively, in five different logic gates. Simulation results of 8-bit ripple carry adders show that CD logic is 39% and 23% faster than the static and dynamic-based adders, respectively. CD logic also demonstrates 39% speedup and 64%
Abstract: This paper presents an all-digital phase-locked loop (ADPLL) clock generator for globally asynchronous locally synchronous (GALS) multiprocessor systems-on-chip (MPSoCs). With its low power consumption of 2.7 mW and ultra small chip area of 0.0078 mm2 it can be instantiated per core for fine-grained power management like DVFS. It is based on an ADPLL providing a multiphase clock signal from which core frequencies from 83 to 666 MHz with 50% duty cycle are generated by phase rotation and frequency division. The clock meets the specification for DDR2/DDR3 memory interfaces. Additionally, it provides a dedicated high-speed clock up to 4 GHz for serial network-on-chip data links. Core frequencies can be changed arbitrarily within one clock cycle for fast dynamic frequency scaling applications. The performance including statistical analysis of mismatch has been verified by a prototype in 65-nm CMOS technology. ETPL VLSI-056 A Colpitts CMOS Quadrature VCO Using Direct Connection of Substrates for Coupling
Abstract: A new low-phase noise low-power quadrature voltage-controlled oscillator (QVCO) using differential Colpitts oscillator is presented. The proposed QVCO is composed of two identical currentswitching differential Colpitts VCOs in which the first core VCO is coupled to the second in an in-phase manner, and the second core VCO is coupled to the first in an anti-phase manner. To couple the two core VCOs, the substrates of the cross-connected transistors as well as the substrates of MOS varactors are used; alleviating the need for any extra elements for coupling, which could add noise and increase power dissipation. A linear (sinusoidal) analysis is presented that confirms that the proposed circuit generates quadrature waveforms. The proposed coupling technique can be generalized to N differential Colpitts VCOs for multiphase signals generation ETPL VLSI-057 A Self-Calibrated DLL-Based Clock Generator for an Energy-Aware EISC Processor
Abstract: This paper describes a low-jitter delay-locked loop (DLL)-based clock generator for dynamic frequency scaling in the extendable instruction set computing (EISC) processor. The DLL-based clock generator provides the system clock with frequencies of 0.5 to 8 of the reference clock, according to the workload of the EISC processor. The proposed analog self-calibration method and a phase detector with an auxiliary charge pump can effectively reduce the delay mismatch between delay cells in the voltage-controlled delay line and the static phase offset due to the current mismatch in the charge pump, respectively. The self-calibrated output waveform exhibits 9.7 ps of RMS jitter and 73.7 ps of peak-topeak jitter at 120 MHz. The prototype clock generator implemented in a 0.18-m CMOS process occupies an active area of 0.27 mm2 and consumes 15.56 mA ETPL VLSI-058 Clamping Virtual Supply Voltage of Power-Gated Circuits for Active Leakage Reduction and Gate-Oxide Reliability
Abstract: This brief presents a 10-bit 30-MS/s successive-approximation-register analog-to-digital converter (ADC) that uses a power efficient switchback switching method. With respect to the monotonic switching method, the input common-mode voltage variation reduces which improves the dynamic offset and the parasitic capacitance variation of the comparator. The proposed switchback switching method does not consume any power at the first digital-to-analog converter switching, which can reduce the power consumption and design effort of the reference buffer. The prototype was fabricated in a 90-nm 1P9M CMOS technology. At 1-V supply and 30 MS/s, the ADC achieves an sequenced neighbor double reservation of 56.89 dB and consumes 0.98 mW, resulting in a figure-of-merit (FOM) of 57 fJ/conversion-step. The ADC core occupies an active area of only 190 525 m2. ETPL VLSI-060 Spur-Reduction Frequency Synthesizer Exploiting Randomly Selected PFD
Abstract: This brief presents a low-spur phase-locked loop (PLL) system for wireless applications. The low-spur frequency synthesizer randomizes the periodic ripples on the control voltage of the voltagecontrolled oscillator to reduce the reference spur at the output of the PLL. A novel random clock generator is presented to perform the random selection of the phase frequency detector control for the charge pump in locked state. The proposed frequency synthesizer was fabricated in a TSMC 0.18-m CMOS process. The proposed PLL achieved phase noise of -93 dBc/Hz with a 600-kHz offset frequency and reference spurs below -72 dBc. ETPL VLSI-061 Gain-Enhanced Monolithic Charge Pump With Simultaneous Dynamic Gate and Substrate Control
Abstract: This brief presents a gain-enhanced complimentary metal-oxide-semiconductor (CMOS) charge pump (CP) circuit via dynamically controlling the gate and substrate terminals of each pMOS pass transistor. The proposed control strategy enables the CP circuit free of the threshold-voltage drops, the body effect, and the floating substrate terminals of pass devices. The on-resistance of each pass device is
Abstract: During systems-on-a-chip (SoC) integration, silicon intellectual properties (IPs) are generally regarded as blockages to long interconnections that connect different IPs. With this constraint, conventional designs are forced to place those repeaters that drive long interconnections outside the IP. These designs either lead to a longer interconnection distance requiring more repeaters or result in a longer signal delay, since the interconnection wire is not appropriately segmented by the repeaters. To solve these problems, we designed the IPs such that designers can embed the repeaters in the IP for the SoC integration. In other words, it allows the cross-IP interconnections to be routed over the IP using repeaters inserted in the IP. The design concept, physical implementation, and application examples of the embedded repeaters are described in this brief ETPL VLSI-063 RATS: Restoration-Aware Trace Signal Selection for Post-Silicon Validation
Abstract: Post-silicon validation is one of the most important and expensive tasks in modern integrated circuit design methodology. The primary problem governing post-silicon validation is the limited observability due to storage of a small number of signals in a trace buffer. The signals to be traced should be carefully selected in order to maximize restoration of the remaining signals. Existing approaches have two major drawbacks. They depend on partial restorability computations that are not effective in restoring maximum signal states. They also require long signal selection time due to inefficient computation as well as operating on gate-level netlist. We have proposed a signal selection approach based on total restorability at gate-level, which is computationally more efficient (10 times faster) and can restore up to three times more signals compared to existing methods. We have also developed a register transfer level signal selection approach, which reduces both memory requirements and signal selection time by several orders-of-magnitude. ETPL VLSI-064 Test Patterns of Multiple SIC Vectors: Theory and Application in BIST Schemes
Abstract: This paper proposes a novel test pattern generator (TPG) for built-in self-test. Our method generates multiple single-input change (MSIC) vectors in a pattern, i.e., each vector applied to a scan chain is an SIC vector. A reconfigurable Johnson counter and a scalable SIC counter are developed to generate a class of minimum transition sequences. The proposed TPG is flexible to both the test-per-clock and the test-per-scan schemes. A theory is also developed to represent and analyze the sequences and to extract a class of MSIC sequences. Analysis results show that the produced MSIC sequences have the favorable features of uniform distribution and low input transition density. The performances of the designed TPGs and the circuits under test with 45 nm are evaluated. Simulation results with ISCAS benchmarks demonstrate that MSIC can save test power and impose no more than 7.5% overhead for a scan design. It also achieves the target fault coverage without increasing the test length.
Abstract: BLAST is one of the most popular sequence analysis tools used by molecular biologists. It is designed to efficiently find similar regions between two sequences that have biological significance. However, because the size of genomic databases is growing rapidly, the computation time of BLAST, when performing a complete genomic database search, is continuously increasing. Thus, there is a clear need to accelerate this process. In this paper, we present a new approach for genomic sequence database scanning utilizing reconfigurable field programmable gate array (FPGA)-based hardware. In order to derive an efficient structure for BLASTN, we propose a reconfigurable architecture to accelerate the computation of the word-matching stage. The experimental results show that the FPGA implementation achieves a speedup around one order of magnitude compared to the NCBI BLASTN software running on a general purpose computer. ETPL VLSI-067 Architecturally Homogeneous Power-Performance Heterogeneous Multicore Systems
Abstract: Dynamic voltage and frequency scaling (DVFS), a widely adopted technique to ensure safe thermal characteristics while delivering superior energy efficiency, is rapidly becoming inefficient with technology scaling due to two critical factors: 1) inability to scale the supply voltage due to reliability concerns and 2) dynamic adaptations through DVFS cannot alter underlying power hungry circuit characteristics, designed for the nominal frequency. In this paper, we show that DVFS scaled circuits substantially lag in energy efficiency, by 22%86%, compared to ground up designs for target frequency levels. We propose architecturally homogeneous power-performance heterogeneous multicore systems, a fundamentally alternate means to design energy efficient multicore systems. Using a system level computer-aided design (CAD) approach, we seamlessly integrate architecturally identical cores, designed for different voltage-frequency domains. We use a combination of standard cell library based CAD flow and full system architectural simulation to demonstrate 11%22% improvement in energy efficiency using our design paradigm.
Abstract: An active filter-based on-chip DCDC voltage converter for application to distributed on-chip power supplies in multivoltage systems is described in this paper. No inductor or output capacitor is required in the proposed converter. The area of the voltage converter is therefore significantly less than that of a conventional low-dropout (LDO) regulator. Hence, the proposed circuit is appropriate for pointof-load voltage regulation for noise sensitive portions of an integrated circuit. The performance of the circuit has been verified with Cadence Spectre simulations and fabricated with a commercial 110 nm complimentary metal oxide semiconductor (CMOS) technology. The area of the voltage regulator is 0.015 ${rm mm}^{2}$ and delivers up to 80 mA of output current. The transient response with no output capacitor ranges from 72 to 192 ns. The parameter sensitivity of the active filter is also described. The advantages and disadvantages of the active filter-based, conventional switching, linear, and switched capacitor voltage converters are compared. The proposed circuit is an alternative to classical LDO voltage regulators, providing a means for distributing multiple local power supplies across an integrated circuit while maintaining high current efficiency and fast response time within a small area. ETPL VLSI-069 CusNoC: Fast Full-Chip Custom NoC Generation
Abstract: We propose a full-chip synthesis methodology to construct custom network-on-chips (CusNoCs) for NoC-based systems. The proposed scheme generates irregular network topologies for application-specific designs with known communication demands. In this method, processors and the communication architecture can be synthesized simultaneously in the floorplanning process, and thus it is called CusNoC. CusNoC synthesizes CusNoC in two steps. The target network topology is first generated based on communication analysis. Processing elements are partitioned into groups such that the utility of routers will be maximized if a router is assigned to each group. In this way, the number of routers passed by a packet, or hops, is minimized, and so is the power consumption in the network. The final network topology is formed by properly connecting these groups. A wirelength-aware floor planning is then carried out to optimize circuit size as well as wirelength. Experimental results show that CusNoC produces custom NoCs with better performance than previous methods while the computation time is significantly shorter. This method is also more scalable, which makes it ideal for complicated systems. ETPL VLSI-070 Cooperating Virtual Memory and Write Buffer Management for Flash-Based Storage Systems
Abstract: Flash memory is becoming the preferred choice of secondary storage in mobile devices and embedded systems. The performance of Flash memory is dictated by asymmetric speeds of read and write, limited number of erase times, and the absence of in-place updates. To improve the performance of Flash-based storage systems, the write buffer has been provided in Flash memories recently. At the same time, new virtual memory management strategies have been proposed in recent studies that consider the characteristics of Flash memory. Currently, approaches on these two memory layers are considered separately, which fail to explore the full potential of these two layers. In this paper, we propose cooperative management schemes for virtual memory and write buffer to maximize the performance of Flash-memory-based systems. Management on virtual memory is designed to exploit write buffer status via reordering of the write sequences. The proposed write buffer management scheme works seamlessly with the proposed virtual memory management scheme. Experimental results show that significant
Abstract: This paper presents an multipath delay commutator (MDC)-based architecture and memory scheduling to implement fast Fourier transform (FFT) processors for multiple input multiple outputorthogonal frequency division multiplexing (MIMO-OFDM) systems with variable length. Based on the MDC architecture, we propose to use radix-$N_{s}$ butterflies at each stage, where $N_{s}$ is the number of data streams, so that there is only one butterfly needed in each stage. Consequently, a 100% utilization rate in computational elements is achieved. Moreover, thanks to the simple control mechanism of the MDC, we propose simple memory scheduling methods for input data and output bit/set-reversing, which again results in a full utilization rate in memory usage. Since the memory requirements usually dominate the die area of FFT/inverse fast Fourier transform (IFFT) processors, the proposed scheme can effectively reduce the memory size and thus the die area as well. Furthermore, to apply the proposed scheme in practical applications, we let $N_{s}=4$ and implement a 4-stream FFT/IFFT processor with variable length including 2048, 1024, 512, and 128 for MIMO-OFDM systems. This processor can be used in IEEE 802.16 WiMAX and 3GPP long term evolution applications. The processor was implemented with an UMC 90-nm CMOS technology with a core area of 3.1 ${rm mm}^{2}$. The power consumption at 40 MHz was 63.72/62.92/57.51/51.69 mW for 2048/1024/512/128-FFT, respectively in the post-layout simulation. Finally, we analyze the complexity and performance of the implemented processor and compare it with other processors. The results show advantages of the proposed scheme in terms of area and power consumption. ETPL VLSI-072 Current-Reused 2.4-GHz Direct-Modulation Transmitter With On-Chip Automatic Tuning
Abstract: This paper presents the design, analysis, and experimental verification of a self-calibrating current-reused 2.4-GHz direct-modulation transmitter for short-range wireless applications. The key contributions are the design/analysis of a stacked power amplifier (PA)/voltage-controlled oscillator (VCO) architecture, the nonlinear frequency-dependent analysis of a Gilbert-cell-based root-mean-square detector, and an on-chip $LC$-tank calibration circuit that needs no analog-to-digital convertor (ADC)/digital signal processor. The stacked architecture reduces the number of required regulators, utilizes supply headroom effectively, and allows for an ADC -less calibration loop that can dynamically tune the PA center frequency by sensing the transmitted signal. The very nature of direct-modulation architecture obviates additional high-purity signal generators, reducing complexity and allowing online calibration. The system was implemented in TSMC 0.18 $mu{rm m}$ CMOS, occupies 0.7 ${rm mm}^{2}~({rm TX})+0.1~{rm mm}^{2}$ (self-tuning), and was measured in a QFN48 package on an FR4 PCB. Automatically correcting PA/VCO tank misalignment in this case yielded ${>}{rm 4}~{rm dB}$ increase in output power. With the automatic tuning active, the transmitter delivers a measured output power ${>}{rm 0}~{rm dBm}$ to a 100-$Omega$ differential load, and the system consumes 22.9 mA from a 1.8-V core-circuit supply.
Abstract: Singular value decomposition (SVD) is an optimal method to obtain spatial multiplexing gain in multi-input multi-output (MIMO) channels. However, the high cost of implementation and high decomposing latency of the SVD restricts its usage in current wireless communication applications. In this paper, we present a complete adaptive SVD algorithm and a reconfigurable architecture for highthroughput MIMO-orthogonal frequency division multiplexing systems. There are several proposed architectural design techniques: reconfigurable scheme, division-free adaptive step size scheme, early termination scheme, and data interleaving scheme. The reconfigurable scheme can support all antenna configurations in a MIMO system. The division-free adaptive step size and early termination schemes are used to effectively reduce the decomposing latency and improve hardware utilization. The data interleaving scheme helps to deal with several channel matrices concurrently. Besides, we propose an orthogonal reconstruction scheme to obtain more accurate SVD outputs, and then the system performance will be greatly enhanced. We apply our SVD design to the IEEE 802.11 n applications. This design is implemented and fabricated in UMC 90 nm 1P9M CMOS technology. The maximum operating frequency is measured to be at 101.2 MHz, and the corresponding power dissipation is at 125 mW. The core size is 2.17 ${rm mm}^{2}$ and the die size occupies 4.93 ${rm mm}^{2}$. The chip result shows that the average latency is only 0.33% of the wireless local area network coherence time. Hence, the proposed reconfigurable adaptive SVD engine design is very suitable for high-throughput wireless communication applications. ETPL VLSI-074 The LUT-SR Family of Uniform Random Number Generators for FPGA Architectures
Abstract: Field-programmable gate array (FPGA) optimized random number generators (RNGs) are more resource-efficient than software-optimized RNGs because they can take advantage of bitwise operations and FPGA-specific features. However, it is difficult to concisely describe FPGA-optimized RNGs, so they are not commonly used in real-world designs. This paper describes a type of FPGA RNG called a LUT-SR RNG, which takes advantage of bitwise xor operations and the ability to turn lookup tables (LUTs) into shift registers of varying lengths. This provides a good resourcequality balance compared to previous FPGA-optimized generators, between the previous high-resource high-period LUT-FIFO RNGs and low-resource low-quality LUT-OPT RNGs, with quality comparable to the best software generators. The LUT-SR generators can also be expressed using a simple C++ algorithm contained within this paper, allowing 60 fully-specified LUT-SR RNGs with different characteristics to be embedded in this paper, backed up by an online set of very high speed integrated circuit hardware description language (VHDL) generators and test benches. ETPL VLSI-075 Exploring the Use of Emerging Nonvolatile Memory Technologies in Future FPGAs,
Abstract: As new nonvolatile memory technologies become increasingly mature, there has been a growing interest on investigating their use in future field-programmable gate arrays (FPGAs). Similar to existing FPGAs with embedded Flash memory, future FPGAs can embed these new nonvolatile memories to persistently store configuration data. By comparing with prior work, we first propose the more appropriate design style for new nonvolatile configuration data storage memory. Moreover, this brief studies a dynamic random-access memory (DRAM)-based FPGA design strategy enabled by high-
Abstract: Tester limitations may impose certain constraints on the primary input vectors applicable as part of a two-pattern test for delay faults. Under these constraints, the primary input vectors may be held constant, or the second primary input vector of a test may be obtained by a single shift of a scan chain relative to the first. The goal of this brief is to study the differences in achievable transition fault coverage between various primary input constraints that are similar to the commonly used ones of holding or shifting primary input vectors. This brief also studies the possibility of combining the constraints in order to increase the transition fault coverage. The combination requires a fixed and circuit-independent hardware structure similar to the case where shifting of primary input vectors is used. This study is done using test sets that consist of both broadside and skewed-load tests in order to maximize the transition fault coverage. ETPL VLSI-078 Supply Noise Suppression by Triple-Well Structure
Abstract: This brief discusses the impact of twin- and triple-well structures on power supply noise, and a substrate model for simulating the power supply noise. We observed $V_{rm ss}$ noise reduction by the resistive network of the p-substrate and $V_{rm dd}$ noise reduction by the junction capacitance of a triple-well structure on a 90-nm test chip. Measurement results also showed that the total noise reduction of a triple-well structure is superior to that of a twin-well structure. The measurement results correlate well with the results obtained from the power supply noise simulation using a hierarchical resistive mesh model. Our simulation-based verification indicates that in common CMOS design, a triple-well structure can reduce the power supply drop by 10%40% or the decoupling capacitance area by 5%10%. We also verified that supply drop sensitivity to variation of the well junction capacitance is sufficiently small and that supply noise reduction using a triple-well structure is robust to process variation. ETPL VLSI-079 Software-Based Self Test Methodology for On-Line Testing of L1 Caches in Multithreaded Multicore Architectures
Abstract: The flexibility that allows the application of different March tests is a critical requirement for on-line testing of memory arrays. In a previous study, we have introduced a low-cost software-based self test (SBST) program development methodology for on-line periodic testing of L1 caches that utilizes direct cache access (DCA) instructions and exploits the native monitoring hardware available in modern architectures. In this brief, we discuss a multithreaded optimization of this SBST methodology that exploits the thread level parallelism of multithreaded multicore architectures in order to speed up March test execution by elaborating the low level multiple sub-bank cache organization. The effectiveness of the methodology and its multithreaded optimization is demonstrated on the L1 caches of OpenSPARC T1 processor. Our results showed a speedup of more than 1.7 when the multithreaded optimization is applied and an acceptable performance overhead (less than 11%), even in intensive periodic test scenarios.
Abstract: In this paper, we discuss logic circuit designs using the circuit model of three-state quantum dot gate field effect transistors (QDGFETs). QDGFETs produce one intermediate state between the two normal stable ON and OFF states due to a change in the threshold voltage over this range. We have developed a simplified circuit model that accounts for this intermediate state. Interesting logic can be implemented using QDGFETs. In this paper, we discuss the designs of various two-input three-state QDGFET gates, including NAND- and NOR-like operations and their application in different combinational circuits like decoder, multiplier, adder, and so on. Increased number of states in three-state QDGFETs will increase the number of bit-handling capability of this device and will help us to handle more number of bits at a time with less circuit elements. ETPL VLSI-081 Parametric DFM Solution for Analog Circuits: Electrical-Driven Hotspot Detection, Analysis, and Correction Flow
Abstract: As VLSI technology pushes into advanced nodes, designers and foundries have exposed a hitherto insignificant set of yield problems. To combat yield failures, the semiconductor industry has deployed new tools and methodologies commonly referred to as design for manufacturing (DFM). Most of the early DFM efforts concentrated on catastrophic failures, or physical DFM problems. Recently, there has been an increased emphasis on parametric yield issues, referred to as electrical-DFM (e-DFM). In this paper, we present a complete e-DFM solution that detects, analyzes, and fixes electrical hotspots (e-hotspots) within an analog circuit design that are caused by different process variations. Novel algorithms are proposed to implement the engines used to develop this solution. The solution is examined on a 130-nm parametrically-failing level shifter circuit, and verified with silicon wafer measurements that confirm the existence of parametric yield issues in the design. Additional experiments are applied on a 65-nm industrial operational amplifier and voltage control oscillator (VCO). E-hotspot devices with a 27.7% variation in dc current are identified. After fixing the e-hotspots, the dc current variation in these devices is dramatically reduced to 7%, which meets the designer acceptance criteria, while saving the original VCO specifications. ETPL VLSI-082 Unified Capture Scheme for Small Delay Defect Detection and Aging Prediction
Abstract: Small delay defect (SDD) and aging-induced circuit failure are both prominent reliability concerns for nanoscale integrated circuits. Faster-than-at-speed testing is effective on SDD detection in manufacturing testing, which is always implemented by designing a suite of test signal generation circuits on the chip. Meanwhile, the integration of online aging sensors is becoming attractive in monitoring aging-induced delay degradation in the runtime. These design requirements, if implemented in separate ways, will increase the complexity of a reliable design and consume more die area. In this paper, a unified capture scheme is proposed to generate programmable clock signals for the detection of both SDDs and circuit aging. Our motivation arises from the observations that SDD detection and online aging prediction both need to capture circuit response ahead of the functional clock. The proposed aging-resistant design method enables the offline test circuit to be reused in online operations. Reversed short channel effect is also exploited to make the underlying circuit resilient to process variations. The proposed scheme is validated by intensive HSPICE simulations. Experimental results demonstrate the effectiveness in terms of low area, power, and performance overheads.
Abstract: A novel detection algorithm with an efficient VLSI architecture featuring efficient operation over infinite complex lattices is proposed. The proposed design results in the highest throughput, the lowest latency, and the lowest energy compared to the complex-domain VLSI implementations to date. The main innovations are a novel complex-domain means of expanding/visiting the intermediate nodes of the search tree on demand, rather than exhaustively, as well as a new distributed sorting scheme to keep track of the best candidates at each search phase. Its support of unbounded infinite lattice decoding distinguishes the present method from previous K-Best strategies and also allows its complexity to scale sublinearly with the modulation order. Since the expansion and sorting cores are data-driven, the architecture is well suited for a pipelined parallel VLSI implementation. The proposed algorithm is used to fabricate a 44, 64-QAM complex multiple-input-multiple-output detector in a 0.13-m CMOS technology, achieving a clock rate of 417 MHz with the core area of 340 kgates. The chip test results prove that the fabricated design can sustain a throughput of 1 Gb/s with energy efficiency of 110 pJ/bit, the best numbers reported to date. ETPL VLSI-084 High-Throughput 0.13- \mu{\rm m} CMOS Lattice Reduction Core Supporting 880 Mb/s Detection
Abstract: This paper presents the first silicon -proven implementation of a lattice reduction (LR) algorithm, which achieves maximum likelihood diversity . The implementation is based on a novel hardware-optimized due to the Lenstra , Lenstra, and Lovasz (LLL) algorithm, which significantly reduces its complexity by replacing all the computationally intensive LLL operations (multiplication, division, and square root) with low-complexity additions and comparisons. The proposed VLSI design utilizes a pipelined architecture that produces an LR-reduced matrix set every 40 cycles, which is a 60% reduction compared to current state-of-the-art LR field-programmable gate array implementations. The 0.13-m CMOS LR core presented in this paper achieves a clock rate of 352 MHz, and thus is capable of sustaining a throughput of 880 Mb/s for 64-QAM multiple-input-multiple-output detection with superior performance while dissipating 59.4 mW at 1.32 V supply. ETPL VLSI-085 Study of Through-Silicon-Via Impact on the 3-D Stacked IC Layout
Abstract: The technology of through-silicon vias (TSVs) enables fine-grained integration of multiple dies into a single 3-D stack. TSVs occupy significant silicon area due to their sheer size, which has a great effect on the quality of 3-D integrated chips (ICs). Whereas well-managed TSVs alleviate routing congestion and reduce wirelength, excessive or ill-managed TSVs increase the die area and wirelength. In this paper, we investigate the impact of the TSV on the quality of 3-D IC layouts. Two design schemes, namely TSV co-placement (irregular TSV placement) and TSV site (regular TSV placement), and accompanying algorithms to find and optimize locations of gates and TSVs are proposed for the design of 3-D ICs. Two TSV assignment algorithms are also proposed to enable the regular TSV placement. Simulation results show that the wirelength of 3-D ICs is shorter than that of 2-D ICs by up to 25%. ETPL VLSI-086 Design of Hardware Function Evaluators Using Low-Overhead Nonuniform Segmentation With Address Remapping
Abstract: Carbon nanotube field effect transistors (CNFETs) show great promise as extensions to silicon CMOS. However, imperfections, which are mainly related to carbon nanotubes (CNTs) growth process, result in metallic and nonuniform CNTs leading to significant functional yield reduction. This paper presents a comprehensive technique for statistical functional yield estimation and enhancement of CNFET-based VLSI circuits. Based on experimental data extracted from aligned CNTs, we propose a compact statistical model to estimate the failure probability of a CNFET. Using the proposed failure model, we show that enhancing the CNT synthesis process alone cannot achieve acceptable functional yield for upcoming CNFET-based VLSI circuits. We propose a technique which is based on replacing each transistor by series-parallel transistor structures to reduce the failure probability of CNFETs in the presence of metallic and nonuniform CNTs. The technique is adapted to use single directional independence, which is inherent in aligned CNTs, to enhance the functional yield as validated by theoretical analysis and simulation results. Tradeoffs between failure probability reduction and design overheads such as area and current drive are explored. As demonstrated by extensive simulation results, the proposed technique achieves 80% functional yield in CNFET technology at the cost of 7.5X area and 34% current drive overheads if the CNT density and the fraction of semiconducting CNTs are improved to 200 CNTs per m and 99.99%, respectively. ETPL VLSI-088 Theoretical Modeling of Elliptic Curve Scalar Multiplier on LUT-Based FPGAs for Area and Speed,
Abstract: This paper uses a theoretical model to approximate the delay of different characteristic two primitives used in an elliptic curve scalar multiplier architecture (ECSMA) implemented on k input lookup table (LUT)-based field-programmable gate arrays. Approximations are used to determine the delay of the critical paths in the ECSMA. This is then used to theoretically estimate the optimal number of pipeline stages and the ideal placement of each stage in the ECSMA. This paper illustrates suitable scheduling for performing point addition and doubling in a pipelined data path of the ECSMA. Finally, detailed analyses, supported with experimental results, are provided to design the fastest scalar multiplier over generic curves. Experimental results for GF(2163) show that, when the ECSMA is suitably pipelined, the scalar multiplication can be performed in only 9.5 s on a Xilinx Virtex V. Notably the design has an area which is significantly smaller than other reported high-speed designs, which is due to
Abstract: Adaptive systems are increasing in importance across a range of application domains. They rely on the ability to respond to environmental conditions, and hence real-time monitoring of statistics is a key enabler for such systems. Probability density function (PDF) estimation has been applied in numerous domains; computational limitations, however, have meant that proxies are often used. Parametric estimators attempt to approximate PDFs based on fitting data to an expected underlying distribution, but this is not always ideal. The density function can be estimated by rescaling a histogram of sampled data, but this requires many samples for a smooth curve. Kernel-based density estimation can provide a smoother curve from fewer data samples. We present a general architecture for nonparametric PDF estimation, using both histogram-based and kernel-based methods, which is designed for integration into streaming applications on field-programmable gate array (FPGAs). The architecture employs heterogeneous resources available on modern FPGAs within a highly parallelized and pipelined design, and is able to perform real-time computation on sampled data at speeds of over 250 million samples per second, while extracting a variety of statistical properties. ETPL VLSI-090 Symbolic Moment Computation for Statistical Analysis of Large Interconnect Networks,
Abstract: The shrinking technology feature size and dense large-scale integration make process variation a challenging issue directly confronting the latest design automation tools. Process variation causes severe variation in interconnect networks, including very large-scale integrated interconnect structures, such as clock trees, clock mesh, power-ground networks, and other wiring structures in 3-D integrated circuits. The traditional moment computation techniques are only partly useful for analyzing such variational problems, however, their computational efficiency cannot meet the quickly rising needs, such as statistical analysis. This paper presents a novel symbolic moment calculator (SMC) for variational interconnect analysis. The moment calculator is constructed in a regular data structure that incorporates binary decision diagrams for data storage and computation. Given an interconnect circuit, such a computation diagram has to be constructed only once and can be repeatedly invoked for computation of moments with varying parameter values. Also, the SMC is friendly to interconnect synthesis in that it can be incrementally modified according to the modifications made to the circuit structure. Applications of the SMC for fast moment computation, sensitivity analysis, and statistical timing analysis are addressed. Significant efficiency is demonstrated comparing to other existing methods. ETPL C-Based Complex Event Processing on Reconfigurable Hardware VLSI-091 Abstract: This brief presents an efficient complex event-processing framework, designed to process a large number of sequential events on field-programmable gate arrays (FPGAs). Unlike conventional structured query language based approaches, our approach features logic automation constructed with a new C-based event language that supports regular expressions on the basis of C functions, so that a wide variety of event-processing applications can be efficiently mapped to FPGAs. Evaluations on an FPGAbased network interface card show that we can achieve 12.3 times better event-processing performance than does a CPU software in a financial trading application.
Abstract: In this paper, we consider the task allocation problem on a hybrid main memory composed of nonvolatile memory (NVM) and dynamic random access memory (DRAM). Compared to the conventional memory technology DRAM, the emerging NVM has excellent energy performance since it consumes orders of magnitude less leakage power. On the other hand, most types of NVMs come with the disadvantages of much shorter write endurance and longer write latency as opposed to DRAM. By leveraging the energy efficiency of NVM and long write endurance of DRAM, this paper explores task allocation techniques on hybrid memory for multiple objectives such as minimizing the energy consumption, extending the lifetime, and minimizing the memory size. The contributions of this paper are twofold. First, we design the integer linear programming (ILP) formulations that can solve different objectives optimally. Then, we propose two sets of heuristic algorithms including three polynomial time offline heuristics and three online heuristics. Experiments show that compared to the optimal solutions generated by the ILP formulations, the offline heuristics can produce near-optimal results. ETPL VLSI-093 BilRC: An Execution Triggered Coarse Grained Reconfigurable Architecture
Abstract: We present Bilkent reconfigurable computer (BilRC), a new coarse-grained reconfigurable architecture (CGRA) employing an execution-triggering mechanism. A control data flow graph language is presented for mapping the applications to BilRC. The flexibility of the architecture and the computation model are validated by mapping several real-world applications. The same language is also used to map applications to a 90-nm field-programmable gate array (FPGA), giving exactly the same cycle count performance. It is found that BilRC reduces the configuration size about 33 times. It is synthesized with 90-nm technology, and typical applications mapped on BilRC run about 2.5 times faster than those on FPGA. It is found that the cycle counts of the applications for a commercial very long instruction word digital signal processor processor are 1.9 to 15 times higher than that of BilRC. It is also found that BilRC can run the inverse discrete cosine transform algorithm almost 3 times faster than the closest CGRA in terms of cycle count. Although the area required for BilRC processing elements is larger than that of existing CGRAs, this is mainly due to the segmented interconnect architecture of BilRC, which is crucial for supporting a broad range of applications. ETPL VLSI-094 Block-Circulant RS-LDPC Code: Code Construction and Efficient Decoder Design
Abstract: This brief presents a method for constructing block-circulant (BC) Reed-Solomon-based lowdensity parity-check (RS-LDPC) codes and an efficient decoder design. The proposed construction method results in a BC form of a parity-check matrix from a random parity-check matrix for RS-LDPC codes. A decoder architecture and switch network for BC-RS-LDPC code are then developed based on the new BC parity-check matrix. Thus, an efficient decoder architecture dedicated to a promising class of high-performance BC-RS-LDPC codes is presented for the first time. Moreover, a (2048, 1723) BC-RSLDPC decoder architecture is designed to demonstrate the efficiency of the presented techniques. Synthesis results show that the proposed decoder requires 1.3-M gates and can operate at 450 MHz to achieve a data throughput of 41 Gb/s with eight iterations.
Abstract: We propose an active dynamic redundancy-based fault-handling approach exploiting the partial dynamic reconfiguration capability of static random-access memory-based field-programmable gate arrays. Fault detection is accomplished in a uniplex hardware arrangement while an autonomous fault isolation scheme is employed, which neither requires test vectors nor suspends the computational throughput. The deterministic flow of the fault-handling scheme achieves an improved recovery in a bounded number of reconfigurations. This approach extends existing signal processing properties to accommodate fault handling, and is validated by implementing an H.263 video encoder discrete cosine transform (DCT) block. The peak signal-to-noise ratio measure of the video sequences indicates fault tolerance in the DCT block with only limited quality degradation, during the isolation and recovery phases spanning a few frames. ETPL VLSI-096 MAEPER: Matching Access and Error Patterns With Error-Free Resource for Low Vcc L1 Cache
Abstract: Large SRAMs are the practical bottleneck to achieve a low supply voltage, because they suffer from process variation-induced bit errors at a low supply voltage. In this paper, we present an errorresilient cache architecture that resolves the drawback of previous approaches, i.e., the performance degradation at a low supply voltage which is caused by cache misses in accesses to faulty resources. We utilize cache access locality and error-free resources in a cost-effective manner. First, we classify cache lines into fully and partially accessed groups and apply appropriate methods to each group. For the partially accessed group, we propose a method of matching memory access behavior and error locations with intra-cache line word-level remapping. In order to reduce the area overhead used to store the cache access information history, we present an access pattern-learning line-fill buffer (LFB). For the fully accessed group, we propose the utilization of error-free assist functions in the cache, i.e., a LFB and victim cache with no process variation-induced error at the target minimum supply voltage. We also present an error-aware prefetch method that allows us to utilize the error-free victim cache to achieve a further reduction in cache misses due to faulty resources. Experimental results show that the proposed method gives an average 32.6% reduction in cycles per instruction at an error rate of 0.2% with a small area overhead of 8.2%.