ISSCC2017-03 Digest PDF

ISSCC 2017 / SESSION 3 / DIGITAL PROCESSORS / OVERVIEW
Session 3 Overview: Digital Processors

DIGITAL ARCHITECTURES AND SYSTEMS SUBCOMMITTEE
Session Chair: Thomas Burd, Session Co-Chair: James Myers,

AMD, Sunnyvale, CA ARM, Cambridge, United Kingdom
Subcommittee Chair: Byeong-Gyu Nam, Chungnam National University, Korea
Digital processors continue to diversify in scope, utilizing a variety of process technologies, application-specific architectures, power
management techniques, and heterogeneous processors integrated on a single die. The first two papers cover next-generation high-
performance POWER and x86 CPUs, followed by ISSCCs first high-density FPGA. Next comes the first highly integrated
power-optimized mobile SOC in 10nm, an automotive microcontroller and a MIMO baseband chip. The final paper is a vision
processor for autonomous drones.
1:30 PM
3.1 POWER9TM: A Processor Family Optimized for Cognitive Computing with 25Gb/s Accelerator Links and
16Gb/s PCIe Gen4
C. Gonzalez, IBM, Yorktown Heights, NY
In Paper 3.1, IBM describes the 8-billion-transistor 24-core POWER9TM processor implemented in a 14nm SOI
FinFET technology, featuring 48 lanes of PCIe Gen4, and 48 lanes of 25Gb/s links.
2:00 PM
3.2 Zen: A Next-Generation High-Performance x86 Core
T. Singh, AMD, Austin, TX
In Paper 3.2, AMD presents the next-generation high-performance x86 core, implemented in a 14nm FinFET
technology achieving 40% higher instructions-per-clock-cycle than the previous-generation processor.
2:30 PM
3.3 A 14nm 1GHz FPGA with 2.5D Transceiver Integration
D. Greenhill, Intel, San Jose, CA
In Paper 3.3, Intel presents a 17-billion-transistor 1GHz FPGA in a 14nm technology, with up to six 20nm
transceiver chips integrated within a 2.5D embedded bridge package.
48 2017 IEEE International Solid-State Circuits Conference 978-1-5090-3758-2/17/$31.00 2017 IEEE

ISSCC 2017 / February 6, 2017 / 1:30 PM
3:15 PM
3.4 A 10nm FinFET 2.8GHz Tri-Gear Deca-Core CPU Complex with Optimized Power-Delivery Network for 3
Mobile SoC Performance
E. Wang, MediaTek, Hsinchu, Taiwan
In Paper 3.4, Mediatek describes a 10nm SoC featuring three different ARMv8A microarchitectures in a tri-
cluster, deca-core configuration.
3:45 PM
3.5 A 40nm Flash Microcontroller with 0.80s Field-Oriented-Control Intelligent Motor Timer and
Functional Safety System for Next-Generation EV/HEV
H. Kimura, Renesas Electronics, Tokyo, Japan
In Paper 3.5, Renesas describes a 40nm microcontroller with intelligent motor timer hardware control that
realizes 0.8s field-oriented-control execution and functional safety mechanism for EV/HEV motor control.
4:15 PM
3.6 A 60pJ/b 300Mb/s 1288 Massive MIMO Precoder-Detector in 28nm FD-SOI
H. Prabhu, Lund University, Lund, Sweden
In Paper 3.6, Lund University presents a 1288 massive MIMO baseband implementation achieving 60pJ/b at
300Mb/s detection rate, implemented in 28nm FDSOI.
4:45 PM
3.7 A 19201080 30fps 2.3TOPS/W Stereo-Depth Processor for Robust Autonomous Navigation
Z. Li, University of Michigan, Ann Arbor, MI
In Paper 3.7, the University of Michigan presents a 30fps HD stereovision processor with 512 levels of depth
perception, featuring a deep pipeline and high-bandwidth custom SRAMs to achieve a 5.8 energy-efficiency
improvement.
DIGEST OF TECHNICAL PAPERS 49

ISSCC 2017 / SESSION 3 / DIGITAL PROCESSORS / 3.1
3.1 POWER9TM: A Processor Family Optimized for As depicted in Fig. 3.1.2, the clock uses 2 system reference clocks and has 17
Cognitive Computing with 25Gb/s Accelerator Links PLL/DLL's controlling 58 domains across the chip. Each of the 6 quads contain
7 synchronous meshes. Six 1:1 meshes are resonant using pulsed mesh buffers
and 16Gb/s PCIe Gen4 and the seventh is 2:1 non-resonant. The pulsed mesh buffers have 4
programmable pulse widths that are selected based on frequency. This addition
Christopher Gonzalez1, Eric Fluhr2, Daniel Dreps2, David Hogenmiller2, of pulsed buffers reduces power by 10% over the POWER8 resonant design.
Rahul Rao3, Jose Paredes2, Michael Floyd2, Michael Sperling4, Active de-skew circuitry continuously aligns all of the running meshes, while
Ryan Kruse2, Vinod Ramadurai2, Ryan Nett2, Saiful Islam2, accommodating dynamic frequency and voltage scaling (DVFS), powering on/off
Juergen Pille5, Donald Plass4 the resonant and pulsed meshes, and other adaptive clocking frequency
adjustments. Clock mesh nodes at the sensors are aligned with a skew of less
1
IBM, Yorktown Heights, NY than 15ps. The clock design supports both synchronous and asynchronous
2
IBM, Austin, TX interfaces with DDR. Delay circuits are programmed at power-on to align 5 clock
3
IBM, Bangalore, India meshes in DDR synchronous mode. The clock design can also operate using a
4
IBM, Poughkeepsie, NY single system reference clock, where the second reference clock is created on-
5
IBM, Boblingen, Germany chip by a PLL that outputs the required spread-spectrum clock. To reduce load
on the global clock mesh, the local clock buffer (LCB) circuits were redesigned to
Cognitive computing and cloud infrastructure require flexible, connectable, and drive 37% more latches. This resulted in an 18% improvement in the chip average
scalable processors with extreme IO bandwidth. With 4 distinct chip number of latches per LCB compared to POWER8.
configurations, the POWER9 family of chips delivers multiple options for memory
ports, core thread counts, and accelerator options to address this need. The 24- To maximize performance, the power management system leverages power
core scale-out processor is implemented in 14nm SOI FinFET technology [1] and headroom due to workload variation by deterministically raising frequency within
contains 8.0B transistors. The 695mm2 chip uses 17 levels of copper interconnect: system thermal and power supply constraints. In addition to an embedded
3-64nm, 2-80nm, 4-128nm, 2-256nm, 4-360nm pitch wiring for signals and 2- PowerPC core (OCC), POWER9 includes 20 distributed reduced-ISA
2400nm pitch wiring levels for power and global clock distribution. Digital logic microcontrollers (PPE) dedicated to managing runtime power-performance
uses three thin-oxide transistor Vts to balance power and performance efficiency. The chip has 63 digital thermal sensors and 30 voltage droop monitors,
requirements, while analog and high-voltage circuits eliminated thick-oxide which feed into DVFS decisions. Fig. 3.1.3 shows the control feedback loops
devices providing process simplification and cost reduction. By leveraging the regulating both voltage and frequency. The new Instant-On idle power state
FinFETs increased current per area, the base standard cell image shrunk from 18 improves core wakeup latency by 10 over POWER8.
tracks per bit in planar 22nm to 10 tracks per bit in 14nm providing additional
area scaling. The POWER9 scale-out chip features 48 lanes of 25Gb/s PHY to support the next
generation NVLinkTM protocol enabling GPU acceleration with 7-10 the bandwidth
The POWER9 C4 array contains 19638 bumps: 2359 signals, 7370 power, and of an industry-standard PCIeGen3 connection. It also supports the OpenCAPITM
9909 ground pins. There are 10 input voltages as shown in Fig. 3.1.2: core/cache protocol which has up to 48 interfaces to enable FPGA or ASIC acceleration
logic (Vdd), cache arrays (Vcs), nest logic (Vdn), PCIe/25G/SMP (Vio), DDR (VDDR), across an open interface with minimal latency and up to 200GB/s of bi-directional
I2C/SPI (VI2C), DPLL voltage supply (VDPLL), analog circuitry (VAVDD), stand-by logic bandwidth. The transmitter uses AC-boosting to effectively increase the launch
(Vsb), and a high-precision reference (Vref); and the chip contains 48.5F of deep- voltage by 16% during high frequency operation and the receiver uses 1 tap of
trench decoupling capacitance. The core and L2/L3 cache regions are divided decision feedback equalization (DFE) in the 7pJ/b link. POWER9 includes 48 lanes
into quads, which consist of 4 cores plus associated cache, and can use internal of 16 Gb/s PCIeGen4 providing 3 bandwidth improvement over POWER8 with a
voltage regulators (iVRM) to control voltage. The iVRM uses an optimized multi- 15% power increase per lane. The connection is also used for the next generation
sector, multi-sense scheme that divides each core into 4 virtual regions, which CAPI2.0 interface. As shown in Fig. 3.1.4, the chip has 8 ports of direct attach
are separately monitored and adjusted to the target. This allows for a 50% 2.667GB/s DDR4 connecting up to 16 RDIMMs per socket and the SMP
reduction in effective dropout voltage versus POWER8TM [2]. The iVRM can adjust interconnect consists of variable speed differential links with a top speed of
voltage 10 faster by operating open-loop for a limited number of cycles until the 16Gb/s at 5pJ/b using up to 12 taps of DFE when needed. The POWER9 scale-
output voltage target is met. In addition each core can be individually power gated up chip IO configuration is optimized for larger SMP systems by replacing the 8
via distributed PFET headers. DDR4 channels with 9.6 Gb/s memory links interfacing to memory buffer chips.
Two additional 25G Link channels can function as either acceleration links or SMP
The redesigned POWER9 cores are constructed using modular, execution-slice busses increasing total off-chip bandwidth to 12.9Tb/s.
building blocks, where each 64b slice contains an arithmetic and a load/store unit.
Figure 3.1.1 illustrates 24, 4-way simultaneous multithreaded (SMT) cores Under nominal process and running a TDP workload, the chip power distribution
constructed from 4 64b slices. Each core has a 32KB L1 instruction cache and a is 80% AC and 20% DC. Leakage current is minimized by using less than 0.5%
32KB L1 data cache, which supports up to 4 load or store instructions per cycle. low vt gate width to meet timing objectives. Figure 3.1.5 shows that 73% of the
Compared to POWER8, the pipeline length was reduced by 5 cycles and the fixed power is consumed in the cores and caches. First pass silicon hardware data is
point and floating point execution units were merged to optimize data-type shown in Fig. 3.1.6.
exchange. Every 2 cores share an 512KB L2 cache, supported by a 120MB shared
L3 cache. Other chip variants use 8 64b slices to create 12 SMT8 cores per chip, References:
each with a private 512KB L2 cache. [1] C-H. Lin, B. Greene, et al., High Performance 14nm SOI FinFET CMOS
Technology with 0.0174m2 embedded DRAM and 15 Levels of Cu Metallization,
There are 4 types of customized RAM cells on chip: core SRAMs use a IEDM, pp. 3.8.1-3.8.2, 2014.
performance optimized 0.102m2 cell, dense SRAMs use a leakage optimized [2] B. Thompto, POWER9: Processor for the Cognitive Era, Hot Chips, 2016.
0.102m2 cell, the compilable SRAM system uses an 8T 0.143m2 cell, and the [3] E. Fluhr, J. Friedrich, et al., POWER8TM: A 12-Core Server-Class Processor in
L3 cache uses a 0.0174mm2 eDRAM cell. Custom and compilable register files 22nm SOI with 7.6Tb/s Off-Chip Bandwidth, ISSCC, pp. 96-97, 2014.
use ground-rule-clean cells. Two write-assist schemes lower the core SRAM
minimum operating voltage (Vmin) and allow SRAMs and digital logic to share one
voltage supply. Macros with banked read and write use a voltage collapse assist
scheme, where the cell voltage supply is lowered during a write operation, while
non-banked designs use a negative bitline assist technique where the bitline is
capacitively coupled below ground to provide more write margin. All large arrays
and register files implement column and row redundancy and each 10MB of L3
cache has 2 fully redundant eDRAM blocks to improve yield.

Figure 3.1.1: Annotated POWER9 die photo. Figure 3.1.2: POWER9 clock distribution.
Figure 3.1.3: POWER9 power management system. Figure 3.1.4: POWER9 scale-out chip interfaces.
Figure 3.1.5: Chip power breakdown. Figure 3.1.6: Frequency vs. voltage.

ISSCC 2017 PAPER CONTINUATIONS
Figure 3.1.7: POWER9 die photo.
2017 IEEE International Solid-State Circuits Conference 978-1-5090-3758-2/17/$31.00 2017 IEEE

3.2 Zen: A Next-Generation High-Performance x86 Core onset keepers [2] and the global bitlines employ a contention-free read dynamic
circuit. Further robustness at Vmin is achieved by aggressively avoiding circuits
with poor roll-off or high variation at low voltages.
Teja Singh1, Sundar Rangarajan1, Deepesh John1, Carson Henrion2,
Shane Southard1, Hugh McIntyre3, Amy Novak1, Stephen Kosonocky2, A shared PLL per CCX and a fine-grain DFS allow each core and the L3 cache to
Ravi Jotwani1, Alex Schaefer1, Edward Chang2, Joshua Bell1, Michael Co1 operate at different frequencies, while maintaining a synchronous timing interface.
The synchronous interface and floorplan optimization reduces the average L3
1
AMD, Austin, TX latency by >30% compared to the previous generation. The fine-grain DFS
2
AMD, Fort Collins, CO contains programmable structures that can adjust insertion delay up to 15% of
3
AMD, Sunnyvale, CA cycleand duty cycle up to5% of cyclefor silicon tuning.
Codenamed Zen, AMDs next-generation, high-performance x86 core targets Global clock construction leverages configurable custom clocking cells and signal
server, desktop, and mobile client applications. Utilizing Global Foundries energy- pre-routes to build a skew and power optimized recombinant mesh. Zen employs
efficient 14nm LPP FinFET process, the 44mm Zen core complex unit (CCX) has coarse-grain clock gating with an efficiency >50%. Coarse gating and optimizing
1.4B transistors and contains a shared 8MB L3 cache and four cores (Fig. 3.2.7). gater cloning reduces clock power as a percentage of total power by 30% over
The 7mm Zen core contains a dedicated 0.5MB L2 cache, 32KB L1 data cache, Jaguar [3] in average workloads like the SpecInt06 benchmark. L3 active power
and 64KB L1 instruction cache. Each core has a digital low drop-out (LDO) voltage is reduced 35% for average workloads like the SpecInt06 benchmark and 60%
regulator and digital frequency synthesizer (DFS) to independently vary frequency during idle by clock gating the mesh. Excessive voltage guardbands are reduced
and voltage across power states. through per-core regulation using the LDO in conjunction with an AVFS [4] system
which runs on the SMU, utilizing frequency-to-PSM curves fused at test-time to
The scalable single Zen core combines both low power and high performance to determine optimal per-core voltage for a given target frequency.
replace AMDs current two-core portfolio. The built-from-scratch Zen architecture
improves instructions per clock cycle by 40% [1] without increasing power over Zen CCX incorporates several circuit solutions to mitigate frequency loss due to
Excavator (XV), and introduces simultaneous multi-threading, allowing eight power supply droop events. An integrated power supply droop detector (DD)
active threads per CCX. Zen increases the issue width and execution resources detects voltage droops that trigger a coarse-grain DFS for a short amount of time
by 150% and the instruction scheduler window by 175% over XV. The 168 entry with a low-latency response time. Secondly, DD triggers the digital LDO to turn
integer register file has 12 read and six write ports. The integer unit can execute on more drivers limiting the droop longer term. Lastly, DD triggers a fine-grain
four ALU operations and two AGU operations, while the 128b FPU can execute DFS to reduce clock frequency as a percentage of the cycle for the event duration
two MUL and two ADD operations. The L2 cache supports an overall bandwidth [Fig. 3.2.4].
of 32B/cycle in each direction and the L2 latency reduces vs. the previous
generation. The L3 operates with all cores powered down and flushes itself, which The team relies on standard place and route tools carefully tuned for a high-
proves invaluable in multiple-CCX SoC configurations. L3 cache bandwidth is performance design. Zen is partitioned into blocks of <0.7M instances to minimize
32B/cycle in each direction for a single core or 128B/cycle in each direction for turnaround time, while maintaining high frequencies and area efficiency. Critical
four cores. The L3 includes a duplicated L2 tag in a power-optimized structure to interfaces and global interconnects are manually placed and constrained to yield
filter transactions to the core. Single-thread power ranges from <1W to 8W as high quality repeatable results.Each block has a mix of preplaced structures and
Zen reduces AC capacitance (Cac) by >15% over XV for an average workload standard place and route logic with tuned optimization recipes. Extensive use of
similar to the SpecInt06 benchmark. Emphasizing power efficiency, the team latch and flop arrays, with custom-structured read muxes, reduces area, power,
carefully optimized Cac across various workloads and process points. Zen adds and route congestion in regular storage queues and structures. Zen utilizes 3
an operation cache that stores decoded instructions, which increases ops/cycle primary Vt types with longer length variants that allow for aggressive swapping
and saves power by reducing effective pipeline length. algorithms to reduce leakage power (Fig. 3.2.5).
Zen uses an 11-layer telescoping metal stack combining thin metals for lower The available sequential cell palette is rich with a full set of low-power to high-
level density, while utilizing wider metals for critical signals and clocks. An speed flipflops for both inverting and non-inverting variants. The rich flop library
additional low-resistance tall aluminum layer provides connections to ESD gives designers granularity to close tough critical paths. The fastest flop achieves
protection and power redistribution necessary to support power headers. FinFETs a 7% frequency advantage at roughly 1.7 the cell dynamic power (Fig. 3.2.6).
increased current density drives careful cell placement and a more robust power The Zen CCX is scalable for low-power and high-performance market segments
grid. and provides substantially better performance/W over previous generation AMD
cores.
Zen timing is optimized across a wide voltage range to support fanless client and
high-end desktop applications (Fig. 3.2.1). The achieved Vmax is considerably Acknowledgements:
higher than the nominal process voltage and requires detailed analysis of gate The authors would like to acknowledge AMDs talented Zen design team for their
and intra-dielectric breakdown. The cores power supply is generated through two contributions.
PFET header channels located at the top and bottom of each core. These channels
are used for both power gating and voltage regulation through the digital LDO References:
(Fig. 3.2.2). The digital LDO consists of a high-precision slow loop utilizing a [1] M. Clark, A New x86 Core Architecture for the Next Generation of Computing,
digital compensator and power supply monitor (PSM) for voltage monitoring, and Hot Chips, 2016.
a fast loop with a high-speed droop detector to provide charge injection in [2] R. Jotwani, S. Sundaram, et al., An x86-64 Core Implemented in 32nm SOI
response to worst-case current transients. CMOS, ISSCC, pp. 106-107, 2010.
[3] T. Singh, J. Bell, et al., Jaguar: A Next-Generation Low-Power x86-64 Core,
Zen CCX achieves low Vmin by implementing wordline boost circuitry (Fig. 3.2.3) ISSCC, pp. 52-53, 2013.
for the L1 macros and powering the L2 and L3 bitcells on a separate memory [4] A. Grenat, S. Sundaram, et al., Increasing the Performance of a 28nm x86-
voltage plane. The L1 boost circuit supports fuse controllable overdrive options 64 Microprocessor Through System Power Management, ISSCC, pp. 74-75,
to reduce the product Vmin. The system management unit (SMU) controls the 2016.
boost circuit to activate only at low voltages to maintain reliability. To avoid circuit
contention for improved low-voltage performance, the L1 local bitlines use delay

Figure 3.2.1: Zen CCX optimization for different market segments. Figure 3.2.2: Digital LDO implementation.
Figure 3.2.3: L1 cache wordline boost and contention-free dynamic circuit. Figure 3.2.4: Zen CCX clock stretch block diagram.
Figure 3.2.5: Power breakdown and Vt usage per core. Figure 3.2.6: Flop library breakdown.

Figure 3.2.7: Zen CCX die photo.

3.3 A 14nm 1GHz FPGA with 2.5D Transceiver Integration bumps per die connected by EMIB. The chip-to-chip connection is constructed
with a layered protocol model, including the physical layer (PHY), protocol layer,
and interfaces to the FPGA fabric and transceivers. The AIB PHY supports several
David Greenhill1, Ron Ho1, David Lewis2, Herman Schmit1,
IO signaling schemes; double-date-rate (DDR) signaling for high-speed data
Kok Hong Chan1, Andy Tong1, Sean Atsatt1, Dana How1, Peter McElheny1, transfer, single-date-rate (SDR) signaling for control signals and direct non-
Keith Duwel1, Jeffrey Schulz1, Darren Faulkner3, Gopal Iyer1, registered signaling. DDR mode runs at 2Gb/s per pin, power of 1.2pJ/b per die,
George Chen1, Hee Kong Phoon4, Han Wooi Lim4, Wei-Yee Koay4, and bandwidth density of 1.5Mb/s/m2 . EMIB acts as a lumped RC interconnect
Ty Garibay3 allowing CMOS rail-to-rail unterminated signaling. Output and input buffers are
CMOS inverters, with static drive strength programmability. A digital duty-cycle
1
Intel, San Jose, CA correction circuit with 20% input correction range and 2% residual error improved
2
Intel, Toronto, Canada transport clock quality. A digital delay-locked loop (DLL) was designed to center
3
Intel, Austin, TX the sampling strobe for the data.
4
Intel, Penang, Malaysia
With 17B transistors in the FPGA die, a new way to control and contain complexity
A Field Programmable Gate Array (FPGA) family was designed to match a was required. The chip was defined as an array of self-contained mini-FPGA
programmable fabric die built in 14nm process technology with 28Gb/s sectors. The floorplan is fully modular and facilitates two critical operations:
transceiver dice. The 2.5D packaging (Fig. 3.3.1) uses embedded interconnect design and validation of a single sector as if it were a full chip and the proliferation
bridges (EMIB) [1]. 20nm transceivers were reused enabling a transceiver of family members with known timing and clocking (Fig. 3.3.4). The clock
roadmap independent of FPGA fabric. Fig. 3.3.2 shows a 560mm2 fabric die and architecture was a principal driver for sectors, the clocks provide the bounding
six transceiver dice. The programmable fabric contains 2.8M logic elements, DSP, edges of the sectors [4]. FPGA management functions include initialization,
memory components, and routing interconnect operating at up to 1GHz. configuration, test, redundancy, and SEU scrubbing. These functions run on the
Applications drove the need for improved flexibility and security of the FPGA SDM control processor and more than 100 Local Sector Manager (LSM)
configuration system. A triple-modular redundant microprocessor-based secure processors, distributed in sector and IO subsystems and connected by a
device manager (SDM) was designed and is programmed by embedded software. Configuration Network on Chip (CNoC) (Fig. 3.3.6). Software loaded during
configuration replaced centralized, tightly coupled finite state machines. Moving
The FPGA die includes a pipelined routing fabric, IO columns and a quad-core control from hardware to software reduced risk, improved flexibility, and allowed
ARM A53 subsystem. Pipelined routing has previously been investigated in the parallelization increasing configuration bandwidth. During development, each
context of pass transistor architectures [2], with modest performance gain at sector and IO subsystem was simulated via standard interfaces, accelerating
significant cost. We used a pulse latch embedded into the internals of the routing validation by two orders of magnitude from previous methodologies.
multiplexer driver shown in Fig. 3.3.3. Two transistors are added to the first stage
buffer to selectively enable it, and a minimum-sized retention latch to the middle IO can be configured to support DDR3/4, LPDDR3, RLDRAM, QDR SRAM, TCAM,
of the routing driver. A clock multiplexer selects from one or two clocks, or puts LVDS, and GPIO standards, all sharing the same pins. The IO reside in two
the latch in combinational mode. The provision of a storage element in every columns embedded in the FPGA fabric die. Each column is organized into 12 tiles
routing mux means that there are sufficient stages to deeply pipeline user designs of 48 IO pins each, for over 1100 memory channels per chip. Each tile contains a
[3]. highly configurable PLL, clock distribution trees for the PHY and DQS, four DLLs
(one per lane of 12 IO pins), and a memory controller. The FPGA fabric connects
The product family was built using standard EDA tools and file formats wherever across the IO columns using multiple feedthroughs, which are assigned during
possible. Due to the unique circuit topologies and programming requirements of FPGA programming. Continued compatibility with LVDS and DDR3 protocols
FPGAs, we augmented industry tools with in-house flows. The highly regular, required support for 1.8V signals, requiring optimizations for a robust thick-oxide
modular floorplan and routing required flows for design planning and hierarchical transistor. DDR4 interfaces running at 2666Mb/s per pin can range from 1 to
integration (Fig. 3.3.4). These in-house flows produced industry standard file 72 widths. Configuration of the memory system is handled by a microprocessor
formats. Design collateral transformations were automated to integrate design slaved to each IO column. These microprocessors are nodes of the CNoC. Upon
information into the FPGA programming software. The design methodology power-up, these configuration controllers configure the memory interfaces, and
included best practices to ensure timing will be accurately modeled for end-user run on-chip termination, duty cycle correction, read training, and write leveling
FPGA programming. Careful attention was given to alignment between the users routines as needed for DDR3/4.
programming view of functional boundaries and the custom layout. In the case
of the routing fabric, many pass-gates are used to implement multiplexing Acknowledgements:
between logic elements. Care was taken to determine the optimal cell boundary The authors would like to acknowledge the many Intel engineers who contributed
for timing characterization of the fabric logic. Decision criteria included accuracy to this project.
of the output model and effects on capacity and runtime for user programming.
Timing boundaries are not convenient layout rectangles, as would be expected References:
for standard cell based designs. Instead, timing cells may correspond to irregular [1] R. Mahajan, R. Sankman, et al., Embedded Multi-Die Interconnect Bridge
physical boundaries allowing tight packing of custom logic. Special care is given (EMIB) A High Density, High Bandwidth Packaging Interconnect, IEEE
to the extraction process to minimize accuracy losses. Electronic Components and Technology Conf., pp. 557-565, 2016.
[2] D. Singh and S. Brown, The Case for Registered Routing Switches in Field
The 14nm technology utilizes 2nd-generation 3D tri-gate transistors. The transistor Programmable Gate Arrays, ACM FPGA, pp. 161-169, 2001.
fins are taller, thinner, and more closely spaced for improved density and lower [3] D. Lewis, G. Chiu, et al., The Stratix 10 Highly Pipelined FPGA Architecture,
capacitance. Fewer fins are required, further improving density. The SRAM cell ACM FPGA, pp. 159-168, 2016.
size was almost half the area of that in the previous generation, enabling high [4] C. Ebeling, D. How, et al., Stratix 10 High Performance Routable Clock
density for SRAM macros and the 550Mb of configuration RAM (CRAM). Fig. Networks, ACM FPGA, pp. 64-73, 2016.
3.3.1 provides technology details.
2.5D interconnect required a new interface bus named AIB. It implemented an

area and power efficient multi-die connection between the FPGA fabric and
transceiver die (Fig. 3.3.5). The AIB supports independent source-synchronous
clocks per channel and independent FPGA fabric clocks. There are 2500 micro-

Figure 3.3.1: Package cross section and technology overview. Figure 3.3.2: FPGA fabric die and transceiver die.
Figure 3.3.3: Routing fabric circuits. Figure 3.3.4: Modular floorplan.
Figure 3.3.5: AIB Interconnect. Figure 3.3.6: Configuration system.

Figure 3.3.7: Fabric die photo.

3.4 A 10nm FinFET 2.8GHz Tri-Gear Deca-Core CPU power show both third droop from PMIC response of ~300kHz and static IR drop
Complex with Optimized Power-Delivery Network for of ~20mV. Similar impairments are seen on the measured current which is
internally computed based on logic activity and measured voltage.
Mobile SoC Performance
Another key focus area for modern CPU design is power supply droop mitigation.
Hugh Mair1, Ericbill Wang2, Alice Wang2, Ping Kao2, Yuwen Tsai2, While prior work focused on tolerance to droop [4-5], the HP CPUs in this work
Sumanth Gururajarao1, Rolf Lagerquist1, Jin Son1, Gordon Gammie1, integrate an analog circuit block to limit the droop. The circuit, shown in Fig.
Gordon Lin2, Achuta Thippana1, Kent Li1, Manzur Rahman1, Wuan Kuo2, 3.4.4, utilizes a 1.8V supply to deliver momentary boosting current and charge to
David Yen2, Yi-Chang Zhuang2, Ue Fu2, Hung-Wei Wang2, Mark Peng3, the CPU. A bank of five clocked on-die voltage sensors continuously monitor the
internal power supply voltage, using an externally supplied 2.5GHz clock. State-
Cheng-Yuh Wu2, Taner Dosluoglu4, Anatoly Gelman4, Daniel Dia2,
based logic in turn monitors the five sensor outputs, and upon detecting a droop
Girishankar Gurumurthy2, Tony Hsieh2, WX Lin2, Ray Tzeng2, occurrence, sends a 12b activation code to the integrated bank of power switches.
Jengding Wu2, CH Wang2, Uming Ko2 This switch activation effectively cancels the resonant droop before it reaches the
minimum voltage, thus improving VMIN. For effective droop reduction, the entire
1
MediaTek, Austin, TX process from detection to prevention must occur in less than 1ns.
2
MediaTek, Hsinchu, Taiwan
3
MediaTek, San Jose, CA Using an on-die oscilloscope module [1] to monitor both the statistical and
4
Endura Technologies, San Diego, CA transient behavior of the supply inside the CPU, silicon measurements of the 1st
droop are made (Fig. 3.4.5). On the transient waveform the droop is improved
This paper describes logic and circuit design features of a heterogeneous tri- from -110mV to -72mV, however, using the deeper statistical mode of the
cluster deca-core CPU complex incorporated into a 10nm FinFET mobile SoC for oscilloscope, the true worst case VAVERAGE to VMIN droop (considering all
smartphone applications. Similar to Helio X20 [1], the Deca-Core compute power supply effects) is improved from -141mV to -103mV, an improvement of
function contains three separate clusters of ARMv8a CPUs. The high-performance 38mV. Note: The 10nm implementation of the oscilloscope has a voltage
(HP) cluster is updated to incorporate the most power-efficient out-of-order resolution of 9.5mV. This 38mV improvement is directly translated to either
Cortex-A73 CPU, operating at max frequency of 2.8GHz. In X20, the low-power increased frequency and performance, or reduced power consumption.
(LP) and ultra-low power (ULP) clusters used Cortex-CA53 with different
implementation flows, while this work achieves a +44% more power-efficient ULP One of the significant challenges to logic performance in 10nm pertains to the
solution based on the newer Cortex-CA35 CPU (Fig. 3.4.1). In addition, the LP Mid-End of Line (MEOL) and higher resistance between transistors and
cluster achieves a +36% more performance than ULP or +40% more power- interconnect [6]. To mitigate this impact a new standard cell topology was
efficiency than the HP cluster, for optimal sustainable performance/power created. As shown in Fig. 3.4.6, an example topology of a 2 Inverter, additional
applications including augmented reality and virtual reality (AR/VR). A die transistors are placed to the left and right of the active transistors in order to
photograph, Fig. 3.4.7, highlights the three CPU clusters. reduce their source resistance. The lowered source resistance and resulting
performance benefit is achieved by using a MEOL connection to short across the
Two new logic design blocks assist software debug: Internal Register Copy (IRC) source and drain from the source of the active transistors to the respective power
and Precise Last Program Counter (PLPC) (Fig. 3.4.2). Under a system-hang rail. Since these additional transistors have gate/drain/source shorted, they do
condition, a watchdog circuit issues a system-wide reset. IRC intercepts the reset not add to device leakage or input loading. This technique is applied to inverters,
signal, and halts the clocks to the CPUs. Scan chains are then utilized to shift the buffers, and commonly used combinational cells. The double-source cells are
entire register contents (state) of the CPUs to an internal SRAM located outside over 5% faster, and 18% larger on average than conventional counterparts. Area
of the CPU cluster. After reset, system boot code copies the internal SRAM impact from the larger double-source cells is managed by restricting their use to
contents to external flash storage for post-crash analysis. In many software and only the most critical timing paths.
hardware debug scenarios, knowing the exact address of the last committed
instruction, PLPC, can dramatically reduce debug time, however, in an out of order In summary, a highly advanced 10nm Deca-Core CPU cluster with integrated
CPU, finding the PLPC can be difficult; no single bank of registers in the CPU power and performance enhancement circuits is presented. Integration of three
contains this information. In addition, many CPU implementations do not distinct CPU microarchitectures further advances tri-gear performance/power
propagate the full instruction address to the commit (outgoing instruction) stage. efficiency, while added logic design elements assist software debug. New circuit
Therefore, a combination of hardware and software was devised to find the PLPC. blocks are integrated to improve system power budget management, and to drive
The hardware component is an SRAM-based circular buffer which keeps a trace continued advancement in both the characterization of, and reduction in, power
of decode (incoming instruction) addresses in program order. Software then supply 1st droop, resulting in 38mV VDD improvement. Finally, a novel standard
uses the contents of the circular buffer in conjunction with IRC data to calculate cell approach is used to mitigate the MEOL performance penalty in the 10nm era
the PLPC during post-crash analysis. and drive continued clock frequency improvement.
An emerging feature of smartphone CPUs is the ability to maximize performance References:

within a fixed power budget [1-2], which requires the continuous monitoring of [1] H. Mair, et al., A Tri-Cluster CPU Complex from a Highly Integrated Mobile
the power being consumed by CPUs, where transistor voltage is a primary factor. SoC Featuring Adaptive Power Allocation and On-Die Oscilloscope, ISSCC, pp.
Due to complex DVFS and system operating scenarios, the applied voltage for a 76-77, 2016.
given CPU is often not readily available, therefore the 10nm FinFET CPUs integrate [2] V. Krishnaswamy, et al., Fine-Grained Adaptive Power Management of the
continuous time voltage monitoring. Shown in Fig. 3.4.3, the voltage monitor SPARC M7 Processor, ISSCC, pp. 74-75, 2015.
circuit utilizes switch-capacitor DACs and a StrongARM latch [3] as part of a [3] Y. T. Wang and B. Razavi, An 8-bit 150-MHz CMOS A/D converter, JSSC,
digital feedback loop. Linear up/down counting of the signal DAC code is used to vol. 35, pp. 308317, 2000.
converge the signal DAC output (given as: VDDSENSE*Code/255), to match a pre- [4] K. Bowman, et al., A 22nm All-Digital Dynamically Adaptive Clock Distribution
defined ratio of reference voltage (VREF/2). The switched-capacitor DAC arrays for Supply Voltage Droop Tolerance, JSSC, vol. 48, no. 4, pp. 907-916, 2013.
comprise 3 thermo-coded bits and 5 binary bits driving a 256 segment array of [5] H. Mair, et al., A Highly Integrated Smartphone SoC Featuring a 2.5GHz Octa-
capacitors. By supplying the signal DAC with only VDDSENSE, data-dependent Core CPU with Advanced High-Performance and Low-Power Techniques, ISSCC,
switching noise is eliminated from the reference voltage, furthermore the pp. 424-425, 2015.
switching noise of the reference DC is constant, eliminating low-frequency noise. [6] R. Pandey, N. Agrawal, R. Arghavani and S. Datta, Analysis of Local
Also, by implementing the reference DAC using the same switched-capacitor Interconnect Resistance at Scaled Process Nodes, Device Research Conference,
network as the signal DAC, any inaccuracy due to parasitic capacitor (Cp) is pp. 184, 2015.
cancelled at the converged voltage point (VREF/2). Finally, a lookup table converts
the signal DAC code to a resulting measured voltage code. Transient performance
is measured on silicon and is shown in Fig. 3.4.3 where a max-power pattern is
executed for ~10s from the idle state. Measured waveforms of voltage and

Figure 3.4.1: DVFS power efficiency curves for 3 CPU types. Figure 3.4.2: Software debug features block diagram.
Figure 3.4.3: Voltage monitor block diagram. Figure 3.4.4: First droop mitigation circuit block diagram.
DK>

&Ksdd
Figure 3.4.5: Measurement of first droop improvement. Figure 3.4.6: Schematic and layout of double-source inverter.

Figure 3.4.7: Die photo.

3.5 A 40nm Flash Microcontroller with 0.80s Field- Motor-angle data is essential for FOC processing, however, it is very sensitive to
Oriented-Control Intelligent Motor Timer and slight attachment position errors of resolver sensors and therefore, tuning is
desirable. The upper side of Fig. 3.5.4 shows the concept of angle tuning. The
Functional Safety System for Next-Generation EV/HEV line of angle data from the RDC (theta_rdc) fluctuates due to sensor position
errors, however, a motor rotation period T(n) is relatively reliable because motor
Hayato Kimura1, Hideyuki Noda1, Hisaaki Watanabe1, Takashi Higuchi1, rotation periods are stable among neighboring periods. Considering this, the ATC
Ryosaku Kobayashi2, Masayuki Utsuno1, Fumitake Takami1, is designed and incorporated into the IMTS. The angle tuning example shown in
Sugako Otani1, Masayuki Ito1, Yasuhisa Shimazaki1, Naoki Yada1, the figure is a simple model utilizing the former rotation period value to predict a
Hiroyuki Kondo1 current angle value. In the block diagram of the ATC, consisting of a
programmable sequencer and accompanying peripheral circuits, the sequencer
1
Renesas Electronics, Tokyo, Japan is triggered by an angle-detection circuit, which observes angle data changes of
2
Renesas System Design, Tokyo, Japan RDC outputs. The ATC calculates the tuned angle data by processing the rotation
period time (T) and theta_rdc and gives it to the FOC circuit. The programs used
Electric Vehicle / Hybrid-Electric Vehicle (EV/HEV) technologies are becoming for angle tuning can be loaded from an embedded flash memory, therefore, they
increasingly important for realizing fuel-efficient vehicles. In particular, maximizing are adjustable by users. As shown in Fig. 3.5.4 (lower right), angle tuning
the energy efficiency of motors, while enhancing power and speed requires high- improves angle data fluctuations.
performance semiconductor devices, such as microcontrollers (MCU) and power
devices. However, high-performance MCUs suitable for EV/HEV applications have In terms of functional safety requirements, the calculated results in IMTS must
yet to be reported, while attractive power devices, such as SiC devices, are be checked by other circuits. Double redundancy of the IMTS is the most reliable
becoming commonplace [1]. In this paper, we have designed an MCU architecture solution, however, it imposes an intolerable area overhead. Thus, we design a
with intelligent motor-timer system (IMTS), including 0.8s high-speed field- circuit structure corresponding to a functional safety concept, shown in Fig. 3.5.5.
oriented-control (FOC) processing and an angle-tuning circuit (ATC). We have The IMTS has redundant registers for diagnosis and the LSDC CPU checks these
also designed a functional safety mechanism essential for automotive use [2]. registers in comparison with software processing of FOC and AT periodically (at
long time intervals). A system interrupt is used to trigger the CPU to check the
A general EV/HEV motor-control system, consisting of motors, resolver sensors IMTS computation of FOC and ATC, i.e., to check the data in the diagnosis
for motor angle detection, inverters/pre-drivers using power devices, and an MCU registers. The timing diagram in Fig. 3.5.5 shows the CPU checking the IMTS,
are shown in Fig. 3.5.1. As it requires parallel control for 2 motors the drive while CPU is mainly engaged in high-level processing such as system control.
motor and regeneration motor both a high-performance CPU core and peripheral
I/O circuits corresponding to the motor system are needed in the MCU. A CPU The left side of Fig. 3.5.6 shows measured results of the FOC processing time
core with the lock-step dual-core (LSDC) system is employed to satisfy both high- compared with the results of conventional work and the CPU processing in this
performance computing and functional safety requirements. The LSDC system is work. The 160MHz IMTS is 50 faster than conventional results [5] and 7 faster
a promising architecture for ensuring accuracy of computed results, because the than the 320MHz CPU embedded in the test chip. An EV/HEV system requires
two CPU cores process the same program and compare their outputs considerable CPU computing resources for high-level processing, as discussed
continuously [3]. In terms of peripheral I/O circuits, A/D Converters (ADC) are in Fig. 3.5.2, therefore, the CPU load percentage is an important indicator for
used to gauge the inverter current, and a Resolver-Digital Converter (RDC) [4] is evaluating system capabilities. The right side of Fig. 3.5.6 shows the CPU load
used to convert resolver signals to the digital motor angle. With this input data, percentage in the assumed use cases. As shown in cases A,B,C (without the
namely, the inverter current and the motor angle data, the MCU has to determine IMTS), the CPU load percentage goes up to 65%, 85%, and 130%, in
the next required states and generate fine-controlled pulse-width modulation combinations of the assumed EV/HEV use cases. As shown in case D, the CPU
(PWM) signals to control the external power devices. For this purpose, we have load percentage dramatically drops to 2.4% with the off-loading effect of
designed an application-specific circuit system, the intelligent motor-timer system employing the IMTS. The condition of the functional safety is that CPU checks 2
(IMTS), to enhance the efficiency of EV/HEV motors. sets of FOC and AT data in every 500s period, which means that the CPU can
detect an anomaly within 1 motor rotation, even if the rotation speed is up to
Figure 3.5.2 illustrates motor-control system processing. In the EV/HEV motor- 100,000rpm. The remaining CPU computing resources can be assigned to the
control system software hierarchy, the upper layers contain high-level processing, high-level processing, which allows efficient motor control essential for an
such as system control, functional safety, and advanced motor-control algorithms EV/HEV. Fig. 3.5.7 shows the chip micrograph. The essential macros for
for maximizing the motor-control efficiency. The bottom layer consists of the FOC supporting advanced motor controls for EV/HEV are equipped in one chip, with a
[5] and the angle-tuning (AT) algorithm. FOC calculates the PWM duty values for flash-embedded MCU fabricated in 40nm CMOS technology.
the optimum current control of inverters, according to the current value calculated
by the speed-control algorithm located in the upper layer. The AT is effective for References:
adjusting the angle data used in FOC. As shown in the detailed functional block [1] M. Novak, et al., "An SiC Inverter for High Speed Permanent Magnet
diagram of the low-level processing, the FOC contains compute-intensive Synchronous Machines," IEEE Industrial Elec. Conf., pp. 2397-2402, 2015.
functions, including coordinates transformations, such as Clarke and Park [2] C. Takahashi, et al., "A 16nm FinFET Heterogeneous Nona-Core SoC Complying
transformations. Such processing must occur with low latency and as such, we with ISO26262 ASIL-B: Achieving 10-7 Random Hardware Failures per Hour
offload the computational work from the CPU to the IMTS. Reliability," ISSCC, pp. 80-81, 2016.
[3] A. Hanafi, et al., "Dual-Lockstep Microblaze-Based Embedded System For Error
The upper side of Fig. 3.5.3 shows the IMTS circuit block diagram, comprising Detection And Recovery With Reconfiguration Technique," IEEE World Conf. on
an up-down counter which measures a motor-current control period, a dedicated Complex Systems, 2015.
sequencer, an FOC circuit, a comparator for PWM signal generation, and an ATC. [4] N. Qamar, et al., "Speed Error Mitigation for a DSP-Based Resolver-to-Digital
The IMTS sequencer manages the detailed timing of every circuit block and it is Converter Using Autotuning Filters," IEEE Transactions on Industrial Electronics,
triggered by the up-down counter when its value is minimum or maximum. The vol. 62, no. 2, pp. 1134-1139, 2015.
lower side of Fig. 3.5.3 shows the timing diagram of the IMTS processing flow. [5] H.-C. Liu, et al., "Micro Controller Unit Based Motor Control System using
First, the sequencer activates the ADC and receives as input the converted current Field Oriented Control Algorithm," IEEE Conf. on Information, Communications,
data by detecting the A/D completion timing. It also takes in the angle data at the and System Processing, 2013.
optimum timing to compensate for the flight time differences between current
and resolver signals due to delays of wire harnesses. After preparing the input
data, the FOC circuit is triggered and calculates the PWM duty values, which are
converted to the PWM wave signals activated in the next control period. These
processes are automatically executed and do not impact the CPU.

Figure 3.5.1: Motor-control system for EV/HEV. Figure 3.5.2: Motor-control system processing.
Figure 3.5.3: Intelligent motor-timer system (IMTS). Figure 3.5.4: Angle-tuning circuit (ATC).
Figure 3.5.5: Functional safety for IMTS. Figure 3.5.6: FOC processing time and CPU load.

Figure 3.5.7: Chip micrograph.

3.6 A 60pJ/b 300Mb/s 1288 Massive MIMO Typically, MaMi channels are close to independent and identically distributed (IID).
Precoder-Detector in 28nm FD-SOI However, scenarios like highly correlated channels, e.g. dense user deployments,
require non-linear schemes. QRD followed by tree-search is a near-optimal
approach for small-scale MIMO systems. In MaMi, it is crucial to lower the
Hemanth Prabhu, Joachim Neves Rodrigues, Liang Liu, Ove Edfors detection matrix dimension by performing MF, which in turn colors noise. The
implemented flexible framework supports both linear and non-linear detection,
Lund University, Lund, Sweden see Fig. 3.6.4. The Cholesky decomposition (CD) unit facilitates whether to solve
the linear equation to perform a minimum mean-squared-error (MMSE) detection
Further exploitation of the spatial domain, as in Massive MIMO (MaMi) systems, operation, or to setup a triangularized system with decolored noise for tree-search
is imperative to meet future communication requirements [1]. Up-scaling of algorithms. As the division unit dictates the accuracy and timing constraint, a
conventional 44 small-scale MIMO implementations to MaMi is prohibitive in- pipelined sequential restoring bit-accurate division unit is designed. It provides a
terms of flexibility, as well as area and power cost. This work discloses a highly accurate decomposition with a 51dB signal-to-quantization noise ratio at
1.1mm2 1288 MaMi baseband chip, achieving up to 12dB array and 2 spatial 12b internal word length. An 88 CD is computed in 325 cycles, followed by a
multiplexing gains. The area cost compared to previous state-of-the-art MIMO streaming 8 cycle/sample forward and back-substitution unit for linear detection.
implementations [2-3], is reduced by 53% and 17% for up- and down-link,
respectively. Algorithm optimizations and a highly flexible framework were Figure 3.6.5 shows the measured power dissipation for pre-coding and detection
evaluated on real measured channels. Extensive hardware time multiplexing at room temperature for different frequency and supply voltages. Forward and
lowered area cost, and leveraging on flexible FD-SOI body bias and clock gating reverse body bias is used for performance-power trade-off and fine tuning effects
resulted in an energy efficiency of 6.56nJ/QRD and 60pJ/b at 300Mb/s detection of process variations. A clock rate of 300MHz achieves the required throughput
rate. of 300Mb/s, at a power consumption of 18mW and 31mW, for detection and pre-
coding, respectively. The downlink pre-coding employs a QRD unit with
For the downlink, pre-coding is a highly critical task in terms of functionality and performance and energy efficiency of 34.1M-QRD/s/kGE and 6.56nJ/QRD,
complexity, as the Gram matrix (GM) needs to be generated and inverted. Well- respectively, at low area cost, which is favorable compared to [3-4]. Using
known low-complexity approaches, i.e. Neumann series and stationary iterative CORDIC as in [3], could further improve the QRD efficiency, however, a generic
methods, suffer from slow convergence and are dependent on diagonal multiplier-based design is opted for, which provides an additional ability to
dominance of the GM. Furthermore, scaling of exact QR-decomposition (QRD)- compute GM not supported in the cited designs. The uplink utilizes a 1.08s
based small-scale MIMO approaches is expensive if larger matrices need to be latency 88 CD unit, with a high detection area efficiency and energy efficiency of
processed [3-4]. The proposed approximate Givens rotation (GR) reduces 2.02Mb/s/kGE and 60pJ/b, respectively. The energy efficiency can be further
complexity by 50%, and thereby addresses the area penalty of a QRD for a large improved by body bias, voltage, and frequency scaling. Compared to small-scale
matrix, with minimal performance loss, as shown in Fig. 3.6.1. The architecture MIMO [6], MaMi systems with linear detection schemes provide superior
is highly reconfigurable, which enables an exact GR-based QRD for unfavorable performance and hardware efficiency. The array and spatial multiplexing gains in
channel scenarios. MaMi require handling large matrices, however, the reported higher hardware
efficiencies, and low complex PAR scheme makes it very promising for future
The high degree of freedom in the spatial domain of MaMi is exploited to reduce deployments. Comparison with state-of-art MIMO implementations is shown in
the peak-to-average ratio (PAR) of the transmit signals. An approach with a low Fig. 3.6.6.
computational complexity in the digital domain is to deliberately clip signals sent
to antennas, while correction signals are transmitted on a set of dedicated Acknowledgments:
antennas. In contrast to tone reservation, this approach has only a logarithmic This work was funded by Lund University, SSF-DISTRANT and MAMMOET.
impact on capacity, and up to 4dB reduction in PAR is achieved. Furthermore, we thank STM for providing the opportunity to fabricate chip in
28nm-FDSOI. Also, thank Oskar A. and Rakesh G. for help in backend flow.
Figure 3.6.2 shows the pre-coder architecture split into three sub-modules: The
triangular arrays perform a GM followed by QRD per-channel realization; the References:
vector projection unit (VPU) and backward substitution unit (BSU) provide implicit [1] F. Rusek, et al., Scaling up MIMO: Opportunities and Challenges with Very
inversion on user data. Lastly, matched filtering (MF), inverse fast Fourier Large Arrays, IEEE Signal Processing Magazine, vol. 30, no. 1, pp. 40-60, 2013.
transform (IFFT), and an optional threshold clipping for on-demand PAR-aware [2] C. H. Chen, W. Tang, and Z. Zhang, A 2.4mm2 130mW MMSE Nonbinary-
pre-coding is executed. A coarse 2.5kb pipeline stage between sub-modules LDPC Iterative Detector-Decoder for 44 256-QAM MIMO in 65nm CMOS,
enables highly pipelined execution across thesub-carriers. The pipeline registers ISSCC, pp. 338-339, 2015.
aredistributed into the processing elements (PEs) of the BSU and VPU, wherein [3] Z.-Y. Huang and P.-Y. Tsai, Efficient implementation of QR decomposition
GR coefficients are stored, which in turn provides high access bandwidth. for gigabit MIMO-OFDM systems, IEEE Trans. on Circuits and Systems-I, vol.
Extensive fine pipelining between PEs keeps the critical path of 0.95ns inside each 58, no. 10, pp. 2531-2542, 2011.
PE. [4] M. Shabany, D. Patel, and P. Gulak, A Low-Latency Low-Power QR
Decomposition ASIC Implementation in 0.13m CMOS, IEEE Trans. on Circuits
A highly area-efficient unified PE is implemented (Fig. 3.6.3), which computes and Systems-I, vol. 60, no. 2, pp. 327-340, 2013.
both the GM and QRD, and thereby reduces gate count by 2.7k per PE. The unified [5] B. Noethen, et al., A 105_GOPS 36mm2 Heterogeneous SDR Mpsoc With
triangular array first computes the GM and feeds the QRD via the vertical Energy-Aware Dynamic Scheduling and Iterative Detection-Decoding for 4G in
interconnect. Furthermore, a highly time-multiplexed PE is realized by a single 65nm CMOS, ISSCC, pp. 188-189, 2014.
generic multiplier. An exact GR requires 16 clock cycles, whereas an approximate [6] M. Winter, et al., A 335_Mb/s 3.9mm2 65nm CMOS Flexible MIMO Detection-
GR also utilizes a constant multiplier lowering the number of cycles to 8. The two Decoding Engine Achieving 4G Wireless Data Rates, ISSCC, pp. 216-218, 2012.
accumulator units perform matrix multiplication by reusing the generic multiplier.
The VPU avoids explicit computation of Q and uses pre-computed GR coefficients
to process the data stream in a butterfly-like fashion. The storage size reduces
with each stage with a total of 1.7kb required for exact computation; half of the
storage is clock-gated when approximate rotation is used. A ping-pong buffer of
size 0.4kB is used for both pipelining and reordering user-vector streams for the
BSU (the PE rate matches the VPU). The BSU employs Newton-Raphson division
and reuses the multiplier during initialization, improving the area efficiency.

Adaptive Gram Matrix Unified

Detection Hardware
Channel
Matrix
FFT/IFFT
Pre-coder
based on
QRD
3
Adaptive QRD
PAR
distortion reserved
prediction antennas
Per-channel information
Pre-coding Performance Detection Performance Pre-coded

User
Ping-Pong
Data data
Buffer
To antenna
Vector Projection Unit Backward-Substitution Unit
Delayed interconnect Per-subcarrier stream
Matched Filter IFFT PAR Aware
Figure 3.6.1: Massive MIMO model with spectral efficiency gains and BER for Per-antenna
pre-coding and detection. Figure 3.6.2: Architecture of downlink pre-coding.
MAC UNITS
SNR
Figure 3.6.3: Unified PE supporting Gram matrix generation and

approximative/exact QRD. Figure 3.6.4: Block diagram for linear/non-linear detection.
350 70
Uplink Detection Uplink Detection This Chen Noethen Winter
Power consumption [mW]
FBB
Work [2] [5] [6]

MIMO (M K) 1288 44 44 44
Functionality
Frequency [MHz]
250 50
Array Gain [dB], Spatial Multiplexing 12, 8 0, 4 0, 4 0, 4
RBB
Detection Algorithm ZF/MMSE MMSE SD SISO SD SO

Modulation 256 256 64 64
150 30 Technology [nm] 28 65 65 65
Power [mW] 18 a 26.5 87 38
Frequency [MHz] 300 517 445 333
18mW, Detection Area [kGE] 148 347 383 215 b
50 300MHz@ 10
Metric
Detection Data-rate [Mb/s] 300 1379 396 296-807

(0.9,-0.2) Area Efficiency c[Mb/s/kGE] 2.02 0.99 0.26 0.34-0.94
0.5 0.6 0.7 0.8 0.9 1 Energy Efficiency [pJ/b] 60 76.8 878.8 188.3
Core supply voltage (V DD ) Downlink Pre-coding This Work Shabany Huang
Approx. Exact [4] [3]
Frequency Power V BB = 0.4 V MIMO (K M ) 8128 44 44
V BB = 0.2 V V BB = 0 V V BB = -0.2 V
Functionality
Technology [in nm] 28 130 130

Frequency [in MHz] 300 278 100
350 70 Pre-coding data-rate [Mb/s] 300 - -
Downlink Pre-coding Logic Analyzer max. freq.
QRD Algorithm Adp. GR Hy. GR
Power consumption [mW]
FBB
Gram Matrix Generation

88 44 44
Frequency [MHz]
250 50 QRD Dimension

RBB
QRD Metric
Gate Count [K] 138 36 111

Power [mW] d 31 48.2 319
QRD Latency [Cycles] 64 128 40 4
150 30
QRD Efficiency [M-QRD/s/kGE] e 34.1 17 24.13 28.2
31mW, QRD Energy Efficiency [nJ/QRD] 6.56 13.2 11.94 21.9
300MHz@ a Post MF power, MIMO dimension lowers to 88
50 (0.9,0.2) 10 b Pre-processing not included
c Area Efficiency = (Data-rate/Gate Count)(K 2 /82 ) ; Energy Efficiency scaled by (K 2 /82 )
0.5 0.6 0.7 0.8 0.9 1 d Only QRD power consumption.
e QRD Efficiency = ((Cycles/Frequency)/Gate Count) K 3 /83 ; Energy scaled by 28 nm/Technology.
Core supply voltage (V DD )
Figure 3.6.5: Chip measurement results. Figure 3.6.6: Comparison with state-of-the-art designs.

Technology 28 nm FD-SOI
Supply voltage 0.5 V to 1.0 V
Body-Bias range -0.3 V to 1.0 V
Chip Gate count 350 K
Test Interface JTAG
Tested Max. Freq 300 MHz
128x8 Massive MIMO Pre-coder

Gate Count 138 K
QRD Clock Cycles 64
Algorithm Approx. GR
Word length 12-bit
128x8 Massive MIMO Detector

Gate Count 148 K
Cholesky clock cycles 325
Modulation 256-QAM
BSU and VPU cycles 8 cycles/sample
Figure 3.6.7: Chip microphotograph.

3.7 A 19201080 30fps 2.3TOPS/W Stereo-Depth propose a dependency-resolving scan in which pixel processing proceeds
Processor for Robust Autonomous Navigation diagonally (Fig. 3.7.3, right). When a pixel (F) is fetched into the pipeline, the
aggregated costs of all previous pixels (light gray and dark gray) are already
computed and stored in high-bandwidth custom SRAMs. This mechanism enables
Ziyun Li, Qing Dong, Mehdi Saligane, Benjamin Kempke, Shijia Yang, aggressive pipelining, yielding a 3 performance gain. As shown in the block
Zhengya Zhang, Ronald Dreslinski, Dennis Sylvester, David Blaauw, diagram in Fig. 3.7.4 (top), our design leverages parallelism in cost aggregation
Hun Seok Kim by running 4 paths in parallel on 4 aggregation units, with each aggregation unit
containing 128 processing elements and 512 selection units, resulting in
University of Michigan, Ann Arbor, MI 1.882TOP/s.
Precise depth estimation is a key kernel function to realizing autonomous Figure 3.7.4 (bottom) shows the proposed architecture of the customized compact
navigation on micro-aerial vehicles (MAVs). The state-of-the-art semi-global high-bandwidth SRAM. In the proposed design, the row buffers are read and
matching (SGM) algorithm has become favored for its high accuracy. In particular, written simultaneously at 170MHz, and all 128 previous aggregated costs are
it effectively handles low texture regions due to its global optimization of the accessed in a single cycle. This approach achieves the required memory
disparity between a left and right image over the entire frame. However, SGM bandwidth of 1.64Tb/s for the 3 row buffers accessed in parallel. This bandwidth
involves massively parallel computation (~2TOP/s) and extremely high bandwidth would incur large chip area and power overhead if realized with compiled SRAMs.
memory access (38.6Tb/s) for 30fps HD resolution. This leads to ~20s runtime To provide an efficient area/power solution, we use a custom high-bandwidth
for an HD image pair on a 3GHz CPU [1] requiring ~386MB memory and >35W SRAM that leverages the designs highly parallelized structure in which each bank
power consumption. Together, these factors place it well outside the realm of has only 50 words with a single word size of 403b (Fig. 3.7.4). All four banks in
MAVs. Prior ASIC implementations have used either simpler local methods [2] one SRAM are read and written concurrently, realizing a 1612b dual port access.
or aggressively truncated global algorithms [3] that produce a depth map with To reduce leakage power in the 40nm technology, the custom 8T memory bitcell
significantly inferior quality or limited disparity range (32 or 64 pixels) and uses HVT transistors. Unlike conventional 8T cells, the read transistor stack is
therefore fail to support standard automotive scene benchmarks [2-5]. In addition, flipped such that the read transistor is not connected to RBL, reducing coupling
due to the high memory requirement of SGM, prior methods [3-4] have used between RWL and the short, low capacitance RBL. Skewed inverters are used in
external DRAM to store intermediate computation, significantly reducing place of conventional sense amplifiers (1612 per SRAM), reducing sense amp
performance and efficiency. overhead by 2.8. Overall, each 80Kb SRAM consumes 6mW with 548.1Gb/s
bandwidth.
This paper presents a stereo vision processor that fully implements the SGM
algorithm on a single chip. The design uses a new image-scanning stride to enable The vision processor is fabricated in 40nm GP CMOS. Fig. 3.7.5 shows the
a deeply pipelined implementation with ultra-wide (1612b) custom SRAM for measurement setup and real-time demonstration platform mounted on a
1.64Tb/s on-chip access bandwidth. Our design is the first ASIC to report quadcopter. Real-time image streams captured by the stereo camera are rectified,
performance under the industrial standard KITTI benchmark that renders realistic block-partitioned by a Samsung Exynos-5422 processor on the ODRIOD-XU4
automobile scenes. The proposed design supports 512-level depth resolution on board, and then transmitted to the stereo processor through a USB3.0 interface.
full HD (19201080) resolution with real-time 30fps, consuming 836mW from a The processed real-time depth and confidence maps provide feedback to the
0.75V supply in 40nm CMOS. We also integrate the stereo chip with a quadcopter Exynos processor through another USB3.0 channel. At 0.9V nominal voltage, the
and demonstrate its operation in real-time flight. real-time VGA (HD) frame processing latency of the stereo processor is 4.1ms
(26ms). In a KITTI automobile scene (Fig. 3.7.5, bottom) and a quadcopter scene
Figure 3.7.1 (top right) visualizes the output difference between the local sum of on-the-fly captured by our demonstration platform (Fig. 3.7.5, middle), large
absolute difference (SAD) algorithm and SGM, clearly illustrating the higher (>100 pixels) disparity frequently occurs, and the proposed processor is able to
quality of SGM. To make a single-chip SGM implementation feasible, i.e. remove generate an accurate depth map over the entire image due to its 512 levels of
the need for external DRAM, we first observe that inter-pixel correlation diminishes resolution. Fig. 3.7.6 shows the voltage and frequency scaling of the chip and
when pixel pairs are more than 50 pixels apart. Hence, the proposed design provides comparison with prior work. The proposed processor achieves 7%
processes the input image in units of 5050 overlapping pixel blocks. Adjacent outlier accuracy on KITTI and an 8 improvement in disparity range compared
blocks are overlapped by 8 pixels to allow cost aggregation across block with [2-5]. Note that [2-5] all lack a standard benchmark evaluation because of
boundaries. This technique reduces the memory requirement for storing their limited depth range. Our system consumes 836mW to process 30fps full
intermediate aggregation results by 95.4%. Fig. 3.7.1 shows a side-by-side HD images at 0.0262nJ normalized energy (an FoM proposed in [4] and defined
comparison of this block-based SGM and the original SGM, which are almost in Fig. 3.7.6 top, left), marking a 5.8 improvement over listed prior work. Power
identical. Fig. 3.7.1 also presents quantitative results evaluated on 194 KITTI test reduces to 55mW for VGA images at 30fps, yielding 0.0117nJ normalized energy.
cases showing only 0.5% accuracy degradation. Fig. 3.7.7 shows the die photo and a performance summary.
As shown in Fig. 3.7.2, the processor streams left and right image blocks into Acknowledgements:
two on-chip interleaved image buffers (30Kb each). It then performs a 77 census We thank TSMC University Shuttle Program for chip fabrication.
transformation on each pixel using its surrounding pixels and compares each
census-transformed pixel on the left image with census-transformed pixels on References:
the right image at 128 different disparity locations. This produces 128 Hamming [1] H. Hirschmuller, et al., Accurate and Efficient Stereo Processing by Semi-
distances (6b each) for each pixel that represent the local matching cost for the Global Matching and Mutual, Computer Vision and Pattern Recognition, pp.
128 disparities. The processor then aggregates the local matching costs 807-814, 2005.
(separately for each disparity) along 8 paths over the 5050 block. By searching [2] M. Hariyama, et al., VLSI Processor For Reliable Stereo Matching Based on
the sum of the aggregated cost for each disparity for the minimum value, the Window-Parallel Logic-In-Memory Architecture, IEEE Symp. VLSI Circuits, pp.
processor obtains the coarse (integer) SGM depth output. It then refines this 166-169, 2004.
depth precision by performing a quadratic fitting on three aggregated costs around [3] K. Lee, et al., A 502GOPS and 0.984mW Dual-Mode ADAS SoC with RNN-
the minimum using a look-up table to provide sub-pixel depth accuracy. FIS Engine for Intention Prediction in Automotive Black-Box System, ISSCC, pp.
256-257, 2016.
Conventionally, SGM is implemented with a forward and a backward raster scan, [4] H-H. Chen, et al., A 19201080 30fps 611mW Five-View Depth-Estimation
with each scan performing aggregation along 4 paths (total of 8 paths). However, Processor for Light-Field Applications, ISSCC, pp. 422-423, 2015.
following this conventional raster scan order results in a data dependency where [5] J. Park, et al., A 30fps Stereo Matching Processor Based on Belief
the previous pixel must complete its computation before the current pixel can be Propagation with Disparity-Parallel PE Array Architecture, ISCAS, pp. 453-454,
aggregated (Fig. 3.7.3, left). This dependency dominates the critical path, limiting 2010.
the clock frequency and voltage scalability for low power operation. We therefore

Figure 3.7.1: Depth estimation on MAVs and associated requirements.

Comparison of local, SGM and proposed block-based SGM. Figure 3.7.2: Proposed block-based SGM processing procedure.
Figure 3.7.3: Proposed dependency-resolving scan and corresponding Figure 3.7.4: Chip block diagram and schematic of proposed compact high
pipelining and forwarding scheme. bandwidth SRAM.
Figure 3.7.5: Real-time demonstration setup and visualization of chip

measurements. Figure 3.7.6: Chip measurements and comparison with recent prior works.

Figure 3.7.7: Die photo and summary of performance.

ISSCC2017-03 Digest PDF

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

ISSCC2017-03 Digest PDF

Caricato da

Copyright:

Formati disponibili

ISSCC 2017 / SESSION 3 / DIGITAL PROCESSORS / OVERVIEW

Session 3 Overview: Digital Processors

Session Chair: Thomas Burd, Session Co-Chair: James Myers,

Subcommittee Chair: Byeong-Gyu Nam, Chungnam National University, Korea

48 2017 IEEE International Solid-State Circuits Conference 978-1-5090-3758-2/17/$31.00 2017 IEEE

DIGEST OF TECHNICAL PAPERS 49

50 2017 IEEE International Solid-State Circuits Conference 978-1-5090-3758-2/17/$31.00 2017 IEEE

DIGEST OF TECHNICAL PAPERS 51

Figure 3.1.7: POWER9 die photo.

2017 IEEE International Solid-State Circuits Conference 978-1-5090-3758-2/17/$31.00 2017 IEEE

52 2017 IEEE International Solid-State Circuits Conference 978-1-5090-3758-2/17/$31.00 2017 IEEE

DIGEST OF TECHNICAL PAPERS 53

Figure 3.2.7: Zen CCX die photo.

2017 IEEE International Solid-State Circuits Conference 978-1-5090-3758-2/17/$31.00 2017 IEEE

2.5D interconnect required a new interface bus named AIB. It implemented an

54 2017 IEEE International Solid-State Circuits Conference 978-1-5090-3758-2/17/$31.00 2017 IEEE

Figure 3.3.3: Routing fabric circuits. Figure 3.3.4: Modular floorplan.

Figure 3.3.5: AIB Interconnect. Figure 3.3.6: Configuration system.

DIGEST OF TECHNICAL PAPERS 55

Figure 3.3.7: Fabric die photo.

2017 IEEE International Solid-State Circuits Conference 978-1-5090-3758-2/17/$31.00 2017 IEEE

An emerging feature of smartphone CPUs is the ability to maximize performance References:

56 2017 IEEE International Solid-State Circuits Conference 978-1-5090-3758-2/17/$31.00 2017 IEEE

DIGEST OF TECHNICAL PAPERS 57

Figure 3.4.7: Die photo.

2017 IEEE International Solid-State Circuits Conference 978-1-5090-3758-2/17/$31.00 2017 IEEE

58 2017 IEEE International Solid-State Circuits Conference 978-1-5090-3758-2/17/$31.00 2017 IEEE

DIGEST OF TECHNICAL PAPERS 59

Figure 3.5.7: Chip micrograph.

2017 IEEE International Solid-State Circuits Conference 978-1-5090-3758-2/17/$31.00 2017 IEEE

60 2017 IEEE International Solid-State Circuits Conference 978-1-5090-3758-2/17/$31.00 2017 IEEE

Adaptive Gram Matrix Unified

Pre-coding Performance Detection Performance Pre-coded

Delayed interconnect Per-subcarrier stream

Matched Filter IFFT PAR Aware

pre-coding and detection. Figure 3.6.2: Architecture of downlink pre-coding.

Figure 3.6.3: Unified PE supporting Gram matrix generation and

Work [2] [5] [6]

Detection Algorithm ZF/MMSE MMSE SD SISO SD SO

Detection Data-rate [Mb/s] 300 1379 396 296-807

Technology [in nm] 28 130 130

Gram Matrix Generation   

250 50 QRD Dimension

Gate Count [K] 138 36 111

DIGEST OF TECHNICAL PAPERS 61

128x8 Massive MIMO Pre-coder

128x8 Massive MIMO Detector

Figure 3.6.7: Chip microphotograph.

2017 IEEE International Solid-State Circuits Conference 978-1-5090-3758-2/17/$31.00 2017 IEEE

62 2017 IEEE International Solid-State Circuits Conference 978-1-5090-3758-2/17/$31.00 2017 IEEE

Figure 3.7.1: Depth estimation on MAVs and associated requirements.

Figure 3.7.5: Real-time demonstration setup and visualization of chip

DIGEST OF TECHNICAL PAPERS 63

Figure 3.7.7: Die photo and summary of performance.

2017 IEEE International Solid-State Circuits Conference 978-1-5090-3758-2/17/$31.00 2017 IEEE

Potrebbero piacerti anche

Gram Matrix Generation