2017 10 24 FPGA Development For C C++ Using HLS

FPGA Development for C/C++ Coders
using High-Level Synthesis

By Karl de Boois - EBV
Page 1 © Copyright 2016 Xilinx

.
Agenda
Introduction
EBV & Xilinx
Design flow: RTL -> HLS -> SDSoC
FPGA basics
– LUT’s & Routing
– Clocking, Memory and DSP resources
Product portfolio & demo

.
Introduction - Karl de Boois
Hardware Design Engineer since 1991

Field Application Engineer for EBV Elektronik since 2002
Some personal facts:
Love to travel the world with my girlfriend.

Play video games in my mancave.
Mancave starts at the front door.

.
.
.
Design flow: RTL -> HLS -> SDSoC

.
Traditional FPGA design in Verilog
Not a programming language but a Hardware Description
Language
What about VHDL ?

.
The battle of the HDL’s – VHDL vs Verilog

.
The (ideal) RTL design flow for FPGA’s

.
Vivado High-Level Synthesis: Accelerated IP
Development and Design Space Exploration
Comprehensive coverage
– C/C++/SystemC
– Arbitrary precision
– Floating-point
Accelerated verification
– 2 to 3 orders of magnitude faster
than RTL for larger design
Fast compilation and design
exploration
– Algorithm feasibility
– Architecture Iteration
Customer proven results

.
Accelerating Design Productivity with HLx & SDx
SystemC, C++, OpenCL

Productivity HLx SDx
 Separate platform design from differentiated logic HW flow SW flow
Open
– Let application designers focus on the differentiated logic Source
C++
Library
 Spend less time on the standard connectivity Vivado HLS
AXI I/F
– IPI: configure & generate a platform on a custom board
 Spend more time on the differentiated logic

IP Integrator
– HLS: enabling core technology: C/C++/OpenCL synthesis Platform RTL IP

awareness
SDK
– HLx: IP design (HLS/SysGen) + connectivity platform (IP RTL
Integrator) Vivado Synthesis,

Implementation
Page 11
© Copyright 2016 Xilinx

.
Vivado HLS Accelerates Verification Productivity
RTL-Based Approach C-based Approach
Hours per iteration Seconds per iteration

Functional Functional
Verification C Verification with C
RTL Compiler
with HDL
simulation
Verified RTL
2 to 3 orders of magnitude faster than

RTL Final
RTL for large designs Validation
– RTL verification becomes final check

Verified RTL
Functional Verification Video Example – 10 Frames of Video Data Final Verification

RTL RTL = ~2 Days RTL
Vivado HLS C = 10 Sec RTL
Page 12
.
Accelerate Functions in Hardware with SDSoC™
(Zynq)
No RTL Design Required
Processing System
ARM® Core
1 System-Level Profiling
C/C++ C/C++ C/C++
Input Output
2 Toggle SW-HW Partitioning Toggle HW/SW [S]

[H]
System Optimizing Compiler

3
Programmable Logic
Takes C code and creates all hardware processing blocks

and connections to make the software work

.
Using the “Hello World”of SDSoC to introduce
the FPGA, the Floating Point Matrix Multiplier.

.
Floating Point Matrix Multiplier example

.

.

.
FPGA basics

.
Logic (FF and LUT) & Interconnect

.
Slices in 7 Series FPGAs
A Configurable Logic Block (CLB) Slice_L

contains two slices
Slice_L
– All 7 series FPGAs use the same CLB
structure
Two types of slice
– SLICE_L: Logic-only CLB_LL
– SLICE_M: Memory-capable
Slice_L
(Logic + RAM / Shift Register)
– No SLICE_X (present in Spartan-6 FPGAs) Slice_M
Not every CLB needs memory
– CLB_LL: two SLICE_Ls
– CLB_LM: SLICE_L + SLICE_M CLB_LM
Total Power
 Ratio of Logic to Memory Optimized for Area and Power
Reduction
Page 20
.
Configurable Logic Block
Architecture consistent with Virtex-6

SLICE
and Spartan-6
LUT – Two side-by-side slices per CLB
– Four 6-input Look Up Tables (LUTs) per
SLICE
slice
LUT
– Two flip-flops per LUT
– MUXF7 and MUXF8 for creating larger
logic structures
Ease of design migration

– Designs can be migrated easily from
Spartan-6 FPGAs and Virtex-6 FPGAs to
7 series FPGAs
– Designs can be migrated between 7
series families as user’s requirements
change
CLB
Page 21
.
Logic Resource Uses
LUT6 LUT5 LUT3
LUT5 LUT3
Six input Shared five Split logic

logic function input logic function
function
64x1 SRL32
RAM
Distributed Shift register

memory logic
Page 22
.
32-bit Shift Register in One LUT
LUT
Versatile SRL-type shift registers
– Variable-length shift register D
32-bit Shift register Q 31
– Synchronous FIFOs CLK
– Content-addressable memory (CAM)
32
– Pattern generator
– Compensate for delay / latency MUX
A
5
Shift register length is determined by Qn
the address
– Constant value giving fixed delay line SRL Configurations
in one Slice (4 LUTs)
– Dynamic addressing for elastic buffer
16x1, 16x2, 16x4, 16x6, 16x8
32x1, 32x2, 32x3, 32x4

Cascade up to 128x1 shift register in
64x1, 64x2
one slice
96x1
128x1
Page 23
.
Arithmetic Carry
Two carry chains per CLB
Carry Out Carry Out – Running South to North for fast
arithmetic addition and subtraction
– Each Slice in a column has
CLB
independent carry chain
Slice Carry lookahead
– Combinatorial carry lookahead
over the four LUTs in a slice
– Improves arithmetic performance
Slice
Carry chains are cascadable
– Carry-Out connects to adjacent
Carry-In in the same column
Carry chain uses
– Carry chain used as large AND/OR
gates and decoders to reduce logic
and improve performance
Carry In Carry In
Page 24
.
Fast Interconnect Between CLBs
Inter-
CLB connect CLB
CLBs are distributed in columns of tiles

– Separated by two back to back interconnect
columns
Back to back interconnect results in
metal area saving
– Reducing metal area lowers power
consumption and overall device cost
– Fast connection between CLBs in either side
of interconnect pair
Only clock signals shared between two
columns
– Data signals feed to either left or right CLB
column
Page 25
.
Length of Routing Resources
Different length interconnect

resources
– Single lines connect adjacent tiles
– Double lines connect two tiles
away
– Quad lines connect four tiles away
– Hex lines connect six tiles away
– Long lines connect between 12
and 18 tiles away
Quantity of routing resources
increased over Virtex-6 FPGAs
and Spartan-6 FPGAs
– To avoid encountering routing
congestion
Page 26
.
Using Logic and Interconnect Resources
All logic resources can be inferred
– Xilinx and third party synthesis tools
Software intelligently packs logic

– Placer is aware of the lengths of available routing resources
Design FPGA
... Slice
Process (Clock)
begin
lutout1 <= not (input1 and input2 and input3 and input4 and common);
lutout2 <= (input5 or input6 or input7 or common);
LUT
if rising_edge (Clock) then
flop1 <= lutout1;
end if;
if falling_edge (Clock) then

flop2 <= lutout2; LUT
end if;
End process;
...
Placer software automatically chooses the best routing resources

– The fewer resources used, the quicker the connection, the faster the design
– A single routing resource will be used if possible
– Otherwise, multiple routing resources will be interconnected
Page 27
.
Embedded Memory (BRAM)

.
7 Series FPGAs Memory Hierarchy
Distributed RAM/SRL32 On-chip BRAM/FIFO Fast Memory Interfaces
DRAM
DRAM
RAM/SRL 32 • SDRAM
• DDR SDRAM
• FCRAM
SRAM
• RLDRAM
LOGIC BRAM/FIFO
7 series SRAM
• Sync SRAM
FPGA FLASH
• DDR SRAM
• ZBT
• QDR
FLASH
EEPROM
EEPROM
Using LUTs as storage elements Dedicated internal memory arrays Ability to interface to external memory
• Very granular, localized memory • Efficient, on-chip blocks •Memory-controller cores
• Minimal impact on logic routing • Flexible + optional FIFO logic •Cost-effective bulk storage
• Great for small FIFOs • Ideal for mid-sized buffering •For large memory requirements
Granularity Capacity
Page 29
.
Distributed RAM
Distributed memory Capability of single SLICE_M

– Each LUT can be 64-bit memory Simple
– Inherently single-port, Single Dual Quad
Dual
but can be made dual-port, multi-port Port Port Port
Port
Ideal for small and fast memories 32x2 32x2D 32x6SDP 32x2Q
– coefficient storage, 32x4 32x4D 64x3SDP 64x1Q
– small data buffers, 32x6 64x1D
– small state machines, 32x8 64x2D
– small FIFOs 64x1 128x1D
– shift registers 64x2
– etc. 64x3
Adjacent LUTs can be cascaded 64x4
– Up to 256x1-bit single port memory or 128x1
64x1bit quad port memory in a single slice 128x2
256x1
Programmable  Flexible, Efficient Implementation of Common Functions

System Integration
Page 30
.
Quad-Port Memory in One SLICE_M
Write Port Read Port

Write port:
Read address = write address – Four LUT6s can
LUT Associated data share the write
Common Independent read address address and data
write address LUT Associated data
and
write data Independent read address Read ports:
LUT Associated data
– Three
Independent read address independent
LUT Associated data read operations
Page 31
.
36-Kbit Block RAM
36K/18K block RAM

– All Xilinx 7 series FPGA families use same
block RAM as Virtex-6 FIFO
or
Two independent ports address common Dual-Port
data BRAM
– Individual address, clock, write enable, clock
enable
– Independent widths for each port
Integrated control for fast and efficient
FIFOs
Integrated 64 / 72-bit Hamming error
correction
Page 32
.
Block RAM Configurations
36K/18K block RAM

– Single 36K block or two independent 18K
blocks Addr A
Port A
– 32k x 1 to 512 x 72 in one 36K block RAM 36 36
Wdata A Rdata A
Configurations similar to Virtex-6
36Kb
– Single Port, Simple Dual Port and True Memory
Dual Port configurations Array
Addr B
– Integrated cascade logic creates 64k x 1 Port B
from two 32k x 1 blocks 36 36
Wdata B Rdata B
– Byte-write enable
– Software controlled power down of
unused block RAM sites
Page 33
.
Dual Port Block RAM Configurations
True dual port – unrestricted flexibility

– Simultaneous or independent read and write operations Port A
port A and port B
– Each port has its own clock, enable, write enable
Port B
– Every write also performs a read operation
• Read before Write, Write before Read, or No Change
– Simultaneous read + write or write + write to the same
location can cause data corruption. User is responsible
Simple Dual Port – allows widest implementation
Read
– One read port and one write port Port
– Natural structure for FIFOs

– 72-bit data on one or both 36K ports Write
Port
– 36 bits width for 18K BRAM
– This doubles the memory bandwidth per block
Page 34
.
Many BRAM Configurations
Each 18K Each 36K
16Kx1, 8Kx2, 32Kx1, 16Kx2,

Two fully independent
True dual-port 4Kx4, 2Kx9, 8Kx4, 4Kx9,
read and write operations
1Kx18 2Kx18, 1Kx36
16Kx1, 8Kx2, 16Kx2, 8Kx4,

1 read & 1 write port
Simple dual-port 4Kx4, 2Kx9, 4Kx9, 2Kx18,
Read AND write in 1 cycle
1Kx18, 512x36 1Kx36, 512x72
16Kx1, 8Kx2, 16Kx2, 8Kx4,

1 read & 1 write port
Single-port 4Kx4, 2Kx9, 4Kx9, 2Kx18,
Read OR write in 1 cycle
1Kx18, 512x36 1Kx36, 512x72
Page 35
.
FPGA clock managment & generation

.
Xilinx 7 Series FPGA Layout
I/O Columns
Clock Management Columns
Clock Routing
CLB, BRAM, DSP Columns
GT Columns
Similar floorplan to Virtex-6 FPGA
– Provides easy migration to 7 series
FPGAs
CMT columns adjacent to I/O
columns
– Support for high performance interfaces
One I/O column per half device
– Uniform skew from center of device
All resources optimized for low
power
Increased System  FPGA Layout Optimized for High Performance Memory

Performance Interfaces

.
Clock Region
25 Rows
25 Rows
All 7 series FPGAs split into Every clock region is 50 rows of

uniform height clock regions CLBs tall
– Each region has its own – 25 rows above and 25 rows below the
resources central horizontal clocking row (HROW)
– All regions can share the
All clock regions span from global vertical
available global resources
clock column to the left or right edge of
the device
.
BUFG (Global Clock Buffer) IGNORE1
BUFG(CTRL)
For Driving the Global Clock Spine CE1

S1
I1
O
I0
Global buffer for distributing clock signals S0

across the height of the device CE0
IGNORE0
32 BUFG per device located in the center of

the vertical clock spine
16 BUFG driven by resources in north, 16

driven by south
Same primitive as previous generations
Glitch-less switching between clock sources
Clock Enable for clock gating

.
MMCM and PLL
MMCM PLL
Functionally similar to Virtex-6 MMCM Spartan-6 PLL/Virtex-6 MMCM features
Seven output counters plus feedback Six output counters plus feedback
Powerdown mode Powerdown mode
Input Clock Switching Input Clock Switching
Fractional Divide on OUT0 and FBOUT
Dynamic Phase Shift
True and Complement outputs (O0-O3)
Spread Spectrum Clock Generation
Lock Detect Lock Detect
Lock Monitor
Lock Lock Monitor Lock
CLKIN1 9 CLKIN1 8
D PFD O0 D PFD
CLKIN2 CLKIN2 O0
Charge Pump Charge Pump
Loop Filter
VCO VCO
Loop Filter
CLKINFB CLKINFB
O1 O1
O2 O2
O3 O3
O4 O4
O5 O5
O6
M CLKFBOUT M CLKFBOUT

.
MMCM and PLL Features
Frequency Synthesis Clock Frequency Synthesis

– Fout = Fin * M / (D*O)
– One M and one D value per MMCM or PLL
– Each MMCM and PLL output can have its own O value
• M: 1…64; D: 1…80; O: 1…128
Fractional Divide – MMCM Only

– Ability to configure O0 and CLKFBOUT as a counter with
1/8th granularity (e.g. 2.125, 2.250, 2.375 etc.)
Phase Shift Dynamic Phase Shift – MMCM Only

0 – Phase Shift port to change the phase real time in
45
increments of 1/56 of VCO period
90
135
True and Complement outputs (O0-O3) – MMCM
180
225 Only
270 – Negative polarity clock for easy generation of phase
315
matched inverted clock and migration of DCM designs

.
FPGA DSP resources

.
Massively Parallel Signal Processing
Standard DSP processor –

Sequential FPGA - Fully Parallel Implementation
(generic DSP) (Virtex-7 FPGA)
Data In
Reg
Data In
Reg
Reg
Reg
Coefficients X C0
X C1 C0
X C2
X C3
X …C199 X
Single-MAC Unit
200 clock
cycles
+ +
needed 200 operations
Reg in 1 clock cycle
Data Out
Data Out
1.2 GHz 741 MHz

= 6.0 MSPS = 741 MSPS
200 clock cycles 1 clock cycle

.
DSP Slice Features
7 series FPGAs DSP slice 100% based on Virtex-6 FPGA DSP48E1

– 25x18 multiplier
– 25-bit pre-adder
– Flexible pipeline
– Cascade in and out
– Carry in and out
– 96-bit MACC
– SIMD support
– 48-bit ALU
– Pattern detect
– 17-bit shifter
– Dynamic operation (cycle by cycle)

.
7 Series FPGAs DSP Architecture
Based on proven Virtex-6 DSP48E1 design
Adder-chain implementation
– No performance degradation or slow-down when
using pre- and post-adders.
– Consumes zero logic, seamless cascading of
DSP48E1 slices
– Filter speed is optimized if the number of taps fits
within DSP column height
High-precision, high-bandwidth operation
– 25x18 input resolution , 48-bit output resolution
– Up to 5,335 GMAC/s (symmetrical filter
implementation in XC7VX690T)
Cycle-by-cycle operation
– time-sharing DSP slices with multiple data streams,
processed and stored in SRL16s
Single-Instruction, multiple-data operation and
pattern detection
– Fully compatible with Virtex-6 IP
.
Pre-adder and Pipelines
Pre-adder and D pipeline with 2 new

registers.
– Doubles the efficiency of symmetrical filters and
convolutions
Fine-grain access to the A and B pipelines
– Optimizes the implementation of certain
algorithms, like short FFTs, sequential complex
multiplications, etc…
Enhanced control of the paths to the post-
adder
and to the multiplier.
– Easier pipeline balancing, higher operation
frequency

.
DSP Performance through the DSP48E1 Slice
Virtex-6, Spartan-7, Artix-7, Kintex-7, Virtex-7
DSP48E1 Slice
B
25x18 48-Bit Accum
DSP48 Tile Pre-Add +
A X - P
Interconnect
DSP48E1
Slice +/-
DSP48E1 D =
Slice
Pattern Detector
C
2 DSP48E1 Slices / Tile

Input Flexibility through 5 Shared interconnect
741 MHz Fmax

.
Zynq® UltraScale+ MPSoC System Features
Memory
Subsystem
High Bandwidth
Real-Time Graphics
Low Latency
Processors Processor
32-bit Dual-Core ARM Mali-400MP2
High Speed
Application
Peripherals
Processor Key Interfaces
64-bit Quad-Core
Fabric Acceleration Video Codec

Customizable Engines 8K4K (15fps)
High Speed Connectivity 4K2K (60fps)
Platform & Power Configuration &

Management Security Unit
Granular Power Control Anti-Tamper & Trust
Functional Safety Industry Standards

.
New & Enhanced UltraScale+™ Capabilities
New at 16nm New at 16nm
FinFET Perf/Watt SmartConnect

Optimized Process & Addressing IP & Fabric
Voltage Scaling Interconnect bottlenecks
Enhanced at 16nm Enhanced at 16nm New at 16nm (58G)

Security & Reliability Transceivers
Decrypt/Auth/Anti-Tamper, 16G, 32G, & 58G
Improved SEU Performance Fractional PLL
Enhanced at 16nm New at 16nm (HBM) Enhanced at 16nm
3rd Gen 3D IC & HBM Networking IP

Greater Inter-Die FMAX 100G EMAC w/RS-FEC
HBM for 20X bandwidth 150G Interlaken w/300GLL
Enhanced at 16nm Enhanced at 16nm
DSP External Memory

2400 Mb/s (20nm)
Floating/Fixed Pt Enhanced
2,666 Mb/s (16nm)
2.5X Bandwidth (vs. 28nm)
Enhanced at 16nm
Enhanced at 16nm
Block RAM
PCI Express® Hardened Cascading
Gen3 x16 Power-Optimized Silicon
Gen4 x8
New at 16nm New at 16nm New at 16nm
High Density I/O Packaging UltraRAM

Power-Optimized I/O For Signal Integrity, Massive Capacity
MIPI D-PHY Support PCB Area, Thermal SRAM Replacement

.
Processing System Summary: From Cost-
Optimized to UltraScale+
FPGA-Based SoC Zynq-7000 SoC Zynq UltraScale+ MPSoC
Application 32-bit Xilinx MicroBlaze 32-bit ARM® Cortex™-A9 64-bit ARM Cortex-A53
Processing Unit Up to 220MHz* @ 1.4DMIPS/MHz Up to 1GHz @ 2.5DMIPS/MHz Up to 1.5GHz @ 2.3DMIPS/MHz
Real-Time 32-bit Xilinx MicroBlaze 32-bit Xilinx MicroBlaze Dual-core ARM Cortex-R5
Processing Unit Up to 220MHz* @ 1.3DMIPS/MHz Up to 220MHz* @ 1.3DMIPS/MHz Up to 600MHz @ 1.6DMIPS/MHz
Multimedia ARM Mali™-400 MP2 GPU

Processing -- -- Up to 667MHz
Video Codec supporting H.264/H.265
External DDR
Flexible DDR3, DDR3L, DDR2, LPDDR2 DDR4, LPDDR4, DDR3, DDR3L, LPDDR3
Interface
High-Speed PCIe® Gen2, USB 3.0, SATA 3.1,

Flexible USB 2.0, Gigabit Ethernet, SD/SDIO
Peripherals DisplayPort, Gigabit Ethernet, SD/SDIO
Max I/O Count Up to 500* Up to 528 Up to 668
Transceivers Up to 16 @ 6.6Gb/s* Up to 16 @ 12.5Gb/s Up to 76 @ Up to 32.75Gb/s
Package Size 8x8 to 35x35* 13x13 to 35x35 19x19 to 45x45
* Specifications based on Cost-Optimized devices, higher capabilities on UltraScale™ devices
50

.
A Broad Range of Processing Performance
Migrate
Next Gen Design
With Same Tools & Code
Processing Capability
Quad
A53
Migrate Quad
Next Gen Design A53
With Same Tools
Dual
A53 VCU
GPU GPU
Dual
Dual A9 Dual R5 Dual R5 Dual R5
Single A9
MicroBlaze MicroBlaze A9
UltraScale+ UltraScale+ UltraScale+
MicroBlaze Kintex-7
Spartan-7 Artix-7 Artix-7 Artix-7 FPGA Logic FPGA Fabric FPGA Logic
FPGA Logic
Spartan-6 FPGA FPGA FPGA Logic FPGA Logic
FPGA
Full Scalable Software, Tools, and IP Ecosystem

.
Covering the Full Spectrum of Memory Solutions
External Memory
2666-DDR4
High Bandwidth (Multi-Gigabyte)
Memory
(Multi-Gigabyte)
UltraRAM
(100s of Megabits)
Block RAM Gap in

(10s of megabits)
Memory
Hierarchy
Shallow Deep Video Deeper Buffering at Large

Buffering & Packet Buffering Highest Performance/Watt Data Storage
14.9GB/s 11.7GB/s 460GB/s 21.3GB/s

36Kb Density 288Kb Density 64Kb Density For 16GB Density

.
Unlocking Performance, Bandwidth, & Integration
7X
7 28nm 20nm 16nm
5
4X
Relative to 28nm
4
3X
3 2.4X
2.1
1.7 1.7
2 1.5
1 1 1 1
0
Logic Fabric Serial Bandwidth DSP Bandwidth On-Chip Memory
Performance/Watt
Enhanced Fabric with Up to 128 transceivers at ~12,000 DSP slices UltraRAM for SRAM
FinFET performance up to 32.75 Gb/s running at ~900 MHz device replacement

.
Hardware Determinism for Critical Tasks & Real-Time Response
ARM Cortex-A9 MicroBlaze Function Accelerated in Fabric

Uncertain Response Time Deterministic Response Time Parallelized, Highest Performance
task0 task1 task0 task1 task2 task0
....
task1
task2
....
ARM Cortex-A9 Programmable Logic
for Application Processing for Deterministic Processing
Soft Processor Dedicated Engines

(pipeline-configurable)
in Programmable Logic
Running RTOS
Non-Critical
Non-CriticalCritical Critical
Compute-
+ +
Compute-IntensiveTasks
Tasks Tasks
Intensive Tasks
Linux RTOS
ARM Cortex-A9 Processor MicroBlaze

.
MicroBlaze as a Co-Processor in a Zynq-7000
Maintaining Separate Threads for Greater Reliability and Performance
ARM Processor for

1 Shared Access (via DMA)
Compute-Intensive
Tasks Flash Controller
DDR
Controller
Shared Resources for
Master or Slave
SPI
Integrated Data Passing
 Access to Hardened Peripherals
MicroBlaze for ARM
I2C
2 ARM CAN  Access to DDR Controller (DMA)
AXI Interconnect
Offloading CortexTMTM-A9
Cortex -A9 UART
(Single or Dual-Core)  Access to On-Chip Memory (OCM)
• Housekeeping Chores (Single or Dual-Core) SDIO
• Network Communication GPIO  ARM access to BRAM (as Cache)

512KB L2 Cache 256KB OCM
• User Interfaces USB
 Shared low latency interrupt (GIC)
Timer JTAG Config GIC DMA GigE
Drag, Drop,
IP Catalog and Customize
Embedded Block RAM See it on1
MicroBlaze
MicroBlaze MCS
MicroBlaze1 MicroBlaze2 . . . MicroBlazeN
Multiple instantiations as
needed
1: “Zynq & MicroBlaze IOP Block, OCM & Memory Resource Sharing”

.
System-Wide Safety and Security Across the
Portfolio
Hardware Software
Hardware (FPGA Fabric) Attacks Attacks
SECURITY
– AES-256 encryption1 → anti-cloning & reverse eng. Tamper Snooping
– SHA-256 authentication → Ensures trusted source
– Temp/Volt. monitor → Flags ‘out-of-spec’ condition Cloning Code
Modification
– Isolated Design Flow (IDF) → Fault containment2
Software (Zynq SoC) Reverse

Engineering Malware
– Secure boot , protects from attack at startup
– ARM TrustZone3 to isolate ‘main’ OS from secure OS
Spoofing
– Memory protection against malware injection Denial of
Service Attack
– Rich ecosystem of run-time Security IP
1: NIST-Approved (National Institute of Standards and Technology)

2: Physical separation of safety-critical regions of the design
3: A compromised OS cannot access ‘secured’ data in the secure OS
56

.
Common Design Tools across the Portfolio
Vivado® HLx Design Suite

System- & IP-Centric Design
IP Integration and design for hardware design
Scalable IP Catalog, Best-in-Class Familiar SW Dev
High level C synthesis for IP creation Integration Automation Implementation Environments
Xilinx IP
HLS IP C/C++
…
Custom IP
Xilinx SDK Ecosystem
…
Processor Design and Debug

Leverage Smallest,
Lowest Cost Device
MicroBlaze and ARM Processors
SDSoC
Complete C/C++ Environment
Design entirely in C/C++ for Zynq
57

.
Vivado® Design Suite: IP Integration and HW/SW
Development
$1,000+ Value (Vivado + SDK)
Project Navigator IP Catalog Implementation & Verification
FREE TOOL
WEBPACK DOWNLOAD
Project navigator for a guided flow from design, to verification, to implementation

Drag and drop hundreds of Xilinx and partner IP cores or design your own in RTL or C
Fast HW and SW implementation and verification environments
58

.
Xilinx SDK for Ease-of-Use for Everything You
Need
A Single Cockpit for Everything You Need
FREE TOOL
Windows or Linux hosted Eclipse-based IDE WEBPACK DOWNLOAD
Linaro GCC compiler

Built for
– Performance optimization
– Firmware & application development
– Linux and bare-metal development
– Code profiling
– Board bring-up
Supports all embedded configurations

– Simple single-processor applications
– Multiple-processor systems for both SMP and AMP configurations

.
Embedded Portfolio Run Time Software Support
OS Ecosystem MicroBlaze Cortex-A9 Cortex-A53 Cortex-R5
Baremetal    
OpenAMP 
Linux: Xilinx PetaLinux   
OS
Linux: Mentor, Wind River, MontaVista 

Linux: ArchLinux, Enea, Timesys  
Android  
Microsoft – Windows Emb Compact ’07/’13 
FreeRTOS    
Mentor Graphics - Nucleus    
Micrium - uC/OS-II & III    
Sciopta Systems – Sciopta   
Wind River - VxWorks7   
RTOS
eSol eT-kernel  
GreenHills – INTEGRITY  
Lynx - LynxOS7  
QNX - Neutrino  
Sysgo – PikeOS  
expresslogic - ThreadX  
Mentor Graphics  
Hypervisor
Wind River  
Open Source - Xen 
Lynx - LynxSecure 
Page 60

.
Getting Started with Cost-Optimized and Zynq UltraScale+ Kits
Microboard ARTY S7 ARTY A7 ARTY Z7 MiniZed UltraZed
Spartan-6 LX9 Spartan-7 S25 Artix-7 A35T Zynq-7000 7Z010 Zynq-7007S Zynq UltraScale+ ZU3EG
Spartan-7 S50 Zynq-7000 7Z020

$109 $209
Page 61

.
Zynq-7000S Single-Core on the Avnet MiniZed
Includes SDSoC License
Based on Single-Core Z-7007S Device (23K

LCs)
Free SDSoC License with Evaluation Kit
Comes with Board Support Package C/C++ Applications
– Basic Software Stack

– Board-specific design rule checks for rapid bring-up System-level Profiling
Specify Functions for

Acceleration
Full System Generation

.
Low Cost Single Core Zynq-7007S: ARM A9 + 23K
FPGA Logic Cells
ARM Cortex-A9 +
Device Name Z-7007S Z-7012S Z-7014S Part Number XC7S6 XC7S15 XC7S25
7 Series PL Equivalent Artix®-7 Artix-7 Artix-7 Logic Cells 6,000 12,800 23,360
Logic Cells 23K 55K 65K
FPGA fabric equivalent to 3rd smallest Spartan-7 device
Featured in the MiniZed Evaluation Kit

– Integrated Wifi and Bluetooth
– Arduino Shield headers
– 2 Pmod Headers: 100s of Pmods to choose from

.
MiniZed Platform Block Diagram
Vterm
512Mb 128Mb 8GB

DDR3L QSPI eMMC
x16 x4 SDIO
OSC
24MHz TYPE A
Arduino-style
Headers USB 2.0 USB
ULPI
PHY Peripherals
PMOD PMOD UART WiFi &

Conn. Conn.
SDIO Bluetooth
PL bi-filament GPIO Module Connected
LED LED Driver
PL bi-filament UART Dual USB To

LED
LED Driver
Zynq-7000S GPIO to Serial Micro AB Computer
PS User Button XC7Z007S-1CLG225C

Power Level External
Done Boot Mode Power
Select Micro AB
PL User Button Switch x 2 Supply
5v
MEMS
Microphone Reset Button Integrated DC/
DC Solution & Power
Reset Supervisor
Motion &
Temp Sensor (Heat Sink)
1.35V
1.0V
1.8V
3.3V
33.33MHz
Page 64
.
Application Development
DNN
CNN
GoogLeNet
Algorithm Development
SSD
FCN …
Platform Development
Page 65
.
Removing the Barrier to Broad Adoption: reVISION Stack
20% Xilinx/80% User
ML Apps
OpenCV
“A subsystem design used
Apps to take 3 weeks. I’ve done
Development Time
it in 4 days with SDSoC.”

- DSP Engineer
Algorithm
to RTL
System
Integration SDSoC
C/C++
Bitstream
Generation Ease of Use
Traditional RTL OpenCV Machine Learning

flow

.
Removing the Barrier to Broad Adoption: reVISION Stack
20% Xilinx/80% User
ML Apps
OpenCV
“A subsystem design used
Apps to take 3 weeks. I’ve done
Development Time
it in 4 days with SDSoC.”

- DSP Engineer
Algorithm
80% Xilinx/20% User
to RTL
“reVISION will shorten our development cycle for new
products and upgrades by up to 12 months.”
System - System Architect
Integration SDSoC
C/C++
Bitstream
Generation Ease of Use
Traditional RTL OpenCV Machine Learning

flow

.
.
.
.
.

2017 10 24 FPGA Development For C C++ Using HLS

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

2017 10 24 FPGA Development For C C++ Using HLS

Caricato da

Copyright:

Formati disponibili

FPGA Development for C/C++ Coders

using High-Level Synthesis

Page 1 © Copyright 2016 Xilinx

Product portfolio & demo

Page 2 © Copyright 2016 Xilinx

Hardware Design Engineer since 1991

Some personal facts:

Love to travel the world with my girlfriend.

Page 3 © Copyright 2016 Xilinx

Page 6 © Copyright 2016 Xilinx

Page 7 © Copyright 2016 Xilinx

Page 8 © Copyright 2016 Xilinx

Page 9 © Copyright 2016 Xilinx

Page 10 © Copyright 2016 Xilinx

SystemC, C++, OpenCL

 Spend more time on the differentiated logic

– HLS: enabling core technology: C/C++/OpenCL synthesis Platform RTL IP

Integrator) Vivado Synthesis,

© Copyright 2016 Xilinx

Hours per iteration Seconds per iteration

2 to 3 orders of magnitude faster than

– RTL verification becomes final check

Functional Verification Video Example – 10 Frames of Video Data Final Verification

Vivado HLS C = 10 Sec RTL

2 Toggle SW-HW Partitioning Toggle HW/SW [S]

System Optimizing Compiler

Takes C code and creates all hardware processing blocks

Page 13 © Copyright 2016 Xilinx

Page 14 © Copyright 2016 Xilinx

Page 15 © Copyright 2016 Xilinx

Page 16 © Copyright 2016 Xilinx

Page 17 © Copyright 2016 Xilinx

Page 18 © Copyright 2016 Xilinx

Page 19 © Copyright 2016 Xilinx

A Configurable Logic Block (CLB) Slice_L

Architecture consistent with Virtex-6

Ease of design migration

LUT6 LUT5 LUT3

Six input Shared five Split logic

Distributed Shift register

32x1, 32x2, 32x3, 32x4

CLBs are distributed in columns of tiles

Different length interconnect

Software intelligently packs logic

if falling_edge (Clock) then

Placer software automatically chooses the best routing resources

Page 28 © Copyright 2016 Xilinx

Distributed RAM/SRL32 On-chip BRAM/FIFO Fast Memory Interfaces

Distributed memory Capability of single SLICE_M

Programmable  Flexible, Efficient Implementation of Common Functions

Write Port Read Port

36K/18K block RAM

36K/18K block RAM

True dual port – unrestricted flexibility

– Natural structure for FIFOs

Each 18K Each 36K

16Kx1, 8Kx2, 32Kx1, 16Kx2,

16Kx1, 8Kx2, 16Kx2, 8Kx4,

16Kx1, 8Kx2, 16Kx2, 8Kx4,

Page 36 © Copyright 2016 Xilinx

Increased System  FPGA Layout Optimized for High Performance Memory

Page 37 © Copyright 2016 Xilinx

All 7 series FPGAs split into Every clock region is 50 rows of