Performance and Programming Environment of A Combined GPU

Performance and Programming environment of a combined GPU/FPGA architecture
Supercomputing 1969-2018
1969: MFlops 1985: GFlops 1997: PFlops 2008: TFlops 2018: EFlops?
1.E+18 1.E+15
1.E+12
1.E+09 1.E+06 1.E+03 CDC 7600 Cray-2 CDC STAR
Hitachi SR2201
Cray X-MP
Intel ASCI
NEC Earth Simulator
IBM Blue Gene
1969 1974 1982 1985 1990 1996 1997 2004 2005 2008 2010 2011
MFLOPS(y) = 1.72(y-1969)
IBM Roadrunner
Fujitsu NWT
Tianhe I
Trendlines
Supercomputing FLOPS > Moores law Memory speed increase << Moores law
18
MFLops 16 (log10) 14
12 10 8 6 4 2 0 1960 1970
MFlops Trendline
R = 0.97
Moore's law
Memory speed increase (relative)
1980
1990
2000
2010
2020
Super desktop with GPU

2009: NVIDIA: personal supercomputer Tesla C2050 515 DP Gflops, 1TFlop (MADD) 144 GB/s mem. bandwidth 384 bit bus (3GB memory) 448 thread processors Limitations:
regular SIMDapplications PC accelerator transfer: PCie x16
Extended with FPGA processing

Field programmable gate array Up to 6 FPGA modules 512 MB DDR3/module PCIe x8 per module PCIe x16 per board
Combining GPU and FPGA strenghts

Image processing + Bio-informatics Face recognition + Security Audio processing + HMM speech recognition Traffic analysis + Neural network control
Super desktop architecture
Programming language: C
GPU: CUDA, OpenCL
C PTX (Parallel Thread Execution)
FPGA: HLS (High Level Synthesis)

C VHDL (VHSIC Hardware Description Language) History:
AutoESL (Xilinx) Vivado HLS Catapult C tool from Mentor Graphics C-to HDL tool from Politecnico di Milano (Italy) C-to-Verilog tool from www.c-to-verilog.com DIME-C from Nallatech Handel-C from Celoxica (defunct) HercuLeS (C/assembly-to-VHDL) tool Impulse C from Impulse Accelerated Technologies Nios II C-to-Hardware Acceleration Compiler from Altera ROCCC 2.0 (free and open source C to HDL tool) from Jacquard Computing Inc. SPARK (a C-to-VHDL) from University Of California, San Diego SystemC from Celoxica (defunct)
FPGA programming environment

ROCCC:
target:
platform dependent modules (IP cores) into library platform independent systems use modules as functions replicate, parallelize and pipeline
optimizations
low level: arithmetic balancing high level: loop unrolling, fusion, wavefront, mul/div elimination, subexpression elimination data optimizations: stream with smart buffer
output
vhdl design + testbench PCore (Xilinx)
FPGA programming environment

AutoESL:
target:
Xilinx FPGAs
optimizations
code: loop unroll, fusion, pipeline, inline data: remap, partition, arrays, reshape, resource, stream interface selection: handshake, fifo, bus, register,
output
vhdl design performance report: timing, design and loops latency, utilization, area, power, interface design viewer with timeline, regs and interfaces, with feed back to source code
AutoESL ROCCC
Compiler optimizations
Optimization Software pipelining Arithmetic balancing Loop unrolling Loop flatten hierarchy Loop fusion (merge) Function inlining Array map (combine arrays H or V) Array partition (into smaller, // arrays) Array reshape (cyclic, block) Array resource (e.g. single or DP RAM) Array streaming (FIFOs instead of RAMs) Smart Buffer Interface (handshake, none, stream, ) AutoESL x x x x x x x x x x x x ROCCC x x x
Tuning design for performance

Simple example: sum of array (N=1.e8) for(i=0; i<N; i++) sum += A[i]; No optimizations: 2 * N = 200e6 cycles

Unroll 8 times arith. balancing
87e6 cycles gain = 2.3 only 2 // adds?

Dual-port memory: only 2 loads at a time! I/O bottleneck

Partition A over 4 memories (=8 ports, 256 bits)
8 loads, 4 // adds 63e6 cycles gain = 4.8

Balancing Unrolling and Partitioning
Unrolling
3.E+08
2.E+08 2 PORTS ONLY # cycles 1.E+08 Partition=2 , 4 // streams (DP) Partition=4 , 8 // streams (DP)
Partition=8 , 16 // streams (DP)

Partition=16, 32 // streams (DP)
I/O bound
0.E+00 1 10 100 1000 Unroll factor 1, 8, 64, 512
Resource bound

Resource saturation at 512 times unroll
Unrolling
3.E+08
2.E+08
# cycles 1.E+08
Partition=16, 32 // streams (DP) Virtex 6 Partition=16, 32 // streams (DP)
0.E+00 1 10 100 1000 Unroll factor 1, 8, 64, 512
Spartan3e Virtex 6

Performance and Programming Environment of A Combined GPU

Caricato da

Informazioni sul documento

Descrizione originale:

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Performance and Programming Environment of A Combined GPU

Caricato da

Copyright:

Formati disponibili

Performance and Programming environment of a combined GPU/FPGA architecture

NEC Earth Simulator

IBM Blue Gene

Memory speed increase (relative)

Super desktop with GPU

Extended with FPGA processing

Combining GPU and FPGA strenghts

Super desktop architecture

FPGA: HLS (High Level Synthesis)

FPGA programming environment

FPGA programming environment

Tuning design for performance

Tuning design for performance

87e6 cycles gain = 2.3 only 2 // adds?

Tuning design for performance

Tuning design for performance

8 loads, 4 // adds 63e6 cycles gain = 4.8

Tuning design for performance

Partition=8 , 16 // streams (DP)

0.E+00 1 10 100 1000 Unroll factor 1, 8, 64, 512

Tuning design for performance

Partition=16, 32 // streams (DP) Virtex 6 Partition=16, 32 // streams (DP)

0.E+00 1 10 100 1000 Unroll factor 1, 8, 64, 512

Potrebbero piacerti anche