Sei sulla pagina 1di 17

Performance and Programming environment of a combined GPU/FPGA architecture

Supercomputing 1969-2018
1969: MFlops 1985: GFlops 1997: PFlops 2008: TFlops 2018: EFlops?
1.E+18 1.E+15

1.E+12
1.E+09 1.E+06 1.E+03 CDC 7600 Cray-2 CDC STAR

Hitachi SR2201

Cray X-MP

Intel ASCI

NEC Earth Simulator

IBM Blue Gene

1969 1974 1982 1985 1990 1996 1997 2004 2005 2008 2010 2011

MFLOPS(y) = 1.72(y-1969)

IBM Roadrunner

Fujitsu NWT

Tianhe I

Trendlines
Supercomputing FLOPS > Moores law Memory speed increase << Moores law
18

MFLops 16 (log10) 14
12 10 8 6 4 2 0 1960 1970

MFlops Trendline
R = 0.97

Moore's law

Memory speed increase (relative)

1980

1990

2000

2010

2020

Super desktop with GPU


2009: NVIDIA: personal supercomputer Tesla C2050 515 DP Gflops, 1TFlop (MADD) 144 GB/s mem. bandwidth 384 bit bus (3GB memory) 448 thread processors Limitations:
regular SIMDapplications PC accelerator transfer: PCie x16

Extended with FPGA processing


Field programmable gate array Up to 6 FPGA modules 512 MB DDR3/module PCIe x8 per module PCIe x16 per board

Combining GPU and FPGA strenghts


Image processing + Bio-informatics Face recognition + Security Audio processing + HMM speech recognition Traffic analysis + Neural network control

Super desktop architecture

Programming language: C
GPU: CUDA, OpenCL
C PTX (Parallel Thread Execution)

FPGA: HLS (High Level Synthesis)


C VHDL (VHSIC Hardware Description Language) History:
AutoESL (Xilinx) Vivado HLS Catapult C tool from Mentor Graphics C-to HDL tool from Politecnico di Milano (Italy) C-to-Verilog tool from www.c-to-verilog.com DIME-C from Nallatech Handel-C from Celoxica (defunct) HercuLeS (C/assembly-to-VHDL) tool Impulse C from Impulse Accelerated Technologies Nios II C-to-Hardware Acceleration Compiler from Altera ROCCC 2.0 (free and open source C to HDL tool) from Jacquard Computing Inc. SPARK (a C-to-VHDL) from University Of California, San Diego SystemC from Celoxica (defunct)

FPGA programming environment


ROCCC:
target:
platform dependent modules (IP cores) into library platform independent systems use modules as functions replicate, parallelize and pipeline

optimizations
low level: arithmetic balancing high level: loop unrolling, fusion, wavefront, mul/div elimination, subexpression elimination data optimizations: stream with smart buffer

output
vhdl design + testbench PCore (Xilinx)

FPGA programming environment


AutoESL:
target:
Xilinx FPGAs

optimizations
code: loop unroll, fusion, pipeline, inline data: remap, partition, arrays, reshape, resource, stream interface selection: handshake, fifo, bus, register,

output
vhdl design performance report: timing, design and loops latency, utilization, area, power, interface design viewer with timeline, regs and interfaces, with feed back to source code

AutoESL ROCCC
Compiler optimizations
Optimization Software pipelining Arithmetic balancing Loop unrolling Loop flatten hierarchy Loop fusion (merge) Function inlining Array map (combine arrays H or V) Array partition (into smaller, // arrays) Array reshape (cyclic, block) Array resource (e.g. single or DP RAM) Array streaming (FIFOs instead of RAMs) Smart Buffer Interface (handshake, none, stream, ) AutoESL x x x x x x x x x x x x ROCCC x x x

Tuning design for performance


Simple example: sum of array (N=1.e8) for(i=0; i<N; i++) sum += A[i]; No optimizations: 2 * N = 200e6 cycles

Tuning design for performance


Unroll 8 times arith. balancing

87e6 cycles gain = 2.3 only 2 // adds?

Tuning design for performance


Dual-port memory: only 2 loads at a time! I/O bottleneck

Tuning design for performance


Partition A over 4 memories (=8 ports, 256 bits)

8 loads, 4 // adds 63e6 cycles gain = 4.8

Tuning design for performance


Balancing Unrolling and Partitioning
Unrolling
3.E+08

2.E+08 2 PORTS ONLY # cycles 1.E+08 Partition=2 , 4 // streams (DP) Partition=4 , 8 // streams (DP)

Partition=8 , 16 // streams (DP)


Partition=16, 32 // streams (DP)

I/O bound

0.E+00 1 10 100 1000 Unroll factor 1, 8, 64, 512

Resource bound

Tuning design for performance


Resource saturation at 512 times unroll
Unrolling
3.E+08

2.E+08

# cycles 1.E+08

Partition=16, 32 // streams (DP) Virtex 6 Partition=16, 32 // streams (DP)

0.E+00 1 10 100 1000 Unroll factor 1, 8, 64, 512

Spartan3e Virtex 6

Potrebbero piacerti anche