Sei sulla pagina 1di 77

Virtual and Physical Cellular

Architectures for Kilo-processor


Chip Computers
Tamás Roska

Hungarian Academy of Sciences and


Pázmány P. Catholic University,
Budapest, Hungary
IEEE CNNA 2010
A new era in computing starts
• NOT just parallel
• NOT using the present algorithms
• NOT massaging the nanoscale devices to repeat
the functional primitives of CMOS building
blocks
• Running time is NOT the only measure
• There are NO efficient tools

Key: understanding spatial-temporal algorithmic


dynamics
Cellular Wave Computers: one way of thinking
Table of Contents
• The technology trend
• The six prototype platforms
• Virtual and Physical Cellular Machines and the
Design Scenario
• Processor and memory array representations
• Design tools and principles
• Qualitative theory of operators with local
connectedness
• New principles of Computational Complexity
The Technology Trend
• ~ 30 nm : over 10 billion transistors
• Severe limits for clock speed and power dissipation
• 25 k mixed mode processors and sensors on a
cellular visual microprocessor at 180 nm
• 45 nm: 70 k logic- and 2k arithmetic- processors
• about 1 million 8 bit microprocessors could be placed
(~5 Billion transistors) ...Why not?
• Wire delay is bigger than gate delay, hence
communication speed is limited (synchrony radius)
Hence
• The precedence of Locality ~ i.e. Cellular, mainly
locally connected architecture is a must for high
computing power
International Technology Roadmap for
Semiconductors, ITRS/2007 /2009
…but it also doesn’t scale terrible
well.
The six prototype platforms

- related new products


1.CELL Broadband Engine Architecture
(CBEA) and high-end supercomputer
•Heterogeneous, multi-core CELL
Multiprocessor chip
–241M transistor, 235mm2
–200 GFlops (SP) @3.2GHz
–200 GB/s bus (internal) @ 3.2GHz
–dual XDR controller (25.6GB/s)
–two configurable interfaces
(76.8GB/s)
•Power Processor Element (PPE)
–General purpose processor
•Synergistic Processor Element
(SPE)
– SIMD-only engine
–fast access to 256KB local memories
–Globally coherent DMA to transfer
data
•Racks, cabinets, PetaFlops
systems
2. FPGAs – three cell arrays: logic- and
arithmetic- processors, and memories
XILINX VIRTEX 6 FPGA
• Over 70k Advanced Silicon Modular Logic
Blocks in a cellular array
• Over 2k DSP „slices” in a cellular array
• 36 kB memory blocks in a cellular array
• Application Specific array components
• Logic and arithmetic processor arrays and
memory arrays are tiled on a 2D plate
• Local directional preference (stream)
Xilinx Virtex-5 FPGA architecture
CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB
DSP Slice DSP Slice
(18x25 bit (18x25 bit
CLB CLB CLB CLB Multiplier) CLB CLB CLB CLB Multiplier) CLB CLB

Memory Memory
CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB
(BRAM) (BRAM)

CLB CLB CLB CLB DSP Slice CLB CLB CLB CLB DSP Slice CLB CLB
(18x25 bit (18x25 bit
Multiplier) Multiplier)
CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB

CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB
DSP Slice DSP Slice
(18x25 bit (18x25 bit
CLB CLB CLB CLB Multiplier) CLB CLB CLB CLB Multiplier) CLB CLB

Memory Memory
CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB
(BRAM) (BRAM)

CLB CLB CLB CLB DSP Slice CLB CLB CLB CLB DSP Slice CLB CLB
(18x25 bit (18x25 bit
Multiplier) Multiplier)
CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB

CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB
DSP Slice DSP Slice
(18x25 bit (18x25 bit
CLB CLB CLB CLB Multiplier) CLB CLB CLB CLB Multiplier) CLB CLB

Memory Memory
CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB
(BRAM) (BRAM)

CLB CLB CLB CLB DSP Slice CLB CLB CLB CLB DSP Slice CLB CLB
(18x25 bit (18x25 bit
Multiplier) Multiplier)
CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB
3. Stacked arrays – GPU: Graphic
Programming Units
Fermi ( NVIDIA -2010):
- in 32 core units / 16 kB local memory
- 512 cores (16x32)
- IEEE 754-2008 32-bit and 64-bit
- Full 32-bit integer path with 64-bit extensions
- Memory access instructions - transition to 64-
bit address
Fermi’s 16 SM are positioned around a common L2 cache. Each SM is a vertical
rectangular strip that contain an orange portion (scheduler and dispatch), a green
portion(execution units), and light blue portions (register file and L1 cache).
• In each core an integer/logic (64 bit)
unit and an arithmetic unit (32/64
floating point)
• Local and Global memory in a cache
architecture
• Possibility to use several 512 core
chips
• Fully cellular architecture
• Physical Cellular Machine
SENSORY INTEGRATED CHIPS

4. WITH MIXED MODE CELL


PROCESSORS – Eye-RIS

5. WITH DIGITAL CELL PROCESSORS


– XENON
Eye-RIS v.1.2 , AnaFocus Ltd.
Many kilo processor chips
using cellular architecture

EyeRIS camera computer -


25 k processors with optical
sensors, 1000 frame per sec
AnaFocus Ltd. Seville
XENON V3 (MDA SBIR Phase II Project:
ROIC No2, 64x64 prototype) Photo of the
ASIC processor
Pixel (sensor) layer:
XENON V3 array of a cell
architecture
µ P- Program
general memory
scheduler

Processors: 64 (8x8)

Sensors: 4096 (64x64)


MUX ADC Processing speed
Processor (depends on
Communi
-cation
algorithmic
Memory complexity):
Cell (processor) array
(to neighbors) > 1000 fps
Topographic sensor-
processor arrangement
Programming: general
(integrated InGaAs/InP EUTECUS, Inc. purpose
sensors are not shown in
the photo) - proprietary
6. Cellular big cores:
Intel Cloud 48 - 2010

48 IA processors in cellular communication scheme


Integrated Systems
• Bi-i systems with different sensor inputs and
outputs and mobile versions

• Desktop high-end systems

• VISCUBE 3D integrated circuits


Desktop Megaprocessor Computer
Viscube 3D
architecture Sensor layer (320x240) Pixel
Parallel BB
Eutecus-SZTAKI-Anafocus Mixed Processor
signal AD
array
layer (160x120)
1 bit/pixel 8 bit/pixel
(mask) (gray value)
ROI parallel Digital Memory array
proc. (32x24)
layer
Proc.
parallel
ONR Grant
Digital Processor array
memory (32x24) ROI
layer
Virtual and Physical Cellular Machines

and the Design Scenario


The many-core
Cellular Wave Computer Components
The architectural element is defined as
• A processor array placed on a 2D or 3D grid
• The processors communicate with a delay
proportional with the distance .
• Typically two speed classes; a local fo within a
synchrony radius r, and a global Fo via a Manhattan
type bus system. In case of continuous time dynamics
f ~1/T, T being the dominant time constant
• The local and global clock speeds are adjusted to
keep the power dissipation within prescribed limit Pd
The Virtual Cellular architecture of a single array is
shown, for a 2D case, in Figure 1 (hiding physical
details)
1

1
New principles
• the role of the geometric address of a single
processor in an array is introducing new
algorithmic principles:
– the precedence of geometric locality (due to
the physical and logical locality), i.e. the
cellular character – granular, topographic,
etc.
– cellular wave dynamics as an instruction
– streaming activity pattern in logic
Do NOT parallelize existing single processor
algorithms – invent new cellular algorithms
The physical constraints on
communication speed and power dissipation
The basic constraints are: wire delay and power dissipation. A
side constraint is the number of communication pins (contacts).
he execution of a logic, arithmetic or symbolic elementary
rray instruction is defined via r input (u(t)), m output (y(t))
nd n state (x(t)) variables (t is the time instant).
his unit is characterized by its
s, surface/area,
e, energy,
f, operating frequency,
w = e f local power dissipation, and
the signals are traveling on a wire with length L , width q,
and with speed vq introducing a delay of D = l vq
•Ω cores can be placed on a single Chip , typically in a
rectangle grid, with R input and Q output physical
connectors typically at the corners of the Chip, altogether
there are K input/output connectors.
•The maximal value of dissipation of the Chip is W.
• The physics is represented by the maximal values of Ω,
K, and W (as well as the operating frequency). The
operating frequency might be global for the whole Chip
Fo, or could be local within the Chip, fo, fi (some parts
might be switched off, fi = 0)
Virtual and Physical Cellular
Machines
• The Virtual Cellular Machine is the modern
version of the Virtual Memory
– (i) to hide the physical details of the huge
number of processing elements (cores, cells,
threads, etc.),
– (ii) to provide the framework for designing the
algorithms with elementary array instructions,
and
– (iii) to serve as the starting phase for the
Virtual to Physical Cellular Machine mapping.
Virtual Cellular Machines
• Its computational building blocks are mainly
arrays, composed of operator and memory
elements, acting in space and time
• Heterogeneous and homogeneous
computational blocks
• Building blocks have no physical constraints -
size, bandwidth, speed, power dissipation,
except the two classes of memory access local
and global)
• A typical logic elementary array instruction (LA) is
a binary logic function on n variables, or a memory
look-up table array
• A typical arithmetic/analog elementary array (AA)
instruction is a multiply and accumulate (add) term
(MAC core) array,
• A symbolic elementary array instruction (SA) might
be a string manipulation core array (e.g. a P system)
• a complex cell based array instruction (XA),
hosting cells with all the three above types of data and
instructions
An 8, 16, or 32 bit microprocessor could be considered
as well as an elementary array instruction with
iterative or multi-thread implementation.
A Virtual Cellular Machine
is composed of five types of building blocks:
• (i) cellular processor arrays/layers, CP, with simple (L,
or A, or S type) or complex (X) cells and their local
memories, these are the protagonist building blocks,
• (ii) classical digital stored program computers, P
(microprocessors),
• (iii) multimodal topographic (T) or non-topographic
inputs and outputs, I/O (e.g. scalar, vector, or matrix
signals),
• (iv) global memories of different data types M,
organized in qualitatively different sizes and access
times (e.g. cache memories), and
• (v) interconnection pathways B (busses).
Simple Example: Global system control & memory
I∕O

Pn CP1/1 ... CP1/g CP2/1 CP2/h


B0
:
:
P1
b0 b1 b2
T T
P0 Input Input
2D 2D

M1

M2
f0 f0 f0 f0
M F0
F0 F0 F0 F0

B0 B0
• tasks, the algorithms to be implemented, are
defined on the Data/Memory representations
of the Virtual Cellular Machine
• Various data representations for
– Topographic (e.g. picture, image, PDE,
molecule dynamics) and
– Non-topographic (e.g. Markov processes,
algebra, number theory)
problems
Physical Cellular Machines
– the model for the physical implementation
of system architectures
We have three elementary programmable Cell processor
(cell core) types used in array implementations
A) An algorithm with input, state and output vectors having real/ arithmetic,
binary/digital logic, and symbolic variables(typically implemened via digital
circuits).
B) A real valued state and output dynamic system with analog/continuous or
arithmetic variables (typically implemented via mixed mode/analog-and-logic
circuits and digital control processors)
C) A physical dynamic entity with well defined geometric layout and I/O ports
(function in layout) – (typical implementations are CMOS and/or beyond CMOS
nanoscale designs, or optical architectures with programmable control)
Physical Cellular Machines
The above cell processors have their own physical
parameters, hence, the arrays of them, in similar
building arrangements like in the Virtual Cellular
Machines, will be defined by given physical
parameters as the
• Size
• Bandwidth/latency
• Operation speed
• Power dissipation
It might have the same or a different architecture
- compared to the Virtual Cellular Machine
• The geometry of the architectures are reflecting the
physical layout of the chips or systems. A building
block could be implemented as a separate chip or a
part of a chip. This architectural geometry defines also
the communication (interacting) speed ranges. Hence
physical closeness means higher speed ranges.
• The spatial location or topographic address of each
elementary cell processor or cell core, as well as that
of each building block within the architecture, and the
communication bandwidth, plays a crucial role.

In addition to the number of processors, these are


the most dramatic differences compared to
classical computer science.
The Design Scenario

algorithms

Physical Cellular Virtual Cellular


COMPILER Machine
Machine

Processor / memory Data /object


topography topography

Physical Task/Problem/
Implementation Workload
Processor and memory/Data
array representations
Array Signals, variables, Processors
memory
logic / symbolic array logic/ symbolic processor

logic / symbolic value arithmetic/analog processor

arithmetic/analog array arithmetic/analog processor array

arithmetic/analog value logic/symbolic processor array


FPGA organization
1c
~50c.
1D 2D

200c. 200c. GPU organization

1c.
1c.
10c.
1c.
100c.

CELL organization
Many sensors/data/memory units
on one Cell Processor
Separate sensory/memory plane
Flow architecture

1 2 K
Design Tools and Principles
• Homogeneous cellular arrays
• Decomposition in 2D
• Decomposition in 1D
• Composing complex cells for a
homogeneous FPGA architecture
• Self-organizing distributed control
• The CNN Universal Machine as a compiler
• Nontopographic data representation
• Spatial distribution of arithmetic operators
with large number of terms
Equivalent transformations for finite size
homogeneous cellular arrays (Á.Zarándy)

Four different methods when the size of the


Physical Cellular machine is smaller than
that of the Virtual Cellular Machine
Simple example of a Physical Cellular Machine
implemented on a single chip (FPGA)

CP2
{50x20} CP2
{20x30}

CP2
{50x40}

CP2 CP2
{30x80} {20x50}
CP2
{50x20}

Chip size:100x80 processors : conditional data path


Spatial decomposition on 1D
IBM CELL BE based design
Partitioning
• Horizontal stripes
• SPE-SPE
CN
Communication
– First and last line
computation
• Row-wise data alignment
• Communication between
2 SPEs can be carried
out by a single DMA
command
A compiler for FPGA implementation (Z.Nagy
and P.Szolgay)
• Compiler: the FALCON architecture via the
CNN Universal Machine.
• Special cases for
– Navier Stokes 2D, 2½D, incompressible and
compressible
– Multilayer retina model
• Introducing special FIFOs to overcome the
bandwidth constraint and the global control:
Cellular Distributed Self-organizing Control
structure ==>> streaming activity pattern
Xilinx Virtex-5 FPGA architecture
CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB
DSP Slice DSP Slice
(18x25 bit (18x25 bit
CLB CLB CLB CLB Multiplier) CLB CLB CLB CLB Multiplier) CLB CLB

Memory Memory
CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB
(BRAM) (BRAM)

CLB CLB CLB CLB DSP Slice CLB CLB CLB CLB DSP Slice CLB CLB
(18x25 bit (18x25 bit
Multiplier) Multiplier)
CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB

CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB
DSP Slice DSP Slice
(18x25 bit (18x25 bit
CLB CLB CLB CLB Multiplier) CLB CLB CLB CLB Multiplier) CLB CLB

Memory Memory
CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB
(BRAM) (BRAM)

CLB CLB CLB CLB DSP Slice CLB CLB CLB CLB DSP Slice CLB CLB
(18x25 bit (18x25 bit
Multiplier) Multiplier)
CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB

CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB
DSP Slice DSP Slice
(18x25 bit (18x25 bit
CLB CLB CLB CLB Multiplier) CLB CLB CLB CLB Multiplier) CLB CLB

Memory Memory
CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB
(BRAM) (BRAM)

CLB CLB CLB CLB DSP Slice CLB CLB CLB CLB DSP Slice CLB CLB
(18x25 bit (18x25 bit
Multiplier) Multiplier)
CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB
Virtex-5 Logic Cell Element
• Configurable Logic LUT6
Cout D

Block (CLB) D1-D6


RAM64
SRL32
Carry&
Control
FF/Latch DQ

– 2 slices
C
LUT6
CLB
RAM64 Carry&
C1-C6 FF/Latch CQ
SRL32 Control
Slice(1)
B
LUT6
RAM64 Carry&
B1-B6 FF/Latch BQ
SRL32 Control

Slice(0) A
LUT6
RAM64 Carry&
A1-A6 FF/Latch AQ
SRL32 Control

Cin
Arithmetic and Memory Cell
Elements
•Dual-ported memory
•DSP Slice DIA DOA
AddrA
WEA
ENA
A R A
M u ltip lie r CLKA

AD D C P
2 5 x 1 8 b it 36Kb
Memory
B R C Array

DIB DOB
C AddrB
WEB
ENB
CLKB
Programmable Interconnect
CLB
Switch Switch
Slice(1) Matrix Matrix
Switch
Matrix DSP48
Switch Switch
Slice(0) Matrix Matrix

Switch Switch
RAMB36
Matrix Matrix

Switch Switch
Matrix Matrix
DSP48

Switch Switch
Matrix Matrix
Wire delay distribution (ns)

5
4
3

Distance (hops)
2
1
0
-1
-2
-3
-4
-5
-5 -4 -3 -2 -1 0 1 2 3 4 5
The CNN State Equation
1
Cx ij ( t ) = − x ij ( t ) + ∑ A(i, j, k , l) y kl ( t ) + ∑ B(i, j, k , l)u kl ( t ) + z ij
Rx kl∈Sr ( ij ) kl∈Sr ( ij )

• Assumed that the input is constant or


changing slowly
• Forward Euler discretization
• Feedback equation
x ij [ n + 1] = ∑ A(i, j, k , l) y kl [n ] + g ij
kl∈Sr ( ij)
• Feed-forward equation

g ij = ∑ B(i, j, k, l)u
kl∈Sr ( ij)
kl + hz ij
The Falcon processor core
• Memory unit

StateIn Co
– Contains a belt of the
cell array
• Mixer
– Contains cell values
for the next updates
• Template memory
• Arithmetic unit
The complex arithmetic unit for
implementing the CELL dynamics

S1 T1 S2 T2 S9 T9 gij xij • 9 multipliers (3x3


Mult Mult Mult
case)

Shift reg
Shift reg
• Adder tree
Adder tree – DSP slice built-in
+ adders
+ • Fully pipelined
Shift &
Round
• 1 cell update / clock
cycle
NewState
The Falcon array
• Each core processor

Input
works on a narrow
slice of the image
• Each line computes
one CNN iteration
• Results are shifted
down to the next row
processors
• Linear speedup
Performance compared to a
conventional microprocessor
10000

1000
Speedup

100

10
2 6 10 14 18 22 26 30 34 38 42 46 50 5
Precision (bit)
*Estimated data
Inviscid, Adiabatic, Compressible
flows on the CNN Universal Machine
• Euler equations: • Notations:
∂ρ – t: time
+ ∇ ( ρ v) = 0 ∇ : Nabla operator
∂t
– ρ: density
∂ ( ρ v)  ∧

+ ∇  ρ vv + I p  = 0 – v(u, v): velocity vector field
∂t   – p: pressure
∂E – I: identity matrix
+ ∇ ( ( E + p) v) = 0 – E: total energy
∂t – γ: ratio of specific heats
• Total energy is defined as:
p 1  1 
E= + ρ v⋅ v p = ( γ − 1)  E - ρ v ⋅ v 
γ −1 2  2 
1 and 2 order solution
st nd
NEW algorithms for non-topographic
problems
Particle filters (A.Horváth and M. Rásonyi)
• Topographic representation of the
nontopographic data
• Random selection within topographic
locality
• 1000 trajectories x 100 time instants x a few
particles

• Convergence and stability both on Virtual
and on an 8 bit Physical Cellular Machine
Arithmetic operations on big
number of terms

• Intel Threading building Blocks (~ log n)

• Spatial-temporal optimization and their


shapes on the cellular core (thread) array

• Shape filling optimization (~log n/N)


Shape filling optimization

Log N
4 parallel arrays fill factor: static/dynamic
N

N
Log N

Data arriving in parallel


waves
Algorithm Representations

• UMF DIAGRAMS
• Dynamic Operational Graphs
• Dynamic Acyclic Graphs
• Dynamic process graphs
• Etc.
UMF diagrams

A single cellular array/ layer defined


by a standard CNN dynamics :
U

Xo

• z
τ Ο TEM k

Y

U: input array, Xo Initial state array, z: threshold or mask


array, Y: output array,
τ: time constant or clock time, TEMk : local instruction
Algorithmic structures Parallel
in terms of arrays/layers A typical parallel structure with two
parallel flows is shown below,
Cascade by combining them in the final layer

U U 1
X 0 1
U 2
X 1 X 2
z
z z

X 0 2

Y
Optimization of algorithms
represented by UMF diagrams

Directed acyclic graphs representing


UMF diagrams
Genetic Programming with Indexed
Memory, GP-IM (G. Pazienza)
Dynamic operational graphs
• Nodes are memories
• Branches are either
– the operators or
– communication paths
Extremal graph problems
Qualitative theory of Operators
with local connectedness
• Fundamental analytic results for 1D Binary CNN
(L.O.Chua et. al.) – the richness of patterns
• A fundamental Theorem for oscillating CNN
(F. Corinto, T.Roska, and M.Gilli):
Any oscillating fully connected dynamical system array
can be represented by a locally connected one
• Synchronization
A globally connected array of stochastic oscillatior
cells can also be implemented by locally connected
ones (G. Máté, E.Á.Horváth, E.Káptalan, A. Tunyagi,
Z.Néda, and T. Roska)
- level of synchrony
- size of local neighborhood
• Logic operators
Any 2D Boolean operator can be represented
by a sequence of locally connected standard
CNN
• Grammar cells
Locally connected simple grammar cell arrays
are equivalent to Turing Machines (E. Csuhaj
Varju et.al.)
• True Randomness in space
True random spatial bit patterns can be
generated with standard CNN Universal
Machine and spatially correlated local noise
(M. Ercsey-Ravasz, Z. Néda, and T. Roska)
New principles of
Computational Complexity
• The algorithmic and physical complexity
measures of these many core architectures will
be eventually different compared to the single
processor systems – abandoning the
asymptotic framework (what is big? 1 million?).
• With Ω cores and M total memory the finite
virtual algorithmic complexity Ca = Ca(Ω , M )
• The finite physical computational complexity of
an algorithm is measured by a proper mix of
speed, power, area, and accuracy.
A new era in
• Computer Science
• Computer engineering

begins….

Potrebbero piacerti anche