Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Memory Memory
CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB
(BRAM) (BRAM)
CLB CLB CLB CLB DSP Slice CLB CLB CLB CLB DSP Slice CLB CLB
(18x25 bit (18x25 bit
Multiplier) Multiplier)
CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB
CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB
DSP Slice DSP Slice
(18x25 bit (18x25 bit
CLB CLB CLB CLB Multiplier) CLB CLB CLB CLB Multiplier) CLB CLB
Memory Memory
CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB
(BRAM) (BRAM)
CLB CLB CLB CLB DSP Slice CLB CLB CLB CLB DSP Slice CLB CLB
(18x25 bit (18x25 bit
Multiplier) Multiplier)
CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB
CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB
DSP Slice DSP Slice
(18x25 bit (18x25 bit
CLB CLB CLB CLB Multiplier) CLB CLB CLB CLB Multiplier) CLB CLB
Memory Memory
CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB
(BRAM) (BRAM)
CLB CLB CLB CLB DSP Slice CLB CLB CLB CLB DSP Slice CLB CLB
(18x25 bit (18x25 bit
Multiplier) Multiplier)
CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB
3. Stacked arrays – GPU: Graphic
Programming Units
Fermi ( NVIDIA -2010):
- in 32 core units / 16 kB local memory
- 512 cores (16x32)
- IEEE 754-2008 32-bit and 64-bit
- Full 32-bit integer path with 64-bit extensions
- Memory access instructions - transition to 64-
bit address
Fermi’s 16 SM are positioned around a common L2 cache. Each SM is a vertical
rectangular strip that contain an orange portion (scheduler and dispatch), a green
portion(execution units), and light blue portions (register file and L1 cache).
• In each core an integer/logic (64 bit)
unit and an arithmetic unit (32/64
floating point)
• Local and Global memory in a cache
architecture
• Possibility to use several 512 core
chips
• Fully cellular architecture
• Physical Cellular Machine
SENSORY INTEGRATED CHIPS
Processors: 64 (8x8)
1
New principles
• the role of the geometric address of a single
processor in an array is introducing new
algorithmic principles:
– the precedence of geometric locality (due to
the physical and logical locality), i.e. the
cellular character – granular, topographic,
etc.
– cellular wave dynamics as an instruction
– streaming activity pattern in logic
Do NOT parallelize existing single processor
algorithms – invent new cellular algorithms
The physical constraints on
communication speed and power dissipation
The basic constraints are: wire delay and power dissipation. A
side constraint is the number of communication pins (contacts).
he execution of a logic, arithmetic or symbolic elementary
rray instruction is defined via r input (u(t)), m output (y(t))
nd n state (x(t)) variables (t is the time instant).
his unit is characterized by its
s, surface/area,
e, energy,
f, operating frequency,
w = e f local power dissipation, and
the signals are traveling on a wire with length L , width q,
and with speed vq introducing a delay of D = l vq
•Ω cores can be placed on a single Chip , typically in a
rectangle grid, with R input and Q output physical
connectors typically at the corners of the Chip, altogether
there are K input/output connectors.
•The maximal value of dissipation of the Chip is W.
• The physics is represented by the maximal values of Ω,
K, and W (as well as the operating frequency). The
operating frequency might be global for the whole Chip
Fo, or could be local within the Chip, fo, fi (some parts
might be switched off, fi = 0)
Virtual and Physical Cellular
Machines
• The Virtual Cellular Machine is the modern
version of the Virtual Memory
– (i) to hide the physical details of the huge
number of processing elements (cores, cells,
threads, etc.),
– (ii) to provide the framework for designing the
algorithms with elementary array instructions,
and
– (iii) to serve as the starting phase for the
Virtual to Physical Cellular Machine mapping.
Virtual Cellular Machines
• Its computational building blocks are mainly
arrays, composed of operator and memory
elements, acting in space and time
• Heterogeneous and homogeneous
computational blocks
• Building blocks have no physical constraints -
size, bandwidth, speed, power dissipation,
except the two classes of memory access local
and global)
• A typical logic elementary array instruction (LA) is
a binary logic function on n variables, or a memory
look-up table array
• A typical arithmetic/analog elementary array (AA)
instruction is a multiply and accumulate (add) term
(MAC core) array,
• A symbolic elementary array instruction (SA) might
be a string manipulation core array (e.g. a P system)
• a complex cell based array instruction (XA),
hosting cells with all the three above types of data and
instructions
An 8, 16, or 32 bit microprocessor could be considered
as well as an elementary array instruction with
iterative or multi-thread implementation.
A Virtual Cellular Machine
is composed of five types of building blocks:
• (i) cellular processor arrays/layers, CP, with simple (L,
or A, or S type) or complex (X) cells and their local
memories, these are the protagonist building blocks,
• (ii) classical digital stored program computers, P
(microprocessors),
• (iii) multimodal topographic (T) or non-topographic
inputs and outputs, I/O (e.g. scalar, vector, or matrix
signals),
• (iv) global memories of different data types M,
organized in qualitatively different sizes and access
times (e.g. cache memories), and
• (v) interconnection pathways B (busses).
Simple Example: Global system control & memory
I∕O
M1
M2
f0 f0 f0 f0
M F0
F0 F0 F0 F0
B0 B0
• tasks, the algorithms to be implemented, are
defined on the Data/Memory representations
of the Virtual Cellular Machine
• Various data representations for
– Topographic (e.g. picture, image, PDE,
molecule dynamics) and
– Non-topographic (e.g. Markov processes,
algebra, number theory)
problems
Physical Cellular Machines
– the model for the physical implementation
of system architectures
We have three elementary programmable Cell processor
(cell core) types used in array implementations
A) An algorithm with input, state and output vectors having real/ arithmetic,
binary/digital logic, and symbolic variables(typically implemened via digital
circuits).
B) A real valued state and output dynamic system with analog/continuous or
arithmetic variables (typically implemented via mixed mode/analog-and-logic
circuits and digital control processors)
C) A physical dynamic entity with well defined geometric layout and I/O ports
(function in layout) – (typical implementations are CMOS and/or beyond CMOS
nanoscale designs, or optical architectures with programmable control)
Physical Cellular Machines
The above cell processors have their own physical
parameters, hence, the arrays of them, in similar
building arrangements like in the Virtual Cellular
Machines, will be defined by given physical
parameters as the
• Size
• Bandwidth/latency
• Operation speed
• Power dissipation
It might have the same or a different architecture
- compared to the Virtual Cellular Machine
• The geometry of the architectures are reflecting the
physical layout of the chips or systems. A building
block could be implemented as a separate chip or a
part of a chip. This architectural geometry defines also
the communication (interacting) speed ranges. Hence
physical closeness means higher speed ranges.
• The spatial location or topographic address of each
elementary cell processor or cell core, as well as that
of each building block within the architecture, and the
communication bandwidth, plays a crucial role.
algorithms
Physical Task/Problem/
Implementation Workload
Processor and memory/Data
array representations
Array Signals, variables, Processors
memory
logic / symbolic array logic/ symbolic processor
1c.
1c.
10c.
1c.
100c.
CELL organization
Many sensors/data/memory units
on one Cell Processor
Separate sensory/memory plane
Flow architecture
1 2 K
Design Tools and Principles
• Homogeneous cellular arrays
• Decomposition in 2D
• Decomposition in 1D
• Composing complex cells for a
homogeneous FPGA architecture
• Self-organizing distributed control
• The CNN Universal Machine as a compiler
• Nontopographic data representation
• Spatial distribution of arithmetic operators
with large number of terms
Equivalent transformations for finite size
homogeneous cellular arrays (Á.Zarándy)
CP2
{50x20} CP2
{20x30}
CP2
{50x40}
CP2 CP2
{30x80} {20x50}
CP2
{50x20}
Memory Memory
CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB
(BRAM) (BRAM)
CLB CLB CLB CLB DSP Slice CLB CLB CLB CLB DSP Slice CLB CLB
(18x25 bit (18x25 bit
Multiplier) Multiplier)
CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB
CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB
DSP Slice DSP Slice
(18x25 bit (18x25 bit
CLB CLB CLB CLB Multiplier) CLB CLB CLB CLB Multiplier) CLB CLB
Memory Memory
CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB
(BRAM) (BRAM)
CLB CLB CLB CLB DSP Slice CLB CLB CLB CLB DSP Slice CLB CLB
(18x25 bit (18x25 bit
Multiplier) Multiplier)
CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB
CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB
DSP Slice DSP Slice
(18x25 bit (18x25 bit
CLB CLB CLB CLB Multiplier) CLB CLB CLB CLB Multiplier) CLB CLB
Memory Memory
CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB
(BRAM) (BRAM)
CLB CLB CLB CLB DSP Slice CLB CLB CLB CLB DSP Slice CLB CLB
(18x25 bit (18x25 bit
Multiplier) Multiplier)
CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB
Virtex-5 Logic Cell Element
• Configurable Logic LUT6
Cout D
– 2 slices
C
LUT6
CLB
RAM64 Carry&
C1-C6 FF/Latch CQ
SRL32 Control
Slice(1)
B
LUT6
RAM64 Carry&
B1-B6 FF/Latch BQ
SRL32 Control
Slice(0) A
LUT6
RAM64 Carry&
A1-A6 FF/Latch AQ
SRL32 Control
Cin
Arithmetic and Memory Cell
Elements
•Dual-ported memory
•DSP Slice DIA DOA
AddrA
WEA
ENA
A R A
M u ltip lie r CLKA
AD D C P
2 5 x 1 8 b it 36Kb
Memory
B R C Array
DIB DOB
C AddrB
WEB
ENB
CLKB
Programmable Interconnect
CLB
Switch Switch
Slice(1) Matrix Matrix
Switch
Matrix DSP48
Switch Switch
Slice(0) Matrix Matrix
Switch Switch
RAMB36
Matrix Matrix
Switch Switch
Matrix Matrix
DSP48
Switch Switch
Matrix Matrix
Wire delay distribution (ns)
5
4
3
Distance (hops)
2
1
0
-1
-2
-3
-4
-5
-5 -4 -3 -2 -1 0 1 2 3 4 5
The CNN State Equation
1
Cx ij ( t ) = − x ij ( t ) + ∑ A(i, j, k , l) y kl ( t ) + ∑ B(i, j, k , l)u kl ( t ) + z ij
Rx kl∈Sr ( ij ) kl∈Sr ( ij )
g ij = ∑ B(i, j, k, l)u
kl∈Sr ( ij)
kl + hz ij
The Falcon processor core
• Memory unit
StateIn Co
– Contains a belt of the
cell array
• Mixer
– Contains cell values
for the next updates
• Template memory
• Arithmetic unit
The complex arithmetic unit for
implementing the CELL dynamics
Shift reg
Shift reg
• Adder tree
Adder tree – DSP slice built-in
+ adders
+ • Fully pipelined
Shift &
Round
• 1 cell update / clock
cycle
NewState
The Falcon array
• Each core processor
Input
works on a narrow
slice of the image
• Each line computes
one CNN iteration
• Results are shifted
down to the next row
processors
• Linear speedup
Performance compared to a
conventional microprocessor
10000
1000
Speedup
100
10
2 6 10 14 18 22 26 30 34 38 42 46 50 5
Precision (bit)
*Estimated data
Inviscid, Adiabatic, Compressible
flows on the CNN Universal Machine
• Euler equations: • Notations:
∂ρ – t: time
+ ∇ ( ρ v) = 0 ∇ : Nabla operator
∂t
– ρ: density
∂ ( ρ v) ∧
+ ∇ ρ vv + I p = 0 – v(u, v): velocity vector field
∂t – p: pressure
∂E – I: identity matrix
+ ∇ ( ( E + p) v) = 0 – E: total energy
∂t – γ: ratio of specific heats
• Total energy is defined as:
p 1 1
E= + ρ v⋅ v p = ( γ − 1) E - ρ v ⋅ v
γ −1 2 2
1 and 2 order solution
st nd
NEW algorithms for non-topographic
problems
Particle filters (A.Horváth and M. Rásonyi)
• Topographic representation of the
nontopographic data
• Random selection within topographic
locality
• 1000 trajectories x 100 time instants x a few
particles
• Convergence and stability both on Virtual
and on an 8 bit Physical Cellular Machine
Arithmetic operations on big
number of terms
Log N
4 parallel arrays fill factor: static/dynamic
N
N
Log N
• UMF DIAGRAMS
• Dynamic Operational Graphs
• Dynamic Acyclic Graphs
• Dynamic process graphs
• Etc.
UMF diagrams
Xo
•
• z
τ Ο TEM k
Y
U U 1
X 0 1
U 2
X 1 X 2
z
z z
X 0 2
Y
Optimization of algorithms
represented by UMF diagrams
begins….