Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
(Re)configurable Computing
Case Studies
JR.S00 2
Applicability of Configurable
Processing
• Stand-alone computational engines
– E.g. PADDI, UCLA Mojave
• Adding programmable I/O to embedded
processors
– E.g. Napa 2000
• Augmenting the instruction set of processors
– E.g. GARP, RAW
• Providing programmable accelerator co-
processors to embedded micro’s and DSP
– Chameleon, Pleiades, Morphics
JR.S00 3
Stand-Alone
Computational Template Matching for Automatic
Engines Target Recognition
UCLA Mojave
System
I960 Board
JR.S00 4
As Programmable
Interface and I/O
Processor
• Logic used in place of • Case for:
– ASIC environment – Always have some system
customization adaptation to do – varying
– external FPGA/PLD glue logic requirements
devices – Modern chips have
capacity to hold processor
• Example + glue logic
– bus protocols – Reduces part count
– peripherals – Valued added must now be
– sensors, actuators accommodated on chip
(formerly board level)
JR.S00 5
Example:
Interface/Peripherals
• Triscend E5
JR.S00 6
Model: IO Processor
• Array dedicated to • Case for:
servicing IO channel – many protocols, services
– sensor, lan, wan, – only need few at a time
peripheral – dedicate attention, offload
• Provides processor
– protocol handling
– stream computation
» compression, encrypt
• Looks like IO peripheral
to processor
JR.S00 7
IO Processing
JR.S00 8
NAPA 1000 Block Diagram
TBT
ToggleBusTM
Transceiver
System
Port CR32 RPC
CompactRISCTM Reconfigurable ALP
32 Bit Processor Pipeline Cntr Adaptive Logic
Processor
BIU PMA CIO
Configurable
Bus Interface Pipeline I/O
Unit Memory Array
External
Memory
Interface
CR32 SMA
Peripheral Scratchpad
Devices Memory Array
JR.S00 9
urce: National Semiconductor
NAPA 1000 as IO Processor
SYSTEM
HOST
Application
System Port Specific
Memory Interface
ROM &
DRAM
JR.S00 10
urce: National Semiconductor
I/O Stream Processor
JR.S00 12
Instruction Augmentation
• Idea:
– provide a way to augment the processor’s instruction set
– with operations needed by a particular application
– close semantic gap / avoid mismatch
• What’s required:
– some way to fit augmented instructions into stream
– execution engine for augmented instructions
» if programmable, has own instructions
– interconnect to augmented instructions
JR.S00 13
“First” Instruction
Augmentation
• PRISM
– Processor Reconfiguration through Instruction Set
Metamorphosis
• PRISM-I
– 68010 (10MHz) + XC3090
– can reconfigure FPGA in one second!
– 50-75 clocks for operations
JR.S00 14
[Athanas+Silverman: Brown]
PRISM-1 Results
JR.S00 15
Raw kernel speedups
PRISM
• FPGA on bus
• access as memory mapped peripheral
• explicit context management
• some software discipline for use
• …not much of an “architecture” presented to
user
JR.S00 16
[Razdan+Smith: Harvard]
PRISC
• Takes next step
– what look like if we put it on chip?
– how integrate into processor ISA?
• Architecture:
– couple into register file as “superscalar” functional unit
– flow-through array (no state)
JR.S00 17
PRISC
• ISA Integration
– add expfu instruction
– 11 bit address space for user-defined expfu instructions
– fault on pfu instruction mismatch
» trap code to service instruction miss
– all operations occur in clock cycle
– easily works with processor context switch
» no state + fault on mismatch pfu instr
JR.S00 18
PRISC Results
• All compiled
• working from MIPS
binary
• <200 4LUTs ?
– 64x3
• 200MHz MIPS base
JR.S00 19
Razdan/Micro27
Chimaera
JR.S00 20
[Hauck: Northwestern]
Chimaera Architecture
• “Live” copy of register
file values feed into
array
• Each row of array may
compute from register
values or intermediates
(other rows)
• Tag on array to indicate
RFUOP
Results
• Compress 1.11
• Eqntott 1.8
• Life 2.06 (160 hand parallelization)
[Hauck/FCCM97] JR.S00 21
Instruction Augmentation
JR.S00 22
GARP
Identified Problems:
• Single-cycle flow-through
– not most promising usage style
• Moving data through Register File to/from
array
– can present a limitation
» bottleneck to achieving high computation rate
JR.S00 23
[Hauser+Wawrzynek: UCB]
GARP
• Integrate as coprocessor
– similar bandwidth to processor as FU
– own access to memory
• Support multi-cycle operation
– allow state
– cycle counter to track operation
• Fast operation selection
– cache for configurations
– dense encodings, wide path to memory
JR.S00 24
GARP
• ISA -- coprocessor operations
– issue gaconfig to make a particular configuration resident
(may be active or cached)
– explicitly move data to/from array
» 2 writes, 1 read
– processor suspend during coprocessor operation
» cycle count tracks operation
– array may directly access memory
» processor and array share memory space
• cache/mmu keeps consistency
» can exploit streaming data operations
JR.S00 25
GARP
• Processor Instructions
JR.S00 26
GARP Array
• Row oriented logic
– denser for datapath
operations
• Dedicated path for
– processor/memory data
• Processor not have to be
involved in
array⇔memory path
JR.S00 27
GARP Results
• General results
– 10-20x on stream, feed-
forward operation
– 2-3x when data-
dependencies limit
pipelining
[Hauser+Wawrzynek/FCCM97] JR.S00 28
PRISC/Chimera … GARP
• PRISC/Chimaera • GARP
– basic op is single cycle: – basic op is multicycle
expfu (rfuop) » gaconfig
– no state » mtga
– could conceivably have » mfga
multiple PFUs?
– can have state/deep
– Discover parallelism => pipelining
run in parallel?
– ? Multiple arrays viable?
– Can’t run deep pipelines
– Identify mtga/mfga w/ corr
gaconfig?
JR.S00 29
Common Theme
JR.S00 30
Applicability of Configurable
Processing
• Stand-alone computational engines
– E.g. PADDI, UCLA Mojave
• Adding programmable I/O to embedded
processors
– E.g. Napa 2000
• Augmenting the instruction set of processors
– E.g. GARP, RAW
• Providing programmable accelerator co-
processors to embedded micro’s and DSP
– Chameleon, Pleiades, Morphics
JR.S00 31
Example: Chameleon Reconfigurable
Co-Processor (network,
communication applications)
Local
Store
Memory
Reconfigurable Logic (LSM)
Configuration bit stream Array of 32-bit Data Path
Operators & Control Logic
JR.S00 32
Multiple banks of I/O
Reconfigurable Processor Tools Flow
Customer
RTL
Application / IP
HDL
(C code)
C Model Development
C Debugger Board
Simulator
JR.S00 33
Heterogeneous Reconfiguration
Data Program
Memory Memory
mux
A ddrG en A ddrG e n
CLB CLB
reg0
Inst ruction
M em ory M em ory Decoder
reg1 Datapath &
C ontroller
adder
CLB CLB
MAC
buffer Data
Memory
Configuration
Arithmetic Arithmetic Arithmetic
Processor Processor Processor Dedicated
Arithmetic
Communication Network
Network Interface
Control Configurable Configurable
Processor Datapath Logic
AddressGen AddressGen
Memory Memory
Convolution
MAC MAC
L G C
Control
Processor
AddrGen
for(i=1;i<=L;i++)
for(k=i;k<=L;k++) MEM: in
AddrGen
phi[i][k]= phi[i-1][k-1]
MPY MPY
+in[NP-i]*in[NP-k] MEM: phi
-in[NA-1-i]*in[NA-1-k]; ALU
JR.S00 38
Inter-Satellite
Communication
• Data-driven execution
– A satellite processor is enabled only when input data is ready
1
MPY 1
1
AddrGen Memory
1
MPY n
n
Embedded n
processor MAC 1
n
10000
Example: 16 point Complex 3970
NormalizedEnergy*Delay / stage[Js*e-14]
Radix-2 FFT (Final Stage) 1000 Energy*Delay/stage
137
10000 Energy/stage 100u 100
Delay/stage
Normalized Delay/stage[s]
NormalizedEnergy/ stage[nJ]
13
10 100n 0.1
StrongARM
StrongARM
StrongARM
TMS320LC54x
TMS320C2xx
TMS320LC54x
TMS320LC54x
TMS320C2xx
TMS320C2xx
Pleiades
Pleiades
Pleiades
JR.S00 40
Adaptive Multi-User Detector for W-CDMA
Pilot Correlator Unit Using LMS
Filter
MEM alt s_r MAC Zmf_r
AG
MEM alt s_i MAC Zmf_i
MEM alt
y_r
AG
MEM alt
y_i
MUL ADD
SUB ACC
MEM alt MUL ADD
AG
MEM alt MUL
ADD ACC
MUL
Coefficient Update
MUL
SUB SUB s_r
MUL SUB y_r MUL Zmf_r
SUB s_i
MUL SUB MUL
y_i Zmf_i
ADD
MUL
JR.S00 41
Architecture Comparison
LMS Correlator at 1.67 MSymbols Data Rate
Complexity: 300 Mmult/sec and 357 Macc/sec
16 Mmacs/mW!
Note: TMS implementation requires 36 parallel processors to meet data rate -
validity questionable
JR.S00 42
Maia: Reconfigurable Baseband
Processor for Wireless
M u lti-B u s cluster
cluster
N Inputs
B Buses
M Outputs cluster
JR.S00 44
Software Methodology
Flow
Algorithms C++
µ p r o &c
Kernel Detection Accelerator
PDA Models
ation Behavioral
Estimation/Exploration
SUIF+ C-IF
Power & Timing Estimation
of Various Kernel Implementations
Premapped
Kernels
Partitioning
C++ Module
Libraries
Software Compilation
Reconfig. Hardware Mapping
Interface Code Generation
JR.S00 45
Hardware-Software Exploration
Macromodel call
JR.S00 46
Implementation Fabrics for
Protocols
RACH
req
RACH
akn
A protocol =
Extended FSM
idle
RACH
Memory
slotset
read write
update R_ENA
idle
W_ENA
BUF
BUF
Slot_Set_Tbl
2x16
addr
JR.S00 48
The Software-Defined Radio
FPGA Embedded uP
Dedicated FSM
Dedicated Reconfigurable
DSP DataPath
JR.S00 49
An Industrial Example:
Basestation for Cellular
Wireless
800 MHz 1900 MHz
A B A B A D B E F C
Antenna Antenna
System System
RF/IF RF/IF
RF/IF
Tuner multiple sectors multiple sectors RF/IF
Tuner
RF/IF
Tuner multi-band multi-band RF/IF
Tuner
Block-Spectrum
Tuner Block-Spectrum
Tuner
Block-Spectrum
A/D Block-Spectrum
A/D
Block-Spectrum
A/D Block-Spectrum
A/D
A/D A/D
...
1
Modular/Parameterizabl /E
T1
e
•per carrier
•per TDMA time-slot
•per CDMA code
JR.S00 50
BTS Signal Processing
Platforms
Hardwired ASICs
A/D Conversion
Comm
Agent
Standard A, F1
Standard A, F2 N DSP/CPUs
Standard B, F1
...
...
Comm
Agent CPU
JR.S00 51
Basestation of the Next
Generation
Wideband 10/
Data
RF 100
ATM Networks
or
Gbit
JR.S00 52
Coexistence of Multiple
Standards In Product
Supply Chain
2G 3G
– GSM – ETSI UTRA
– DCS1800 – ARIB W-CDMA
– PCS1900 2.5G – TIA cdma2000
– IS-95 – GPRS – W-TDMA (UWC)
– IS-54B – HCSD
– IS-136 – IS-95 MDR
– PDC – IS-95 HDR
– IS-136 HS
CIRCUIT PACKET
VOICE DATA
NARROWBAND WIDEBAND
JR.S00 53
Wideband CDMA: MOPS?
No, GOPS!
Single 384 kbps ARIB W-CDMA Channel
Function MIPS
Digital RRC Channel 3600
Searcher 2100
RAKE 1050
Maximal Ratio Combiner 24
Channel Estimator 12
AGC, AFC 10
Deinterleaver 15
Turbo Coder 90
Source: J. Kohnen et al. “Baseband Solution for WCDMA,” Proc. IEEE
TOTAL
Communication Theory Workshop, May 1999, Aptos, USA. 6901 JR.S00 54
HW Multistandard
Solutions
The common approach to hardware design involves:
multiple ASIC’s to support each standard.
Digital
Hardwired IF RF
DSP ASIC
Digital
Hardwired IF RF
ASIC
Digital
Control Processor
Hardwired IF RF
ASIC
Programmable Unique
Combinations Analog
IF RF
DSP
IF RF
Control Processor
IF RF
Programmable Analog
…but is simply not an good implementation for base stations:
-Unacceptably high cost per channel
-Unacceptably large power per channel
This is definitely not a viable implementation for terminals
JR.S00 56
The Law of Diminishing
Returns
• More transistors are being thrown at improving
general-purpose CPU and DSP performance
JR.S00 57
Embedded Logic
Evolution
– Increasing fixed-function hardwired content of systems
– Core+Logic becomes de-facto design architecture
– Move to deep sub-micron technology
» rapidly increasing product integration cycles
» increasingly constrained design resources
» sharp increases in cost of “trying out” an idea- NRE
– Design methodologies optimized for random logic, homogeneous architectures,
and lower speed signal processing (I.e. control-flow dominated systems)
– Verification issues dominate design cycle time
JR.S00 58
FPGA the Solution?
Cellular Handset Using
Current FPGA
JR.S00 59
Some Interesting
Observations
Don’t use more transistors to stretch general-purpose performance,
whether for CPUs, DSPs, or
reconfigurable logic.
JR.S00 60
View The “Reconfigurability”
Problem From The System Level
What Are the Application-Specific Performance Needs?
JR.S00 61
Successfully Using
Reconfigurability
Application-Specific Leverage
JR.S00 62
Application-Specific MOPS in
Digital Communications
TDMA
Wideband Signal Programmable
Processing Engine DSP
Digital
Downconversion Wideband Channel
RF/IF
and Decoder Engine
Channelization
CDMA
Wideband Signal Microprocessor
Processing Engine
JR.S00 63
Morphics’ DRL
Architecture
Heterogeneous Multiprocessing Engine Using Application-Specific Reconfigurable Logic
DATAFLOW
in p u t
m
m
O
in p u t R
in p u t
m
m
O
C lk
in p u t R o u t p u t
E n a b le
E n a b le
o u t p u t
in p u t
m
m
O
in p u t R
C lk
o u t p u t
E n a b le
JR.S00 64
DRL Kernels
DATA MEMORY
PARAMETERIZABLE
CONFIGURABLE
DATA SEQUENCER
ALU
JR.S00 65
Mapping Software to Target
Architecture
DATA MEMORY
PARAMETERIZABLE
DATA SEQUENCER CONFIGURABLE
ALU
DATA MEMORY
PARAMETERIZABLE
DATA SEQUENCER CONFIGURABLE
ALU
JR.S00 66
Programmer’s Guide
/* morphics soft API usage examples: */
/* search a pilot set in search set */
Document provided with each /* maintainance mode. */
JR.S00 67
Key Pieces of Design
Methodology
System-level Profiling
• Analyze sequences of operations (arithmetic, memory access, etc)
• Analyze communication bottlenecks
• Key flexible parameters (algorithm v architecture parameters)
Architecture-level Profiling
• ALU/kernel definition (sequences of operators)
• Memory profile
• Type of configurability required for flexibility
• Macro-sequencer development
Implementation
• SW- programmer’s model developed at architecture specification stage
• SW- API proven out via behavioral models & demonstrator hardware
• VLSI-focus on regular predictable timing and routability
• VLSI- embedded reconfigurability in an ASIC flow
JR.S00 68
CDMA Modem Analysis
JR.S00 69
MOPS Breakdown Analysis
JR.S00 70
Prototype Demonstrator
RISC
Microprocessor
Configurable Kernel(s)
Data Router
JR.S00 71
Summary
• Configurable computing is finding its way into the
embedded processor space
• Best suited (so far) for
– Flexible I/O and Interface functionality
– providing task-level acceleration of “parametizable” functions
• Improvement of IP seems limited
• Software flow still subject to improvement
• Might become more interesting with the emergence of
low-current devices (TFT, organic transistors,
molecular computing)
JR.S00 72