Sei sulla pagina 1di 72

Lecture 14:

(Re)configurable Computing
Case Studies

Prof. Jan Rabaey


Computer Science 252, Spring 2000

The contributions of Andre Dehon (Caltech) and Ravi Subramanian (Morphics)


to this slide set are gratefully acknowledged
JR.S00 1
Summary of Previous
Class
• Configurable Computing using “programming
in space” versus “programming in time” for
traditional instruction-set computers
• Key design choices
– Computational units and their granularity
– Interconnect Network
– (Re)configuration time and frequency
• Next class: Some practical examples of
reconfigurable computers

JR.S00 2
Applicability of Configurable
Processing
• Stand-alone computational engines
– E.g. PADDI, UCLA Mojave
• Adding programmable I/O to embedded
processors
– E.g. Napa 2000
• Augmenting the instruction set of processors
– E.g. GARP, RAW
• Providing programmable accelerator co-
processors to embedded micro’s and DSP
– Chameleon, Pleiades, Morphics

JR.S00 3
Stand-Alone
Computational Template Matching for Automatic
Engines Target Recognition
UCLA Mojave
System

I960 Board

JR.S00 4
As Programmable
Interface and I/O
Processor
• Logic used in place of • Case for:
– ASIC environment – Always have some system
customization adaptation to do – varying
– external FPGA/PLD glue logic requirements
devices – Modern chips have
capacity to hold processor
• Example + glue logic
– bus protocols – Reduces part count
– peripherals – Valued added must now be
– sensors, actuators accommodated on chip
(formerly board level)

JR.S00 5
Example:
Interface/Peripherals
• Triscend E5

JR.S00 6
Model: IO Processor
• Array dedicated to • Case for:
servicing IO channel – many protocols, services
– sensor, lan, wan, – only need few at a time
peripheral – dedicate attention, offload
• Provides processor
– protocol handling
– stream computation
» compression, encrypt
• Looks like IO peripheral
to processor

JR.S00 7
IO Processing

• Single threaded processor


– cannot continuously monitor multiple data pipes (src,
sink)
– need some minimal, local control to handle events
– for performance or real-time guarantees , may need to
service event rapidly
– E.g. checksum (decode) and acknowledge packet

JR.S00 8
NAPA 1000 Block Diagram
TBT
ToggleBusTM
Transceiver
System
Port CR32 RPC
CompactRISCTM Reconfigurable ALP
32 Bit Processor Pipeline Cntr Adaptive Logic
Processor
BIU PMA CIO
Configurable
Bus Interface Pipeline I/O
Unit Memory Array
External
Memory
Interface
CR32 SMA
Peripheral Scratchpad
Devices Memory Array

JR.S00 9
urce: National Semiconductor
NAPA 1000 as IO Processor
SYSTEM
HOST

Application
System Port Specific

NAPA1000 Sensors, Actuators, or


CIO
other circuits

Memory Interface

ROM &
DRAM

JR.S00 10
urce: National Semiconductor
I/O Stream Processor

Combines Trimedia VLIW with


Configurable media co-processors

Philips Nexperia NX-2700


A programmable HDTV
media processor JR.S00 11
Model: Instruction
Augmentation
• Observation: Instruction Bandwidth
– Processor can only describe a small number of basic
computations in a cycle
» I bits →2I operations
– This is a small fraction of the operations one could do even in
terms of w⊗ w→w Ops
» w22(2w) operations
– Processor could have to issue w2(2 (2w) -I) operations just to
describe some computations
– An a priori selected base set of functions could be very bad for
some applications

JR.S00 12
Instruction Augmentation

• Idea:
– provide a way to augment the processor’s instruction set
– with operations needed by a particular application
– close semantic gap / avoid mismatch
• What’s required:
– some way to fit augmented instructions into stream
– execution engine for augmented instructions
» if programmable, has own instructions
– interconnect to augmented instructions

JR.S00 13
“First” Instruction
Augmentation
• PRISM
– Processor Reconfiguration through Instruction Set
Metamorphosis
• PRISM-I
– 68010 (10MHz) + XC3090
– can reconfigure FPGA in one second!
– 50-75 clocks for operations

JR.S00 14
[Athanas+Silverman: Brown]
PRISM-1 Results

JR.S00 15
Raw kernel speedups
PRISM

• FPGA on bus
• access as memory mapped peripheral
• explicit context management
• some software discipline for use
• …not much of an “architecture” presented to
user

JR.S00 16
[Razdan+Smith: Harvard]

PRISC
• Takes next step
– what look like if we put it on chip?
– how integrate into processor ISA?
• Architecture:
– couple into register file as “superscalar” functional unit
– flow-through array (no state)

JR.S00 17
PRISC

• ISA Integration
– add expfu instruction
– 11 bit address space for user-defined expfu instructions
– fault on pfu instruction mismatch
» trap code to service instruction miss
– all operations occur in clock cycle
– easily works with processor context switch
» no state + fault on mismatch pfu instr

JR.S00 18
PRISC Results
• All compiled
• working from MIPS
binary
• <200 4LUTs ?
– 64x3
• 200MHz MIPS base

JR.S00 19
Razdan/Micro27
Chimaera

• Start from PRISC idea


– integrate as functional unit
– no state
– RFUOPs (like expfu)
– stall processor on instruction miss, reload
• Add
– manage multiple instructions loaded
– more than 2 inputs possible

JR.S00 20
[Hauck: Northwestern]
Chimaera Architecture
• “Live” copy of register
file values feed into
array
• Each row of array may
compute from register
values or intermediates
(other rows)
• Tag on array to indicate
RFUOP

Results
• Compress 1.11
• Eqntott 1.8
• Life 2.06 (160 hand parallelization)

[Hauck/FCCM97] JR.S00 21
Instruction Augmentation

• Small arrays with limited state


– so far, for automatic compilation
» reported speedups have been small
– open
» discover less-local recodings which extract greater
benefit

JR.S00 22
GARP

Identified Problems:
• Single-cycle flow-through
– not most promising usage style
• Moving data through Register File to/from
array
– can present a limitation
» bottleneck to achieving high computation rate

JR.S00 23
[Hauser+Wawrzynek: UCB]
GARP

• Integrate as coprocessor
– similar bandwidth to processor as FU
– own access to memory
• Support multi-cycle operation
– allow state
– cycle counter to track operation
• Fast operation selection
– cache for configurations
– dense encodings, wide path to memory

JR.S00 24
GARP
• ISA -- coprocessor operations
– issue gaconfig to make a particular configuration resident
(may be active or cached)
– explicitly move data to/from array
» 2 writes, 1 read
– processor suspend during coprocessor operation
» cycle count tracks operation
– array may directly access memory
» processor and array share memory space
• cache/mmu keeps consistency
» can exploit streaming data operations

JR.S00 25
GARP

• Processor Instructions

JR.S00 26
GARP Array
• Row oriented logic
– denser for datapath
operations
• Dedicated path for
– processor/memory data
• Processor not have to be
involved in
array⇔memory path

JR.S00 27
GARP Results
• General results
– 10-20x on stream, feed-
forward operation
– 2-3x when data-
dependencies limit
pipelining

[Hauser+Wawrzynek/FCCM97] JR.S00 28
PRISC/Chimera … GARP
• PRISC/Chimaera • GARP
– basic op is single cycle: – basic op is multicycle
expfu (rfuop) » gaconfig
– no state » mtga
– could conceivably have » mfga
multiple PFUs?
– can have state/deep
– Discover parallelism => pipelining
run in parallel?
– ? Multiple arrays viable?
– Can’t run deep pipelines
– Identify mtga/mfga w/ corr
gaconfig?

JR.S00 29
Common Theme

• To get around instruction expression limits


– define new instruction in array
» many bits of config … broad expressability
» many parallel operators
– give array configuration short “name” which processor
can callout
» …effectively the address of the operation

But – Impact of using reconfiguration at Instruction


Level seems limited
⇒ Explore opportunities at larger granularity levels
(basic block, task, process)

JR.S00 30
Applicability of Configurable
Processing
• Stand-alone computational engines
– E.g. PADDI, UCLA Mojave
• Adding programmable I/O to embedded
processors
– E.g. Napa 2000
• Augmenting the instruction set of processors
– E.g. GARP, RAW
• Providing programmable accelerator co-
processors to embedded micro’s and DSP
– Chameleon, Pleiades, Morphics

JR.S00 31
Example: Chameleon Reconfigurable
Co-Processor (network,
communication applications)

JTAG PCI Memory


Debugging Port Interface Controller Data Memory

Wide Internal communications bus


Bus Manager
ARC CPU Instruction
& DMA
Controllers Cache

Background Configuration Plane

Local
Store
Memory
Reconfigurable Logic (LSM)
Configuration bit stream Array of 32-bit Data Path
Operators & Control Logic

JR.S00 32
Multiple banks of I/O
Reconfigurable Processor Tools Flow
Customer
RTL
Application / IP
HDL
(C code)

C Compiler Synthesis & Layout

ARC Linker Configuration Bits


Object
Code
Chameleon Executable

C Model Development
C Debugger Board
Simulator

JR.S00 33
Heterogeneous Reconfiguration

Reconfigurable Reconfigurable Reconfigurable Reconfigurable


Logic Datapaths Arithmetic Control
In

Data Program
Memory Memory
mux
A ddrG en A ddrG e n
CLB CLB
reg0
Inst ruction
M em ory M em ory Decoder
reg1 Datapath &
C ontroller
adder
CLB CLB
MAC
buffer Data
Memory

Bit-Level Operations Dedicated data paths Arithmetic kernels RTOS


e.g. encoding e.g. Filters, AGU e.g. Convolution Process management
JR.S00 34
Multi-granularity Reconfigurable
Architecture:
The Berkeley Pleiades Architecture
Configuration Bus
Satellite Processor

Configuration
Arithmetic Arithmetic Arithmetic
Processor Processor Processor Dedicated
Arithmetic
Communication Network

Network Interface
Control Configurable Configurable
Processor Datapath Logic

• Computational kernels are “spawned” to satellite processors


• Control processor supports RTOS and reconfiguration
• Order(s) of magnitude energy-reduction over traditional programmable architectures
JR.S00 35
Matching Computation and
Architecture

AddressGen AddressGen

Memory Memory
Convolution

MAC MAC

L G C
Control
Processor

Two models of computation: Two architectural models:


communicating processes + data-flow sequential control+ data-driven
JR.S00 36
Execution Model of a Data-
Flow Kernel
Embedded processor
Code seg start end

AddrGen
for(i=1;i<=L;i++)
for(k=i;k<=L;k++) MEM: in
AddrGen
phi[i][k]= phi[i-1][k-1]
MPY MPY
+in[NP-i]*in[NP-k] MEM: phi
-in[NA-1-i]*in[NA-1-k]; ALU

Code seg ALU

•Distributed control and memory JR.S00 37


Reconfigurable Kernels for
W-CDMA

• Dominant kernel M(MTX)


requires array of MACs and
segmented memories
• Additional operations such as
sqrt(x), 1/x, and Trellis
decoding may be implemented
using FPGA or cordic satellite

JR.S00 38
Inter-Satellite
Communication
• Data-driven execution
– A satellite processor is enabled only when input data is ready

• Data sources generate data of different types: scalars, vectors,


matrices
• Data computing processors handle data inputs of different types
end-of-vector token

1
MPY 1
1
AddrGen Memory
1
MPY n
n

Embedded n
processor MAC 1
n

Data sources Data computing processors


JR.S00 39
Impact of Architectural
Choice

10000
Example: 16 point Complex 3970

NormalizedEnergy*Delay / stage[Js*e-14]
Radix-2 FFT (Final Stage) 1000 Energy*Delay/stage

137
10000 Energy/stage 100u 100
Delay/stage

Normalized Delay/stage[s]
NormalizedEnergy/ stage[nJ]

1870 21u 18.5


10u 10u 10
1000
3.8u
131
100 1u 1
49 570n 0.75

13
10 100n 0.1

StrongARM
StrongARM

StrongARM

TMS320LC54x
TMS320C2xx
TMS320LC54x

TMS320LC54x
TMS320C2xx
TMS320C2xx

Pleiades

Pleiades

Pleiades
JR.S00 40
Adaptive Multi-User Detector for W-CDMA
Pilot Correlator Unit Using LMS
Filter
MEM alt s_r MAC Zmf_r
AG
MEM alt s_i MAC Zmf_i

MEM alt
y_r
AG
MEM alt
y_i
MUL ADD
SUB ACC
MEM alt MUL ADD
AG
MEM alt MUL
ADD ACC
MUL

Coefficient Update

MUL
SUB SUB s_r
MUL SUB y_r MUL Zmf_r
SUB s_i
MUL SUB MUL
y_i Zmf_i
ADD
MUL
JR.S00 41
Architecture Comparison
LMS Correlator at 1.67 MSymbols Data Rate
Complexity: 300 Mmult/sec and 357 Macc/sec

16 Mmacs/mW!
Note: TMS implementation requires 36 parallel processors to meet data rate -
validity questionable
JR.S00 42
Maia: Reconfigurable Baseband
Processor for Wireless

• 0.25um tech: 4.5mm x 6mm


• 1.2 Million transistors
• 40 MHz at 1V
• 1 mW VCELP voice coder
• Hardware
• 1 ARM-8
• 8 SRAMs & 8 AGPs
• 2 MACs
• 2 ALUs
• 2 In-Ports and 2 Out-Ports
• 14x8 FPGA
JR.S00 43
Reconfigurable Interconnect
Exploration

Mesh Hierarchical Mesh


Module

M u lti-B u s cluster
cluster
N Inputs

B Buses

M Outputs cluster
JR.S00 44
Software Methodology
Flow
Algorithms C++

µ p r o &c
Kernel Detection Accelerator
PDA Models

ation Behavioral
Estimation/Exploration
SUIF+ C-IF
Power & Timing Estimation
of Various Kernel Implementations
Premapped
Kernels
Partitioning
C++ Module
Libraries
Software Compilation
Reconfig. Hardware Mapping
Interface Code Generation
JR.S00 45
Hardware-Software Exploration

Macromodel call

JR.S00 46
Implementation Fabrics for
Protocols
RACH
req
RACH
akn
A protocol =
Extended FSM
idle

RACH
Memory
slotset
read write

update R_ENA
idle

W_ENA

BUF
BUF
Slot_Set_Tbl
2x16

addr

slot_set Slot_no Slot Pkt


<31:0> <5:0> start end

Intercom TDMA MAC JR.S00 47


Intercom TDMA MAC
Implementation alternatives

ASIC FPGA ARM8


Power 0.26mW 2.1mW 114mW
Energy 10.2pJ/op 81.4pJ/op n*457pJ/op

• ASIC: 1V, 0.25 µ m CMOS process


• FPGA: 1.5 V 0.25 µ m CMOS low-energy FPGA
• ARM8: 1 V 25 MHz processor; n = 13,000
• Ratio: 1 - 8 - >> 400

JR.S00 48
The Software-Defined Radio

FPGA Embedded uP

Dedicated FSM

Dedicated Reconfigurable
DSP DataPath

JR.S00 49
An Industrial Example:
Basestation for Cellular
Wireless
800 MHz 1900 MHz

A B A B A D B E F C

Antenna Antenna
System System

RF/IF RF/IF
RF/IF
Tuner multiple sectors multiple sectors RF/IF
Tuner
RF/IF
Tuner multi-band multi-band RF/IF
Tuner
Block-Spectrum
Tuner Block-Spectrum
Tuner
Block-Spectrum
A/D Block-Spectrum
A/D
Block-Spectrum
A/D Block-Spectrum
A/D
A/D A/D

HIGH-SPEED DIGITAL BUS

...
1
Modular/Parameterizabl /E
T1
e
•per carrier
•per TDMA time-slot
•per CDMA code
JR.S00 50
BTS Signal Processing
Platforms
Hardwired ASICs

A/D Conversion
Comm
Agent

High-Speed Digital Bus


D/A Conversion

Standard A, F1
Standard A, F2 N DSP/CPUs
Standard B, F1
...
...
Comm
Agent CPU

JR.S00 51
Basestation of the Next
Generation

Wideband 10/
Data
RF 100
ATM Networks
or
Gbit

JR.S00 52
Coexistence of Multiple
Standards In Product
Supply Chain
2G 3G
– GSM – ETSI UTRA
– DCS1800 – ARIB W-CDMA
– PCS1900 2.5G – TIA cdma2000
– IS-95 – GPRS – W-TDMA (UWC)
– IS-54B – HCSD
– IS-136 – IS-95 MDR
– PDC – IS-95 HDR
– IS-136 HS

CIRCUIT PACKET
VOICE DATA
NARROWBAND WIDEBAND

JR.S00 53
Wideband CDMA: MOPS?
No, GOPS!
Single 384 kbps ARIB W-CDMA Channel

Function MIPS
Digital RRC Channel 3600
Searcher 2100
RAKE 1050
Maximal Ratio Combiner 24
Channel Estimator 12
AGC, AFC 10
Deinterleaver 15
Turbo Coder 90
Source: J. Kohnen et al. “Baseband Solution for WCDMA,” Proc. IEEE
TOTAL
Communication Theory Workshop, May 1999, Aptos, USA. 6901 JR.S00 54
HW Multistandard
Solutions
The common approach to hardware design involves:
multiple ASIC’s to support each standard.
Digital
Hardwired IF RF
DSP ASIC
Digital
Hardwired IF RF
ASIC
Digital
Control Processor
Hardwired IF RF
ASIC

Programmable Unique
Combinations Analog

• Hardwired implementation is not scalable or upgradeable to new standards.


• This approach costs time in a time-to-market dominated world.
• Creating new chipsets for every technology combination critically challenges
available design resources!
JR.S00 55
SW Multistandard
Solution
Applying instruction-set processor architectures to
all baseband processing would be desireable...

IF RF
DSP

IF RF

Control Processor
IF RF

Programmable Analog
…but is simply not an good implementation for base stations:
-Unacceptably high cost per channel
-Unacceptably large power per channel
This is definitely not a viable implementation for terminals
JR.S00 56
The Law of Diminishing
Returns
• More transistors are being thrown at improving
general-purpose CPU and DSP performance

• Fundamental bounds are being pushed


– limits on instruction-level parallelism
– limits on memory system performance

• Returns per transistor are diminishing


– new architectures realizing only 2-3 instructions/clock
– increasingly large caches to hide DRAM latency

JR.S00 57
Embedded Logic
Evolution
– Increasing fixed-function hardwired content of systems
– Core+Logic becomes de-facto design architecture
– Move to deep sub-micron technology
» rapidly increasing product integration cycles
» increasingly constrained design resources
» sharp increases in cost of “trying out” an idea- NRE
– Design methodologies optimized for random logic, homogeneous architectures,
and lower speed signal processing (I.e. control-flow dominated systems)
– Verification issues dominate design cycle time

Growing Design Cycle Times At Odds With


Shrinking Product Cycle Times

JR.S00 58
FPGA the Solution?
Cellular Handset Using
Current FPGA

JR.S00 59
Some Interesting
Observations
Don’t use more transistors to stretch general-purpose performance,
whether for CPUs, DSPs, or
reconfigurable logic.

Don’t use more time to design


dedicated hardwired solutions in cases where
mass customization
is what the market demands.

JR.S00 60
View The “Reconfigurability”
Problem From The System Level
What Are the Application-Specific Performance Needs?

– What are the applications targeted?


– What algorithms are essential to achieving the performance goals?
– What are the functions at the heart of these algorithms?
– Which functions yield poor price-performance with general-purpose
MOPS and system memory models?
– What is the embedded systems programmer’s model?
– Best performance at what cost:
» Area - instruction-level parallelism, memory hierarchy
» Power- energy requirements on a function basis
» Time- quality and ease of programming for app development
» Pain- forward opportunity vs backward compatibility

JR.S00 61
Successfully Using
Reconfigurability
Application-Specific Leverage

Focus on first on applications and constituent algorithms, not the


silicon architecture !
Wireless Communications Transceiver Signal Processing

Minimize the hardware reconfigurability to constrained set


Maximize the software parameterizability and ease of use of the
programmer’s model for flexibility

Define optimal architecture for efficient implementation

JR.S00 62
Application-Specific MOPS in
Digital Communications

TDMA
Wideband Signal Programmable
Processing Engine DSP
Digital
Downconversion Wideband Channel
RF/IF
and Decoder Engine
Channelization
CDMA
Wideband Signal Microprocessor
Processing Engine

JR.S00 63
Morphics’ DRL
Architecture
Heterogeneous Multiprocessing Engine Using Application-Specific Reconfigurable Logic

Large Granularity Kernel

DATAFLOW
in p u t

m
m
O
in p u t R

in p u t

m
m
O
C lk
in p u t R o u t p u t
E n a b le

Small Granularity Kernel C lk

E n a b le
o u t p u t

in p u t

m
m
O
in p u t R

C lk
o u t p u t
E n a b le

JR.S00 64
DRL Kernels

DATA MEMORY

PARAMETERIZABLE
CONFIGURABLE
DATA SEQUENCER
ALU

JR.S00 65
Mapping Software to Target
Architecture

DATA MEMORY

PARAMETERIZABLE
DATA SEQUENCER CONFIGURABLE
ALU

DATA MEMORY

PARAMETERIZABLE
DATA SEQUENCER CONFIGURABLE
ALU

JR.S00 66
Programmer’s Guide
/* morphics soft API usage examples: */
/* search a pilot set in search set */
Document provided with each /* maintainance mode. */

processor to enable ...


set_ptr = get_next_set();
application development, if (no_need_to_throttle())
{
configuration, and system search_set(set_ptr);

control via host processor. }


...
void search_set(PILOT_TYPE *pilot)
{
/* search threshold is assumed to be set in other places */

Includes complete morphics_searcher_set_win_size(pilot->win_size);


morphics_searcher_set_pn(pilot->pn);
morphics_searcher_set_int_len(pilot->int_len);
• description of each API function }
call
/* finger re-assignments */
• system control functions ...
fing[i].pos = morphics_demod_get_fin_pos(i);
• variables, parameters ...
distance = calc_fing_movement(&fing[i], new_pn);
• coding examples, and ...
performance realized (i.e. ROC fing[i].slew = distance;
fing[i].pn = new_pn;
curves) morphics_demod_set_fing_iq(i, fing[i].pn);
while ( slew_not_done(morphics_demod_get_status()) )
{
morphics_demod_set_fing_slew(i, fing[i].slew);
}
...

JR.S00 67
Key Pieces of Design
Methodology
System-level Profiling
• Analyze sequences of operations (arithmetic, memory access, etc)
• Analyze communication bottlenecks
• Key flexible parameters (algorithm v architecture parameters)

Architecture-level Profiling
• ALU/kernel definition (sequences of operators)
• Memory profile
• Type of configurability required for flexibility
• Macro-sequencer development

Implementation
• SW- programmer’s model developed at architecture specification stage
• SW- API proven out via behavioral models & demonstrator hardware
• VLSI-focus on regular predictable timing and routability
• VLSI- embedded reconfigurability in an ASIC flow

JR.S00 68
CDMA Modem Analysis

JR.S00 69
MOPS Breakdown Analysis

JR.S00 70
Prototype Demonstrator

RISC
Microprocessor

Configurable Kernel(s)

Data Router

JR.S00 71
Summary
• Configurable computing is finding its way into the
embedded processor space
• Best suited (so far) for
– Flexible I/O and Interface functionality
– providing task-level acceleration of “parametizable” functions
• Improvement of IP seems limited
• Software flow still subject to improvement
• Might become more interesting with the emergence of
low-current devices (TFT, organic transistors,
molecular computing)

DO NOT FORGET CONFIGURATION OVERHEAD

JR.S00 72

Potrebbero piacerti anche