(Re) Configurable Computing Case Studies: Prof. Jan Rabaey Computer Science 252, Spring 2000

Lecture 14:
(Re)configurable Computing
Case Studies
Prof. Jan Rabaey

Computer Science 252, Spring 2000
The contributions of Andre Dehon (Caltech) and Ravi Subramanian (Morphics)

to this slide set are gratefully acknowledged
JR.S00 1
Summary of Previous
Class
• Configurable Computing using “programming
in space” versus “programming in time” for
traditional instruction-set computers
• Key design choices
– Computational units and their granularity
– Interconnect Network
– (Re)configuration time and frequency
• Next class: Some practical examples of
reconfigurable computers
JR.S00 2
Applicability of Configurable
Processing
• Stand-alone computational engines
– E.g. PADDI, UCLA Mojave
• Adding programmable I/O to embedded
processors
– E.g. Napa 2000
• Augmenting the instruction set of processors
– E.g. GARP, RAW
• Providing programmable accelerator co-
processors to embedded micro’s and DSP
– Chameleon, Pleiades, Morphics
JR.S00 3
Stand-Alone
Computational Template Matching for Automatic
Engines Target Recognition
UCLA Mojave
System
I960 Board
JR.S00 4
As Programmable
Interface and I/O
Processor
• Logic used in place of • Case for:
– ASIC environment – Always have some system
customization adaptation to do – varying
– external FPGA/PLD glue logic requirements
devices – Modern chips have
capacity to hold processor
• Example + glue logic
– bus protocols – Reduces part count
– peripherals – Valued added must now be
– sensors, actuators accommodated on chip
(formerly board level)
JR.S00 5
Example:
Interface/Peripherals
• Triscend E5
JR.S00 6
Model: IO Processor
• Array dedicated to • Case for:
servicing IO channel – many protocols, services
– sensor, lan, wan, – only need few at a time
peripheral – dedicate attention, offload
• Provides processor
– protocol handling
– stream computation
» compression, encrypt
• Looks like IO peripheral
to processor
JR.S00 7
IO Processing
• Single threaded processor

– cannot continuously monitor multiple data pipes (src,
sink)
– need some minimal, local control to handle events
– for performance or real-time guarantees , may need to
service event rapidly
– E.g. checksum (decode) and acknowledge packet
JR.S00 8
NAPA 1000 Block Diagram
TBT
ToggleBusTM
Transceiver
System
Port CR32 RPC
CompactRISCTM Reconfigurable ALP
32 Bit Processor Pipeline Cntr Adaptive Logic
Processor
BIU PMA CIO
Configurable
Bus Interface Pipeline I/O
Unit Memory Array
External
Memory
Interface
CR32 SMA
Peripheral Scratchpad
Devices Memory Array
JR.S00 9
urce: National Semiconductor
NAPA 1000 as IO Processor
SYSTEM
HOST
Application
System Port Specific
NAPA1000 Sensors, Actuators, or

CIO
other circuits
Memory Interface
ROM &
DRAM
JR.S00 10
urce: National Semiconductor
I/O Stream Processor
Combines Trimedia VLIW with

Configurable media co-processors
Philips Nexperia NX-2700

A programmable HDTV
media processor JR.S00 11
Model: Instruction
Augmentation
• Observation: Instruction Bandwidth
– Processor can only describe a small number of basic
computations in a cycle
» I bits →2I operations
– This is a small fraction of the operations one could do even in
terms of w⊗ w→w Ops
» w22(2w) operations
– Processor could have to issue w2(2 (2w) -I) operations just to
describe some computations
– An a priori selected base set of functions could be very bad for
some applications
JR.S00 12
Instruction Augmentation
• Idea:
– provide a way to augment the processor’s instruction set
– with operations needed by a particular application
– close semantic gap / avoid mismatch
• What’s required:
– some way to fit augmented instructions into stream
– execution engine for augmented instructions
» if programmable, has own instructions
– interconnect to augmented instructions
JR.S00 13
“First” Instruction
Augmentation
• PRISM
– Processor Reconfiguration through Instruction Set
Metamorphosis
• PRISM-I
– 68010 (10MHz) + XC3090
– can reconfigure FPGA in one second!
– 50-75 clocks for operations
JR.S00 14
[Athanas+Silverman: Brown]
PRISM-1 Results
JR.S00 15
Raw kernel speedups
PRISM
• FPGA on bus
• access as memory mapped peripheral
• explicit context management
• some software discipline for use
• …not much of an “architecture” presented to
user
JR.S00 16
[Razdan+Smith: Harvard]
PRISC
• Takes next step
– what look like if we put it on chip?
– how integrate into processor ISA?
• Architecture:
– couple into register file as “superscalar” functional unit
– flow-through array (no state)
JR.S00 17
PRISC
• ISA Integration
– add expfu instruction
– 11 bit address space for user-defined expfu instructions
– fault on pfu instruction mismatch
» trap code to service instruction miss
– all operations occur in clock cycle
– easily works with processor context switch
» no state + fault on mismatch pfu instr
JR.S00 18
PRISC Results
• All compiled
• working from MIPS
binary
• <200 4LUTs ?
– 64x3
• 200MHz MIPS base
JR.S00 19
Razdan/Micro27
Chimaera
• Start from PRISC idea

– integrate as functional unit
– no state
– RFUOPs (like expfu)
– stall processor on instruction miss, reload
• Add
– manage multiple instructions loaded
– more than 2 inputs possible
JR.S00 20
[Hauck: Northwestern]
Chimaera Architecture
• “Live” copy of register
file values feed into
array
• Each row of array may
compute from register
values or intermediates
(other rows)
• Tag on array to indicate
RFUOP
Results
• Compress 1.11
• Eqntott 1.8
• Life 2.06 (160 hand parallelization)
[Hauck/FCCM97] JR.S00 21
Instruction Augmentation
• Small arrays with limited state

– so far, for automatic compilation
» reported speedups have been small
– open
» discover less-local recodings which extract greater
benefit
JR.S00 22
GARP
Identified Problems:
• Single-cycle flow-through
– not most promising usage style
• Moving data through Register File to/from
array
– can present a limitation
» bottleneck to achieving high computation rate
JR.S00 23
[Hauser+Wawrzynek: UCB]
GARP
• Integrate as coprocessor
– similar bandwidth to processor as FU
– own access to memory
• Support multi-cycle operation
– allow state
– cycle counter to track operation
• Fast operation selection
– cache for configurations
– dense encodings, wide path to memory
JR.S00 24
GARP
• ISA -- coprocessor operations
– issue gaconfig to make a particular configuration resident
(may be active or cached)
– explicitly move data to/from array
» 2 writes, 1 read
– processor suspend during coprocessor operation
» cycle count tracks operation
– array may directly access memory
» processor and array share memory space
• cache/mmu keeps consistency
» can exploit streaming data operations
JR.S00 25
GARP
• Processor Instructions
JR.S00 26
GARP Array
• Row oriented logic
– denser for datapath
operations
• Dedicated path for
– processor/memory data
• Processor not have to be
involved in
array⇔memory path
JR.S00 27
GARP Results
• General results
– 10-20x on stream, feed-
forward operation
– 2-3x when data-
dependencies limit
pipelining
[Hauser+Wawrzynek/FCCM97] JR.S00 28
PRISC/Chimera … GARP
• PRISC/Chimaera • GARP
– basic op is single cycle: – basic op is multicycle
expfu (rfuop) » gaconfig
– no state » mtga
– could conceivably have » mfga
multiple PFUs?
– can have state/deep
– Discover parallelism => pipelining
run in parallel?
– ? Multiple arrays viable?
– Can’t run deep pipelines
– Identify mtga/mfga w/ corr
gaconfig?
JR.S00 29
Common Theme
• To get around instruction expression limits

– define new instruction in array
» many bits of config … broad expressability
» many parallel operators
– give array configuration short “name” which processor
can callout
» …effectively the address of the operation
But – Impact of using reconfiguration at Instruction

Level seems limited
⇒ Explore opportunities at larger granularity levels
(basic block, task, process)
JR.S00 30
Applicability of Configurable
Processing
• Stand-alone computational engines
– E.g. PADDI, UCLA Mojave
• Adding programmable I/O to embedded
processors
– E.g. Napa 2000
• Augmenting the instruction set of processors
– E.g. GARP, RAW
• Providing programmable accelerator co-
processors to embedded micro’s and DSP
– Chameleon, Pleiades, Morphics
JR.S00 31
Example: Chameleon Reconfigurable
Co-Processor (network,
communication applications)
JTAG PCI Memory

Debugging Port Interface Controller Data Memory
Wide Internal communications bus

Bus Manager
ARC CPU Instruction
& DMA
Controllers Cache
Background Configuration Plane
Local
Store
Memory
Reconfigurable Logic (LSM)
Configuration bit stream Array of 32-bit Data Path
Operators & Control Logic
JR.S00 32
Multiple banks of I/O
Reconfigurable Processor Tools Flow
Customer
RTL
Application / IP
HDL
(C code)
C Compiler Synthesis & Layout
ARC Linker Configuration Bits

Object
Code
Chameleon Executable
C Model Development
C Debugger Board
Simulator
JR.S00 33
Heterogeneous Reconfiguration
Reconfigurable Reconfigurable Reconfigurable Reconfigurable

Logic Datapaths Arithmetic Control
In
Data Program
Memory Memory
mux
A ddrG en A ddrG e n
CLB CLB
reg0
Inst ruction
M em ory M em ory Decoder
reg1 Datapath &
C ontroller
adder
CLB CLB
MAC
buffer Data
Memory
Bit-Level Operations Dedicated data paths Arithmetic kernels RTOS

e.g. encoding e.g. Filters, AGU e.g. Convolution Process management
JR.S00 34
Multi-granularity Reconfigurable
Architecture:
The Berkeley Pleiades Architecture
Configuration Bus
Satellite Processor
Configuration
Arithmetic Arithmetic Arithmetic
Processor Processor Processor Dedicated
Arithmetic
Communication Network
Network Interface
Control Configurable Configurable
Processor Datapath Logic
• Computational kernels are “spawned” to satellite processors

• Control processor supports RTOS and reconfiguration
• Order(s) of magnitude energy-reduction over traditional programmable architectures
JR.S00 35
Matching Computation and
Architecture
AddressGen AddressGen
Memory Memory
Convolution
MAC MAC
L G C
Control
Processor
Two models of computation: Two architectural models:

communicating processes + data-flow sequential control+ data-driven
JR.S00 36
Execution Model of a Data-
Flow Kernel
Embedded processor
Code seg start end
AddrGen
for(i=1;i<=L;i++)
for(k=i;k<=L;k++) MEM: in
AddrGen
phi[i][k]= phi[i-1][k-1]
MPY MPY
+in[NP-i]*in[NP-k] MEM: phi
-in[NA-1-i]*in[NA-1-k]; ALU
Code seg ALU
•Distributed control and memory JR.S00 37

Reconfigurable Kernels for
W-CDMA
• Dominant kernel M(MTX)

requires array of MACs and
segmented memories
• Additional operations such as
sqrt(x), 1/x, and Trellis
decoding may be implemented
using FPGA or cordic satellite
JR.S00 38
Inter-Satellite
Communication
• Data-driven execution
– A satellite processor is enabled only when input data is ready
• Data sources generate data of different types: scalars, vectors,

matrices
• Data computing processors handle data inputs of different types
end-of-vector token
1
MPY 1
1
AddrGen Memory
1
MPY n
n
Embedded n
processor MAC 1
n
Data sources Data computing processors

JR.S00 39
Impact of Architectural
Choice
10000
Example: 16 point Complex 3970
NormalizedEnergy*Delay / stage[Js*e-14]
Radix-2 FFT (Final Stage) 1000 Energy*Delay/stage
137
10000 Energy/stage 100u 100
Delay/stage
Normalized Delay/stage[s]
NormalizedEnergy/ stage[nJ]
1870 21u 18.5

10u 10u 10
1000
3.8u
131
100 1u 1
49 570n 0.75
13
10 100n 0.1
StrongARM
StrongARM
StrongARM
TMS320LC54x
TMS320C2xx
TMS320LC54x
TMS320LC54x
TMS320C2xx
TMS320C2xx
Pleiades
Pleiades
Pleiades
JR.S00 40
Adaptive Multi-User Detector for W-CDMA
Pilot Correlator Unit Using LMS
Filter
MEM alt s_r MAC Zmf_r
AG
MEM alt s_i MAC Zmf_i
MEM alt
y_r
AG
MEM alt
y_i
MUL ADD
SUB ACC
MEM alt MUL ADD
AG
MEM alt MUL
ADD ACC
MUL
Coefficient Update
MUL
SUB SUB s_r
MUL SUB y_r MUL Zmf_r
SUB s_i
MUL SUB MUL
y_i Zmf_i
ADD
MUL
JR.S00 41
Architecture Comparison
LMS Correlator at 1.67 MSymbols Data Rate
Complexity: 300 Mmult/sec and 357 Macc/sec
16 Mmacs/mW!
Note: TMS implementation requires 36 parallel processors to meet data rate -
validity questionable
JR.S00 42
Maia: Reconfigurable Baseband
Processor for Wireless
• 0.25um tech: 4.5mm x 6mm

• 1.2 Million transistors
• 40 MHz at 1V
• 1 mW VCELP voice coder
• Hardware
• 1 ARM-8
• 8 SRAMs & 8 AGPs
• 2 MACs
• 2 ALUs
• 2 In-Ports and 2 Out-Ports
• 14x8 FPGA
JR.S00 43
Reconfigurable Interconnect
Exploration
Mesh Hierarchical Mesh

Module
M u lti-B u s cluster
cluster
N Inputs
B Buses
M Outputs cluster
JR.S00 44
Software Methodology
Flow
Algorithms C++
µ p r o &c
Kernel Detection Accelerator
PDA Models
ation Behavioral
Estimation/Exploration
SUIF+ C-IF
Power & Timing Estimation
of Various Kernel Implementations
Premapped
Kernels
Partitioning
C++ Module
Libraries
Software Compilation
Reconfig. Hardware Mapping
Interface Code Generation
JR.S00 45
Hardware-Software Exploration
Macromodel call
JR.S00 46
Implementation Fabrics for
Protocols
RACH
req
RACH
akn
A protocol =
Extended FSM
idle
RACH
Memory
slotset
read write
update R_ENA
idle
W_ENA
BUF
BUF
Slot_Set_Tbl
2x16
addr
slot_set Slot_no Slot Pkt

<31:0> <5:0> start end
Intercom TDMA MAC JR.S00 47

Intercom TDMA MAC
Implementation alternatives
ASIC FPGA ARM8

Power 0.26mW 2.1mW 114mW
Energy 10.2pJ/op 81.4pJ/op n*457pJ/op
• ASIC: 1V, 0.25 µ m CMOS process

• FPGA: 1.5 V 0.25 µ m CMOS low-energy FPGA
• ARM8: 1 V 25 MHz processor; n = 13,000
• Ratio: 1 - 8 - >> 400
JR.S00 48
The Software-Defined Radio
FPGA Embedded uP
Dedicated FSM
Dedicated Reconfigurable
DSP DataPath
JR.S00 49
An Industrial Example:
Basestation for Cellular
Wireless
800 MHz 1900 MHz
A B A B A D B E F C
Antenna Antenna
System System
RF/IF RF/IF
RF/IF
Tuner multiple sectors multiple sectors RF/IF
Tuner
RF/IF
Tuner multi-band multi-band RF/IF
Tuner
Block-Spectrum
Tuner Block-Spectrum
Tuner
Block-Spectrum
A/D Block-Spectrum
A/D
Block-Spectrum
A/D Block-Spectrum
A/D
A/D A/D
HIGH-SPEED DIGITAL BUS
...
1
Modular/Parameterizabl /E
T1
e
•per carrier
•per TDMA time-slot
•per CDMA code
JR.S00 50
BTS Signal Processing
Platforms
Hardwired ASICs
A/D Conversion
Comm
Agent
High-Speed Digital Bus

D/A Conversion
Standard A, F1
Standard A, F2 N DSP/CPUs
Standard B, F1
...
...
Comm
Agent CPU
JR.S00 51
Basestation of the Next
Generation
Wideband 10/
Data
RF 100
ATM Networks
or
Gbit
JR.S00 52
Coexistence of Multiple
Standards In Product
Supply Chain
2G 3G
– GSM – ETSI UTRA
– DCS1800 – ARIB W-CDMA
– PCS1900 2.5G – TIA cdma2000
– IS-95 – GPRS – W-TDMA (UWC)
– IS-54B – HCSD
– IS-136 – IS-95 MDR
– PDC – IS-95 HDR
– IS-136 HS
CIRCUIT PACKET
VOICE DATA
NARROWBAND WIDEBAND
JR.S00 53
Wideband CDMA: MOPS?
No, GOPS!
Single 384 kbps ARIB W-CDMA Channel
Function MIPS
Digital RRC Channel 3600
Searcher 2100
RAKE 1050
Maximal Ratio Combiner 24
Channel Estimator 12
AGC, AFC 10
Deinterleaver 15
Turbo Coder 90
Source: J. Kohnen et al. “Baseband Solution for WCDMA,” Proc. IEEE
TOTAL
Communication Theory Workshop, May 1999, Aptos, USA. 6901 JR.S00 54
HW Multistandard
Solutions
The common approach to hardware design involves:
multiple ASIC’s to support each standard.
Digital
Hardwired IF RF
DSP ASIC
Digital
Hardwired IF RF
ASIC
Digital
Control Processor
Hardwired IF RF
ASIC
Programmable Unique
Combinations Analog
• Hardwired implementation is not scalable or upgradeable to new standards.

• This approach costs time in a time-to-market dominated world.
• Creating new chipsets for every technology combination critically challenges
available design resources!
JR.S00 55
SW Multistandard
Solution
Applying instruction-set processor architectures to
all baseband processing would be desireable...
IF RF
DSP
IF RF
Control Processor
IF RF
Programmable Analog
…but is simply not an good implementation for base stations:
-Unacceptably high cost per channel
-Unacceptably large power per channel
This is definitely not a viable implementation for terminals
JR.S00 56
The Law of Diminishing
Returns
• More transistors are being thrown at improving
general-purpose CPU and DSP performance
• Fundamental bounds are being pushed

– limits on instruction-level parallelism
– limits on memory system performance
• Returns per transistor are diminishing

– new architectures realizing only 2-3 instructions/clock
– increasingly large caches to hide DRAM latency
JR.S00 57
Embedded Logic
Evolution
– Increasing fixed-function hardwired content of systems
– Core+Logic becomes de-facto design architecture
– Move to deep sub-micron technology
» rapidly increasing product integration cycles
» increasingly constrained design resources
» sharp increases in cost of “trying out” an idea- NRE
– Design methodologies optimized for random logic, homogeneous architectures,
and lower speed signal processing (I.e. control-flow dominated systems)
– Verification issues dominate design cycle time
Growing Design Cycle Times At Odds With

Shrinking Product Cycle Times
JR.S00 58
FPGA the Solution?
Cellular Handset Using
Current FPGA
JR.S00 59
Some Interesting
Observations
Don’t use more transistors to stretch general-purpose performance,
whether for CPUs, DSPs, or
reconfigurable logic.
Don’t use more time to design

dedicated hardwired solutions in cases where
mass customization
is what the market demands.
JR.S00 60
View The “Reconfigurability”
Problem From The System Level
What Are the Application-Specific Performance Needs?
– What are the applications targeted?

– What algorithms are essential to achieving the performance goals?
– What are the functions at the heart of these algorithms?
– Which functions yield poor price-performance with general-purpose
MOPS and system memory models?
– What is the embedded systems programmer’s model?
– Best performance at what cost:
» Area - instruction-level parallelism, memory hierarchy
» Power- energy requirements on a function basis
» Time- quality and ease of programming for app development
» Pain- forward opportunity vs backward compatibility
JR.S00 61
Successfully Using
Reconfigurability
Application-Specific Leverage
Focus on first on applications and constituent algorithms, not the

silicon architecture !
Wireless Communications Transceiver Signal Processing
Minimize the hardware reconfigurability to constrained set

Maximize the software parameterizability and ease of use of the
programmer’s model for flexibility
Define optimal architecture for efficient implementation
JR.S00 62
Application-Specific MOPS in
Digital Communications
TDMA
Wideband Signal Programmable
Processing Engine DSP
Digital
Downconversion Wideband Channel
RF/IF
and Decoder Engine
Channelization
CDMA
Wideband Signal Microprocessor
Processing Engine
JR.S00 63
Morphics’ DRL
Architecture
Heterogeneous Multiprocessing Engine Using Application-Specific Reconfigurable Logic
Large Granularity Kernel
DATAFLOW
in p u t
m
m
O
in p u t R
in p u t
m
m
O
C lk
in p u t R o u t p u t
E n a b le
Small Granularity Kernel C lk
E n a b le
o u t p u t
in p u t
m
m
O
in p u t R
C lk
o u t p u t
E n a b le
JR.S00 64
DRL Kernels
DATA MEMORY
PARAMETERIZABLE
CONFIGURABLE
DATA SEQUENCER
ALU
JR.S00 65
Mapping Software to Target
Architecture
DATA MEMORY
PARAMETERIZABLE
DATA SEQUENCER CONFIGURABLE
ALU
DATA MEMORY
PARAMETERIZABLE
DATA SEQUENCER CONFIGURABLE
ALU
JR.S00 66
Programmer’s Guide
/* morphics soft API usage examples: */
/* search a pilot set in search set */
Document provided with each /* maintainance mode. */
processor to enable ...

set_ptr = get_next_set();
application development, if (no_need_to_throttle())
{
configuration, and system search_set(set_ptr);
control via host processor. }

...
void search_set(PILOT_TYPE *pilot)
{
/* search threshold is assumed to be set in other places */
Includes complete morphics_searcher_set_win_size(pilot->win_size);

morphics_searcher_set_pn(pilot->pn);
morphics_searcher_set_int_len(pilot->int_len);
• description of each API function }
call
/* finger re-assignments */
• system control functions ...
fing[i].pos = morphics_demod_get_fin_pos(i);
• variables, parameters ...
distance = calc_fing_movement(&fing[i], new_pn);
• coding examples, and ...
performance realized (i.e. ROC fing[i].slew = distance;
fing[i].pn = new_pn;
curves) morphics_demod_set_fing_iq(i, fing[i].pn);
while ( slew_not_done(morphics_demod_get_status()) )
{
morphics_demod_set_fing_slew(i, fing[i].slew);
}
...
JR.S00 67
Key Pieces of Design
Methodology
System-level Profiling
• Analyze sequences of operations (arithmetic, memory access, etc)
• Analyze communication bottlenecks
• Key flexible parameters (algorithm v architecture parameters)
Architecture-level Profiling
• ALU/kernel definition (sequences of operators)
• Memory profile
• Type of configurability required for flexibility
• Macro-sequencer development
Implementation
• SW- programmer’s model developed at architecture specification stage
• SW- API proven out via behavioral models & demonstrator hardware
• VLSI-focus on regular predictable timing and routability
• VLSI- embedded reconfigurability in an ASIC flow
JR.S00 68
CDMA Modem Analysis
JR.S00 69
MOPS Breakdown Analysis
JR.S00 70
Prototype Demonstrator
RISC
Microprocessor
Configurable Kernel(s)
Data Router
JR.S00 71
Summary
• Configurable computing is finding its way into the
embedded processor space
• Best suited (so far) for
– Flexible I/O and Interface functionality
– providing task-level acceleration of “parametizable” functions
• Improvement of IP seems limited
• Software flow still subject to improvement
• Might become more interesting with the emergence of
low-current devices (TFT, organic transistors,
molecular computing)
DO NOT FORGET CONFIGURATION OVERHEAD
JR.S00 72

(Re) Configurable Computing Case Studies: Prof. Jan Rabaey Computer Science 252, Spring 2000

Caricato da

Informazioni sul documento

Descrizione originale:

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

(Re) Configurable Computing Case Studies: Prof. Jan Rabaey Computer Science 252, Spring 2000

Caricato da

Copyright:

Formati disponibili

Lecture 14:

Prof. Jan Rabaey

The contributions of Andre Dehon (Caltech) and Ravi Subramanian (Morphics)

• Single threaded processor

NAPA1000 Sensors, Actuators, or

Combines Trimedia VLIW with

Philips Nexperia NX-2700

• Start from PRISC idea

• Small arrays with limited state

• To get around instruction expression limits

But – Impact of using reconfiguration at Instruction

JTAG PCI Memory

Wide Internal communications bus

Background Configuration Plane

C Compiler Synthesis & Layout

ARC Linker Configuration Bits

Reconfigurable Reconfigurable Reconfigurable Reconfigurable

Bit-Level Operations Dedicated data paths Arithmetic kernels RTOS

• Computational kernels are “spawned” to satellite processors

Two models of computation: Two architectural models:

Code seg ALU

•Distributed control and memory JR.S00 37

• Dominant kernel M(MTX)

• Data sources generate data of different types: scalars, vectors,

Data sources Data computing processors

1870 21u 18.5

• 0.25um tech: 4.5mm x 6mm

Mesh Hierarchical Mesh

slot_set Slot_no Slot Pkt

Intercom TDMA MAC JR.S00 47

ASIC FPGA ARM8

• ASIC: 1V, 0.25 µ m CMOS process

HIGH-SPEED DIGITAL BUS

High-Speed Digital Bus

• Hardwired implementation is not scalable or upgradeable to new standards.

• Fundamental bounds are being pushed

• Returns per transistor are diminishing

Growing Design Cycle Times At Odds With

Don’t use more time to design

– What are the applications targeted?

Focus on first on applications and constituent algorithms, not the

Minimize the hardware reconfigurability to constrained set

Define optimal architecture for efficient implementation

Large Granularity Kernel

Small Granularity Kernel C lk

processor to enable ...

control via host processor. }

Includes complete morphics_searcher_set_win_size(pilot->win_size);

DO NOT FORGET CONFIGURATION OVERHEAD

Potrebbero piacerti anche