FPGA-Based System Design - Wayne Wolf (1) - 1

Overview
 Why VLSI?
 Moore’s Law.
 Why FPGAs?
 The VLSI and system design process.
FPGA-Based System Design: Chapter 1 Copyright  2004 Prentice Hall PTR

Why VLSI?
 Integration improves the design:

– lower parasitics = higher speed;
– lower power;
– physically smaller.
 Integration reduces manufacturing cost-
(almost) no manual assembly.

VLSI and you
 Microprocessors:
– personal computers;
– microcontrollers.
 DRAM/SRAM/flash.
 Audio/video and other consumer systems.
 Telecommunications.

Moore’s Law
 Gordon Moore: co-founder of Intel.

 Predicted that number of transistors per chip
would grow exponentially (double every 18
months).
 Exponential improvement in technology is a
natural trend: steam engines, dynamos,
automobiles.

Moore’s Law plot

The cost of fabrication
 Current cost: $2-3 billion.

 Typical fab line occupies about 1 city block,
employs a few hundred people.
 New fabrication processes require 6-8
month turnaround.
 Most profitable period is first 18 months-2
years.

Cost factors in ICs
 For large-volume ICs:

– packaging is largest cost;
– testing is second-largest cost.
 For low-volume ICs, design costs may
swamp all manufacturing costs.
– $10 million-$20 million.

Mask cost vs. line width
1,000,000
900,000
800,000
700,000
600,000
500,000
mask cost ($)
400,000
300,000
200,000
100,000
0
.25 micron .18 micron .13 micron .09 micron

Field-programmable gate arrays
 FPGAs are programmable logic devices:

– Logic elements + interconnect.
– Provide multi-level logic.
LE LE
Interconnect
LE LE
network
LE LE

FPGAs and VLSI
 FPGAs are standard parts:

– Pre-manufactured.
– Don’t worry (much) about physical design.
 Custom silicon:
– Tailored to your application.
– Generally lower power consumption.

Standard parts vs. custom
 Do you build your system with an FPGA or

with custom silicon?
– FPGAs have shorter design cycle.
– FPGAs have no manufacturing delay.
– FPGAs reduce inventory.
– FPGAs are slower, larger, more power-hungry.

Challenges in system design
 Multiple levels of abstraction: logic to

CPUs.
 Multiple and conflicting constraints: low
cost and high performance are often at odds.
 Short design time: Late products are often
irrelevant.

The system design process
 May be part of larger product design.

 Major levels of abstraction:
– specification;
– architecture; FPGA-based system design
– logic design;
– circuit design;
– layout.

Dealing with complexity
 Divide-and-conquer: limit the number of

components you deal with at any one time.
 Group several components into larger
components:
– transistors form gates;
– gates form functional units;
– functional units form processing elements;
– etc.
Hierarchical name
 Interior view of a component:

– components and wires that make it up.
 Exterior view of a component = type:
– body; cout
– pins. sum
a Full
adder
b
cin

Instantiating component types
 Each instance has its own name:

– add1 (type full adder)
– add2 (type full adder).
 Each instance is a separate copy of the type:
cout Add2.a
Add1.a
sum sum
a Add1(Full a Add2(Full
adder) adder)
b b
cin cin
A hierarchical logic design
box1 box2 x

Net lists and component lists
 Net list:  Component list:

net1: top.in1 in1.in top: in1=net1 n1=topin1
net2: i1.out xxx.B n2=topin2 n3=topine
topin1: top.n1 xxx.xin1 out=outnet
topin2: top.n2 xxx.xin2 i1: in=net1 out=net2
botin1: top.n3 xxx.xin3 xxx: xin1=topin1
xin2=topin2
net3: xxx.out i2.in xin3=botin1 B=net2
outnet: i2.out top.out out=net3
i2: in=net3 out=outnet

Component hierarchy
top
i1 xxx i2

Hierarchical names
 Typical hierarchical name:

– top/i1.foo
component pin

Layout and its abstractions
 Layout for dynamic latch:

Stick diagram

Transistor schematic

Mixed schematic
inverter

Levels of abstraction
 Specification: function, cost, etc.

 Architecture: large blocks.
 Logic: gates + registers.
 Circuits: transistor sizes for speed, power.
 Layout: determines parasitics.

Circuit abstraction
 Continuous voltages and time:

Digital abstraction
 Discrete levels, discrete time:

Register-transfer abstraction
 Abstract components, abstract data types:
0010
+
0001
+ 0011
0100

Top-down vs. bottom-up design
 Top-down design adds functional detail.

– Create lower levels of abstraction from upper
levels.
 Bottom-up design creates abstractions from
low-level behavior.
 Good design needs both top-down and
bottom-up efforts.

Design abstractions
English specification
Executable Throughput,
program behavior design time
register- Function units,

function Sequential clock cycles cost
transfer
machines
Literals,
Logic gates logic logic depth
transistors circuit nanoseconds
rectangles layout microns

FPGA design
 FPGA manufacturer creates an FPGA

fabric; system designer uses the fabric.
 FPGA fabric design issues:
– Study sample user designs.
– Select interconnect topology.
– Create logic element structures.
– Design circuits, layout.

Why do we care about layout?
 We won’t design layout.

 Layout determines:
– Logic delay.
– Interconnect delay.
– Energy consumption.
 We want to understand sources of FPGA
characteristics.

Design validation
 Must check at every step that errors haven’t

been introduced-the longer an error remains,
the more expensive it becomes to remove it.
 Forward checking: compare results of less-
and more-abstract stages.
 Back annotation: copy performance
numbers to earlier stages.

Topics
 Basic fabrication steps.

 Transistor structures.
 Basic transistor behavior.
 Latch up.

Fabrication processes
 IC built on silicon substrate:

– some structures diffused into substrate;
– other structures built on top of substrate.
 Substrate regions are doped with n-type and p-
type impurities. (n+ = heavily doped)
 Wires made of polycrystalline silicon (poly),
multiple layers of aluminum/copper (metal).
 Silicon dioxide (SiO2) is insulator.

Simple cross section
SiO2 metal3
metal2
transistor metal1
via
poly
n+ n+
p+
substrate
substrate
Photolithography
Mask patterns are put on wafer using photo-

sensitive material:

Process steps
First place tubs to provide properly-doped

substrate for n-type, p-type transistors:
p-tub n-tub
substrate

Process steps, cont’d.
Pattern polysilicon before diffusion regions:
poly gate oxide poly
p-tub n-tub

Process steps, cont’d
Add diffusions, performing self-masking:
poly poly
n+ p-tub n+ p+ n-tub p+

Process steps, cont’d
Start adding metal layers:
metal 1 metal 1
poly vias poly
n+ p-tub n+ p+ n-tub p+

Level 2 metal
 Polish SiO2 before adding metal 2:

metal 2
metal 1 metal 1
poly vias poly
n+ p-tub n+ p+ p-tub p+

Transistor structure
n-type transistor:

Transistor layout
n-type (tubs may vary):

Drain current characteristics

Drain current
 Linear region (Vds < Vgs - Vt):

– Id = k’ (W/L)(Vgs - Vt)(Vds - 0.5 Vds2)
 Saturation region (Vds >= Vgs - Vt):
– Id = 0.5k’ (W/L)(Vgs - Vt) 2

90 nm transconductances
Typical parameters:
 n-type:
– kn’ = 13 A/V2
– Vtn = 0.14 V
 p-type:
– kp’ = 7 A/V2
– Vtp = -0.21 V

Current through a transistor
Use 90 nm parameters. Let W/L = 3/2.

Measure at boundary between linear and
saturation regions.
 Vgs = 0.25V:
Id = 0.5k’(W/L)(Vgs-Vt)2= 0.12 A
 Vgs = 1V:
Id = 7.2 A

Basic transistor parasitics
 Gate to substrate, also gate to source/drain.

 Source/drain capacitance, resistance.

Basic transistor parasitics, cont’d
 Gate capacitance Cg. Determined by active

area.
 Source/drain overlap capacitances Cgs, Cgd.
Determined by source/gate and drain/gate
overlaps. Independent of transistor L.
– Cgs = Col W
 Gate/bulk overlap capacitance.

Latch-up
 CMOS ICs have parastic silicon-controlled

rectifiers (SCRs).
 When powered up, SCRs can turn on,
creating low-resistance path from power to
ground. Current can destroy chip.
 Early CMOS problem. Can be solved with
proper circuit/layout structures.

Parasitic SCR
circuit I-V behavior

Parasitic SCR structure

Solution to latch-up
Use tub ties to connect tub to power rail. Use

enough to create low-voltage connection.

Topics
 Logic gate delay.

 Logic gate power consumption.
 Driving large loads.

Logic levels
 Solid logic 0/1 defined by VSS/VDD.

 Inner bounds of logic values VL/VH are not
directly determined by circuit properties, as
in some other logic families.
VDD
logic 1
VH
unknown
VL
VSS logic 0

Logic level matching
 Levels at output of one gate must be

sufficient to drive next gate.

Transfer characteristics
 Transfer curve shows static input/output

relationship—hold input voltage, measure
output voltage.

Inverter transfer curve

Logic thresholds
 Choose threshold voltages at points where

slope of transfer curve = -1.
 Inverter has a high gain between VIL and
VIH points, low gain at outer regions of
transfer curve.
 Note that logic 0 and 1 regions are not equal
sized—in this case, high pullup resistance
leads to smaller logic 1 range.

Noise margin
 Noise margin = voltage difference between

output of one gate and input of next. Noise
must exceed noise margin to make second
gate produce wrong output.
 In static gates, t= voltages are VDD and
VSS, so noise margins are VDD-VIH and VIL-
VSS.

Delay
 Assume ideal input (step), RC load.

Delay assumptions
 Assume that only one transistor is on at a

time. This gives two cases:
– rise time, pullup on;
– fall time, pullup off.
 Assume resistor model for transistor.
Ignores saturation region and
mischaracterizes linear region, but results
are acceptable.

Current through transistor
 Transistor starts in saturation region, then

moves to linear region.

Resistive model for transistor
 Average V/I at two voltages:

– maximum output voltage
– middle of linear region
 Voltage is Vds, current is given Id at that
drain voltage. Step input means that Vgs =
VDD always.

Resistive approximation

Ways of measuring gate delay
 Delay: time required for gate’s output to

reach 50% of final value.
 Transition time: time required for gate’s
output to reach 10% (logic 0) or 90% (logic
1) of final value.

Inverter delay circuit
 Load is resistor + capacitor, driver is

resistor.

Inverter delay with t model
 t model: gate delay based on RC time

constant t.
 Vout(t) = VDD exp{-t/(Rn+RL)/ CL}
 tf = 2.2 R CL
 For pullup time, use pullup resistance.

t model inverter delay
 90 nm process:
– Rn = 11.1 kW
– Cl = 0.12 fF
 So
– tf = 2.2 x 11.1E3 x 0.12E-15 = 2.9 ps.

Quality of RC approximation

Quality of step input
approximation

Results of using small pullup

Other models
 Current source model (used in power/delay

studies):
– tf = CL (VDD-VSS)/Id
– = CL (VDD-VSS)/0.5 k’ (W/L) (VDD-VSS -Vt)2
 Fitted model: fit curve to measured circuit
characteristics.

Body effect and gates
 Difference between source and substrate

voltages causes body effect.
 Source for gates in middle of network may
not equal substrate:
0
Source above VSS
0

Body effect and gate input
ordering
 To minimize body effect, put early arriving

signals at transistors closest to power
supply:
Early arriving signal

Power consumption analysis
 Dynamic power consumption comes from

switching behavior.
 Static power dissipation comes from
leakage currents.
 Surprising result: dynamic power
consumption is independent of the sizes of
the pullups and pulldowns.

Power consumption circuit
 Input is square wave.

Power consumption
 A single cycle requires one charge and one

discharge of capacitor: E = CL(VDD - VSS)2 .
 Clock frequency f = 1/t.
 Energy E = CL(VDD - VSS)2.
 Power = E x f = f CL(VDD - VSS)2.

Observations on power
consumption
 Resistance of pullup/pulldown drops out of

energy calculation.
 Power consumption depends on operating
frequency.
– Slower-running circuits use less power (but not
less energy to perform the same computation).

Speed-power product
 Also known as power-delay product.

 Helps measure quality of a logic family.
 For static CMOS:
– SP = P/f = CV2.
 Static CMOS speed-power product is
independent of operating frequency.
– Voltage scaling depends on this fact.
– Considers only dynamic power.

Sources of leakage
 Weak inversion current (subthreshold current)

 Gate-induced drain leakage at the gate/drain
overlap.
 Drain-induced barrier lowering of the source.
 Punchthrough currents.
 Reverse-biased pn junctions.
 etc.

Subthreshold leakage current
 Strong function of the threshold voltage Vt.

 Important in 90 nm and below technologies.
 Can adjust threshold by changing substrate
bias.
 Leakage through a chain of transistors is
lower than leakage through a single
transistor.

Driving large loads
 Sometimes, large loads must be driven:

– off-chip;
– long wires on-chip.
 Sizing up the driver transistors only pushes
back the problem—driver now presents
larger capacitance to earlier stage.

Cascaded driver circuit

Optimal sizing
 Use a chain of inverters, each stage has

transistors a larger than previous stage.
 Minimize total delay through driver chain:
– ttot = n(Cbig/Cg)1/n tmin.
 Optimal number of stages:
– nopt = ln(Cbig/Cg).
 Driver sizes are exponentially tapered with
size ratio a.

Topics
 Wire and via structures.

 Wire parasitics.

Wires and vias
metal 3
metal 2
vias
metal 1
poly poly
n+ p-tub n+

Metal migration
 Current-carrying capacity of metal wire

depends on cross-section. Height is fixed,
so width determines current limit.
 Metal migration: when current is too high,
electron flow pushes around metal grains.
Higher resistance increases metal migration,
leading to destruction of wire.

Metal migration problems and
solutions
 Marginal wires will fail after a small

operating period—infant mortality.
 Normal wires must be sized to accomodate
maximum current flow:
Imax = 1.5 mA/m of metal width.
 Mainly applies to VDD/VSS lines.

Diffusion wire capacitance
 Capacitances formed by p-n junctions:

sidewall
capacitances depletion region
n+ (ND)
bottomwall
substrate (NA) capacitance

Depletion region capacitance
 Zero-bias depletion capacitance:

– Cj0 = si/xd.
 Depletion region width:
– xd0 = sqrt[(1/NA + 1/ND)2siVbi/q].
 Junction capacitance is function of voltage
across junction:
– Cj(Vr) = Cj0/sqrt(1 + Vr/Vbi)

Poly/metal wire capacitance
 Two components:
– parallel plate;
– fringe.
fringe
plate

Metal coupling capacitances
 Can couple to adjacent wires on same layer,

wires on above/below layers:
metal 2
metal 1 metal 1

Wire resistance
 Resistance of any size square is constant:

Metal mean-time-to-failure
 MTF for metal wires = time required for

50% of wires to fail.
 Depends on current density:
– proportional to j-n e Q/kT
– j is current density
– n is constant between 1 and 3
– Q is diffusion activation energy

Skin effect
 At low frequencies, most of copper

conductor’s cross section carries current.
 As frequency increases, current moves to
skin of conductor.
– Back EMF induces counter-current in body of
conductor.
 Skin effect most important at gigahertz
frequencies.
Skin effect, cont’d
 Isolated conductor:  Conductor and ground:
Low frequency
Low frequency
High frequency
High frequency
Skin depth
 Skin depth is depth at which conductor’s

current is reduced to 1/3 = 37% of surface
value:
 d = 1/sqrt(p f  s)
– f = signal frequency
  = magnetic permeability
 s = wire conductivity

Effect on resistance
 Low frequency resistance of wire:

– Rdc = 1/ s wt
 High frequency resistance with skin effect:
– Rhf = 1/2 s d (w + t)
 Resistance per unit length:
– Rac = sqrt(Rdc 2 + k Rhf 2)
 Typically k = 1.2.

Wire capacitance and resistance
Capacitance Coupling Resistance/

to ground capacitance length
(aF/m) (aF/m) (W/m)
Metal 3 18 9 0.2
Metal 2 47 24 0.3
Metal 1 76 36 0.3

Gate delay vs. wire delay
 Minimum-size inverter delay: 2.9 ps

 Length of wire with equal delay---assume
wire with capacitance equal to inverter
input capacitance = 0.12 fF.
– Metal 3 length is 6.7 m.
– About 75 times width of minimum-size
transistor.

Topics
 Driving long wires.

Wire delay
 Wires have parasitic resistance, capacitance.

 Parasitics start to dominate in deep-
submicron wires.
 Distributed RC introduces time of flight
along wire into gate-to-gate delay.

RC transmission line
 Assumes that dominant capacitive coupling

is to ground, inductance can be ignored.
 Elemental values are ri, ci.

Elmore delay
 Elmore defined delay through linear

network as the first moment of the network
impulse response.

RC Elmore delay
 Can be computed as sum of sections:

dE =  r(n - i)c = 0.5 rcn(n-1)
 Resistor ri must charge all downstream
capacitors.
 Delay grows as square of wire length.
 Minimizing rc product minimizes growth of
delay with increasing wire length.

RC transmission lines
 More complex analysis.

 Step response:
– V(t) @ 1 + K1 exp{-s1t/RC}.

Wire sizing
 Wire length is determined by layout

architecture, but we can choose wire width
to minimize delay.
 Wire width can vary with distance from
driver to adjust the resistance which drives
downstream capacitance.

Optimal wiresizing
 Wire with minimum delay has an

exponential taper.
 Optimal tapering improves delay by about
8%.

Approximate tapering
Can approximate optimal tapering with a few

rectangular segments.

Tapering of wiring trees
Different branches of tree can be set to

different lengths to optimize delay.
source
sink 1
sink 2

Spanning tree
A spanning tree has segments that go directly

between sources and sinks.
source
sink 1
sink 2

Steiner tree
A Steiner point is an intermediate point for the

creation of new branches.
source
Steiner point
sink 1
sink 2

RC trees
Generalization of RC transmission line.

Buffer insertion in RC
transmission lines
 Assume RC transmission line.

 Assume R0 is driver’s resistance, C0 is
driver’s input capacitance.
 Want to divide line into k sections of length
l. Each buffer is of size h.

Buffer insertion analysis
 Assume h = 1:
– k = sqrt{(0.4 Rint Cint)/(0.7R0 C0)}
 Assume arbitrary h:
– k = sqrt{(0.4 Rint Cint)/(0.7R0 C0)}
– h = sqrt{(R0 Cint)/(Rint C0)}
– T50% = 2.5 sqrt{R0 C0 Rint Cint}

Buffer insertion example
 10x minimum-size inverter drives metal 3

wire of 5000 l x 3 l.
– Driver: R0 = 11.1 kW, C0 = 1.2 fF
– Wire: Rint = 100 W, Cint = 135 fF.
 Then
– k = 2.4 approx 2.
– H = 35.4.
– T50% = 11 E-12 sec

RC crosstalk
 Crosstalk slows down signals---increases

settling noise.
 Two nets in analysis:
– aggressor net causes interference;
– victim net is interfered with.

Aggressors and victims
aggressor net
victim net

Wire cross-section
 Victim net is surrounded by two aggressors.

S W
aggressor T victim aggressor
substrate

Crosstalk delay vs. wire aspect
relative RC delay ratio
increased spacing
Increasing aspect ratio

Crosstalk delay
 There is an optimum wire width for any

given wire spacing---at bottom of U curve.
 Optimium width increases as spacing
between wires increases.

Topics
 Latches and flip-flops.

 RAMs and ROMs.

Register
 Stores a value as controlled by clock.

 May have load signal, etc.
 In CMOS, memory is created by:
– capacitance (dynamic);
– feedback (static).

Variations in registers
 Form of required clock signal.

 How behavior of data input around clock
affects the stored value.
 When the stored value is presented to the
output.
 Whether there is ever a combinational path
from input to output.

Register terminology
 Latch: transparent when internal memory is

being set from input.
 Flip-flop: not transparent—reading input
and changing output are separate events.

Clock terminology
 Clock edge: rising or falling transition.

 Duty cycle: fraction of clock period for
which clock is active (e.g., for active-low
clock, fraction of time clock is 0).

Registerd parameters
 Setup time: time before clock during which

data input must be stable.
 Hold time: time after clock event for which
data input must remain stable.
clock
data

Dynamic latch
Stores charge on inverter gate capacitance:

Latch characteristics
 Uses complementary transmission gate to

ensure that storage node is always strongly
driven.
 Latch is transparent when transmission gate
is closed.
 Storage capacitance comes primarily from
inverter gate capacitance.

Latch operation
  = 0: transmission gate is off, inverter

output is determined by storage node.
  = 1: transmission gate is on, inverter
output follows D input.
 Setup and hold times determined by
transmission gate—must ensure that value
stored on transmission gate is solid.

Stored charge leakage
 Stored charge leaks away due to reverse-

bias leakage current.
 Stored value is good for about 1 ms.
 Value must be rewritten to be valid.
 If not loaded every cycle, must ensure that
latch is loaded often enough to keep data
valid.

Multiplexer dynamic latch

Non-dynamic latches
 Must use feedback to restore value.

 Some latches are static on one phase
(pseudo-static)—load on one phase, activate
feedback on other phase.

Recirculating latch
Static on one phase:

Edge-triggered flip-flop
D Q

Master-slave operation
  = 0: master latch is disabled; slave latch is

enabled, but master latch output is stable, so
output does not change.
  = 1: master latch is enabled, loading value
from input; slave latch is disabled,
maintaining old output value.

High-density memory
architecture

Memory operation
 Address is divided into row, column.

– Row may contain full word or more than one
word.
 Selected row drives/senses bit lines in
columns.
 Amplifiers/drivers read/write bit lines.

Read-only memory (ROM)
 ROM core is organized as NOR gates—

pulldown transistors of NOR determine
programming.
 Mask-programmable ROM uses pulldowns
to determine ROM contents.

Flash memory
 Flash: electrically erasable PROM that can

be programmed with standard voltages.
 Uses dual capacitor structure.
 Available in some digital processes for
integrated memory, but raises the price of
the manufacturing process.

ROM core circuit

SRAM critical path
core
Sense
amp

Row decoders
 Decode row using NORs:

Static RAM (SRAM)
 Core cell uses six-transistor circuit to store

value.
 Value is stored symmetrically—both true
and complement are stored on cross-
coupled transistors.
 SRAM retains value as long as power is
applied to circuit.

SRAM core cell

SRAM core operation
 Read:
– precharge bit and bit’ high;
– set select line high from row decoder;
– one bit line will be pulled down.
 Write:
– set bit/bit’ to desired (complementary) values;
– set select line high;
– drive on bit lines will flip state if necessary.
SRAM sense amp

Sense amp operation
 Differential pair—takes advantage of

complementarity of bit lines.
 When one bit line goes low, that arm of diff
pair reduces its current, causing
compensating increase in current in other
arm.
 Sense amp can be cross-coupled to increase
speed.
3-transistor dynamic RAM
(DRAM)
 First form of DRAM—modern commercial

DRAMs use one-transistor cell.
 3-transistor cell can easily be made with a
digital process.
 Dynamic RAM loses value due to charge
leakage—must be refreshed.

3-T DRAM core cell

1-T RAM
 1 transistor + 1 capacitor:
word
bit

1-T DRAM with trench capacitor

1-T DRAM with stacked capacitor

Embedded DRAM
 Embedded DRAM is integrated with logic.

 DRAM and logic processes are hard to
make compatible.
– Capacitor requires high temperatures that
destroy fine-line transistors.
 Embedded DRAM is less dense than
commodity DRAM.

3-T DRAM operation
 Value is stored on gate capacitance of t1.
 Read:
– read = 1, write = 0, read_data’ is precharged;
– t1 will pull down read_data’ if 1 is stored.
 Write:
– read = 0, write = 1, write_data = value;
– guard transistor writes value onto gate capacitance.
 Cannot support full connectivity between all data path
elements—must choose number of transfers per cycle
allowed.
 A bus circuit is a specialized multiplexer circuit.
 Two major choices: pseudo-nMOS, precharged.
Topics
 FPGA fabric architecture concepts.

Elements of an FPGA fabric
 Logic. IOB IOB IOB …

 Interconnect. LE LE LE
 I/O pins. interconnect
LE LE LE
…
LE LE LE

Terminology
 Configuration: bits that determine logic

function + interconnect.
 CLB: combinational logic block = logic
element (LE).
 LUT: Lookup table = SRAM used for truth
table.
 I/O block (IOB): I/O pin + associated logic
and electronics.

Logic element
 Programmable:
– Input connections.
– Internal function.
 Coarser-grained than logic gates.
– Typically 4 inputs.
 Generally includes register.
 May provide specialized logic.
– Adder carry chain.

Example logic element
 Lookup table: a b out
0 0
0 1
a 0010 0 1 0 0
memory out
b 1001 1 0 1 0
1 1 0 1

Logic synthesis
 How do we break the function into logic

elements?
 How do we implement an operation within
a logic element?

Placement
 Where do we put each piece of logic in the

array of logic elements?
LE LE LE
LE LE
… LE
LE LE LE

Programmable wiring
 Organized into channels.

– Many wires per channel.
 Connections between wires made at
programmable interconnection points.
 Must choose:
– Channels from source to destination.
– Wires within the channels.

Programmable interconnection
point
D Q

Programmable wiring paths

Choosing a path
LE
LE

Routing problems
 Global routing:
– Which combination of channels?
 Local routing:
– Which wire in each channel?
 Routing metrics:
– Net length.
– Delay.

Segmented wiring
Length 1
Length 2

Offset segments

I/O
 Fundamental selection: input, output, three-

state?
 Additional features:
– Register.
– Voltage levels.
– Slew rate.

Programming technologies
 SRAM.
– Can be programmed many times.
– Must be programmed at power-up.
 Antifuse.
– Programmed once.
 Flash.
– Similar to SRAM but using flash memory.

Configuration
 Must set control bits for:

– LE.
– Interconnect.
– I/O blocks.
 Usually configured off-line.
– Separate burn-in step (antifuse).
– At power-up (SRAM).

Configuration vs. programming
 FPGA configuration:  CPU programming:

– Bits stay at the device – Instructions are fetched
they program. from a memory.
– A configuration bit – Instructions select
controls a switch or a complex operations.
logic bit.
add r1, r2 addIR

r1, r2
memory CPU

Reconfiguration
 Some FPGAs are designed for fast

configuration.
– A few clock cycles, not thousands of clock
cycles.
 Allows hardware to be changed on-the-fly.

FPGA fabric architecture
questions
 Given limited area budget:

– How many logic elements?
– How much interconnect?
– How many I/O blocks?

Logic element questions
 How many inputs?

 How many functions?
– All functions of n inputs or eliminate some
combinations?
– What inputs go to what pieces of the function?
 Any specialized logic?
– Adder, etc.
 What register features?

Interconnect questions
 How many wires in each channel?

 Uniform distribution of wiring?
 How should wires be segmented?
 How rich is interconnect between channels?
 How long is the average wire?
 How much buffering do we add to wires?

I/O block questions
 How many pins?

– Maximum number of pins determined by
package type.
 Are pins programmed individually or in
groups?
 Can all pins perform all functions?
 How many logic families do we support?

Topics
 SRAM-based FPGA fabrics:

– Xilinx.
– Altera.

SRAM-based FPGAs
 Program logic functions, interconnect using

SRAM.
 Advantages:
– Re-programmable;
– dynamically reconfigurable;
– uses standard processes.
 Disadvantages:
– SRAM burns power.
– Possible
FPGA-Based System Design: Chapter 1to steal, disrupt configuration bits.
Copyright  2004 Prentice Hall PTR
Logic elements
 Logic element includes combinational

function + register(s).
 Use SRAM as lookup table for
combinational function.

LUT-based logic element
n
inputs
Lookup 1
table out
configuration
bits
Can multiplex at output or address at input

Example
111
1, 1, 1, 0,
0, 1, 1, 0,
1, 0,
1, 10 0 1

Evaluation of SRAM-based LUT
 N-input LUT can handle function of 2n inputs.

 All logic functions take the same amount of space.
 All functions have the same delay.
 SRAM is larger than static gate equivalent of
function.
 Burns power at idle.
 Want to selectively add register to LE:

Registers in logic elements
 Register may be selected into the circuit:
Configuration bit
LUT LE out
D Q

Other LE features
 Multiple logic functions in an LE.

 Addition logic:
– carry chain.
 Partitioned lookup tables.

Xilinx Spartan-II CLB
 Each CLB has two identical slices.

 Slice has two logic cells:
– LUT.
– Carry logic.
– Registers.

Spartan-II CLB details
 Each lookup table can be used as a 16-bit

synchronous RAM or 16-bit shift register.
 Arithmetic logic includes an XOR gate.
 Each slice includes a mux to ocmbine the results
of the two functino generators in the slice.
 Register can be configured as DFF or latch.
 Has three-state drivers (BUFTs) for on-chip
busses.

Spartan-II CLB operation
 Arithmetic:
– Carry block includes XOR gate.
– Use LUT for carry, XOR for sum.
 Each slice uses F5 mux to combine results of
multiplexers.
 F6 mux combines outputs of F5 muxes.
 Registers can be FF/latch; clock and clock enable.
 Includes three-state output for on-chip bus.

Altera APEX II logic element
 Each logic array block has 10 logic

elements.
 Logic elements share some logic.

Apex II LE modes
 Modes of operation:
– Normal.
– Arithmetic.
– Counter.

APEX-II LE normal mode

APEX-II LE arithmetic mode

APEX-II LE counter mode

APEX-II LE control logic

Programmable interconnect
 MOS switch controlled by configuration bit:
D Q

Programmable vs. fixed
interconnect
 Switch adds delay.

 Transistor off-state is worse in advanced
technologies.
 FPGA interconnect has extra length = added
capacitance.

Interconnect strategies
 Some wires will not be utilized.

 Congestion will not be same throughout
chip.
 Types of wires:
– Short wires: local LE connections.
– Global wires: long-distance, buffered
communication.
– Special wires: clocks, etc.

Paths in interconnect
 Connection may be long, complex:
LE LE LE LE LE
Wiring channel
Wiring channel
LE LE LE LE LE
LE LE LE LE LE

Interconnect architecture
 Connections from wiring channels to LEs.

 Connections between wires in the wiring
channels. Wiring channel
LE LE

Interconnect richness
 Within a channel:
– How many wires.
– Length of segments.
– Connections from LE to channel.
 Between channels:
– Number of connections between channels.
– Channel structure.

Segmented wiring
Length 1
Length 2

Offset segments

Switchbox
channel
channel channel
channel

Spartan-II interconnect
 Types of interconnect:
– local;
– general-purpose;
– dedicated;
– I/O pin.

Spartan-II general-purpose
network
 Provides majority of routing resources:

– General routing matrix (GRM) connects
horizontal/vertical channels and CLBs.
– Interconnect between adjacent GRMs.
– Hex lines connect GRM to GRMs six blocks
away.
– 12 longlines span the chip.

Spartan-II routing
 Relationship
between
GRM, hex
lines, and
local
interconnect:

Spartan-II three-state bus
 Horizontal on-chip busses:

Spartan-II clock distribution

APEX II interconnect
row
column

Spartan-II I/O
 Supports multiple I/O standards:

– LVTTL, PCI, LVCMOS2, AGP2X, etc.
 Provides registers.
 Programmable delay for pin-dependent hold
time.
 Programmable weak keeper circuit.

Spartan-II I/O block diagram

Configuration
 Need to set all configuration SRAM bits:

– minimum pin cost;
– reasonable speed.
 Configuration can also be read back for
testing.

Configuration ROM
 Configured on start-up from ROM:
Configuration
memory
FPGA

Spartan-II configuration
 Configuration length depends on size of

chip:
– 200,000 to 1.3 million bits.
 Configuration modes:
– Master serial for first chip in chain.
– Slave serial for follow-on chips.
– Slave parallel.
– Boundary-scan.
Scan chain
 Scan chain: shift register used to access

internal state.
 Logic-sensitive scan design (LSSD): scan
structure that uses some hardware for
normal mode and scan.

JTAG boundary scan
 JTAG: Joint Test Action Group.

 Boundary scan:
– provide scan chain at pins;
– allow control of chip interior;
– decouple chip from rest of board for test.

Chip-on-board testing
 Boundary scan decouples chips:
board

Boundary scan concepts
 TAP: test access port.

– Requires three pins not shared with other logic.
– Test reset, test clock, test mode select, test data
in, test data out.
 TAP controller recognizes pins, controls
boundary scan registers.
 Instruction register defines boundary scan
mode.
Topics
 Antifuse-based FPGA fabrics:

– Actel.
 Flash-based FPGAs

Antifuses
 Permanently programmed.
 Make a connection with electrical signal.
– More reliable than breaking a connection.
– Avoids shrapnel.
 Resistance of about 100 W.

Antifuse structure
Metal 2
antifuse
via
Metal 1
substrate

Flash-programmed FPGA
 Flash is electrically-erasable EPROM.

 Allows reprogramming without boot-up
procedure.

Flash-programmed switch

Logic blocks
 Program by making connections.

 Based on multiplexing.
d0 a out
out 0 d0
d1 1 d1
a Truth table

Larger logic block
10 10 0 0
Actel 54SX logic element

Actel 54SX adder logic
 Uses two C-cells

in SuperCluster.
 Adds bits A0
and A1.
 Carry in FCI,
carry out FCO.
 Active when
CFN is high.

Actel 54SX R cell

Actel 54SX LE
 C/R cells organized into clusters.

– Type 1 cluster: CRC.
– Type 2 cluster: CRR.
 Clusters grouped into superclusters.
– Type 1: two type 1 clusters.
– Type 2: one type 1, one type 2.

Actel ProASIC 500K logic gate
 Uses switches to connect inputs, feedback, etc.

Actel 54SX interconnect
 FastConnect provides horizontal

connections between logic modules.
– Within a supercluster.
– To supercluster below.
 DirectConnect is within a supercluster:
– connects C-cell to R-cell neighbor.
 Generic global wiring in segmented
channels.
I/O pins
 Need programmable pins:

– Input or output.
– Three-state.
 Other features:
– Registers.
– Slew rate.
– Voltage levels.
– Double-data rate (DDR) support.

Actel APEX II I/O
 Supports SDRAM and double-data rate

(DDR) memory.
 Six registers and latch.
 Bidirectional buffers.
 Two inputs and two outputs.

APEX II I/O

Antifuse programming
 Need to be able to apply programming

voltage to every antifuse.
– Path from VDD to GND.
 Programming can be performed slowly.
– Don’t need a lot of parallelism.
 Use the wiring network to gain access to the
antifuses.
– Access transistors control path to antifuse.

Antifuse programming access
transistors

Topics
 Circuit design for FPGAs:

– Logic elements.
– Interconnect.

Multiplexers as logic elements
1
Q
A
CLR
1
0
0
D (AB)’
A^B
latch
A
CLR
0
1
B
0
0
CLK

Using antifuses

Static CMOS gate vs. LUT
 Number of transistors:
– NAND/NOR gate has 2n transistors.
– 4-input LUT has 128 transistors in SRAM, 96 in
multiplexer.
 Delay:
– 4-input NAND gate has 9t delay.
– SRAM decoding has 21t delay.
 Power:
– Static gate’s power depends on activity.
– SRAM
FPGA-Based System always
Design: Chapter 1 burns power. Copyright  2004 Prentice Hall PTR
Lookup table circuitry
 Demultiplexer or multiplexer?
adrs
adrs
LUT LUT

Traditional RAM/ROM
 Cell drives long bit line:
Bit line
adrs

Lookup memory
 Multiplexer presents smaller load to

memory cells.
– Allows smaller memory cells.

Multiplexer styles
static gates
pass transistors
Multiplexer design
 Pass transistor multiplexer uses fewer

transistors than fully complementary gates.
 Pass transistor is somewhat faster than
complementary switch:
– Equal-strength p-type is 2.5X n-type width.
– Total resistance is 0.5X, total capacitance is
3.5X.
– RC delay is 0.5 x 3.5 = 1.75 times n-type
switch.
Static gate four-input mux
 Delay through n-
input NAND is
(n+2)/3.
 Lg b + 1 inputs at
first level, so delay
is (lg b + 3)/3.
 Delay at second
level is (b+2)/3.
 Delay grows as b
lg b.

Pass-transistor-based four-input mux
 Must include decode

logic in total delay.

Tree-based four-input mux
 Delay proportional to
square of path length.
 Delay grows as lg b2.

LE output drivers
 Must drive load:

– Wire;
– Destination LE.
 Different types of wiring present different
loads.

Avoiding programming hazards
 Want to disable connections to routing

channel before programming.
From LE
config
progb
Routing channel

Interconnect circuits
 Why so many types of

interconnect?
– Provide a choice of
delay alternatives.
 Sources of delay:
– Wires.
– Programming points.

Styles of programmable
interconnection point
pass transistor Three-state

Pass transistor programmable
interconnect point
 Small area.
 Resistive switch.
 Delay grows as the
square of the number
of switches.

Three-state programmable
interconnection point
 Larger area.
 Regenerative driver.

Switch area * wire delay vs. buffer
size (Betz & Rose)
© 1999 IEEE

Switch area * wire delay vs. pass
transistor width (Betz & Rose)
© 1999 IEEE

Wire delay vs. switch sizes (Chandra
and Schmidt)
 Delay vs. switch

size for various
driver sizes.
 U-shaped curve:
– Resistance
initially
decreases.
– Increased
capacitance
eventually
dominates.
© 2002 IEEE
Clock drivers
 Clock driver tree:

Clock nets
 Must drive all LEs.

 Design parameters:
– number of fanouts;
– load per fanout;
– wiring tree capacitance.
 Determine optimal buffer sizes.

H tree
 Regular layout
structure.
– Recursive.

Topics
 The logic design process.

Combinational logic networks
 Functionality. Primary Primary

inputs outputs
 Other
requirements:
Combinational
– Size. logic
– Power.
– Performance.

Non-functional requirements
 Performance:
– Clock speed is generally a primary requirement.
 Size:
– Determines manufacturing cost.
 Power/energy:
– Energy related to battery life, power related to heat.
– Many digital systems are power- or energy-limited.

Mapping into an FPGA
 Must choose the FPGA:

– Capacity.
– Pinout/package type.
– Maximum speed.

Hardware description languages
 Structural description:
– A connection of components.
 Functional description:
– A set of Boolean formulas, state
transitions, etc. A
 Simulation description:
– A program designed for
simulation.
NAND
 Major languages:
– Verilog. x
– VHDL.

Logic optimization
 Must transform Boolean expressions into a form

that can be implemented.
– Use available primitives (gates).
– Meet delay, size, energy/power requirements.
 Logic gates implement expressions.
– Must rewrite logic to use the expressions provided by
the logic gates.
 Maintain functionality while meeting non-
functional requirements.

Macros
 Larger modules designed to fit into a

particular FPGA.
– Hard macro includes placement.
– Soft macro does not include placement.

Physical design
 Placement:
– Place logic components into FPGA fabric.
 Routing:
– Choose connection paths through the fabric.
 Configuration generation:
– Generate bits required to configure FPGA.

Example: parity
 Simple parity function:

– P = a0 XOR a1 XOR a2 XOR a3.
 Implement with Xilinx ISE.

Xilinx ISE main screen
Sources in project
Source window
Processes for source
Output

New project

New project info

Create HDL file

I/O description

I/O info

Empty Verilog description
module parity(a,p);
input [31:0] a;
output p;
endmodule

Verilog with functional code
module parity(a,p);
input [31:0] a;
output p;
assign p = ^a;
endmodule
RTL schematic: top-level

RTL model: implementation

Example: simulation
 Apply stimulus/test vectors.

 Look at response/output vectors.
 Can’t exhaustively simulate but we can
exercise the module.
 Simulation before synthesis is faster and
easier than simulating the mapped design.
– Sometimes want to simulate the mapped
design.

Testbench
Stimulus
Unit
Under
Test
(UUT)
Response
testbench

Automatically-created testbench
module parity_testbench_v_tf();
// DATE: 11:48:13 11/07/2003
// MODULE: parity
// DESIGN: parity
// FILENAME: testbench.v
// PROJECT: parity
// VERSION:
// Inputs
reg [31:0] a;
// Outputs
wire p;
// Bidirs
// Instantiate the UUT
parity uut (
.a(a),
.p(p)
);
// Initialize Inputs
‘ifdef auto_init
initial begin
a = 0;
end
‘endif
endmodule
Test vector application code
initial begin
$monitor("a = %b, parity=%b\n",a,p);
#10 a = 0;
#10 a = 1;
#10 a = 2’b10;
#10 a = 2’b11;
#10 a = 3’b100;
#10 a = 3’b101;
#10 a = 3’b110;
#10 a = 3’b111;
…
#10 a = 1024;
#10 a = 1025;
#10 a = 16’b1010101010101010;
#10 a = 17’b11010101010101010;
#10 a = 17’b10010101010101010;
#10 a = 32’b10101010101010101010101010101010;
#10 a = 32’b11101010101010101010101010101010;
#10 a = 32’b10101010101010101010101010101011;
$finish;
end

Project summary

Topics
 Modeling with hardware description

languages (HDLs).

Hardware description languages
 Textual languages for describing hardware:

– structure;
– function.
 Most people today use textual languages
rather than schematics for most digital
design.
– Schematics make poor use of screen space.

Major HDLs
 Two major HDLs designed for simulation:

– VHDL;
– Verilog.
– Similar capabilities but somewhat different
language philosophies.
 EDIF is a standard netlist format.

Simulation vs. programming
 Simulation tags computations with times.

– Must know when signals change to properly
simulate hardware.
 Simulation is parallel.
– Many statements can execute at the same
(simulation) time.
– Just like hardware.

Types of simulation
 Compiled code simulation.

– Generate program that evaluates a hardware
block.
– Operational details within the hardware block
are lost.
 Event-driven simulation.
– Propagate events through simulation.
– Don’t simulate a block until its inputs change.

Event-driven simulation
 An event is a change
in a net’s value.
net1
 An event has two
components:
– value; t=35 ns time
– time. net
net1=0 @ 35 ns
event
Events on a gate
 Propagate events only

when nets change
value.
0 1
 If an input change 1 no0
doesn’t cause an 1 0 event
output change, no
event is propagated.

Timewheel
 The timewheel is a data structure in the

simulator that efficiently determines the
order of events processed.
 Events are placed on the timewheel in time
order.
 Events are taken out of the head of the
timewheel to process them in order.

Timewheel operation
c=0 @ 2 ns
a
1 c
1 0 b=1 @ 1 ns time
0 1
b
a=1 @ 0 ns
netlist timewheel

Order of evaluation
 Order of evaluation is important.

– Causality must be obeyed.
 Evaluating events in the wrong order can
cause inaccurate results.

Order of evaluation example
a e=0 @ 4 ns
0 c 1
1 0
0 1 e d=1 @ 2 ns time
b
0 1 b=1 @ 1 ns
d
netlist timewheel

Compiled simulation
 A block of code is generated to simulate a

block of hardware.
– Can use compiler to optimize the code.
 Code ignores much temporal behavior
within the block.
– Must still evaluate events in the right order.
– Must generate times at interface to event-driven
model.
Modeling
 Structural modeling describes the

connections between components.
– Netlists are structural models.
 Behavioral models describes the functional
relationship between inputs and outputs.
– Similar to programming but values are events.

HDLs language constructs
 Must be able to define component types.

– A model may be behavioral or structural.
 May be able to define abstract data types.
– A wire may carry an enumerated value.
– Multi-valued simulation may be defined using
abstract data types.
 May be able to define modules to control
the scope of names.
Testbenches
 A testbench is a model used to exercise a

simulation.
– Provides stimulus.
– Checks outputs.
 Testbenches help automate design
verification.
– Rerun edited module against testbench.
– Run models at behavioral, RTL levels against
the same testbench.
Synthesis subsets
 VHDL and Verilog were designed for

simulation.
 A synthesis subset is:
– synthesizable;
– produces consistent simulation results.
 Different tools may use different synthesis
subsets.

Register-transfer synthesis
 Most common type of synthesis.

 Synthesizes gates from abstract RT model.
– Registers are explicit.
– Some tools will infer storage elements---be
careful.
 Optimized for performance, area, power.

Topics
 HDL coding for synthesis.

– Verilog.
– VHDL.

Synthesis vs. simulation
semantics
 Simulation:
– Events are interpreted during simulation.
 Synthesis:
– Logic/memory is extracted from the
description.
CL

Logic synthesis
 Synthesis = translation + optimization.

– Translated from HDL or direct Boolean
network.
– Ideally, translation includes don’t-cares.
– Optimization rewrites to satisfy objective
functions: area, speed, power.

Syntax-directed translation
x = a and b;
a
x
b
if (a or b)
begin
x = c;
end; a
b x

Verilog simulation and synthesis
 Signal assignments must use the assign

keyword:
– assign sig3 = sig1 & sig2;

Verilog structural descriptions
 Build a structure by wiring together components:
input [7:0] a, b;
input carryin;
output [7:0] sum;
output carryout;
wire [7:1] carry;
fulladd a0(a[0],b[0],carryin,sum[0],carry[1]);
fulladd a1(a[1],b[1],carry[1],sum[1],carry[2]);
Type name Instance name

VHDL for Synopsys synthesis
 Each process should start with an activation

list:
process foo (a,b,in1,in2)
 At least two processes:
– combinational;
– sequential.
 Sequential process includes
wait until clock…
Initializing variables
All variables used must be initialized.

Uninitialized variables cause latches to be
introduced: BAD.

State machines
Use case(x/z) statement to decode current

state:
initial begin: Init s0 = B”000”; end
case (curr)
2’b00:
if (in1 = ‘0’) begin o1 = a or b;
end;
2’b01: ...

Process structure
 How many combinational processes?

– separate datapath;
– single process for data and control.
 Comparison:
– single process is simpler;
– separate datapath uses less logic.
ctrl
combin
combin seq vs. seq

dp
combin
Multiplexing a datapath element
case (muxctrl)
1’b0: muxout = a;
1’b1: muxout = b;
end;
foo = muxout or c;

Arithmetic
 Can generate logic by hand.

 Operators (+,-,<,>,*,+1,-1,etc.) can be
mapped onto library elements.
– May be limited to standard widths.

General synthesis hints
 Check out all warnings carefully.

 An early synthesis run keeps you from
debugging a simulation that won’t
synthesize.

The synthesis process
 Synthesis is driven by a script:

compile -map_effort med
report_fpga > TOP + “.fpga”
 Script may be customized for the design.
– Verilog file foo.v, script file foo.script.
– Typically start with a standard script.

Timing constraints
 Clock period.
– Duty cycle, etc.
 Group path timing.
– Cells or ports that share the same timing
behavior.
 Input/output delay.
– End-to-end delay.

Hierarchical design and logic
optimization
 Boolean network model does not reach

across component boundaries.
 Tools generally won’t automatically flatten
logic.
– Size may blow up.
 You may direct the tool to flatten a group of
components.
– Heuristic flattening algorithms may be used.
Instantiating memory
 Use a memory model:

– primitive memories based on LUT;
– larger memories synthesized from multiple
logic elements.
 Synthesis can’t handle a memory described
behaviorally.
– Can handle behavioral ROM.

I/O configuraiton
 Synthesis can automatically determine the

types of many I/O blocks, configure
appropriately.
 Some things that need to be specified:
– indirect three-state activity;
– I/O pin location;
– registered bidirectional I/O.

Timing model
 Synthesis system reads a wire load model

from a technology library.
– Model depends on part, speed grade.

Attribute passing
 FPGA Compiler allows attributes to be

passed to EDIF:
– BUFG X(.I(a),.O(b)); // synopsys attribute LOC
BR

Results and reports
 Save design as:

– database;
– EDIF.
 Types of reports:
– Default synthesis report.
– Configuration report.
» Describes LEs, IOBs, etc.
– Timing report.
Fun with CAD tools
 Array renaming between tools:

– <0>
– [0]
– _0_

Topics
 Combinational network delay.

 Combinational network energy/power.

Delay characteristics
 Measured from change in inputs to change

in ouputs.
 Data-dependent:
– Some inputs give longer delays than others.
 May exercise different paths through the
network.

Timing diagram
tc >= tx + ty
A
0
1
S
B C
X
time

Sources of delay
 Gate delay:
– intrinsic;
– drive;
– load.
 Wire:
– lumped load;
– transmission line.

Basic gate delay model
 Gate delay tg.

 Wire delay tw.
LE PIP LE

Optimizing a single link
 Custom design---improve gate delay:

– Transistor sizing.
– Gate topology.
 FPGA or custom design---improve wire
delay:
– Shorten wire length.
– Choose wire category.
– Increase driver size.

Fanout
 Fanout adds capacitance.
sink
source sink
sink

Driving fanout
 Adding gates adds capacitance:

Ways to drive large fanout
 Increase sizes of driver transistors. Must

take into account rules for driving large
loads.
 Add intermediate buffers. This may
require/allow restructuring of the logic.

Buffers

Wire capacitance
 Use layers with lower capacitance.

 Redesign layout to reduce length of wires
with excessive delay.

Path delay
 Combinational network delay is measured

over paths through network.
 Can trace a causality chain from inputs to
worst-case output.

Path delay example
network
graph model

Critical path
 Critical path = path which creates longest

delay.
 Can trace transistions which cause delays
that are elements of the critical delay path.

Delay model
 Nodes represent gates.

 Assign delays to edges—signal may have
different delay to different sinks.
 Lump gate and wire delay into a single
value.

Critical path through delay graph

Reducing critical path length
 To reduce circuit delay, must speed up the

critical path—reducing delay off the path
doesn’t help.
 There may be more than one path of the
same delay. Must speed up all equivalent
paths to speed up circuit.
 Must speed up cutset through critical path.

False paths
 Logic gates are not simple nodes—some

input changes don’t cause output changes.
 A false path is a path which cannot be
exercised due to Boolean gate conditions.
 False paths cause pessimistic delay
estimates.

False path example

Another false path example
d = 10 d = 10
d = 20 d = 20
False path

Placement and delay
 Placement helps determine routing.

 Routing determines wire length.
 Wire length determines capacitive load.

Placement and wire capacitance
g1 g3
dvr
g2 g4
unbalanced load
g1 g3
dvr
g2 g4
FPGA-Based System Design: Chapter 1 more balanced Copyright  2004 Prentice Hall PTR
Optimizing network delay
 Identify the longest path.

 Improve delay along the longest path:
– Driver delay.
– Wire delay.
– Logic restructuring.

Example: adder placement and delay
 N-bit adder:
+ + + +

Bad placement and routing
placement routing
Better placement and routing
placement routing

Logic rewrites
shallow
deep logic logic

Logic transformations
 Can rewrite by using subexpressions.

– Simplifications affect the cost of rewrites.
 Flattening logic increases gate fanin.
 Logic rewrites may affect gate placement.

Power optimization
 Transitions cause power consumption.

 Logic network design helps control power
consumption:
– minimizing capacitance;
– eliminating unnecessary glitches.

Glitching example
 Gate network:

Glitching example behavior
 NOR gate produces 0 output at beginning

and end:
– beginning: bottom input is 1;
– end: NAND output is 1;
 Difference in delay between application of
primary inputs and generation of new
NAND output causes glitch.

Adder chain glitching
bad
good

Explanation
 Unbalanced chain has signals arriving at

different times at each adder.
 A glitch downstream propagates all the way
upstream.
 Balanced tree introduces multiple glitches
simultaneously, reducing total glitch
activity.

Power estimation tools
 Power estimator approximates power

consumption from:
– gate network;
– primary input transition probabilities;
– capacitive loading.
 May be switch/logic simulation based or
use statistical models.

Factorization for low power
 Proper factorization reduces glitching.
bad good
Factorization techniques
 In example, a has high transition

probability, b and c low probabilities.
 Reduce number of logic levels through
which high-probability signals must travel
in order to reduce propagation of glitches.

Layout for low power
 Place and route to minimize capacitance of

nodes with high glitching activity.
 Feed back wiring capacitance values to
power analysis for better estimates.

Topics
 Number representation.
 Shifters.
 Adders and ALUs.

Signed number representations
 One’s complement:
– 3 = 0101
– -3 = ~(0101) = 1010
– Two zeroes: 0000, 1111
 Two’s complement:
– 3 = 0101
– -3 = ~(0101) +1 = 1011
– One zero: 0000

Representations and arithmetic
 N =  2i bi
 Test for zero: all bits are 0.
 Test for negative: sign bit is 1.
 Subtraction: negate then add.
– a – b = a + (-b) = a + (~b+1)

Combinational shifters
 Useful for arithmetic operations, bit field

extraction, etc.
 Latch-based shift register can shift only one
bit per clock cycle.
 A multiple-shift shifter requires additional
connectivity.

Barrel shifter
 Can perform n-bit shifts in a single cycle.

Barrel shifter structure
Accepts 2n data inputs and n control signals,

producing n data outputs.
data 1
n bits
output
n bits
data 2
n bits

Barrel shifter operation
 Selects arbitrary contiguous n bits out of 2n

input buts.
 Examples:
– right shift: data into top, 0 into bottom;
– left shift: 0 into top, data into bottom;
– rotate: data into top and bottom.

Verilog for barrel shifter
module shifter(data,b,result);
parameter Nminus1 = 31; /* 32-bit shifter */
input [Nminus1:0] data; /* compute parity of
these bits */
input [3:0] b; /* amount to shift */
output [Nminus1:0] result; /* shift result */
assign result = data << b;

endmodule

Adders
 Adder delay is dominated by carry chain.

 Carry chain analysis must consider
transistor, wiring delay.
 Modern VLSI favors adder designs which
have compact carry chains.

Full adder
 Computes one-bit sum, carry:

– si = ai XOR bi XOR ci
– ci+1 = aibi + aici + bici
 Half adder computes two-bit sum.
 Ripple-carry adder: n-bit adder built from
full adders.
 Delay of ripple-carry adder goes through all
carry bits.

Verilog for full adder
module fulladd(a,b,carryin,sum,carryout);
input a, b, carryin; /* add these bits*/
output sum, carryout; /* results */
assign {carryout, sum} = a + b + carryin;

/* compute the sum and carry */
endmodule

Verilog for ripple-carry adder
module nbitfulladd(a,b,carryin,sum,carryout)
input [7:0] a, b; /* add these bits */
input carryin; /* carry in*/
output [7:0] sum; /* result */
output carryout;
wire [7:1] carry; /* transfers the carry between bits */
fulladd a0(a[0],b[0],carryin,sum[0],carry[1]);
…
fulladd a7(a[7],b[7],carry[7],sum[7],carryout]);
endmodule

Carry-lookahead adder
 First compute carry propagate, generate:

– Pi = ai + bi
– Gi = ai bi
 Compute sum and carry from P and G:
– si = ci XOR Pi XOR Gi
– ci+1 = Gi + Pici

Carry-lookahead expansion
 Can recursively expand carry formula:

– ci+1 = Gi + Pi(Gi-1 + Pi-1ci-1)
– ci+1 = Gi + PiGi-1 + PiPi-1 (Gi-2 + Pi-1ci-2)
 Expanded formula does not depend on
intermerdiate carries.
 Allows carry for each bit to be computed
independently.

Depth-4 carry-lookahead

Analysis
 Deepest carry expansion requires gates with

large fanin: large, slow.
 Carry-lookahead unit requires complex
wiring between adders and lookahead
unit—values must be routed back from
lookahead unit to adder.
 Layout is even more complex with multiple
levels of lookahead.
Verilog for carry-lookahead carry
block
module carry_block(a,b,carryin,carry);
input [3:0] a, b; /* add these bits*/
input carryin; /* carry into the block */
output [3:0] carry; /* carries for each bit in the block */
wire [3:0] g, p; /* generate and propagate */
assign g[0] = a[0] & b[0]; /* generate 0 */

assign p[0] = a[0] ^ b[0]; /* propagate 0 */
assign g[1] = a[1] & b[1]; /* generate 1 */
assign p[1] = a[1] ^ b[1]; /* propagate 1 */
…
assign carry[0] = g[0] | (p[0] & carryin);
assign carry[1] = g[1] | p[1] & (g[0] | (p[0] & carryin));
assign carry[2] = g[2] | p[2] &
(g[1] | p[1] & (g[0] | (p[0] & carryin)));
assign carry[3] = g[3] | p[3] &
(g[2] | p[2] & (g[1] | p[1] & (g[0] | (p[0] & carryin))));
 endmodule

Verilog for carry-lookahead sum
unit
module sum(a,b,carryin,result);
output result; /* sum */
assign result = a ^ b ^ carryin;

/* compute the sum */
endmodule
Verilog for carry-lookahead adder
 module carry_lookahead_adder(a,b,carryin,sum,carryout);
input [15:0] a, b; /* add these together */
input carryin;
output carryout;
wire [16:1] carry; /* intermediate carries */
assign carryout = carry[16]; /* for simplicity */

/* build the carry-lookahead units */
carry_block b0(a[3:0],b[3:0],carryin,carry[4:1]);
carry_block b1(a[7:4],b[7:4],carry[4],carry[8:5]);
/* build the sum */
sum a0(a[0],b[0],carryin,sum[0]);
sum a1(a[1],b[1],carry[1],sum[1]);
…
sum a15(a[15],b[15],carry[15],sum[15]);
endmodule

Carry-skip adder
 Looks for cases in which carry out of a set

of bits is identical to carry in.
 Typically organized into b-bit stages.
 Can bypass carry through all stages in a
group when all propagates are true: Pi Pi+1
… Pi+b-1.
– Carry out of group when carry out of last bit in
group or carry is bypassed.

Two-bit carry-skip structure
ci
Pi
Pi+1 AND
…
Pi+b-1
OR
Ci+b-1

Carry-skip structure
b adder stages b adder stages b adder stages
Carry out P[2b,3b-1] Carry out P[b,2b-1]Carry out P[0,b-1]

skip skip skip

Worst-case carry-skip
 Worst-case carry-propagation path goes

through first, last stages:

Verilog for carry-skip add with P
module fulladd_p(a,b,carryin,sum,carryout,p);
output sum, carryout, p; /* results including
propagate */
assign {carryout, sum} = a + b + carryin;

/* compute the sum and carry */
assign p = a | b;
endmodule

Verilog for carry-skip adder
module carryskip(a,b,carryin,sum,carryout);
input [7:0] a, b; /* add these bits */
input carryin; /* carry in*/
output carryout;
wire [8:1] carry; /* transfers the carry between bits */
wire [7:0] p; /* propagate for each bit */
wire cs4; /* final carry for first group */
fulladd_p a0(a[0],b[0],carryin,sum[0],carry[1],p[0]);
fulladd_p a1(a[1],b[1],carry[1],sum[1],carry[2],p[1]);
assign cs4 = carry[4] | (p[0] & p[1] & p[2] & p[3] & carryin);
fulladd_p a4(a[4],b[4],cs4, sum[4],carry[5],p[4]);
…
assign carryout = carry[8] | (p[4] & p[5] & p[6] & p[7] & cs4);
endmodule

Delay analysis
 Assume that skip delay = 1 bit carry delay.

 Delay of k-bit adder with block size b:
– T = (b-1) + 0.5 + (k/b –2) + (b-1)
block 0 OR gate skips last block
 For equal sized blocks, optimal block size is
sqrt(k/2).

Carry-select adder
 Computes two results in parallel, each for

different carry input assumptions.
 Uses actual carry in to select correct result.
 Reduces delay to multiplexer.

Carry-select structure

Carry-save adder
 Useful in multiplication.
 Input: 3 n-bit operands.
 Output: n-bit partial sum, n-bit carry.
– Use carry propagate adder for final sum.
 Operations:
– s = (x + y + z) mod 2.
– c = [(x + y + z) –2] / 2.

FPGA delay model
 Xing/Yu---ripple-carry adder:
– n-stage adder divided into x blocks;
– each block has n/x stages;
– block k, 1<= k <= x.
# stages in block k
 Delays: constant
– ripple-carry R(yk) = l1 + dyk Delay of a single stage
– carry-generate G(yk) = l2 + d(yk-1)
– carry-terminate T(yk) = G(yk)
Carry-skip delay model
 Consider only inter-CLB delay.

 Delay dominated by interconnect:
– S(yk) = l3 + bl2
 Wire length l is proportional to the number
of carry-skip layers.

Adder comparison
 Ripple-carry adder has highest

performance/cost.
 Optimized adders are most effective in very
long bit widths (> 48 bits).

350 120 400
350
300 100
300
250
Performance-Cost Ratio
Operational Time (ns) 80
250
Ripple
Cost (CLBs)
200 Complete
60 200 CLA
Skip
150
150 RC-select
40
100
100
20
50 50
0 0
0
32
56
80
8
32
56
80
8
40
72
8
Bits Copyright  2004 Prentice Hall PTR

FPGA-Based System Design: Chapter 1
Bits Bits © 1998 IEEE
Serial adder
 May be used in signal-processing arithmetic

where fast computation is important but
latency is unimportant.
 Data format (LSB first):
0 1 1 0
LSB

Serial adder structure
LSB control signal clears the carry shift

register:

ALUs
 ALU computes a variety of logical and

arithmetic functions based on opcode.
 May offer complete set of functions of two
variables or a subset.
 ALU built around adder, since carry chain
determines delay.

ALU as multiplexer
 Compute functions then select desired one:
opcode
AND
OR
NOT
SUM

Verilog for ALU
‘define PLUS 0
‘define MINUS 1
‘define AND 2
‘define OR 3
‘define NOT 4
module alu(fcode,op0,op1,result,oflo);
parameter n=16, flen=3; input [flen-1:0] fcode; [n-1:0] op0, op1; output [n-1:0] result; output
oflo;
assign
{oflo,result} =
(fcode == ‘PLUS) ? (op0 + op1) :
(fcode == ‘MINUS) ? (op0 - op1) :
(fcode == ‘AND) ? (op0 & op1) :
(fcode == ‘OR) ? (op0 | op1) :
(fcode == ‘NOT) ? (~op0) : 0;
endmodule

Topics
 Switch networks.
 Combinational testing.

Boolean functions and switches
pseudo-AND
pseudo-OR

Driving switch outputs
 If switch network output is not connected to

power supply through switch path, output
will float.
 Switch network inputs may be connected to
power supply or logic signals.

Switching logic signals
b’
a
b ab’ + a’b
a’

Switch multiplexer

Charge sharing
 Interior nodes in a switch network may not

be driven.
 Charge can accumulate on small parasitic
capacitances.
 Shared charge can produce erroneous output
values.

Charge division
 At undriven nodes, charge is divided

according to capacitance ratio.

Charge sharing example
 Long chains of switches have intermediate

nodes which may be disconnected from
power supplies.
Cia Cab Cbc

Charge over time
time i Cia a Cib b Cbc c C

0 1 1 1 1 1 1 1 1
1 0 0 1 0 0 1 0 1
2 0 0 0 1/2 1 1/2 0 1
3 0 0 0 1/2 0 3/4 1 3/4
4 0 0 0 0 0 3/4 0 3/4
5 0 0 0 3/8 1 3/8 0 3/4

Avoiding charge sharing
 Make sure that for every input combination

there is a path from the power supply to the
output.

Manufacturing testing
 Errors are introduced during manufacturing.

 Testing verifies that chip corresponds to
design.
 Varieties of testing:
– functional testing;
– performance testing (binning chips by speed).
 Testing also weeds out infant mortality.

Testing and faults
 Fault model:
– possible locations of faults;
– I/O behavior produced by the fault.
 Good news: if we have a fault model, we
can test the network for every possible
instantiation of that type of fault.
 Bad news: it is difficult to enumerate all
types of manufacturing faults.
Stuck-at-0/1 faults
 Stuck-at-0/1: logic gate output is always

stuck at 0 or 1, independent of input values.
 Correspondence to manufacturing defects
depends on logic family.
 Experiments show that 100% stuck-at-0/1
fault coverage corresponds to high overall
fault coverage.

Testing procedure
 Testing procedure:
– set gate inputs;
– observe gate output;
– compare fault-free and observed gate output.
 Test vector: set of gate inputs applied to a
system.

Stuck-at faults in gates
a b OK SA0 SA1 a b OK SA0 SA1

0 0 1 0 1 0 0 1 0 1
0 1 1 0 1 0 1 0 0 1
1 0 1 0 1 1 0 0 0 1
1 1 0 0 1 1 1 0 0 1
NAND NOR

Testing single gates
 Three ways to test NAND for stuck-at-0,

only one way to test it for stuck-at-1.
 Three ways to test NOR for stuck-at-1, only
one way to test it for stuck-at-0.

Testing combinational networks
 100% coverage: test every gate for

– stuck-at-0;
– stuck-at-1.
 Assume that there is only one faulty gate
per network.
 Most networks require more than one test
vector to test all gates.

Multiple test example

Example
 Can test both NANDs for stuck-at-0

simultaneously (abc = 000).
 Cannot test both NANDs for stuck-at-1
simultaneously due to inverter. Must use
two vectors.
 Must also test inverter.

Stuck-at-open/closed model
 Models transistors always on/off.

Stuck-open behavior
 If t1 is stuck open (switch cannot be closed),

there can be no path from VDD to output
capacitance.
 Testing requires two cycles:
– must discharge capacitor;
– try to operate t1 to see if capacitor can be
charged.

Delay fault
 Delay falls outside acceptable limits:

– gate delay fault assumes that all delays are
lumped into one gate;
– path delay fault models delay problems along
path through network.
 Delay problems reduce yield:
– performance problems;
– functional problems in some types of circuits.

Combinational network testing
Two parts to testing:

– controlling the inputs of (possibly interior)
gates;
– observing the outputs of (possibly interior)
gates.

Combinational testing example

Testing procedure
 Goal: test gate D for stuck-at-0 fault.

 First step: justify 0 values on gate inputs.
 Work backward from gate to primary
inputs:
– w1 = 0 (A output = 0);
– i1 = i2 = 1.

Testing procedure, cont’d
 Observe the fault at a primary output:

– o1 gives different values if D is true/faulty.
 Work forward and backward:
– F’s other input must be 0 to detect true/fault.
– Justify 0 at E’s output.
 In general, may have to propagate fault
through multiple levels of logic to primary
outputs.
Fault masking
Redundant logic can mask faults:

Redundancy example
 Testing NOR for SA0 requires setting both

inputs to 0.
 Network topology ensures that one NOR
input will always be 1.
 Function reduces to 0:
– f = (ab)’ + b’ = a’ + b’ + b = 0.

Redundancies and testing
 Redundant logic cannot be controlled.

 Observations requiring control of redundant
logic may not be possible.
 Redundant logic should be minimized to
eliminate redundancy. Redundancies can
introduce delay faults and other problems.

Topics
 Multipliers.

Elementary school algorithm
0110 multiplicand
x1001 multiplier
0110
+0000 partial product
00110
+0000
000110
+0110
FPGA-Based System Design: Chapter 1
0110110 Copyright  2004 Prentice Hall PTR
Word serial multiplier
register

Combinational multiplier
Uses n-1 adders, eliminates registers:

Array multiplier
 Array multiplier is an efficient layout of a

combinational multiplier.
 Array multipliers may be pipelined to
decrease clock period at the expense of
latency.

Array multiplier organization
0110
multiplicand x1001
0110
+0000
multiplier 00110
skew array
+0000 for rectangular
000110 layout
0110
product 0110110
Unsigned array multiplier
x2y0 x1y0 x0y0

0 0
+ x1y1 + x0y1
+ x1y2 + x0y2
xn-1yn-1
+ + 0
P(2n-1) P(2n-2) P0
Unsigned array multiplier, cont’d

Array multiplier critical path

Verilog for multiplier row
module multrow(part,x,ym,yo,cin,s,cout);
/* A row of one-bit multiplies */
input [2:0] part;
input [3:0] x;
input ym, yo;
input [2:0] cin;
output [2:0] s;
output [2:0] cout;
assign {cout[0],s[0]} = part[1] + (x[0] & ym) + cin[0];

assign {cout[1],s[1]} = part[2] + (x[1] & ym) + cin[1];
assign {cout[2],s[2]} = (x[3] & yo) + (x[2] & ym) + cin[2];
endmodule

Verilog for last multiplier row
module lastrow(part,cin,s,cout);
/* Last row of adders with full carry chain. */
input [2:0] part;
input [2:0] cin;
output [2:0] s;
output cout;
wire [1:0] carry;
assign {carry[0],s[0]} = part[0] + cin[0];

assign {carry[1],s[1]} = part[1] + cin[1] + carry[0];
assign {cout,s[2]} = part[2] + cin[2] + carry[1];
endmodule

Verilog for multiplier
module array_mult(x,y,p);
input [3:0] x;
input [3:0] y;
output [7:0] p;
wire [2:0] row0, row1, row2, row3, c0, c1, c2, c3;
/* generate first row of products */

assign row0[2] = x[2] & y[0]; assign row0[1] = x[1] & y[0];
assign row0[0] = x[0] & y[0]; assign p[0] = row0[0]; assign c0 = 3’b000;
multrow p0(row0,x,y[1],y[0],c0,row1,c1); assign p[1] = row1[0];
lastrow l({x[3] & y[3],row3[2:1]},c3,p[6:4],p[7]);
endmodule

Baugh-Wooley multiplier
 Algorithm for two’s-complement

multiplication.
 Adjusts partial products to maximize
regularity of multiplication array.
 Moves partial products with negative signs
to the last steps; also adds negation of
partial products rather than subtracts.

Booth multiplier
 Encoding scheme to reduce number of

stages in multiplication.
 Performs two bits of multiplication at
once—requires half the stages.
 Each stage is slightly more complex than
simple multiplier, but adder/subtracter is
almost as small/fast as adder.

Booth encoding
 Two’s-complement form of multiplier:

– y = -2nyn + 2n-1yn-2 + 2n-2yn-2 + ...
 Rewrite using 2a = 2a+1 - 2a:
– y = -2n(yn-1-yn) + 2n-1(yn-2 -yn-1) + 2n-2(yn-3 -yn-2)
+ ...
 Consider first two terms: by looking at three
bits of y, we can determine whether to
add/subtract x, 2x to partial product.

Booth actions
yi yi-1 yi-2 increment

000 0
001 x
010 x
011 2x
100 -2x
101 -x
110 -x
111 0

Booth example
 x = 011001 (2510), y = 101110 (-1810).

 y1y0y-1 = 100, P1 = P0 - (10  011001) =
11111001110.
 y3y2y1= 111, P2 = P1+ 0 = 11111001110.
 y5y4y3= 101, P3 = P2 - 0110010000 =
11000111110.

Booth structure

Wallace tree
 Reduces depth of adder chain.

 Built from carry-save adders:
– three inputs a, b, c
– produces two outputs y, z such that y + z = a +
b+c
 Carry-save equations:
– yi = parity(ai,bi,ci)
– zi = majority(ai,bi,ci)
Wallace tree structure

Wallace tree operation
 At each stage, i numbers are combined to

form ceil(2i/3) sums.
 Final adder completes the summation.
 Wiring is more complex.
 Can build a Booth-encoded Wallace tree
multiplier.

Serial-parallel multiplier
 Used in serial-arithmetic operations.

 Multiplicand can be held in place by
register.
 Multiplier is shfited into array.

Serial-parallel multiplier
structure

Topics
 Logic synthesis.
 Placement and routing.

Logic optimization
 Logic synthesis programs transform

Boolean expressions into logic gate
networks in a particular library.
 Optimization goals: minimize area, meet
delay constraint; some power optimizations.

Syntax-directed translation
 Translate HDL into logic directly.

– ab + ac
 Generally requires optimization.

Macros
 Pre-designed logic.
– Generally identified by language features.
– + operator.
– xxx()
 Hard macro: includes placement.
 Soft macro: no placement.

Logic synthesis phases
 Technology-independent optimizations
work on logic representations that do not
directly model logic gates.
 Technology-dependent optimizations work
in the available set of logic gates.
 Transformation from technology-
independent to technology-dependent is
called library binding.
Technology-independent
optimizations
 Works on Boolean expression equivalent.

 Estimates size based on number of literals.
 Uses factorization, resubstitution,
minimization, etc. to optimize logic.
 Technology-independent phase uses simple
delay models.

Technology-dependent
optimizations
 Maps Boolean expressions into a particular

cell library.
 Mapping may take into account area, delay.
 May perform some optimizations on
addition to simple mapping.
 Allows more accurate delay models.

Boolean network
 A Boolean network is the main

representation of the logic functions for
technology independent optimizations.
 Each node can be represented as sum-of-
products (or product-of-sums).
 Provides multi-level structure, but functions
in the network need not correspond to logic
gates.
Boolean network example
primary outputs
out1 = k2 + x2’ out2 = k3 + x1
k2 = x1’ x2 x4 + k1
k3 = k1 x4’
k1 = x2 + x3
primary inputs
x1 x2 x3 x4
Terms
 Support: set of variables used by a function.

 Transitive fanout: all the primary outputs
and intermediate variables of a function.
 Transitive fanin: all the primary inputs and
intermediate variables used by a function.
Transistive fanin determines a cone of logic.
primary inputs cone output

Technology-independent logic
optimization
 Simplification rewrites node to simplify its

form.
 Network restructuring introduces new nodes
for common factors, collapses several nodes
into one new node.
 Delay restructuring changes factorization to
reduce path length.

Cost in the Boolean network
 Don’t know exact gate structure, but can

estimate final network cost:
– area estimated by number of literals (true or
complement forms of variables);
– delay estimated by path length.

Partially-specified functions
 Don’t-cares can be implemented in either

the on-set or off-set.
 Don’t-cares provide the greatest
opportunities for minimization in many
cases.

Partially-specified function
example
x2
1 don’t care
x1
0
1
x3
Don’t-cares in Boolean networks
 In two-level function, don’t-cares are

defined at primary output.
 In Boolean network, structure of network
itself introduces don’t-cares.
 Types of structural don’t-cares:
– satisfiability;
– observability.

Satisfiability don’t-cares
 Occur when an intermediate variable value

is inconsistent with its function inputs.
Since this can’t happen, we don’t care.
f=yc
y == g
y a=b=0, f=1 can’t happen
Don’t-care for f: y’g + yg’
g=ab
a b c
Observability don’t-cares
 Occur when an intermediate variable’s

value doesn’t affect the network primary
outputs.
a If a=1, then b is don’t-care
x

Optimizations
 Simplification.
– Changing the way a function is represented.
 Network restructuring.
– Adding and removing nodes.
 Delay restructuring.
– Optimizations that reduce the height of critical
paths.

Cube representation
 On-set, off-set, don’t-care set, cover:

x2
x1
x3
Espresso example
x2
x1
x3

Partial collapsing
f1 f4 F f4
f2 f3 f3
before after

Factorization
 Based on division:
– formulate candidate divisor;
– test how it divides into the function;
– if g = f/c, we can use c as an intermediate
function for f.
 Algebraic division: don’t take into account
Boolean simplification. Less expensive then
Boolean division.
Factorization using division
 Three steps:
– generate potential common factors and compute
literal savings if used;
– choose factors to substitute into network;
– restructure the network to use the new factors.
 Algebraic/Boolean divison can be used to
implement first step.

Technology mapping
 Cover the function:

FPGA tech mapping
 Cost (number of
inputs) doesn’t always
increase with added
functions:

FPGAs vs. custom logic
 Cost metric for static gates is literal:

– ax + bx’ has four literals, requires 8 transistors.
 Cost metric for FPGAs is logic element:
– All functions that fit in an LE have the same
cost.

LUT-based logic synthesis
 Find the largest logic cone that will fit into

the LUT:
r = q + s’
q = g’ + h s = d’
d=a+b

Placement and routing
 Two critical phases of layout design:

– placement of components on the chip;
– routing of wires between components.
 Placement and routing interact, but
separating layout design into phases helps
us understand the problem and find good
solutions.

Placement metrics
 Quality metrics for layout:

– area;
– delay.
 Area and delay determined partly by wiring.
 How do we judge a placement without
wiring? Estimate wire length without
actually performing routing.
 Design time may be important for FPGAs
Wire length as a quality metric
bad placement good placement

Wire length measures
 Estimate wire length by

distance between
components.
 Possible distance measures:
Euclidean
– Euclidean distance (sqrt(x2
+
y2));
– Manhattan distance (x + y). Manhattan
 Multi-point nets must be
broken up into trees for
good estimates.
Wiring trees
Steiner point

Placement techniques
 Can construct an initial solution, improve an

existing solution.
 Pairwise interchange is a simple
improvement metric:
– Interchange a pair, keep the swap if it helps
wire length.
– Heuristic determines which two components to
swap.

Placement by partitioning
 Works well for components of fairly

uniform size.
 Partition netlist to minimize total wire
length using min-cut criterion.
 Partitioning may be interpreted as 1-D or 2-
D layout.

Recursive partitioning

Min-cut bisecting partitioning
1 net B
A
3 nets
C
D
partition 1 partition 2
Min-cut bisecting partitioning,
cont’d
 Swapping A and B:
– B drags 1 net;
– A drags 3 nets;
– total cut increase: 3 nets.
 Conclusion: probably not a good swap, but
must be compared with other pairs.

Kernighan-Lin algorithm
 Compute min cut criterion:

– count total net cut change.
 Algorithm exchanges sets of nodes to
perform hill-climbing—finding
improvements where no single swap will
improve the cut.
 Recursively subdivide to determine
placement detail.
Simulated annealing
 Powerful but CPU-intensive optimization

technique.
 Analogy to annealing of metals:
– temperature determines probability of a
component jumping position;
– probabilistically accept moves.
– start at high temperature, cool to lower
temperature to try to reach good placement.

Routing
 Major phases in routing:

– global routing assigns nets to routing areas;
– detailed routing designs the routing areas.
 Net ordering is a major problem. Order in
whch nets are routed determines quality fo
result. Net ordering is a heuristic.

Global routing
 Choose a sequence of channels.

– Not tracks within a channel.
 Must take capacity into account.
 Channel graph allows path algorithms to be
used for global routing.

Channel graph
switch switch switch
box channel channel box
box
channel LE channel LE channel

channel box channel box
box
channel LE channel LE channel

box channel box channel box

Maze routing
 Will find shortest path for a single wire, if

such a path exists.
 Two phases:
– Label nodes with distance, radiating from
source.
– Use distances to trace from sink to source,
choosing a path that always decreases distance
to source.

Lee (wavefront) router

FPGA issues
 Often want a fast answer. May be willing to

accept lower quality result for less
place/route time.
 May be interested in knowing wirability
without needing the final configuration.
 Fast placement: constructive placement,
iterative improvement through simulated
annealing.
FPGA routing
 Finding a route into given interconnection

network.
 Global routing assigns to channels.
 Local routing selects the programming
points used to make the connections.

FPGA routing techniques
 Nair: route based on congestion, not

distance. Route in two passes:
– Estimate congestion.
– Final routing.
 Triptych: more gradual penalty for
congestion.

Topics
 16 x 16 multiplier example.

The FPGA design process
 Translation from HDL.

– (synthesis, translation)
 Logic synthesis.
– (mapping)
 Placement and routing.
– (place and route)
 Configuration generation.
– (program file generation)

Design experiments
 Synthesize with no constraints.

 Synthesize with timing constraint.
– Tighten timing constraint.
 Synthesize with placement constraints.

Post-translation simulation model
 HDL model in terms of FPGA primitives.
 Example:
X_LUT4 \p12_Madd__n0015_Mxor_Result_Xo<1>1 (
.ADR0(x_7_IBUF),
.ADR1(y_13_IBUF),
.ADR2(c12[7]),
.ADR3(row12[8]),
.O(row13[7])
);

Mapping report
Design Summary
--------------
Number of errors: 0
Number of warnings: 0
Logic Utilization:
Number of 4 input LUTs: 501 out of 1,024 48%
Logic Distribution:
Number of occupied Slices: 255 out of 512 49%
Number of Slices containing only related logic: 255 out of 255 100%
Number of Slices containing unrelated logic: 0 out of 255 0%
*See NOTES below for an explanation of the effects of unrelated logic
Total Number 4 input LUTs: 501 out of 1,024 48%
Number of bonded IOBs: 64 out of 92 69%
Total equivalent gate count for design: 3,006

Additional JTAG gate count for IOBs: 3,072
Peak Memory Usage: 64 MB

Static timing analysis report
Timing constraint: TS_P2P = MAXDELAY

FROM TIMEGRP "PADS" TO TIMEGRP
"PADS" 99.999 uS ;
20135312 items analyzed, 0 timing errors

detected. (0 setup errors, 0 hold errors)
Maximum delay is 20.916ns.
---------------------------------------------------------------
-----------------

Static timing report: delays along
paths
Data Sheet report:
-----------------
All values displayed in nanoseconds (ns)
Pad to Pad
------------------+----------------------+-----------+
Source Pad |Destination Pad| Delay |
------------------+----------------------+-----------+
x<0> |p<0> | 5.824|
x<0> |p<10> | 10.675|
x<0> |p<11> | 11.214|
x<0> |p<12> | 11.753|

Routing report
Phase 1: 1975 unrouted; REAL time: 11 secs
Phase 4: 619 unrouted; (0) REAL time: 12 secs

The NUMBER OF SIGNALS NOT COMPLETELY ROUTED for this design is:
0

Static timing after routing

"PADS" 99.999 uS ;

---------------------------------------------------------------
-----------------

Timing constraint
 Use timing constraint

editor:

Post-map static timing report

FROM TIMEGRP "PADS" TO
TIMEGRP "PADS" 32 nS ;


Post-routing static timing report

FROM TIMEGRP "PADS" TO
TIMEGRP "PADS" 32 nS ;


Tighter timing constraints
 Tighten requirement to 25 ns.

 Post-place-route timing report:
"PADS" 25 nS ;

Report on a violated path
Slack: -6.128ns (requirement - data path)
Source: y<0> (PAD)
Destination: p<30> (PAD)
Requirement: 25.000ns
Data Path Delay: 31.128ns (Levels of Logic = 31)
Data Path: y<0> to p<30>

Location Delay type Delay(ns) Physical Resource
Logical Resource(s)
------------------------------------------------- -------------------
K5.I Tiopi 0.825 y<0>
y<0>
y_0_IBUF
SLICE_X2Y11.G4 net (fanout=31) 1.792 y_0_IBUF
SLICE_X2Y11.Y Tilo 0.439 c2<5>
p0_Madd__n0017_Mxor_Result_Xo<1>1
SLICE_X2Y11.F4 net (fanout=2) 0.304 row1<6>
SLICE_X2Y11.X Tilo 0.439 c2<5>
p1_Madd__n0019_Cout1
SLICE_X5Y16.F3 net (fanout=2) 0.784 c2<5>
SLICE_X5Y16.X Tilo 0.439 c3<5>
p2_Madd__n0019_Cout1
SLICE_X2Y18.G4 net (fanout=2) 0.668 c3<5>
SLICE_X2Y18.Y Tilo 0.439 row5<4>
p3_Madd__n0019_Mxor_Result_Xo<1>1

Power report
Power summary: I(mA) P(mW)
----------------------------------------------------------------
Total estimated power consumption: 333
---
Vccint 1.50V: 0 0
Vccaux 3.30V: 100 330
Vcco33 3.30V: 1 3
---
Inputs: 0 0
Logic: 0 0
Outputs:
Vcco33 0 0
Signals: 0 0
---
Quiescent Vccaux 3.30V: 100 330
Quiescent Vcco33 3.30V: 1 3
Thermal summary:
----------------------------------------------------------------
Estimated junction temperature: 36C
Ambient temp: 25C
Case temp: 35C
Theta J-A: 34C/W

Power report: decoupling
capacitance
Decoupling Network Summary: Cap Range (uF) #
----------------------------------------------------------------
Capacitor Recommendations:
Total for Vccint : 4
470.0 - 1000.0 : 1
0.0100 - 0.0470 : 1
0.0010 - 0.0047 : 2
---
Total for Vccaux : 4
470.0 - 1000.0 : 1
0.0100 - 0.0470 : 1
0.0010 - 0.0047 : 2
---
Total for Vcco33 : 10
470.0 - 1000.0 : 1
0.470 - 2.200 : 1
0.0470 - 0.2200 : 2
0.0100 - 0.0470 : 3
0.0010 - 0.0047 : 3

Improving area
 Floorplanner window:
Chip
floorplan
LEs

Rat’s nest wiring

Routing editor view

Adding placement constraints
 Must add attributes to the Verilog:

// synthesis attribute rloc of p0 is X0Y0
multrow
p0(row0,x,y[1],y[0],c0,row1,c1);

Editing constraints
 Use constraints editor

to place constraints:

Design browser pane

Drag and drop constraints

Change the shape of constraints

Full set of placement constraints

Placement results

New timing report
 After placement constraints:

 Compares to 31 ns for unconstrained
placement.

Detailed routing constraints

Topics
 Basics of sequential machines.

 Sequential machine specification.
 Sequential machine design processes.

Sequential machines
 Use registers to make primary output

values depend on state + primary inputs.
 Varieties:
– Mealy—outputs function of present state,
inputs;
– Moore—outputs depend only on state.

FSM structure

Constraints on structure
 No combinational cycles.
 All components must have bounded delay.

Synchronous design
 Controlled by clock(s).
– State changes at time determined by the clock.
– Inputs to registers settle in time for state change.
– Primary inputs settle in time for combinational delay
through logic.
 Machine state is determined solely by registers.
– Don’t have to worry about timing constraints, events
outside the registers.

Non-functional requirements and
optimization
 Performance:
– Clock period is determined by combinational logic
delay.
 Area:
– Combinational logic size usually dominates area.
 Energy/power:
– Often dominated by combinational logic.
– May be improved by latching values.

Models of state machines
 Register-transfer:
– Combinational equations for inputs to registers.
 State transition graph/table:
– Next-state, output functions described
piecewise.

State transition graph
 Each transition describes part of the next-

state, output functions:
0/010 S2
S1
1/1-0 S3

Register-transfer structure
 Registers fed by combinational logic:
D Q D Q
D Q Combinational D Q
logic
D Q D Q

Block diagram
 Purely structural description:
A B1
B2

Symbolic values
 A sequential machine description may use

symbolic, not binary values.
– Symbolic values must be encoded during
implementation.
 Encoding may optimize implementation
characteristics:
– Area.
– Performance.
– Energy.

STG vs. register-transfer
 Each representation is easier for some types

of machines.
 Example: counter.

Counter state transition graph
 Cyclic structure:
1/1 1/2 1/7

0 1 6 7
…
1/0

Counter register-transfer function
 Specify using addition:

– Next_count = count + 1.
 Regular structure of logic.

Example: 01 string recognizer
 Recognize 01 sequence in input string:
recognizer
0 0
0 0
1 1
1 0
0 0
1 1
Recognizer state transition graph
1/0 0/0
0/0
Bit 1 Bit 2
1/1

Mealy vs. Moore machine
 Moore machine:
– Output a function of state.
 Mealy machine:
– Output a function of primary inputs + state.

Sequential machine definition
 Machine computes next state N, primary

outputs O from current state S, primary
inputs I.
 Next-state function:
– N = d(I,S).
 Output function (Mealy):
– O = l(I,S).

Reachability
 State is reachable if there is a path from

given state.
 May be created by state encoding:
s0 s1
s2 s3

Homing sequence
 Sequence of inputs that drives the machine

to a given state.
s0 s1
s2

Equivalent states
 States are equivalent if they cannot be

distinguished by any input sequence:
0/0
s1 s2
-/1
1/0 -/0
-/0
s3 s4

Networks of FSMs
 Functions can be built up from

interconnected FSMs:
I1 x I2
M1 y M2
O1 O2
Internal connections
External connections

Illegal composition of Mealy
machines
Combinational
Combinational
logic
logic
D
Q
Q
D

Communicating FSM states
0/0
s1 s3
-/1 1/1
1/0 -/0
s2 s4 0/0
M1 M2
Product machine
 Two connected machines:
i1 o1 i2 o2
R S

Component STGs

Behavior of connected machines
i1 o1 i2 o2
R S
0 R1 0 S1 1
0 R2 1 S2 0
0 R3 0 S1 0
0 R3 0 S1 0
Forming product machine
 Form Cartestian product of states:

– R1S1, R1S2, R2S1, R2S2, R3S1, R3S2.
 For each product state, determine the
combined behavior of each product
transition:
– Required inputs.
– Produced output.
– Next product state.

State assignment
 Find a binary code for symbolic values in

machine.
– Optimize area, performance.
– May be done on inputs, outputs as well.

Optimizing state assignments
 Codes affect the next-state, output logic.

– Compute conditions based on state.
 Best code depends on the input, output logic
and its interaction with state computations.

Encoding a shift register
 Symbolic state 0 S00 S00 0

transition table for 1 S00 S10 0
shift register:
0 S01 S00 1
1 S01 S10 1
0 S10 S01 0
1 S10 S11 0
0 S11 S01 1
1 S11 S11 1
Bad encoding
 Let S00 = 00, S01 = 01, S10 = 10, S11 = 10.

 Logic:
– Output = S1 S0’ + S1’ S0
– N1 = I
– N0 = I S1’ + I S1

Good encoding
 Let S00 = 00, S01 = 01, S10 = 10, S11 = 11.

 Logic:
– Output = S0
– N1 = I
– N0 = S1

One-hot code
 N-state machine has n-bit encoding.

 Ith bit is 1 if machine is in state i.
 Comparison:
– Easy to tell what state the machine is in.
– Easy to get the machine into an illegal state
(0000, 1111, etc.).
– Uses a lot of registers.

Common factors in state coding
 Consider this set of transitions:

– 0, s1 OR s2 -> s3, 1
 Want to choose a code that easily produces
s1 OR s2.
– S1 = 00, S2 = 01.

State codes in n-space
1
s2 code = 110
s1 code = 111
0 1

State codes and delay

Topics
 Verilog styles for sequential machines.

 Flip-flops and latches.

Verilog always statement
 Use always to wait for clock edge:

always @(posedge clock) // start execution
at the clock edge
begin
// insert combinational logic here
end

Verilog state machine
always @(posedge clock) // start execution at the clock edge
begin
if (rst == 1)
begin
// reset code
end
else begin // state machine
case (state)
‘state0: begin
o1 = 0;
state = ‘state1;
end
‘state1: begin
if (i1) o1 = 1; else o1 = 0;
state = ‘state0;
endcase
end // state machine
end
Traffic light controller
 Intersection of two roads:

– highway (busy);
– farm (not busy).
 Want to give green light to highway as
much as possible.
 Want to give green to farm when needed.
 Must always have at least one red light.

Traffic light system
traffic farm road

light
highway
sensor

System operation
 Sensor on farm road indicates when cars on

farm road are waiting for green light.
 Must obey required lengths for green,
yellow lights.

Traffic light machine
 Build controller out of two machines:

– sequencer which sets colors of lights, etc.
– timer which is used to control durations of
lights.
 Separate counter isolates logical design
from clock period.
 Separate counter greatly reduces number of
states in sequencer.
Sequencer state transition graph
(cars & long)’ / 0 green red
hwy-
green cars & long / 1 green red
short/ 1 red yellow
short’ / hwy-
farm- short’ /
0 red yellow yellow yellow 0 yellow red
short / 1 yellow red

cars’ & long / 1 red green farm-
green
cars & long’ / 0 red green
Verilog model of controller
module sequencer(rst,clk,cars,long,short,hg,hy,hr,fg,fy,fr,count_reset);
input rst, clk; /* reset and clock */ input cars; // high when a car is present at the farm road
input long, short; /* long and short timers */ output hg, hy, hr; // highway light: green, yellow, red
output fg, fy, fr; /* farm light: green, yellow, red */ reg hg, hy, hr, fg, fy, fr; // remember these outputs
output count_reset; /* reset the counter */ reg count_reset; // register this value for simplicity
// define the state codes
‘define HWY_GREEN 0
‘define HWY_YX 1
‘define HWY_YELLOW 2
‘define HWY_YY 3
‘define FARM_GREEN 4
‘define FARM_YX 5
‘define FARM_YELLOW 6
‘define FARM_YY 7
reg [2:0] state; // state of the sequencer
always @(posedge clk)

begin
if (rst == 1)
begin
state = ‘HWY_GREEN; // default state
count_reset = 1;
end
Verilog model of controller, cont’d.
else begin // state machine
count_reset = 0;
case (state)
‘HWY_GREEN: begin
if (~(cars & long)) state = ‘HWY_GREEN;
else begin
state = ‘HWY_YX;
count_reset = 1;
end
hg = 1; hy = 0; hr = 0; fg = 0; fy = 0; fr = 1;
end
‘HWY_YX: begin
state = ‘HWY_YELLOW;
hg = 0; hy = 1; hr = 0; fg = 0; fy = 0; fr = 1;
end
‘HWY_YELLOW: begin
if (~short) state = ‘HWY_YELLOW;
else begin
state = ‘FARM_YY;
end
hg = 0; hy = 1; hr = 0; fg = 0; fy = 0; fr = 1;
end
‘FARM_YY: begin
state = ‘FARM_GREEN;
hg = 0; hy = 0; hr = 1; fg = 1; fy = 0; fr = Copyright
0;  2004 Prentice Hall PTR
FPGA-Based System Design: Chapter 1end
Verilog model of timer
module timer(rst,clk,long,short);
input rst, clk; // reset and clock
output long, short; // long and short timer outputs
reg [3:0] tval; // current state of the timer
always @(posedge clk) // update the timer and outputs

if (rst == 1)
begin
tval = 4’b0000;
short = 0;
long = 0;
end // reset
else begin
{long,tval} = tval + 1; // raise long at rollover
if (tval == 4’b0100)
short = 1’b1; // raise short after 2^2
endmodule

Verilog model of system
module tlc(rst,clk,cars,hg,hy,hr,fg,fy,fr);
input rst, clk; // reset and clock
input cars; // high when a car is present at the farm road
output hg, hy, hr; // highway light: green, yellow, red
output fg, fy, fr; // farm light: green, yellow, red
wire long, short, count_reset; // long and short

// timers +
counter reset
sequencer s1(rst,clk,cars,long,short,
hg,hy,hr,fg,fy,fr,count_reset);
timer t1(count_reset,clk,long,short);
endmodule

‘FARM_GREEN: begin
if (cars & ~long) state = ‘FARM_GREEN;
else begin
state = ‘FARM_YX;
count_reset = 1;
end
hg = 0; hy = 0; hr = 1; fg = 1; fy = 0; fr = 0;
end
‘FARM_YX: begin
state = ‘FARM_YELLOW;
hg = 0; hy = 0; hr = 1; fg = 1; fy = 0; fr = 0;
end
‘FARM_YELLOW: begin
if (~short) state = ‘FARM_YELLOW;
else begin
state = ‘HWY_GREEN;
end
hg = 0; hy = 0; hr = 1; fg = 0; fy = 1; fr = 0;
end
‘HWY_YY: begin
state = ‘HWY_GREEN;
hg = 1; hy = 0; hr = 0; fg = 0; fy = 0; fr = 1;
end
endcase
end // always
endmodule

The synchronous philosophy
 All operation is controlled by the clock.

– All timing is relative to clock.
– Separates functional, performance
optimizations.
 Put a lot of work into designing the clock
network so you don’t have to worry about it
throughout the combinational logic.

Register characteristics
 Form of clock signal used to trigger the

register.
 How the behavior of data around the clock
trigger affects the stored value.
 When the stored value is presented at the
output.
 Whether there is ever a combinational path
from input to output.

Types of registers
 Latch: transparent when internal memory is

being set.
 Flip-flop: not transparent, reading and
changing output are separate.

Types of registers
 D-type (data). Q output is determined by the

D input at the clocking event.
 T-type (toggle). Toggles its state at input
event.
 SR-type (set/reset). Set or reset by inputs
(S=R=1 not allowed).
 JK-type. Allows both J and K to be 1,
otherwise similar to SR.

Clock event
 Change in clock signal that controls register

behavior.
– 0-1 transition or 1-0 transition.
 Data must generally be held constant
around the clock event.

Setup and hold times
event
setup
hold
clock
changing stable
D
time
Duty cycle
 Percentage of time that clock is active.
50%

Topics
 Clocking disciplines.
– Flip-flops.
– Latches.

Clocking disciplines
 Rules for constructing sequential machines.

– Combinations of registers and gates.
– Behavior of clocks and primary inputs over
time.
 Rules are sufficient to guarantee that the
system will work at some clock rate.
– May not be as fast as we want.

Qualified clock
 Clock logically combined with signal:
D Q

sig1

Flip-flop-based sequential
machines

Flip-flop rules
 Primary inputs change after clock () edge.

 Primary inputs must stabilize before next
clock edge.
 Rules allow changes to propagate through
combinational logic for next cycle.
 Flip-flop outputs hold current-state values
for next-state computation.

Signals in flip-flop system
positive clock edge

Latch-based machines
 Latches do not cut combinational logic

when clock is active.
 Latch-based machines must use multiple
ranks of latches.
 Multiple ranks require multiple phases of
clock.

Two-sided latch constraint
 Latch must be open less than the shortest

combinational delay.
 Period between latching operations must be
longer than the longest combinational delay.
 Note: difference between shortest and
longest combinational delay may be large
(sum0 vs. sum31).

Latch shoot-through
Latch may allow data to shoot through:

Strict two-phase clocking
discipline
 Strict two-phase discipline is conservative

but works.
 Can be relaxed later with proper knowledge
of constraints.
 Strict two-phase machine makes latch-based
machine behave more like flip-flop design,
but requires multiple phases.

Strict two-phase architecture

Two-phase clock
Phases must not overlap:
non-overlap region

Why it works
 Each phase has a one-sided constraint:

phase must be long enough for all
combinational delays.
 If there are no combinational loops, phases
can always be stretched to make that section
of the machine work.
 Total clock period depends on sum of phase
periods.
Clocking types
 Logic on different phases operate at

different times—can’t mix signals from
different phases.
 Primary inputs must obey the same rules as
internal signals.
 Clocking types are bookkeeping that help us
ensure that machine structure is valid.

Stable signals
 A logic signal is always stable during one

phase—phase in which the latch which
produced it is not active.
 Easiest to think of machine behavior in
terms of stable signals, though signals
propagate while not stable.

Signal types
 Clocks are separate type: 1 , 2.

 Two types of stable data signal:
– stable 1 (s 1)
– stable 2 (s 2)
 A stable signal has a complementary valid
signal:
– stable 2 (s 2) = valid 1 (v 1)

Stable data signal
inactive clock
stable 2 becomes stable until latch

valid at end of 1 feeding this
logic goes active
How clocking types combine

Clocking types in the two-phase
machine
I1(s 2) s 2
combinational
D Q logic
O1(s 2)
1
combinational I2(s 1)
s 1 logic Q D
O2(s 1)
2
Clocking type propagation
 Combinational logic does not change type

of signal.
 Primary inputs must be compatible.
 Latches change signals from one clock type
to another.
 In strict system, never mix clocks with data
signals in combinational logic.

Two-coloring
I1(s 2) s 2
combinational
D Q logic
O1(s 2)
1
combinational I2(s 1)
s 1 logic Q D
O2(s 1)
2
Example: shift register
 Want to displace bit by n registers in n

cycles.
 Each register requires two phases:

Shift register operation
1 = 1, 2 = 0
FPGA-Based System Design: Chapter 1 1 = 0, 2 = 1 Copyright  2004 Prentice Hall PTR

Non-strict disciplines
 Some relaxation of the rules can be useful:

– reduce area;
– increase performance.
 Rules must be relaxed in a way that ensures
the machine will still work.

Qualified clocks
 Use logic to generate a clock signal which

is not always active.
 Qualification must not introduce glitches
into the clock—glitches violate the
fundamental definition of a clock by
introducing extra edges.
 Use stable signals to qualify clocks.

Uses of qualified clocks
 May want to conditionally load a register.

 May qualify a clock to turn off machine for
low-power operation.
 Latch must be not lose its value during
inactive period.
 Difficult to ensure that logic value will
come high in time—use quasi-static latch.

Qualified clocks and skew
 Logic in the clocking path introduces delay.

 Delay can cause clock to arrive at latches at
different times, violating clocking
assumptions.
 When designing qualification logic:
– minimize and check skew;
– sharpen clock edge.

Qualification skew example

Topics
 Performance analysis.

Unbalanced delays
Logic with unbalanced delays leads to

inefficient use of logic:
short clock period long clock period

Flip-flop-based system performance
analysis

Flip-flop-based system model
 Clock signal is perfect (no rise/fall), period P.

 Clock event on rising edge.
 Setup time s.
– Time from arrival of combinational logic event to clock
event.
 Propagation time p.
– Time for value to go from flip-flop input to output.
 Worst-case combinational delay C.
– Time from output of flip-flop to input.

Clock parameters

Clock period constraint
 P >= C + s + p. s p C

Clock with rise/fall

Rise/fall clock period constraint
 P >= C + s + p + tr. s tr p C

Min-max delays
 Delays may vary:

– Manufacturing
variations.
– Temperature variations.
 Min/max delays
compound over paths. t
– Delays within a chip
are correlated.

Latch system clock period
 For each phase, phase period must be longer

than sum of:
– combinational delay;
– latch propagation delay.
 Phase period depends on longest path.

Latch-based system model

Two-phase timing parameters

Clock period constraint
 Total clock period (both phases):

– P >= C1 + C2 + 2s + 2p.
 Each phase must meet timing for its own
latch.

Latch-based system model

Advanced performance analysis
 Latch-based systems always have some idle

logic.
 Can increase performance by blurring phase
boundaries. Results in cycle time closer to
average of phases.

Example with unbalanced phases
One phase is much longer than the other:

Spreading out a phase
Compute only part of long paths in one phase:

Spreading out a phase, cont’d.
Use other phase for end of long logic block

and all of short logic block:

Problems
 Hard to debug—can’t stop the system.

 Hard to initialize system state.
 More sensitive to process variations.

Timing and glitches in FSMs
 If inputs don’t change, can outputs glitch?
input output
logic
Q D

Skew
 Skew: relative delay between events.

 Signal skew: most important for
asynchronous, timing-dependent logic.
 Clock skew: can harm any sequential
system.

Signal skew
Machine data signals must obey setup and

hold times—avoid signal skew.

Signal skew example

Clock skew
Clock must arrive at all memory elements in

time to load data.

Clock skew example

Clock skew in system
D Q
logic
d
D Q

Clock skew and qualified clocks

Clock skew analysis model
ss2112 == dd21 –– dd12

Skew and clock period
 Assume that each flip-flop operates

instantaneously:
– T >= D2 + d12
 If clock arrives at FF2 after FF1, then we
have more time to compute.
 Given clock period, determine allowable
skew:
– s12 >= T + D2

Timing through logic
 As skew increases, we
have less time to get
the signal through the
logic.

Clock distribution
 Often one of the

hardest problems in
clock design.
– Fast edges.
– Minimum skew.

Clock skew example
D Q D Q
10 ps 10 ps
20 ps 20 ps
D Q
30 ps 30 ps

Retiming
Retiming moves registers through

combinational logic:

Retiming properties
 Retiming changes encoding of values in

registers, but proper values can be
reconstructed with combinational logic.
 Retiming may increase number of registers
required.
 Retiming must preserve number of registers
around a cycle—may not be possible with
reconvergent fanout.
Topics
 Basics of register-transfer design:

– data paths and controllers.
 High-level synthesis.

Register-transfer design
 A register-transfer system is a sequential

machine.
 Register-transfer design is structural—
complex combinations of state machines
may not be easily described solely by a
large state transition graph.
 Register-transfer design concentrates on
functionality, not details of logic design.
Register-transfer system example
A register-transfer machine has combinational

logic connecting registers:
Q D combinational
logic
combinational D Q combinational D Q
logic logic

Block diagrams
Block diagrams specify structure: wire bundle

of width 5

Data path-controller systems
 One good way to structure a system is as a

data path and a controller:
– data path executes regular operations
(arithmetic, etc.), holds registers with data-
oriented state;
– controller evaluates irregular functions, sets
control signals for data path.

Data and control are equivalent
 We can rewrite control into data and visa

versa:
– control: if i1 = ‘0’ then o1 <= a; else o1 <=
b; end if;
– data: o1 <= ((i1 = ‘0’) and a) or ((i1 = ‘1’)
and b);
 Data/control distinction is useful but not
fundamental.
Data and control
ctrl
carry select

Data operators
 Arithmetic operations are easy to spot in

hardware description languages:
– x <= a + b;
 Multiplexers are implied by conditionals.
Must evaluate entire program to determine
which sources of data for registers.
 Multiplexers also come from sharing
adders, etc.
Conditionals and multiplexers
if x = ‘0’ then
reg1 <= a;
else
reg1 <= b;
end if;
code
register-transfer
Alternate data path-controller
systems
controller controller controller
data path data path data path
one controller, two communicating

one data path data path-controller
systems

Pipelines
 Provide higher utilization of logic:
Combinational logic

Pipeline metrics
 Throughput: rate at which new values enter

the system.
– Initiation interval: time between successive
inputs.
 Latency: delay from input to output.

Simple pipelines
 Pure pipelines have no control.

 Choose latency, throughput.
 Choose register locations with retiming.
 Overhead:
– Setup, hold times.
– Power.

Complex pipelines
 Actions in pipeline depend on data or

external events.
 Actions on pipe:
– Stall values.
– Abort operation.
– Bypass values.

High-level synthesis
 Sequential operation is not the most abstract

description of behavior.
 We can describe behavior without assigning
operations to particular clock cycles.
 High-level synthesis (behavioral synthesis)
transforms an unscheduled behavior into a
register-transfer behavior.

Tasks in high-level synthesis
 Scheduling: determines clock cycle on

which each operation will occur.
 Allocation: chooses which function units
will execute which operations.

Functional modeling code in
Verilog
assign o1 = i1 | i2;
if (! I3) then
o1 = 1’b1; clock cycle boundary can
o2 = a + b; be moved to design different
register transfers
else
o1 = 1’b0;
end;

Data dependencies
 Data dependencies describe relationships

between operations:
– x <= a + b; value of x depends on a, b
 High-level synthesis must preserve data
dependencies.

Data flow graph
 Data flow graph (DFG) models data

dependencies.
 Does not require that operations be
performed in a particular order.
 Models operations in a basic block of a
functional model—no conditionals.
 Requires single-assignment form.

Data flow graph construction
original code: single-assignment form:

x <= a + b; x1 <= a + b;
y <= a * c; y <= a * c;
z <= x + d; z <= x1 + d;
x <= y - d; x2 <= y - d;
x <= x + c; x3 <= x2 + c;

Data flow graph construction,
cont’d
Data flow forms directed acyclic graph

(DAG):

Goals of scheduling and
allocation
 Preserve behavior—at end of execution,

should have received all outputs, be in
proper state (ignoring exact times of
events).
 Utilize hardware efficiently.
 Obtain acceptable performance.

Data flow to data path-controller
One feasible schedule for last DFG:

Binding values to registers
registers fall on
clock cycle
boundaries

Register lifetimes
a b c d
x x x
x x

Allocation creates multiplexers
 Same unit used for different values at

different times.
– Function units.
– Registers.
 Multiplexer controls which value has access
to the unit.

Choosing function units
muxes allow
function units
to be shared
for several
operations

Building the sequencer
sequencer requires three states,

even with no conditionals

Verilog for data path
module dp(reset,clock,a,b,c,d,muxctrl1,muxctrl2,muxctrl3,
muxctrl4,loadr1,loadr2,loadr3,loadr4,x3,z);
parameter n=7;
input reset; input clock; input [n:0] a, b, c, d; // data primary inputs input muxctrl1, muxctrl2, muxctrl4; // mux control
input [1:0] muxctrl3; // 2-bit mux control input loadr1, loadr2, loadr3, loadr4; // register control output [n:0] x3, z;
reg [n:0] r1, r2, r3, r4; // registers

wire [n:0] mux1out, mux2out, mux3out, mux3bout, mux4out, mult1out, mult2out;
assign mux1out = (muxctrl1 == 0) ? a : r1;

assign mux2out = (muxctrl2 == 0) ? b : r4;
assign mux3out = (muxctrl3 == 0) ? a : (muxctrl3 == 1 ? r4 : r3);
assign mux4out = (muxctrl4 == 0) ? c : r2;
assign mult1out = mux1out * mux2out;
assign mult2out = mux3out * mux4out;
assign x3 = mult2out;
assign z = mult1out;
always @(posedge clock)
begin
if (reset)
r1 = 0; r2 = 0; r3 = 0; r4 = 0;
end
if (loadr1) r1 = mult1out;
if (loadr2) r2 = mult2out;
if (loadr3) r3 = c;
if (loadr4) r4 = d;
end
 endmodule

Choices during high-level
synthesis
 Scheduling determines number of clock

cycles required; binding determines area,
cycle time.
 Area tradeoffs must consider shared
function units vs. multiplexers, control.
 Delay tradeoffs must consider cycle time vs.
number of cycles.

Finding schedules
 Two simple schedules:

– As-soon-as-possible (ASAP) schedule puts
every operation as early in time as possible.
– As-late-as-possible (ALAP) schedule puts
every operation as late in schedule as possible.
 Many schedules exist between ALAP and
ASAP extremes.

ASAP and ALAP schedules
ASAP
ALAP

Verilog model of ASAP schedule
reg [n-1:0] w1reg, w2reg, w6reg1, w6reg2, w6reg3,
w6reg4, w3reg1, w3reg2, w4reg, w5reg;

begin
// cycle 1
w1reg = i1 + i2;
w3reg1 = i4 + i5;
w6reg1 = i7 + i8;
// cycle 2
w2reg = w1reg + i3;
w3reg2 = w3reg1;
w6reg2 = w6reg1;
// cycle 3
w4reg = w3reg2 + w2reg;
w6reg3 = w6reg2;
// cycle 4
w5reg = i6 + w4reg;
w6reg4 = w6reg3;
// cycle 5
o1 = w6reg4 + w5reg;
end

Verilog of ALAP schedule
reg [n-1:0] w1reg, w2reg, w6reg, w6reg2,
w6reg3, w3reg, w4reg, w5reg;

begin
// cycle 1
w1reg = i1 + i2;
// cycle 2
w2reg = w1reg + i3;
w3reg = i4 + i5;
// cycle 3
w4reg = w3reg + w2reg;
w6reg3 = w6reg2;
// cycle 4
w5reg = i6 + w4reg;
w6reg = i7 + i8;
// cycle 5
o1 = w6reg + w5reg;
end

Critical path of schedule
Longest path through data flow determines

minimum schedule length:

Operator chaining
 May execute several

operations in sequence in one
cycle—operator chaining.
 Delay through function units
may not be additive, such as
through several adders.

Control implementation
 Clock cycles are also known as control

steps.
 Longer schedule means more states in
controller.
 Cost of controller may be hard to judge
from casual inspection of state transition
graph.

Controllers and scheduling
functional
model:
x <= a + b;
one state
y <= c + d;
two states
Distributed control
two distributed controllers
one centralized controller

Synchronized communication
between FSMs
To pass values between two machines, must schedule output

of one machine to coincide with input expected by the other:

Hardwired vs. microcoded
control
 Hardwired control has a state register and

“random logic.”
 A microcoded machine has a state register
which points into a microcode memory.
 Styles are equivalent; choice depends on
implementation considerations.

Data path-controller delay
Watch out for long delay paths created by

combination of data path and controller:

Topics
 Low power design.

 Pipelining.

Rules for reducing power
consumption.
 Turn it off.
– Eliminates leakage current.
 Slow it down, reduce voltage.
– Performance is linear with clock frequency.
– Power is V2.
 Don’t change its inputs.
– Activity-dependent.

Energy and power
 Energy = power * time.

 Energy consumption is critical for battery-
powered systems.
 Power consumption is critical for heat
dissipation limited systems.

Energy and performance
 In many cases, high performance = low

energy.
– Efficiency pays off in both arenas.
 In some cases, energy can be saved by
reducing performance.
– P = 1/2 CV2
– Power goes down faster than performance.

Levels of abstraction
 Physical:
– Minimize capacitance.
 Gate:
– Use low leakage gates.
 Combinational:
– Avoid twitches.
 Register-transfer:
– Avoid using units.
 Architecture:
– Slow things down, turn them off.
Sources of energy consumption
 Static:
– Leakage.
 Dynamic:
– Switching activity.

Physical optimizations
 Assuming equal signal probabilities, total

wire capacitance is proportional to dynamic
power consumption.
 Shorter wires -> less power consumption.
 More active nets should be shortened first.

How to reduce wire length
 Use hard macros where possible.

 Add placement constraints.
 Use design hierarchy to guide placement
search.
 Use nets with small drivers where possible.
– Don’t drive a net faster than it needs to go.

Logic/circuit optimizations
 Turn off gate where possible.

– Not an option in most FPGAs, but it should be.
 Operate gate at low voltage.
– Speed decreases linearly, power decreases as
V2.

Combinational optimizations
 Design network to avoid unnecessary

glitching where possible.
– Balance delays across paths.
 Can duplicate logic to reduce wire lengths.
– Does the duplicate logic use less power than the
wire?

Register-transfer optimizations
 Hold inputs when a unit’s output will not be

used.
– Put register at inputs.
 Turn off units when they won’t be used for
several cycles.
– Can’t selectively turn off LEs in most FPGAs.

Architectures for low power
 Two important methods:

– architecture-driven voltage scaling
– power-down modes

Architecture-driven voltage
scaling
 Add extra logic to increase parallelism so

that system can run at lower rate.
 Power improvement for n parallel units over
Vref:
– Pn(n) = [1 + Ci(n)/nCref + Cx(n)/Cref](V/Vref)
Clock = 25 MHz
Clock = 50 MHz
Clock = 25 MHz
Power-down modes
 CMOS doesn’t consume power when not

transitioning. Many systems can incorporate
power-down modes:
– condition the clock on power-down mode;
– add state to control for power-down mode;
– modify the control logic to ensure that power-
down/power-up don’t corrupt control state.

Pipelines
 Provide higher utilization of logic:

P21
Combinational logic

Pipeline metrics
 Throughput: rate at which new values enter

the system.
– Initiation interval: time between successive
inputs.
 Latency: delay from input to output.

Simple pipelines
 Pure pipelines have no control.

 Choose latency, throughput.
 Choose register locations with retiming.
 Overhead:
– Setup, hold times.
– Power.

Complex pipelines
 Actions in pipeline depend on data or

external events.
 Actions on pipe:
– Stall values.
– Abort operation.
– Bypass values.

Pipeline metrics
 Ignore register delay:

– Combinational logic delay D.
– Latency L.
– Throughput T.
 Delay through unpipelined system.
– L = D.
– T = 1/D.

Adding pipeline stages
 Add a pipeline stage:

– Latency remains L = D.
– Throughput increases: T = 2/D.
 n-stage pipeline:
– Throughput increases: T = n/D.
 Clock period:
– P = D/n.

Performance vs. pipeline stages
throughput
clock period
# stages
Adding pipeline stages
 Must add a pipeline stage that cuts the logic.

– Cutset for PI-PO graph.
 Can use retiming to position the registers in
the logic.

Cutsets

Bad pipeline

Pipeline utilization
 Need to fill up the pipeline.

– Later stages are unused as the pipeline fills up.
 Assume D stages of valid data, n total
stages.
– Utilization U= D / D+n.
 In steady state, utilization approaches 1.

Pipelines with control
 Pipeline may do different things at different

times.
– CPU control flow.
 Must make sure that the pipeline operates
properly in all cases.

Sending a control signal forward

Sending a control signal backward
 Make sure control arrives at right cycle:

Combining signals from multiple
cycles
 Different stages
can’t use ALU on
same cycle.

Distributed pipeline control

Pipeline control logic
 Ideal pipeline
needs no
-/ALU = op
significant
control:
s1

Simple decisions
 Simple decision
doesn’t add
1 /ALU = - 0 /ALU = +
states:
s1

Controlling a pipeline flush

Product machine for pipeline flush

Hardware sharing control

Pipeline verification
 Extensive simulation is required to exercise

the pipeline.
– State of pipeline stages interact.
 Symbolic simulation: simulate names, not
particular values.

Topics
 Design methodologies.

Design methodologies
 Every company has its own design

methodology.
 Methodology depends on:
– size of chip;
– design time constraints;
– cost/performance;
– available tools.

Design teams
 Almost all interesting projects are too big

for one person to handle.
 Need a team of people with varying skills.
 Who is in charge?

Documents
 Documents are critical:

– Writing it helps you decide what to do.
– Minimizes risk of hit-by-a-truck syndrome.
– Provides information for maintenance, next
generation.
 Each document serves as the contract
between the provider of the document and
the consumer of the document.
Major documents
 Requirements.
 Specification.
 Architecture.
 Module designs.
 Reference manual, user manual.

Types of information
 Functional description.
 Non-functional description: cycle time,
power, etc.
 Timetables.
 Design verification methods.
 Quality metrics.
 Job assignments.

Starting the project
 Requirements: English description of what

is to be done.
– Customer-oriented.
– High-level.
 May be written by marketing.
 Author of requirements should verify that
the requirements are accurate.

Specifications
 The specification is the contract between

marketing and the design team.
 The specification is more technical than the
requirements:
– delays, etc.
 An ideal specification would contain no
architectural information, but that goal may be
hard to achieve in practice.
– The specification says what to do, not how to do it.

Specification and planning
 Driven by contradictory impulses:

– customer-centric concerns about cost,
performance, etc.;
– forecasts of feasibility of cost and performance.
 Features, performance, power, etc. may be
negotiated at early stages; negotiation at
later stages creates problems.

Architecture
 The architecture document is the contract

between the system designers and the
component designers.
 Specifies major subsystems and their
interactions.
 Makes important design decisions.
 Isn’t a full implementation.

Module designs
 Specifies details of a module:

– functionality;
– non-functional parameters;
– design verification.

Do documents reflect the
product?
 In a word, no.
– Things change.
– People don’t have time to conform documents
to the final design.
 Some amount of updating is important for
maintenace, future generations.

Design reviews
 Have other designers (team + non-team)

evaluate a design.
– Relatively simple.
– Proven to work.
 Must walk through the design in detail to
look for problems, improvements.

Generic design flow
architectural
simulation
detailed register-transfer logic
specs design design
Timing/area
budget
Final physical
configuration design design
verification

Estimation and planning
 Estimation techniques vary with module:

– memories may be generated once size is
known;
– data paths may be estimated from previous
design;
– controllers are hard to estimate without details.
 Estimates must include speed, area, power.

Floorplanning and budgeting
 Want some early physical design

information: area, delay, power, etc.
 Ways to get info:
– previous designs;
– quick design runs.

Architecture
 Need to build an executable model of the

architecture.
– Run vectors on architecture.
– Use as golden design for comparison with later
stages.
 Modeling languages:
– C: easier to write, less detailed.
– Verilog: harder to write, synthesizable with
effort.
Logic design
 For controllers, good state assignment is

usually requires CAD tools.
 Logic synthesis is an option:
– very good for non-critical logic;
– can work well for speed-critical logic.
 Logic synthesis system may be sensitive to
changes in the input specification.

Place and route
 Most computationally expensive stage.

 Metrics take more time to judge than
functional vectors.
 Deciding how to fix a problem may take
effort.
– How to change placement, etc.

Design verification
 Functional verification:
– runs reasonable set of vectors.
 Non-functional verification:
– performance;
– power.

Functional verification
 At all levels of hierarchy: module,

subsystem, system.
 At every level of abstraction.
– Compare to previous level of abstraction,
golden model.
 Must check interfaces.
– Half of bugs are at the interface to other
modules.
Functional verification input
 Sources of vectors:
– Previous designs.
– Vectors from higher levels of abstraction.
– Vectors designed previously for this stage.
– Inputs from other modules.

Non-functional verification
 Performance:
– Static timing analysis.
 Power:
– Some information from timing analysis.
– Power analysis tools.

Breadboards
 May build a board to test an FPGA-based

design.
– Takes some time.
– May allow running the design against the real
I/O device.

Topics
 Bus interfaces.
 Platform FPGAs.

Bus interfaces
 Requirements:
– High performance.
– Variable signal environment.
 Techniques:
– Asynchronous logic.
– Handshaking-oriented protocols.

Timing diagrams
0 1
a
changing
b stable
Timing constraint
c

Asynchronous logic
 Distribute timing information with values.

– No global clock.
 Clock signal paths must have the same
delay as data values.

Latching an asynchronous signal
adrs
adrs D Q
adrs_ready

Asynchronous timing constraints
 Must satisfy setup, hold times.
adrs
Hold time
Setup time

Bus system design
 Requirements:
– Imposed by the other side of the system.
 Constraints:
– Imposed by this side of the system.
requirements
a b
constraints
Views of the bus
 Hardware:
D Q D Q
a b
Combinational
logic

Views of bus system, cont’d.
 Timing diagram:
x y
x
D Q D Q
a b
y Combinational
logic

Bus protocols
 Basic transaction:
– four-cycle handshake.

Handshake machine
 Each side is an FSM (possibly

asynchronous):
Go enq enq
0 a 1 0 b 1
ack ack
ack

Basic protocols
 Handshake transmits data:

Box 1 logic

Box 2 logic

Bus timing
t1 = tc1 - td1 >= tr td1 = d stable

td2 = d not stable
tc1 = c rises
t2 = tack1 - tc1 >= th
tc2 = c falls
tack1 = ack rises
t3 = tc2 - tack1 >= th

Busses and systems
 Microprocessor systems often have several

busses running at different rates:
CPU
mem
High-speed
I/O bridge
Low-speed

Basic signals in a bus

Bus characteristics
 Physical
– Connector size, etc.
 Electrical
– Voltages, currents, timing.
 Protocol
– Sequence of events.

Advanced transactions
 Multi-cycle transfers:
– Several values on one handshake.
– May use implicit addressing.

PCI bus
 Used for box-level system interconnect.

 Two versions:
– 33 MHz.
– 66 MHz.
 Supports advanced transactions.

PCI bus read

Multi-rate systems
 Logic blocks
running at different
clock rates may
communicate: Logic 1 Logic 2
– Multi-chip.
– Single-chip.
» Slow bus connects 100 MHz 33 MHz
to fast logic.

Metastability
 Registers capturing
transitioning signals
may take an
arbitrarily long time
to settle.

Resynchronization
 Use cascaded registers to minimize the

chance of using a metastable value.
d D Q D Q dout

Platform FPGAs
 Put all the logic for a system on one FPGA.

 Requires large FPGAs plus:
– Specialized logic:
» I/O support;
» memory interface.
– CPUs.

Example: Virtex II Pro
 Major features:
– Large FPGA fabric.
– High-speed I/O.
– PowerPC.

Virtex II Pro High-speed I/O
 Rocket I/O:
– parallel/serial or serial/parallel transceiver.
 Clock recovery circuitry.
 Transceivers for multiple standards: Gigabit
Ethernet, Fibre Channel, etc.
 Programmable decoding features.
 Interface to FPGA fabric.

Virtex II Pro CPUs
 Up to 4 PowerPC 405s per chip:

– 5 stage pipe, static branch prediction, etc.
 Separate instruction, data caches.
 MMU.
 Timers.
 Scan-based debug support.

PowerPC CoreConnect

Altera Stratix
 Combines FPGA fabric, memory blocks,

multipliers.

Stratix DSP block

Topics
 Hardware/software co-design.

Why put CPUs on FPGAs?
 Shrink a board to a chip.

 What CPUs do best:
– Irregular code.
– Code that takes advantage of a highly
optimized datapath.
 What FPGAs do best:
– Data-oriented computations.
– Computations with local control.
System design
 True concurrency increases system

performance.
– CPU and accelerator should run in parallel.
 CPU cost is a non-linear function of
performance.
– Accelerator will be smaller, faster, lower
power.

Hardware/software partitioning
if (foo < 8) {
for (i=0; i<N; i++)
x[i] = y[i]*z[i];
}
CPU accelerator

Methodology
 Measure the application.

 Identify what to put onto the accelerator.
 Build interfaces.

Concurrency
 Concurrent applications provide the most

speedup. No data dependencies
if (a > b) ... x[i] = y[i] * z[i]
CPU accelerator

Concurrency analysis
 Data dependencies.
z= x * y;
w = z - v;
 Control dependencies.
if (a < b)
u = r + s;

Partitioning
 Can divide the application into several

processes that run concurrently.
 Process partitioning exposes opportunities
for parallelism.
if (i>b) … Process 1
for (i=0; i<N; i++) Process
… 2
for (j=0; j<N; j++) Process
... 3

Partitioning programs
 Reasonable partitioning points:

– If statements,etc.
– Loop nests.

Multi-threaded systems
 Single thread:  Multi-thread:

Performance analysis
 Single threaded:  Multi-threaded with

– Find longest possible no synchronization:
execution path. – Find the longest of
several execution
paths.
 Multi-threaded with
synchronization:
– Find the worst-case
synchronization
conditions.
Multi-threaded performance
analysis
 Synchronization causes the delay along one

path to affect the delay along another.
ta tb
synchronization point
tc td
Delay = max(ta, tb) + td

Control
 Need to signal between CPU and

accelerator.
– Data ready.
– Complete.
 Implementations:
– Shared memory.
– Handshake.

Keeping the accelerator fed
 Must get data in, must get data out.

 Data transfer costs:
– flush CPU cache;
– device driver;
– bus transactions.

Memory buffers
 Must keep accelerator fed.

– Buffer size in accelerator depends on amount of
data needed at a time, delays in obtaining
needed values.
 Streaming generally requires small buffers:
– x[i] = y[i] * z[i];
 Values with long lifetimes need more buffer
space.
Allocation
 How do we decide what goes on the CPU,

what goes on the FPGA?
 Allocation puts functions on the CPU or
FPGA.

Speedup
 Speedup for one iteration:

– tHW - tSW - tI - tO
 May be able to set up many iterations at
once:
– N*(tHW - tSW) - tI - tO

Drivers
 Need interface between CPU and

accelerator:
– transfer data values;
– start, stop computation.
 If computation time is very predictable, a
simpler communication scheme may be
possible.

Debugging
 Hard to test a CPU/accelerator system:

– Hard to control and observe the accelerator
without the CPU.
– Software on CPU may have bugs.
 Build separate test benches for CPU code,
accelerator.
 Test integrated system after components
have been tested.
Topics
 Multi-FPGA systems.

Issues
 Types of multi-FPGA systems.

 Multi-FPGA networks.
 Multi-FPGA partitioning.

Types of systems
 Can build a specialized multi-FPGA

system.
– Wired for one purpose.
 Can build reusable multi-FPGA system.
– Emulators, other debugging systems.

Networks
 Ad hoc.
– Best suited for specialized systems.
 Crossbar.
– Fully connected.
 Specialized crossbars.
 Multi-stage.
– Not often used in multi-FPGA systems.

Crossbar
 Fully connected:
w
x
y
z
a b c d
Properties of crossbar
 Fully connected:
– Single source/destination.
– Multi-point.
 n2 area.

Clos network
 System of crossbars that has less than n2

area.
 Fully connected for single-destination
connections.
– Not fully connected for multiple destinations.

Clos network organization

Net size distribution
 Most nets are small, making Clos network

feasible for logic:
# nets
1 2 3 # pins
Partial crossbar
 Takes advantage of FPGA

reprogrammability.
 Several small crossbars.
– If your crossbar doesn’t have room for the
connection, reprogram to use another crossbar
on another pin.

Trees and fat trees
 Trees allow
communication
between leaves.
 Fat trees provide
more bandwidth
near root.
…

Multi-chip partitioning
 Somewhat similar to partitioning for LE

placement.
 Differences:
– k-way partitioning;
– pins are a major cost;
– must handle large problems.

K-way partitioning
 Direct:
– Divide into k sets.
 Iterative:
– Extract one set, then another, etc.

Clustering-based partitioning
 Grow a cluster to form a partition.

– Start with a seed for the cluster.
– Choose new nodes to add to the cluster.
 Next move depends on the quality of the
previous moves.

Fiduccia-Mattheyses partitioning
 Can deal with variable-sized blocks.

 Related to Kernighan-Lin partitioning.
– Uses a new data structure to determine the best cell to
move.
 Uses an improved algorithm for updating cell
gains after a move.
– Total gain recomputation can be performed by a set of
constant time gain increments/decrements.

Topics
 Coarse-grained FPGAs.
 Reconfigurable systems.
 Reconfigurable ASICs.

FPGA granularity
 Typical LEs implement a small amount of

logic.
– Waste a lot of space/power on connecting logic
elements.
– Specialized adder logic tries to solve this
problem for a special case.
 Can build FPGAs with larger elements.

Granularity issues
 How big is the logic element?

 How flexible should it be?
 What interconnection network is needed?
 How do you program it?

Reconfigurable systems
 Reconfigure logic on-the-fly:

– application characteristics may change over
time.
 Issues:
– Reconfiguration time.
– Reconfiguration memory cost.
– Power consumption.
– Synthesis for reconfiguration.
PipeRench
 Reconfigurable pipeline:
– Each stage of the pipeline can be reconfigured
quickly and independently.
 Allows virtual pipeline that is longer than
physical pipeline.

PipeRench pipeline operation

RaPiD architecture
 Coarse-grained computational architecture:

– Soft control can be reconfigured on every cycle.
– Hard control can be reconfigured only in
configuration mode.
 Interconnect network allows computational
elements to be arranged in pipelines.

RaPiD pipeline

Reconfigurable ASICs
 Problems with ASICs:

– Mask cost.
– Manufacturing time.
 Solution---mix ASIC and FPGA:
– Reconfigurable logic on bottom.
– Custom wiring on top.

FPGA-Based System Design - Wayne Wolf (1) - 1

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

FPGA-Based System Design - Wayne Wolf (1) - 1

Caricato da

Copyright:

Formati disponibili

Overview

FPGA-Based System Design: Chapter 1 Copyright  2004 Prentice Hall PTR

 Integration improves the design:

FPGA-Based System Design: Chapter 1 Copyright  2004 Prentice Hall PTR

FPGA-Based System Design: Chapter 1 Copyright  2004 Prentice Hall PTR

 Gordon Moore: co-founder of Intel.

FPGA-Based System Design: Chapter 1 Copyright  2004 Prentice Hall PTR

FPGA-Based System Design: Chapter 1 Copyright  2004 Prentice Hall PTR

 Current cost: $2-3 billion.

FPGA-Based System Design: Chapter 1 Copyright  2004 Prentice Hall PTR

 For large-volume ICs:

FPGA-Based System Design: Chapter 1 Copyright  2004 Prentice Hall PTR

FPGA-Based System Design: Chapter 1 Copyright  2004 Prentice Hall PTR

 FPGAs are programmable logic devices:

FPGA-Based System Design: Chapter 1 Copyright  2004 Prentice Hall PTR

 FPGAs are standard parts:

FPGA-Based System Design: Chapter 1 Copyright  2004 Prentice Hall PTR

 Do you build your system with an FPGA or

FPGA-Based System Design: Chapter 1 Copyright  2004 Prentice Hall PTR

 Multiple levels of abstraction: logic to

FPGA-Based System Design: Chapter 1 Copyright  2004 Prentice Hall PTR

 May be part of larger product design.

FPGA-Based System Design: Chapter 1 Copyright  2004 Prentice Hall PTR

 Divide-and-conquer: limit the number of

 Interior view of a component:

FPGA-Based System Design: Chapter 1 Copyright  2004 Prentice Hall PTR

 Each instance has its own name:

FPGA-Based System Design: Chapter 1 Copyright  2004 Prentice Hall PTR

 Net list:  Component list:

FPGA-Based System Design: Chapter 1 Copyright  2004 Prentice Hall PTR

FPGA-Based System Design: Chapter 1 Copyright  2004 Prentice Hall PTR

 Typical hierarchical name:

FPGA-Based System Design: Chapter 1 Copyright  2004 Prentice Hall PTR

 Layout for dynamic latch:

FPGA-Based System Design: Chapter 1 Copyright  2004 Prentice Hall PTR

FPGA-Based System Design: Chapter 1 Copyright  2004 Prentice Hall PTR

FPGA-Based System Design: Chapter 1 Copyright  2004 Prentice Hall PTR

FPGA-Based System Design: Chapter 1 Copyright  2004 Prentice Hall PTR

 Specification: function, cost, etc.

FPGA-Based System Design: Chapter 1 Copyright  2004 Prentice Hall PTR

 Continuous voltages and time:

FPGA-Based System Design: Chapter 1 Copyright  2004 Prentice Hall PTR

 Discrete levels, discrete time:

FPGA-Based System Design: Chapter 1 Copyright  2004 Prentice Hall PTR

 Abstract components, abstract data types:

FPGA-Based System Design: Chapter 1 Copyright  2004 Prentice Hall PTR

 Top-down design adds functional detail.

FPGA-Based System Design: Chapter 1 Copyright  2004 Prentice Hall PTR

register- Function units,

transistors circuit nanoseconds

rectangles layout microns

FPGA-Based System Design: Chapter 1 Copyright  2004 Prentice Hall PTR

 FPGA manufacturer creates an FPGA

FPGA-Based System Design: Chapter 1 Copyright  2004 Prentice Hall PTR

 We won’t design layout.

FPGA-Based System Design: Chapter 1 Copyright  2004 Prentice Hall PTR

 Must check at every step that errors haven’t

FPGA-Based System Design: Chapter 1 Copyright  2004 Prentice Hall PTR

 Basic fabrication steps.

FPGA-Based System Design: Chapter 1 Copyright  2004 Prentice Hall PTR

 IC built on silicon substrate:

FPGA-Based System Design: Chapter 1 Copyright  2004 Prentice Hall PTR

Mask patterns are put on wafer using photo-