Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Dr. James P. Davis, Associate Professor Director, VLSI Systems Design Lab University of South Carolina Columbia, S.C., U.S.A.
4. Session Wrap-Up
2004 Dr. James P. Davis
Design process, methods, and notation allow you to take any digital systems application, analyze it, and implement it: starting with math formulae, algorithms or protocols.
Iterative enhancement, top-down/bottom-up, stepwise refinement (stables of Software Engineering discipline). Additional heuristics in relating properties of digital circuits to high-level systems design, to make the systems planning process more effective.
Evaluate the goodness of your architecture based on the use of available tools, by collecting and analyzing post-implementation data to select a best fit.
Method of designing for circuit synthesis, then evaluating different architectures in order to select the best for the application at hand. Metrics will include speed, area, power consumption, module cohesion, module coupling, Power-Delay product, Area-Delay product, etc.
2004 Dr. James P. Davis
Telecomm
Computers
Consumer Electronics
"At the root of cascading changes of modern economic life...devaluing resources in technology, business and geopolitics...overcoming the the constraints of material resources, the microchip has devalued most large accumulations of physical capital and made possible the launching of global economic enterprises...microchips find their value not in their substance but in their intellectual content: their design..." George Gilder, Microcosm, 1989
2004 Dr. James P. Davis
Range
WWAN
The opportunity for creating value chains encompassing product offerings, distribution and new service offerings hinges on the ability to get low cost solutions to market quickly.
Deliver content to wireless handheld devices. Function convergence in the handset and at the base station. Requires large cross-functional design teams in varied disciplines.
Increasing disparity: capacity of the underlying technologies versus capability of designers to manage increasing design complexity.
2004 Dr. James P. Davis
The capability of designers and design teams to use this capacity isnt keeping 9 M onths up.
The Capacity versus Capability Gap is widening. Each set of technology and process changes requires designers to manage ever more complexity in the design process. New architectures, abstractions, methods and tools are required to address this increasing complexity.
3
Device C apacity
D esign Size
Design C apability
1990 1995
100K 2000
1985
Source: Gartner/Dataquest
Electronic systems today are of different types, depending on the (1) function, (2) application.
Computing system: CPU and hardware executing O/S, with applications running on top of O/S. Embedded system: fixed-functions, instruction-based micro-controller with both hardware and software components. On-chip system: complete control and data functions implemented in "custom logic" VLSI package. On-chip systems can be components of embedded systems, which can be part of a layered "virtual" machine.
2004 Dr. James P. Davis
Embedded System
ROM RAM
uC
I/O
On-chip System
Architectural
Algorithm flowchart
queueing network From the highest level to lower block diagram petri net levels of design "abstraction", a
Behavioral
The design description is verified and validated at each level, often cycling between levels of abstraction. Design descriptions are described using one or more domain representations (Behavior, Structure, Physical).
state equations state diagram flowdiagram ASM diagrams RTL notation datapath diagram truth table schematic diagram netlist
Functional/RTL
Structural
Geometrical
layout mask
2004 Dr. James P. Davis
Trade-offs and constraint checks at each node. At dead-end node, "backtrack" and try another path. Backtracking is costly and time consuming.
Specification High-level Design and Verification Logic Design, Timing Analysis Layout, Production Routing and Field Test
Problem
Behavioral or functional constraint violations cause 50-80% of cycling between design steps.
Goal
Eliminate unnecessary "cycling" through unplanned design steps. Improve the turnaround time per cycle.
Approach
Support easy exploration of design alternatives via iteration. Allow function & behavior changes to be made quickly.
TM
TM
Design Approach - "stepwise refinement", with "iterative enhancement". Create design "skeleton", with core functions and cycle-level timing information specified. Iterate the design through synthesis, checking key area and timing constraints.
NO
YES
Cycle-based Simulation?
YES
Synopsys SGE
NO
TM TM
Synopsys VSS
TM
FPGA Compiler
TM
NO
YES
DesignWare
TM
NO
YES
NO
YES
Return to the top of the process to make corrections, and to enhance the design description. Integrate completed behavioral block with other blocks for HDL "system" simulation.
NO
Xilinx ISETM
YES
TM IPalette blockHDL
Designer #1
cycle-based verification
TM Nimbus flowHDL
Lead Designer
Designer #2
NimbusTM flowHDL
B i n d
Designer #3
Test Engineer
cycle-based verification
TM Nimbus flowHDL
t o b l o c k s
TM Nimbus flowHDL
b0
Dataflow structure.
Usually represented as a lattice.
b1
X X
Xn
b2
x(n-1)
+ +
2004 Dr. James P. Davis
X
x(n-2)
The intent is to specify the transfer function algorithm as a sequence of standard RTL statements. Note the alignment of operators so that they are the same width. We will then create a high-level design description for the control and data path operations comprising the design of this FIR filter element. The following diagram depicts the RTL representation of part of this functionality (ADDER units) using standard elements:
in1[7:0] a[7:0] b[7:0] ADD1[8:0] CLK RES
2004 Dr. James P. Davis
8 9
8 10 OUT
8 9 8
K0 IN K1
4 4 4 4 4 4 4 4
* * * *
9 8 10
OUT
K2
8 9
K3
CL3
*
K3
10 9
OUT
9 9
K2
Additional MUX resources and additional registers required to resource share the Arithmetic units.
Total Area
3rd Pass
The 2nd design model explored uses resources most efficiently. On examination, we see that the cost of the multiplexing and registers exceeds that of the original Multiplier resources (ignoring interconnect).
2004 Dr. James P. Davis
Next Section:
Discussion of background of digital systems their representation, specification and design.
2004 Dr. James P. Davis
Section 2 Introduction to High-Level Digital Design for Custom and Programmable Logic
2004 Dr. James P. Davis
High-level Design for Synthesis starts with abstract description created in graphical notation. Notation is appropriate for capturing structure and behavior of "on-chip VLSI systems. Notation #1: structured blocks for partitioning and interconnect of design components. Notation #2: structured flowgraph for partial ordering a sequence of abstract operations. Both notations are used to construct a "plan" for meeting the design spec, in terms of structure and behavior.
2004 Dr. James P. Davis
Block_1 Block_2 CLK (rising) ^RES Block_3 s0 s1 s2 s3 s4 s5 outSig1 outSig2 aReg <- inBus bReg <- '1' aReg <- '0' bReg <- '0'
idleSig aSig 1
Higher-level block name Block name Block type Input pin Input port block_1 Architectural - Local
Output pin
UART (appears as Output port at higher level)
Bi-directional pin
(appears as Bi-directional port at higher level) Functional Block
Step #1: Define sequence of steps within control algorithm, and specific operations to occur in each step, using abstract "flow chart". Step #2: Insert the necessary control to respond to events, using Case branch and Conditional branch. Step #3: Allocate each abstract data operation to RTL macro operator. Step #4: Bind individual buses for function variables to RTL bus units (registers, latches, wires). Step #5: Define control scheduling with declaration of clocking. Step #6: Decompose into concurrent "threads" for resource sharing.
2004 Dr. James P. Davis
State Input Conditions: Binary Decision Condition (If-Then) Multiway Branch Condition (CASE) IObus
s1
!signal5 Breg <- input2 default Macro-function Assignment Output <- NMUX (Areg, Breg, in1) MDR <- ScratchPad [MAR] Memory Read/Write with Relative Addressing
1001
0110
s2
A<- '0'
s3
s4
!signal4
s5
Behavioral synthesis starts with abstract description of behavior written in VHDL, SpecC or C, using no timing information. Task #1: Compile source code into intermediate format, for example, control-flow graph, dataflow graph. Task #2: schedule data operations to occur on specific control cycles, determined by clocking. Task #3: allocate data operations to RTL components implied by use of language operators <+, -, *...>. Task #4: bind specific operations to individual RTL components, to construct complete circuit topology.
2004 Dr. James P. Davis
control step 1
+ +
control step 1
+ +
control step 2
a b c
+ +
d
a c b
+
d
MUX
s
MUX
Designer starts with Functional Spec, and a "concept" of design solution. Choice #1: create initial design representation as RTL level VHDL or Verilog code, or... Choice #2: create initial design representation using "behavioral VHDL or SpecC, SystemC, C language code, or... Choice #3: create initial design representation as high-level graphical "plan" of design solution.
High-level Synthesis
RT level description
RT level description
Logic Synthesis
The goal is to create HDL code that is synthesizable and efficient for logic synthesis.
2004 Dr. James P. Davis
Graphical notations more effective in chunking design knowledge: a few changes in graphical model implies large number of changes in HDL code.
More compact representation in graphics. The links between constructs carry much information.
Design consists of planning and configuration tasks, which are easier to perform with diagrammatic representations than textual ones. Graphics allows designers to keep focus at higher-level, making possible better trade-offs in the design, also allowing more agile exploration of design space.
2004 Dr. James P. Davis
Control in
control outputs
Control out
State Registers
input/next state decoding logic output decoding logic
CLK
^RES
Data in
Data out
steering logic clocked register combin. logic MUX clocked register
State Registers
output decoding logic
CLK
next state inputs present state input/next state decoding logic State Registers
control outputs
Moore machines are used when it is important to synchronize all control actions with the change in state, and thus, by the clock. Moore machines effectively filter out transients, and can be used to eliminate race conditions when inputs are unfiltered.
next state inputs input/next state decoding logic State Registers present state output decoding logic
control outputs
tp2
Inputs
Executable ASMs
States Conditions Cases Conditional Outputs Assertions Assignments Expressions Macro-functions Memory Indexing Clocking Reset Synchronous events
State Box
State Input Conditions: Binary Decision Condition (If-Then) Multiway Branch Condition (CASE) IObus
s1
1001
0110
s2
A<- '0'
s3
s4
!signal4
s5
Output <- NMUX (Areg, Breg, in1) MDR <- ScratchPad [MAR] Memory Read/Write with Relative Addressing
2004 Dr. James P. Davis
s0
Action1 Action2 Action3 Action4 Action5
s0
s1
s0 s1
a=0 s0 s1
0 1
a|b
a=1 s2
s1 s2
s2
s2
Action6
Sequence: Series of States States follow the sequencing indicated by direction of state transitions. ASM diagrams have actions attached to the right side of state "boxes" for clarity. In ASM diagrams, transitions to next state are triggered by the system clock or any single-bit signal. States with no actions are called delay states, indicating a delay of one clock cycle.
Selection: Binary Branching Binary branching is represented using a condition "diamond". Boolean expressions on conditions can be of arbitrary complexity, with many terms and variables. For simple branching situations, both representations are equally suited to the task.
s1 s2
s0 a&b
0 1 1 1
others?
a=0100 s3
^a & b
0
^a & c & d
0
s4
^b & c
1 0
default
s1
s2
s3
s4
s4
s3
s1
s2
s3
s4
Selection: Multi-way Branching, Single Variable ASM diagrams use case construct for multi-way branch conditions of a single, multi-bit variable/bus. Branch conditions are binary or enumerated values. In ASM diagrams, all undefined transitions are tied to a default transition. This eliminates possible transitions to unspecified states, a common cause of design failure in the field. State diagrams have no such mechanism, and thus are ambiguous.
Selection: Multi-way Branching, Multi-variable In State diagrams: (1) the ordering of transitions is ambiguous; (2) all transitions aren't specified (device problems likely in the field); (3) behavior is unpredictable under all conditions. In ASM diagrams: (1) ordering of transitions is explicit; (2) transitions specified for all possible input combinations (including MVL values); (3) behavior is predictable.
^a
s2
s0
s1
s1 a
1 0
Repetition: "While-Do" Control Loop While-Do control structure is more apparent in the ASM diagram, and is consistent with hardware description language (HDL) constructs. Single decision point in ASM diagram handles complexity more easily when multiple terms and variables are used in looping condition expression. This type of control construct is used often for counting the number of loop iterations, where you test the condition prior to each execution of the loop, including the first pass.
Repetition: "Repeat-Until" Control Loop Placement of the decision "diamond" defines the type of looping construct, and the type of control behavior. Repetition control structures are used in various styles of polling loops for implementing handshaking protocols. This type of control construct is used often for counting the number of loop iterations, where you test the condition after completing each execution of the loop, requiring execution of at least the first pass.
s0
a : P2
s0 a
s1
s2
0
a
1 ^RES : P1
a s4
s1 a
0 1
s2 a
0 1
Control Interrupt Schemes - State Diagrams: The State diagram models state transitions, but the prioritization information--when multiple transitions are possible--isn't clear in the notation, without adding some additional symbol to indicate priority of the transitions. Also, if State transitions have additional conditions, it isn't clear what happens when conditions aren't met (incomplete specification). However, this may be handled by modeling looping on a state in some cases.
Control Interrupt Schemes ASM Diagrams: In ASM diagrams, you can either define a test on each state transition for the specific event using condition "diamonds", or you can use the Enable Event construct, indicating that the specified event has precedence over the normally-specified next state transition. This works like a priority encoder on the next state decoding logic of the state machine. At any time when the input is sampled for determining next state, the transition for the Enable Event a will take precedence.
2004 Dr. James P. Davis
Comparator (EQ): 9 COMP allows comparison of two different inputs to check if they are equal. Alternate circuits evaluate the relative magnitude of two inputs. 9 If they are equal, the output is HIGH, but if they are not equal, the output is LOW. 9 The comparison must be done with two signal inputs of equal width. 9 The output of the Comparator operation is a single-bit signal.
2004 Dr. James P. Davis
MUX allows the signal value of one of its data inputs (D0 D7) to pass to the output F. The selection of the signal to be passed is controlled by SELECT lines (A, B, C). The number of select lines, n, is based on a power of 2 for the number of inputs, m. So, if we have m inputs, well need n select lines so that 2**n=m. The MUX inputs and output must be the same width, and the SELECT lines are 1-bit each.
2004 Dr. James P. Davis
DECO takes a binary encoded input of n data bits, and decodes it into individual data output lines, where one output (D0 D7) is enabled, depending on whether the encoded value corresponds to the data line number. A Decoder input with n lines means we can encode 2**n possible binary encoded values. With 2**n possible encoded values on the input, well need exactly n output lines, one for each possible encoded input value. The output line corresponding to the decoded value will be enabled.
2004 Dr. James P. Davis
tp1
tp2
Inputs
A_Bus
DECO
NOT
Macro types
Arithmetic: ADD, SUB, INCR, MUL, DIV, REM. Boolean logic: AND, OR, NOT. Steering logic: MUX, DECO, PENCO, DMUX. Combinational logic elements.
B_Bus
AND
BOR
C_Bus
nPort
select
AnyXN
CollIn CollisionEvent
B_Bus
AND
BOR
C_Bus
nPort
select
AnyXN
CollIn CollisionEvent
RTL macro-functions: contain over 30 primitive data path elements in a library. Macros are "scalable" - with any number of buses and any bus widths. Macros can be used to construct more complex user-defined data path functions.
Macro-function definition: AnyXN(A_bus,B_bus) ::= BOR(AND(B_bus,NOT(DECO(A_bus)))) Macro-function binding: CollisionEvent <- AnyXN(nPort,CollIn)
This circuit takes two singlebit inputs and adds them together to produce a Sum as output. The Full Adder also has a Carry (called a Carry Out) like the Half Adder. The Full Adder also has a Carry In signal, allowing Carry Out from earlier adder stage to be connected. This allows a multi-bit, multistage adder circuit to be constructed.
2004 Dr. James P. Davis
This circuit takes two multibit inputs and adds them together to produce a multibit Sum as output. The RPC Adder also has a Carry Out signal that results from the rippling of the carry output from each FA bit computation through the entire operand word length. The RPC Adder is a reasonable solution for small bit widths; however, a more elegant solution is needed when speed is critical, or when bit-widths get large.
2004 Dr. James P. Davis
Carry logic
The Add functions are a Boolean reduction of the Full Adder structure on previous page.
This is a structural model, with each bit modeled in terms of its gate-level logic. It is very lowlevel.
Each combinational block is represented as a separate, single-state concurrent thread. The final Sum bits are registered.
2004 Dr. James P. Davis
Valid
Each count cycle the Enable pulse is set, the Direction bit can be set to count in one or the other direction, and the Seed and NoCounts values are set. The counter will run for the number of cycles indicated by input NoCounts.
Poll
CntControl
Setup Time
Time tsu is the amount of time we must keep stable data on the input prior to sampling at the active clock edge.
Hold Time
Time th is the amount of time the input signal value must be stable after the active clock edge, so that it can be sampled by the flip flop correctly.
tpsm = 0 t0t0+
tpdp = 0 t0t0+
Timing diagram
Multiple clocks
Scheme 1:
Single-phase, rising or falling edge clocks.
Scheme 2:
Two-phase overlapping, rising or falling edge clocks.
Scheme 3:
Two-phase non-overlapping, rising and/or falling edge clocks.
2004 Dr. James P. Davis
s0 s1 s6
Thread 1
s2 s3 s8
Thread 2
s4 s5 s7
Thread 3
ASSERT_1 !ASSERT_2
Modeling Concurrency: - Multiple model FSM "threads" having shared buses. - Independent clocking schemes and enabling events (e.g., ^RES). - Types of concurrent interaction:
I. Synchronization - coordinated activities (e.g., handshaking, pipelining). - implicit references to shared buses. II. Competition - shared resources (for example, bus arbitration). - explicit use of other concurrent processes, components, or entities to model the arbitration protocol.
2004 Dr. James P. Davis
Architecture Patterns
Asynchronous handshaking:
FSM threads use an asynchronous interrupt mechanism to alert it to when the event has occurred (however, most likely gated to a clock signal).
Sequencing
9
This pattern has the sequencing of data path operations by one or more state machines. The example shown is the data path for a small CPU, where micro-operations based on program instructions are decoded and staged to execute multi-cycle instructions out of memory. This example also uses a pipelining pattern structure (discussed later).
Source: Tanenbaum, 4th ed. 1999, Prentice-Hall 2004 Dr. James P. Davis
Pipelining - 1
9
There are two kinds of pipelining: data path pipelining and control pipelining. An example of control pipelining is the Instruction Fetch, Decode, Execute cycle used in all CPU architectures. Another example is Bus Reads and Writes, which are generally pipelined so as to interleave the control operations, thus saving clock cycles (shown in the figure).
2004 Dr. James P. Davis
Source: Tanenbaum, 4th ed. 1999, Prentice-Hall 2004 Dr. James P. Davis
Source: Tanenbaum, 4th ed. 1999, Prentice-Hall 2004 Dr. James P. Davis
Arbitration-2
The previous scheme involves use of a separate arbiter module, as is the case with most bus schemes.
802.11 WLAM operates this way when an Access Point is present, and the network is operating in PCF (point coordination facility) mode.
8 0 2 .1 1 W ir e le s s M e d iu m (C S M A /C A )
S ta tio n -2
S ta tio n -1
Memory Arrays
4 x 3 Memory Array
The Memory array is built up from gates and flip flops, to take advantage of certain properties of the devices. Each row is one of four 3-bit words. Data Lines I0 I2 feed all of the D FFs in a column. The address lines A0 A1 act as select lines for a given bank. The control signals CS (control select), RD (read enable), OE (output enable) are used to route data to and from memory (allowing writes and reads) based on the combinational logic gates. The bus drivers are enabled by the AND of the 3 control signals.
2004 Dr. James P. Davis
Here, we have two banks of memory that are selectable through a bank select signal. Usually, some number of upper bits of the Address are used for this purpose. Note how we access memory locations via assignment and an index register.
Invoke
Return
Invocation of a sub-flow in the ASM model: the handshaking logic is abstracted into a single thread fragment which is reused.
2004 Dr. James P. Davis
Request
Section 2 - Summary
Executable algorithmic state machines (E-ASM):
Allow both control and data operations to be specified in time (cycle by cycle scheduling) and space (allocation of specific resource types using abstract macro-functions). Notation we define is directly executable in the Nimbus tool set.
TM
TM
Design Approach - "stepwise refinement", with "iterative enhancement". Create design "skeleton", with core functions and cycle-level timing information specified. Iterate the design through synthesis, checking key area and timing constraints.
NO
YES
Cycle-based Simulation?
YES
Synopsys SGE
NO
TM TM
Synopsys VSS
TM
FPGA Compiler
TM
NO
YES
DesignWare
TM
NO
YES
NO
YES
Return to the top of the process to make corrections, and to enhance the design description. Integrate completed behavioral block with other blocks for HDL "system" simulation.
NO
Xilinx ISETM
YES
Realities:
Z
B q
2q or q
Large multipliers dont behave like small ones, as they are not linearly scalable in space or time. [Parhami, 2000]
Adder Intermediate Unit
CIN
B p
2p or p
A r
Questions:
Z r
+
1
COUT
B n
*
Adder Base Unit
CIN
r Z 2n
What small multiplier unit configurations can be efficiently deployed to achieve the best tradeoffs in area and timing on the FPGA fabric? How do we structure the hierarchy of such units to achieve the best overall balance in building wide-bit MUL units for use in our application?
2004 Dr. James P. Davis
A B
1 m m
+
1
COUT
. . .
MUX 32
X
Operand Select 64
Shift_32
Rf
Rd
Rb
Re
Rc
Ra
Requires 6 32-bit MUL units, 3 64-bit ADD units, a 192-bit ADD unit, and lots of shifters & wide registers. Were I/O limited by a 32-bit PCI-X bus transfer to the FPGAs
We size the MUL ops of this unit to match. We absorb some of the cycle cost of getting operand data by starting MUL pipelining after 7 bus cycles.
+ + +
MUX
MUX
Ri
Rh
Rg
Ripple Carry
Shift_32
Rl
Rk
Rj
Final Product
MUL Architectures Virtex 18x18 Block (VE) Behavioral VHDL * Op (BE) Booth (BO) Shift-Add (SA) Broadcast (BC) Divide & Conquer (DC) Bit-widths 12, 16-bits 32, 48, 64-bits 16, 32, 48, 64-bits 16, 32, 48, 64-bits 32, 48, 64, 96, 128, 192, 256-bits 32, 48, 64, 96, 128, 192, 256-bits
Using a decomposition tree model, we can discuss size of the space of candidate MUL configurations, based on possible candidate partition trees.
Let R be the set of possible root node multipliers, of differing topology and bit-width, R type X R bit-width
R = { DC192, DC256, BC192, BC256, .. }
Root node
Let NT be the set of non-terminal nodes, compositional MUL units, of different topology and bit-width, which can be further decomposed (either compositionally or recursively) into smaller NT units, where
NT = { DC128, DC96, DC64, DC48, DC32, BC128, BC96, BC64, BC48, BC32, BO64, BO48, BO32, SA64, SA48, SA32, BE64, BE48, BE32, ..}
|S0| = N-bits
d=0
Non-terminal nodes
d=1 d=2
-- Let T be the set of terminal nodes, base MUL units, which are not further decomposed, by type (width = 16 bits). T = { VE16, VE12, BO32, BO16, SA32, SA16, ..}
dmax = 4
Goal: to evaluate our ability to estimate area and delay for different bit-width MUL unit architectures, top-down. Approach: build models of combinational & sequential logic units, whose width scalability costs can be assessed, bottom-up.
Build unit models using ASM. Compare cost of behavioral vs. structural model styles.
20000.00
15000.00
10000.00
5000.00
Carry Look Ahead 870.94 Carry Select 1031.10 Bit Serial (Gray) 19752.77 VHDL Add Operator ('+') 522.24
Metrics:
area and delay are used to calculate AreaDelay Product. Take inverse ratio of others to lowest A-D Product, and normalize against the most efficient.
120.0% 100.0%
Normalized Efficiency
As ADD widths scale, do their costs factor significantly in overall cost of a wide-bit MUL?
What impact on design do ADD units have relative to the over-all MUL pipeline architecture? Focus on the delay cost of the carry chain.
Delay (ns)
20 15 10 5 0 16 32 48 64 80 96 112 128 144 160 176 192 208 224 240 256 272 288 Operand Bit-width
Factors:
For Virtex-II, with 18x18s, the ADD stages will be slower than the base unit MUL stages in the overall wide-bit pipeline. Clocking strategies will be gated by how fast ADD stages of different widths can run.
Split this 192-bit ADD into 3 64-bit ADD pipeline stages, then clock at faster rate.
b5 a0 a1 a5 32 b4 b3 b2 b1 b0
. . .
MUX 32
X
O p e ra n d S e le c t 64
S h if t _ 3 2
Rf
Rd
Rb
Re
Rc
Ra
MUX R i p p le C a r r y
MUX
MUX
Ri
Rh
Rg
R i p p le C a r r y
S h ift_ 3 2
Rl
Rk
Rj
F in a l P r o d u c t
269
119
Knuth 425
Booth 387
197
13.563
9.289
4.421
Each path was costed out at width including Adder data collected in previous step. Some were pruned from further consideration.
At issue: how much better are Xilinx 18x18s than CLB-based MULs?
Booth 9.289
6.223
Preliminary data.
3500
Shift-Add 1225.931
Knuth 3078.275
Booth 3594.843
Area * Delay
How to assess the relative strength of using the Virtex II 18x18 multipliers as base units?
Use Normalized Efficiency, defined relative to the 18x18. Most of the other multiplier schemes are extremely inefficient when compared to 18x18 as a base unit. The Shift-Add architectures approach 45% of the 18x18 unit efficiency (60.4% in area, 71% in delay), making it the next candidate, when 18x18 resources are exhausted. The Divide & Conquer is 14% as efficient as 18x18s at 32-bits!!!
120.0%
100.0%
Normalized Efficiency
80.0% 60.0%
40.0% 20.0% 0.0% Shift-Add Efficiency 42.9% Knuth 17.1% Booth 14.6%
Preliminary data.
Worst case delay envelope => minimum cycle time Cycle time * number of cycles per operation => latency
DC-2
BC
Preliminary data.
State encoding scheme can impact delay cost from output decoding logic.
N states requires register resources of 2N/2 CLB slices on Virtex-II device (binary, gray coding) or N/2 slices (one-hot), plus decoding LUTs and MUXes.
12000.00 10000.00 8000.00 6000.00 4000.00 2000.00 0.00 Binary 14081.23 Gray 11126.29 OneHot 14217.25
Area * Delay
120.00% 100.00%
Normalized Efficiency
80.00%
60.00%
40.00%
20.00%
0.00% Efficiency
Binary 79.02%
Gray 100.00%
OneHot 78.26%
- use of subunits consists of intermediate MUL units - themselves divide & conquer units of the smaller width - these ultimately make reference to MULT18x18s.
We can assess the cost of the MUL pipelines, in terms of latency and throughput. There is a big question as to whether DCn will scale effectively while holding resource utilization, latency and throughput at acceptable levels.
32
64 Bit-width
128
8 0 2 .1 1 W ir e le s s M e d iu m (C S M A /C A )
S ta tio n - 2
S t a tio n -1
P eak P o w er C o n s u m p tio n (m W /M H z )
M em o ry S ystem
S e le cta b le I & D ca ch e size s: 0 , 4 K , 8 K ... 1 M S e le cta b le I & D T C M size s: 0 , 4 K , 8 K ... 1 M
1 .1 m W /M H z @ 1 .8 V (e stim a te d )
S o ftd rive r
PHY
M AC
M e m o ry b lo c k (4 ~ 1 5 k B ) + c o n tro l b lo c k
1 0 K g a te s P C I/P C M C IA
Source: Knowledge Edge KK
Most 802.11b MAC implementations are done as embedded systems executing on a CPU (e.g., ARM microprocessor).
Well be designing our MAC layer model in VLSI custom logic using concurrent state machines, and will generate a circuit using a Xilinx family FPGA device.
<<PartOf>>
1 FrameSequencer
(from Use Case View)
<<PartOf>> Generates_Word
Maintains 1
St ate_Sequence
Done
Created Data_Buffered Target_Decoded Frame_Subtype = Data AND Buffer_Complete Frame_Subtype != D ata ^Select_Enable Frame_Subtype = Data AND ^Buffer_Complete
4: Shift in 4-bits 5: Create Frame_W ord 6: State = "Created" 7: Signal new Frame_W ord
In this scenario, we assume this is the first word created for a new frame.
10: Set Decoder_Selector = "FrameHeaderDecoder" 1 1: Signal "Target_ Dec oded" 11: 1 3: Signal Fram e_H eader_Decoder 14: Latch Frame_W ord
15: Signa l "S elect_ Ena ble" 16: Stat e = "Lat ched" 17: Decode Header 18: Pass Frame_Subtype 19: Signal "Decode_Complete" 20: State = "Fully_Decoded"
9: Signal New Word Available 8b: Buffer Header Word 10: Initialize for New Frame Header 11: Latc h Frame Header W ord 12: Determine Frame Subtype & DS Bits
13: Buffer Frame Subtype and DS Bits 14: Signal Header_Data_Ready 15: Load Subtype & DS Bits
6: New_W ord
Note that the Word_Counter of the previous slide is now decomposed into the Word-Selector and the Frame_Sequencer actors, which will be the basis for depicting the details in the block diagrams that follow.
11: M ak e Dec oder S elec tion for W ord Des tination 12: S et E nable Line for downs tream bloc k
Poll waiting for second word (16 bits) of FCS frame field..
Check that FCS field bits match CRC-32 bits computed from frame stream on the fly.
Test is new word is first word of a new MAC frame. Our sequencing choice depends on first word or not. Well always assume Frame Control Header if we have a new frame. Enable decoding of target block select, based on current state of Frame Sequencer.
This logic block encapsulates output Decoding logic, which is good to put In ASM sub-flows.
2004 Dr. James P. Davis
Section 3 - Summary
Executable algorithmic state machines:
Allow both control and data operations to be specified in time (cycle by cycle scheduling) and space (allocation of specific resource types using abstract macro-functions). Notation we define is directly executable in the Nimbus tool set, hence, executable ASM models. Basic data operations and memory operations supported directly in the ASM notation using datapath macro-functions and memory arrays.