Tsinghua Session1 040425 D 2pg

High-level VLSI Digital Systems Design: Methods, Notation, Architecture
Dr. James P. Davis, Associate Professor Director, VLSI Systems Design Lab University of South Carolina Columbia, S.C., U.S.A.
2004 Dr. James P. Davis
Seminar Session Outline

1. Introduction to High-level Design (50 minutes)
1.1 Rationale Design Issues and Trends 1.2 Systems Custom Logic versus Microprocessor Embedded 1.3 Concepts Design Hierarchy, Modeling, Abstractions, Patterns, Reuse 1.4 Process Design Objectives, Planning, Activities 1.5 Methods 1 Design Representation and Search Space 1.6 Methods 2 Model-Driven Architecture (MDA) for VLSI Systems 1.7 Metrics Measuring Effectiveness and Productivity
Question & Answer
2. Teaching & Practicing High-level Design (70 minutes)

2.1 Using the Executable ASM Method 2.2 Architecture of Digital Systems Control and Datapath 2.3 Analysis and Modeling of Algorithms and Protocols 2.4 Datapath-Dominated Designs Arithmetic and Filter Circuits 2.5 Control-Dominated Designs Protocol Engines 2.6 Applying Architecture Patterns for Reuse
Question & Answer
3. Example High-level System Designs (50 minutes)

3.1 Unsigned Integer Multiplier Circuits 3.2 MAC Layer for 802.11b Wireless LAN
Question & Answer
4. Session Wrap-Up
Section 1 Introduction to High-Level Design
Section 1 Key Points

High-level analysis and architecture design of VLSI-based digital systems.
Assume systems will be implemented wholly in custom logic. Assume for this lecture that applications are designed without use of CPU, and therefore do not incur the overhead of an Instruction Set.
Design process, methods, and notation allow you to take any digital systems application, analyze it, and implement it: starting with math formulae, algorithms or protocols.
Iterative enhancement, top-down/bottom-up, stepwise refinement (stables of Software Engineering discipline). Additional heuristics in relating properties of digital circuits to high-level systems design, to make the systems planning process more effective.
Evaluate the goodness of your architecture based on the use of available tools, by collecting and analyzing post-implementation data to select a best fit.
Method of designing for circuit synthesis, then evaluating different architectures in order to select the best for the application at hand. Metrics will include speed, area, power consumption, module cohesion, module coupling, Power-Delay product, Area-Delay product, etc.
System Design Trends What is driving us:

Complexity, Capacity vs. Capability
Introduction Vertical Market Drivers
Telecomm
Computers
Consumer Electronics
VLSI Silicon "chip"
"At the root of cascading changes of modern economic life...devaluing resources in technology, business and geopolitics...overcoming the the constraints of material resources, the microchip has devalued most large accumulations of physical capital and made possible the launching of global economic enterprises...microchips find their value not in their substance but in their intellectual content: their design..." George Gilder, Microcosm, 1989
Example Wireless Communications

The market is seeking product technology options to cover different geography ranges and data rates.
Bluetooth WPAN. IEEE 802.11 - WLAN. 2/3G Network WWAN.
Range
WWAN
The opportunity for creating value chains encompassing product offerings, distribution and new service offerings hinges on the ability to get low cost solutions to market quickly.
Deliver content to wireless handheld devices. Function convergence in the handset and at the base station. Requires large cross-functional design teams in varied disciplines.
G2/G3 Network CDMA

WPAN WLAN
Blue Tooth 802.11b
IEEE802.11g Data Rate

Source: Knowledge Edge KK
Introduction - VLSI SOC Drivers

Many market and technology factors coming together to create pressure on electronics product engineering organizations worldwide.
Increasing global competition and new markets. Increasing rate of product innovations and new product introductions. Decreasing time-to-market windows. Decreasing shelf life for products in many categories. Increasing pressure on competitive cost containment, profit margins. Increasing convergence: integrated functionality in single electronics devices and product packages. Increasing quality expectations: used as a means to better manage distribution and support costs. Increasing innovation in silicon process technology and wafer scale integration densities.
9
Increasing disparity: capacity of the underlying technologies versus capability of designers to manage increasing design complexity.
The Capacity vs. Capability Gap

Increasing capacity of the technology:
The rate of new technology and associated silicon process changes has continued to follow Moores Law.
24 10,000K
The capability of designers and design teams to use this capacity isnt keeping 9 M onths up.
The Capacity versus Capability Gap is widening. Each set of technology and process changes requires designers to manage ever more complexity in the design process. New architectures, abstractions, methods and tools are required to address this increasing complexity.
3
Device C apacity
D esign Size
Product Tim e-to-Market

1,000K
Design C apability
1990 1995
100K 2000
1985
Source: Gartner/Dataquest
Introduction VLSI SOC Objectives

The objectives of SOC approaches are to better manage design complexity.
Using better design planning in trade-off analysis and decision-making
greater availability of downstream design constraints earlier in the process. back annotating early iteration data into high-level design activities.
Using electronic systems design best practices

Increased levels of design reuse. More effective hardware-software co-design. Better trade-offs between general-purpose vs. domain-specific architectures and algorithms. Greater integration of functionality on-chip (hardware-software, analog-digital).
SOC Architecture Imperatives:

Reusability/Extensibility faster creation of primary and derivative products. Reliability managing device technology constraints as geometry shrinks. Scalability ever-larger design densities and levels of integration. Performance increasing data throughput, system capacity requirements. Resource Utilization efficient use of function, area, power, clocking, interconnect.
Design Problem-Solving What we use:

Methods, Notation, & Process
VLSI-based Design Space (Y-chart)
Categories of Computing Systems Design

"Layered" Computing System
Electronic systems today are of different types, depending on the (1) function, (2) application.
Computing system: CPU and hardware executing O/S, with applications running on top of O/S. Embedded system: fixed-functions, instruction-based micro-controller with both hardware and software components. On-chip system: complete control and data functions implemented in "custom logic" VLSI package. On-chip systems can be components of embedded systems, which can be part of a layered "virtual" machine.
VLSI hardware microcode machine code operating system application
ASICs and FPGAs
Classes of Electronic Systems
Embedded System
ROM RAM
uC
I/O
On-chip System
Mapping Algorithms/Protocols to Architecture
Levels of Abstraction in System Design

A design transforms from "concept" to "implementation" in a series of ordered levels.
Protocol
UML diagrams
Architectural
Algorithm flowchart
queueing network From the highest level to lower block diagram petri net levels of design "abstraction", a
design is iteratively refined.
Behavioral
The design description is verified and validated at each level, often cycling between levels of abstraction. Design descriptions are described using one or more domain representations (Behavior, Structure, Physical).
state equations state diagram flowdiagram ASM diagrams RTL notation datapath diagram truth table schematic diagram netlist
Functional/RTL
Structural
Geometrical
layout mask
Algorithm to Architecture Process

Software Algorithm (C code)
Control Flow modeling (Algorithmic structure)
Create Ordered Sequence of Operations

Algorithm Spec (Text or Math)
Overlay Operation Sequence onto Control Structure
Add Hardware Semantics
Data Flow modeling (Operation ordering)
- Clocking - Operation Scheduling - Parallelism - Resource Binding
Design as a Problem-solving Process

z
The "search" for an optimal solution involves tools & methods.

Many possible solutions, some better than others. Search through "solution space".
Application Requirements
Trade-offs and constraint checks at each node. At dead-end node, "backtrack" and try another path. Backtracking is costly and time consuming.
Specification High-level Design and Verification Logic Design, Timing Analysis Layout, Production Routing and Field Test
Impact of Backtracking on High-level Design

Application Requirements Violation Requirements Text documentation Function and Timing Specification Behavioral or Functional Constraint Violation
HDL Coding or Text documentation
High-level Design and Verification Area or Timing Constraint Violation
Logic Synthesis or Schematic Capture
Gate-level Analysis and Verification Area or Timing Constraint Violation
Layout and Routing
Physical Analysis and Verification
Problem
Behavioral or functional constraint violations cause 50-80% of cycling between design steps.
Goal
Eliminate unnecessary "cycling" through unplanned design steps. Improve the turnaround time per cycle.
Approach
Support easy exploration of design alternatives via iteration. Allow function & behavior changes to be made quickly.
Process Design-for-Synthesis Methods & Tools

Start KBS flowHDL Exsedia NimbusTM
& IPstation blockHDL
TM
TM
TM
Design Approach - "stepwise refinement", with "iterative enhancement". Create design "skeleton", with core functions and cycle-level timing information specified. Iterate the design through synthesis, checking key area and timing constraints.
Capture Design Compile & Checking Correct Entry?
NO
YES
Cycle-based Simulation?
YES
Synopsys SGE
NO
TM TM
Synopsys VSS
Behavioral Simulation Correct Behavior?
HDL Simulation Required? NO

YES
Synopsys Design Compiler HDL Compiler

TM
TM
FPGA Compiler
TM
NO
YES
Functional Simulation Correct Function?
Logic Synthesis Gate-level Timing Analysis Correct Timing?
DesignWare
TM
Design Analyzer Timing Analyzer

TM
NO
YES
NO
YES
Return to the top of the process to make corrections, and to enhance the design description. Integrate completed behavioral block with other blocks for HDL "system" simulation.
NO
Partition, Place & Route Area & Speed?
Xilinx ISETM
YES
Fabricate Device Done
The Design Integration Process

Decompose into functionally partitioned modules.
entry unit-level module behavior.
Capture, analyze and verify
TM IPalette blockHDL
Designer #1
cycle-based verification
TM Nimbus flowHDL
Lead Designer
Capture, plan and verify the system test harness graphically.

test spec entry
Designer #2
automatic HDL code generation
NimbusTM flowHDL
B i n d
Designer #3
Test Engineer
cycle-based verification
TM Nimbus flowHDL
t o b l o c k s
TM Nimbus flowHDL
automatic HDL test harness generation
Integrate HDL modules
Integrate HDL test harness

Analysis & Design Process Variations

We can have several possible process variations, using the Executable ASM as the core methodology component in analysis and design.
Starting from a protocol description, derive an appropriately partitioned architecture of communicating concurrent ASM threads for realizing the distributed control. Starting with an algorithm, derive an appropriately sequenced and scheduled set of data operations, allocated and bound to appropriate classes of macro operators, as modeled using one or more ASM threads. Starting with an existing circuit design, refactor or reengineer, the design into an abstract architecture consisting of one or more ASM threads.
Architecture Analysis Example
Finite Impulse Response Filter
Example Finite Impulse Response Filter

Mathematical structure.
In the 802.11 PHY Digital Signal Processor (DSP) block, it is a requirement to smooth time samples of data with a digital Finite Impulse Response (FIR) filter. Values of past output samples are required, making this is a recursive operation where discrete data signals create a linear time invariant system, given the transfer function. The equation of this systems behavior is as follows:
We generate an expansion for the above equation, as follows:
b0
Dataflow structure.
Usually represented as a lattice.
b1
X X
Xn
b2
x(n-1)
+ +
X
x(n-2)
FIR Filter Analysis of Datapath Operations

Data path architecture.
The above operation is expressed as a sequence of Register Transfer statements as follows.
out1[9:0] <- ADD( ADD(a,b)), CAT(0,in1)) a <- in1[7:0] b <- a[7:0]
The intent is to specify the transfer function algorithm as a sequence of standard RTL statements. Note the alignment of operators so that they are the same width. We will then create a high-level design description for the control and data path operations comprising the design of this FIR filter element. The following diagram depicts the RTL representation of part of this functionality (ADDER units) using standard elements:
in1[7:0] a[7:0] b[7:0] ADD1[8:0] CLK RES
'0' out1[9:0] ADD0[9:0]
Example - FIR Filter Architecture: 1st Pass

K0 IN K1 a K2 b K3 4 4 4 4 4 4 4 4 c CLK
8 9
8 10 OUT
8 9 8
Example - FIR Filter Architecture: 2nd Pass

CL1 CL1
K0 IN K1
4 4 4 4 4 4 4 4
* * * *
9 8 10
OUT
K2
8 9
K3
CL3
Example - FIR Filter Architecture: 3rd Pass

CL7 IN 9
*
K3
10 9
OUT
9 9
K2
K0 K1 CL1 CL2 CL3 CL4 CL5 CL6
Additional MUX resources and additional registers required to resource share the Arithmetic units.
FIR Filter - Architecture: Area Evaluation

1100 1095 1090 1085 1080 1075 1070 1065 1060 1055 1050 1045 1st Pass 2nd Pass
O O The 3 design architectures are synthesized into circuits, and their area data is compared to see which is most efficient in resource usage.
Total Area
3rd Pass
The 2nd design model explored uses resources most efficiently. On examination, we see that the cost of the multiplexing and registers exceeds that of the original Multiplier resources (ignoring interconnect).
Section 1 Summary of Key Points

Increasing Pressure Factors.
System complexity. Design functionality. Productivity Gap: Device Capacity vs. Designer Capability.
Systems Design Methods.

Manage design complexity through analysis and design methods. Model-driven architecture definition (hardware and software). Map algorithms and protocols into architectures with high cohesion and low coupling (use of metrics). Executable ASM models are important design representation method,allowing you to bridge to gap between systems algorithms or protocol specification and circuits architecture. Use process to explore design space and realize best trade-off.
Next Section:
Discussion of background of digital systems their representation, specification and design.
Section 2 Introduction to High-Level Digital Design for Custom and Programmable Logic
Section 2 Key Points

Design of Digital Systems.
Combinational logic (truth tables, Boolean expressions, K-Maps for minimizationup to 4-5 variables). Sequential logic (state tables, bubble diagrams, ASM diagrams). Digital System Unit = Control + Datapath + Timing Constraints. Digital System = N Digital System Units + Sequenced Interaction Pattern.
Design of Computing Systems.

Specification as custom-logic architecture can be done as effectively as programming a particular microprocessor architecture. Custom-logic implementations are usually 100x faster than processor-based embedded software, due to elimination of instruction set, implementing operations in parallel, and locating memory (registers, arrays) adjacent to the logic processing circuit. Map algorithms and protocols into architectures with high cohesion and low coupling (use of metrics). Executable ASM models are important design representation method,allowing you to bridge to gap between systems algorithms or protocol specification and circuits architecture.
Design Representation of Digital Systems or

What are the notations and methods of analysis and design?
Representation for Design for Synthesis

On-chip System
High-level Design for Synthesis starts with abstract description created in graphical notation. Notation is appropriate for capturing structure and behavior of "on-chip VLSI systems. Notation #1: structured blocks for partitioning and interconnect of design components. Notation #2: structured flowgraph for partial ordering a sequence of abstract operations. Both notations are used to construct a "plan" for meeting the design spec, in terms of structure and behavior.
Block_1 Block_2 CLK (rising) ^RES Block_3 s0 s1 s2 s3 s4 s5 outSig1 outSig2 aReg <- inBus bReg <- '1' aReg <- '0' bReg <- '0'
idleSig aSig 1
0 aReg <- '4' bReg <- '2'
ôutSig3 aReg <- '4' bReg <- '2'
Digital Systems Design - Structure

Top_Level UART Architectural - Local
Block diagram: Used to partition

and decompose a design into its functional units.
Step #1: Identify "top level" and all functional operations that transform the application's data. Step #2: Group the functions by how the data is manipulated, or by what resources are needed. Step #3: Separate different functions from each other by "interconnect" that shows the flow of data. Step #4: Decompose more "abstract functions into more "primitive" sets of functions that operate on data. Step #5: Repeat Step #4 until all the functions are fully decomposed into a hierarchy of "primitive" units.
Higher-level block name Block name Block type Input pin Input port block_1 Architectural - Local
Output pin
UART (appears as Output port at higher level)
Internal output/Buffer pin

(appears as Buffer port at higher level)
Bi-directional pin
(appears as Bi-directional port at higher level) Functional Block
Bounding Box (Encapsulates block's internal structure and interconnect)
Digital Systems Design Behavior

E-ASM Diagram: Executable Algorithmic
State Machine (ASM) used to decompose behavior within a function block into an ordered sequence of operations.
Clocking definition Enabling event definition s0 Moore Machine Actions: (synchronous or asynchronous) Signal Assertion signal1 Areg <- '0' Bus Assignment CLK1 (rising) ^RES
Step #1: Define sequence of steps within control algorithm, and specific operations to occur in each step, using abstract "flow chart". Step #2: Insert the necessary control to respond to events, using Case branch and Conditional branch. Step #3: Allocate each abstract data operation to RTL macro operator. Step #4: Bind individual buses for function variables to RTL bus units (registers, latches, wires). Step #5: Define control scheduling with declaration of clocking. Step #6: Decompose into concurrent "threads" for resource sharing.
State Input Conditions: Binary Decision Condition (If-Then) Multiway Branch Condition (CASE) IObus
s1
input1 & input2
Boolean input expression Mealy Machine Actions: (synchronous or asynchronous)
!signal5 Breg <- input2 default Macro-function Assignment Output <- NMUX (Areg, Breg, in1) MDR <- ScratchPad [MAR] Memory Read/Write with Relative Addressing
1001
0110
s2
A<- '0'
s3
Areg <- input1
s4
!signal4
s5
Overview of Behavioral Synthesis

d <= a + b + c; d <= a + b + c;
Behavioral synthesis starts with abstract description of behavior written in VHDL, SpecC or C, using no timing information. Task #1: Compile source code into intermediate format, for example, control-flow graph, dataflow graph. Task #2: schedule data operations to occur on specific control cycles, determined by clocking. Task #3: allocate data operations to RTL components implied by use of language operators <+, -, *...>. Task #4: bind specific operations to individual RTL components, to construct complete circuit topology.
control step 1
+ +
control step 1
+ +
control step 2
a b c
+ +
d
a c b
+
d
MUX
s
MUX
High-Level Design for Synthesis

Concept
Designer starts with Functional Spec, and a "concept" of design solution. Choice #1: create initial design representation as RTL level VHDL or Verilog code, or... Choice #2: create initial design representation using "behavioral VHDL or SpecC, SystemC, C language code, or... Choice #3: create initial design representation as high-level graphical "plan" of design solution.
behavioral C/VHDL description
graphical plan-based description High-level design for synthesis
register-level VHDL/Verilog description RT level description
High-level Synthesis
RT level description
RT level description
Logic Synthesis
Gate level description
The goal is to create HDL code that is synthesizable and efficient for logic synthesis.
Graphical Languages versus HDLs

Why we choose to use graphical design notations for analysis and architecture design:
Human mind works more effectively with visual and spatial information (for learning, retention, manipulation of artifacts, and for communicating ideas).
During evolution of human species, we spent more time using pictures to convey ideas rather than text-based language. We use sophisticated graphical representation of design artifacts annotated with textual components.
Graphical notations more effective in chunking design knowledge: a few changes in graphical model implies large number of changes in HDL code.
More compact representation in graphics. The links between constructs carry much information.
Design consists of planning and configuration tasks, which are easier to perform with diagrammatic representations than textual ones. Graphics allows designers to keep focus at higher-level, making possible better trade-offs in the design, also allowing more agile exploration of design space.
Digital Systems Design
Partitioning of Control and Data
Functional Partitioning of Control and Datapath

present state information next state information inputs
Control in
control outputs
Control out
State Registers
input/next state decoding logic output decoding logic
CLK
^RES
Control Unit Status Select
Data in
Data out
steering logic clocked register combin. logic MUX clocked register
Data Path Unit
Finite State Machine Model - Introduction

present state information next state information inputs input synchronizing registers input/next state decoding logic CLK inputs CLK control outputs
State Registers
output decoding logic
output filtering registers
CLK
Components of FSM Model

State registers, input synchronization registers (optional) and output filter registers (optional). Next state decoding logic, and output decoding logic - combinational logic blocks. Input signals to the state machine, which are inputs to the next state and output decoding logic blocks (could be synchronized to clock with input registers). Next state information, which is generated as a result of input/next state decoding logic. Present state information, output from the state registers, which is fed back as an input to both next state and output decoding logic blocks. Outputs from the state machine - either generated synchronously from the output of the state registers (also used as present state information), or asynchronously as output of the output decoding logic block (which takes input and present state information to produce outputs). Could be filtered using output registers to eliminate possible signal transients.
Finite State Machine Design Model Types

Moore machines:
Control outputs generated by the state machine are dependent only on the present state information. The control outputs are synchronized to the clock that controls state transitions.
output decoding logic
next state inputs present state input/next state decoding logic State Registers
control outputs
Moore machines are used when it is important to synchronize all control actions with the change in state, and thus, by the clock. Moore machines effectively filter out transients, and can be used to eliminate race conditions when inputs are unfiltered.
Finite State Machine Design Model Types

Mealy machines: The control outputs of the state machine are dependent on the inputs and present state information. The control outputs can be asynchronous, in that outputs can change value as the inputs change value, provided the appropriate present state information is maintained. The control outputs are gated by the present state. Mealy machines are used to create control blocks that respond quickly to external signal changes. Care must be taken to isolate the design from transients and race conditions.
next state inputs input/next state decoding logic State Registers present state output decoding logic
control outputs
Sequential Logic Design

Use of storage elements in the data path to store signal values.
Purpose is to synchronize the behavior of complex circuits. Benefits of circuit synchronization:
Eliminate unpredictability of output behavior due to timing skew. Create signal stability, as they must have stable values for certain period of time. Better isolate signals from noise transients.
Use of storage to create complex control structures.

Controller sequences the operations in the data path. The movement of data through the data path is staged in pipelined fashion Complex circuits can be broken down and architected in terms of their control or data dominated behaviors.
tp1
Data path pipeline stage 1

Storage Registers
Synchronizing Clock Signal
tp2
Data path pipeline stage n

Storage Registers
Outputs
Inputs
Combinational Logic block
Executable ASM Method State Machines

The timing semantics of Moore and Mealy machine modeling
Composition of ASM Diagram - Example

Clocking definition CLK1 (rising) ^RES Enabling event definition s0 Moore Machine Actions: (synchronous or asynchronous) Signal Assertion signal1 Areg <- '0' Bus Assignment
Executable ASMs
States Conditions Cases Conditional Outputs Assertions Assignments Expressions Macro-functions Memory Indexing Clocking Reset Synchronous events
State Box
State Input Conditions: Binary Decision Condition (If-Then) Multiway Branch Condition (CASE) IObus
s1
input1 & input2
Boolean input expression Mealy Machine Actions: (synchronous or asynchronous)
!signal5 Breg <- input2 default Macro-function Assignment
1001
0110
s2
A<- '0'
s3
Areg <- input1
s4
!signal4
s5
Output <- NMUX (Areg, Breg, in1) MDR <- ScratchPad [MAR] Memory Read/Write with Relative Addressing
Relationship of State Machines to Datapath

ASM diagrams incorporate information about control path and data path into a single representation. Using this notation, a design can express different design styles for both synchronous and asynchronous behavior of both the control and datapath.
Moore Machine - Registered Bus Assignments

Registered bus assignments may be used by placing expressions on buses that are defined as Register in the Element field of Bus Table. The buses in the datapath will be realized using registered logic. The buses aReg and bReg are realized by using additional layer of registers. This imposes a 1 clock cycle delay from when the operation is scheduled by the state machine in state s1 and when the updated values are propagated to outputs of aReg and bReg. However, the resultant assignment value is preserved in the registered output until it is explicitly modified. Using registered bus assignments increases gate count and circuit delay of the datapath. However, the designer avoids race conditions, signal transients, and unwanted feedback conditions by designing with registered logic.
Moore Machine - Unregistered Bus Assignments

Unregistered bus assignments may be used by selecting bus Elements as Wire in Nimbus Bus Table. This implies that the buses in the datapath will not be realized using registers, but with wires. Unregistered datapath operations, though causing the datapath buses to be realized without registers, are still synchronized by the clock driving the Moore-style state register outputs. Using unregistered bus assignments reduces gate count of the datapath, and reduces circuit delay. However, special care must be taken to avoid race conditions, signal transients, and unwanted feedback loops in the datapath that can cause metastability (not settling to a specific value) or oscillation.
Mealy Machine - Unregistered Bus Assignments

For Mealy-style outputs specifying datapath Bus assignments, the control outputs change asynchronously with the value of input aSig. The use of Bus assignments with Mealy outputs implies that the datapath buses with be unregistered (either latches or wires, depending on Element value specified in the Bus Table). Thus the result of the assignment will appear on the output of the datapath immediately (assuming no propagation delay, since it is assumed this is subsumed by the clocking around the other elements of the design unit).
Mealy Machine - Registered Bus Assignments

Specifying registered assignments of buses used in Mealy outputs is done by selecting Register as bus Element type in Bus Table. The values to buses aReg and bReg are assigned synchronous to the next active clock edge.
Executable ASM Method Control Structures

The inventory of algorithmic control constructs and comparing them to state diagrams
ASM vs. State Diagrams - Control Structures

CLK(rising) ^RES
s0
Action1 Action2 Action3 Action4 Action5
s0
s1
s0 s1
a=0 s0 s1
0 1
a|b
a=1 s2
s1 s2
s2
s2
Action6
Sequence: Series of States States follow the sequencing indicated by direction of state transitions. ASM diagrams have actions attached to the right side of state "boxes" for clarity. In ASM diagrams, transitions to next state are triggered by the system clock or any single-bit signal. States with no actions are called delay states, indicating a delay of one clock cycle.
Selection: Binary Branching Binary branching is represented using a condition "diamond". Boolean expressions on conditions can be of arbitrary complexity, with many terms and variables. For simple branching situations, both representations are equally suited to the task.
ASM Diagram Nested Control Structures

If-Then-Else: Nested Statements
Source: Roth 1998 PWS Publishing
ASM vs. State Diagrams Branching Control Structures

a=0001 a=0010 s0 a=1000 s0 a
0001 0010 0100 1000
s1 s2
a&b s1 â & b s0 s2 â & c & d ^b & c
s0 a&b
0 1 1 1
others?
a=0100 s3
â & b
0
â & c & d
0
s4
^b & c
1 0
default
s1
s2
s3
s4
s4
s3
s1
s2
s3
s4
Selection: Multi-way Branching, Single Variable ASM diagrams use case construct for multi-way branch conditions of a single, multi-bit variable/bus. Branch conditions are binary or enumerated values. In ASM diagrams, all undefined transitions are tied to a default transition. This eliminates possible transitions to unspecified states, a common cause of design failure in the field. State diagrams have no such mechanism, and thus are ambiguous.
Selection: Multi-way Branching, Multi-variable In State diagrams: (1) the ordering of transitions is ambiguous; (2) all transitions aren't specified (device problems likely in the field); (3) behavior is unpredictable under all conditions. In ASM diagrams: (1) ordering of transitions is explicit; (2) transitions specified for all possible input combinations (including MVL values); (3) behavior is predictable.
ASM vs. State Diagrams Loop Control Structures

â s2 s0 a s1 â a s1 s0 s2
s0 s2
1 0
â
s2
s0
s1
s1 a
1 0
Repetition: "While-Do" Control Loop While-Do control structure is more apparent in the ASM diagram, and is consistent with hardware description language (HDL) constructs. Single decision point in ASM diagram handles complexity more easily when multiple terms and variables are used in looping condition expression. This type of control construct is used often for counting the number of loop iterations, where you test the condition prior to each execution of the loop, including the first pass.
Repetition: "Repeat-Until" Control Loop Placement of the decision "diamond" defines the type of looping construct, and the type of control behavior. Repetition control structures are used in various styles of polling loops for implementing handshaking protocols. This type of control construct is used often for counting the number of loop iterations, where you test the condition after completing each execution of the loop, requiring execution of at least the first pass.
ASM vs. State Diagrams Synchronous Transfer of Control

s4
^RES
s0
a : P2
s0 a
s1
s2
0
a
1 ^RES : P1
s4 s0 s1 s2 synchronous enable event
a s4
s1 a
0 1
s2 a
0 1
explicit conditional tests
Control Interrupt Schemes - State Diagrams: The State diagram models state transitions, but the prioritization information--when multiple transitions are possible--isn't clear in the notation, without adding some additional symbol to indicate priority of the transitions. Also, if State transitions have additional conditions, it isn't clear what happens when conditions aren't met (incomplete specification). However, this may be handled by modeling looping on a state in some cases.
Control Interrupt Schemes ASM Diagrams: In ASM diagrams, you can either define a test on each state transition for the specific event using condition "diamonds", or you can use the Enable Event construct, indicating that the specified event has precedence over the normally-specified next state transition. This works like a priority encoder on the next state decoding logic of the state machine. At any time when the input is sampled for determining next state, the transition for the Enable Event a will take precedence.
Combinational Logic Units as Datapath Building Blocks
Gate Level Design Gate Devices

Symbols and Truth tables for 5 basic gate-level logic gates: NOT, NAND, NOR, AND, OR.
Tanenbaum 1999 Prentice-Hall Publishing
Combinational Logic - Comparator Circuit

z
Comparator (EQ): 9 COMP allows comparison of two different inputs to check if they are equal. Alternate circuits evaluate the relative magnitude of two inputs. 9 If they are equal, the output is HIGH, but if they are not equal, the output is LOW. 9 The comparison must be done with two signal inputs of equal width. 9 The output of the Comparator operation is a single-bit signal.
Combinational Steering Circuits - MUX

8-input Multiplexer (MUX):

9
MUX allows the signal value of one of its data inputs (D0 D7) to pass to the output F. The selection of the signal to be passed is controlled by SELECT lines (A, B, C). The number of select lines, n, is based on a power of 2 for the number of inputs, m. So, if we have m inputs, well need n select lines so that 2**n=m. The MUX inputs and output must be the same width, and the SELECT lines are 1-bit each.
Combinational Encoder/Decoder Circuits

3:8 Decoder (n to 2n DECO):

9
DECO takes a binary encoded input of n data bits, and decodes it into individual data output lines, where one output (D0 D7) is enabled, depending on whether the encoded value corresponds to the data line number. A Decoder input with n lines means we can encode 2**n possible binary encoded values. With 2**n possible encoded values on the input, well need exactly n output lines, one for each possible encoded input value. The output line corresponding to the decoded value will be enabled.
Executable ASM Method Datapath

The timing semantics of Moore and Mealy driven datapath
Datapath Logic Design

Use of memory elements in the data path to store signal values.
Purpose is to synchronize the behavior of complex circuits. Benefits of circuit synchronization:
Eliminate unpredictability of output behavior due to timing skew. Create signal stability, as they must have stable values for certain period of time. Better isolate signals from noise transients.
Use of memory to create complex control structures.

Controller sequences operations in the data path.
tp1
Data path pipeline stage 1

Storage Registers
tp2
Data path pipeline stage n

Storage Registers
Outputs
Inputs
Datapath Logic Operations

Pre-defined executable datapath operator macros.
Are used in LHS of macro assignment statements. Datapath macro operations are scheduled in states on entry to the state. Attached as a text expression to the state. LHS: assigned bus/signal. RHS: input args, macro function calls, possibly nested (like C functions).
Specified output bus/signal Pre-defined macro-functions Specified input buses/signals
C-Bus <- ( BOR( AND( B-Bus, NOT( DECO( A_Bus)))))
A_Bus
DECO
NOT
Macro types
Arithmetic: ADD, SUB, INCR, MUL, DIV, REM. Boolean logic: AND, OR, NOT. Steering logic: MUX, DECO, PENCO, DMUX. Combinational logic elements.
B_Bus
AND
BOR
C_Bus
nPort
select
AnyXN
CollIn CollisionEvent
Moore Machine - Macro-function Assignments
Creating User-defined Macro-functions

A_Bus DECO NOT
B_Bus
AND
BOR
C_Bus
nPort
select
AnyXN
CollIn CollisionEvent
RTL macro-functions: contain over 30 primitive data path elements in a library. Macros are "scalable" - with any number of buses and any bus widths. Macros can be used to construct more complex user-defined data path functions.
Macro-function definition: AnyXN(A_bus,B_bus) ::= BOR(AND(B_bus,NOT(DECO(A_bus)))) Macro-function binding: CollisionEvent <- AnyXN(nPort,CollIn)
Using Boolean Operators vs Macro-functions

Specifying simple data path operations can be done using either expression operators or macrofunction operators.
Carry/Borrow Bit with Arithmetic Macro-functions
Combinational Logic Example:

Unsigned Integer Adder
Adder Circuit The Basic Adder Unit

Source: Tanenbaum, 4th Edition, 1999 Prentice-Hall Publishers.
Full Adder circuit:

9
This circuit takes two singlebit inputs and adds them together to produce a Sum as output. The Full Adder also has a Carry (called a Carry Out) like the Half Adder. The Full Adder also has a Carry In signal, allowing Carry Out from earlier adder stage to be connected. This allows a multi-bit, multistage adder circuit to be constructed.
Adder Circuit The Ripple Carry Adder

Source: Tanenbaum, 4th Edition, 1999 Prentice-Hall Publishers.
RPC Adder circuit:

9
This circuit takes two multibit inputs and adds them together to produce a multibit Sum as output. The RPC Adder also has a Carry Out signal that results from the rippling of the carry output from each FA bit computation through the entire operand word length. The RPC Adder is a reasonable solution for small bit widths; however, a more elegant solution is needed when speed is critical, or when bit-widths get large.
ASM Models 32-bit Ripple Carry Adder

We model the Carry Logic separate from the Add Logic. Each bit of the 32-bit operands is sliced.
Carry logic
The Add functions are a Boolean reduction of the Full Adder structure on previous page.
1st stage AND logic.
This is a structural model, with each bit modeled in terms of its gate-level logic. It is very lowlevel.
Each combinational block is represented as a separate, single-state concurrent thread. The final Sum bits are registered.
2nd stage OR logic.
ASM Models 32-bit Ripple Carry Adder

This is a more abstract model, with each bit operated on by a 1-bit ADD unit. It is still a lowlevel model.
Using a Behavioral Adder 32-bit Multiplier

This is a reference to a behavioral Adder macro model, maintained by NimbusTM, with all bits operated on in parallel by a 32-bit unit. It is a higherlevel model. The shift-add Multiplier scheme is the most basic of unsigned Integer multiplication algorithms. Note the algorithmic nature of the model: (1) loop control using a count register, (2) concurrent operations scheduled on each state, (3) bus slicing, shifting and register-register assignment.
Sequential Logic Example:

Binary Up/Down Counter
The Binary Up/Down Counter Block Diagram

The Binary Up/Down Counter 9 This multi-mode counter block takes an input seed value, and counts up or down from this value, based on the direction of count. CountValue
z
NoCounts Seed Direction Enable 1 8 8 1 1 8
Valid
Each count cycle the Enable pulse is set, the Direction bit can be set to count in one or the other direction, and the Seed and NoCounts values are set. The counter will run for the number of cycles indicated by input NoCounts.
Up/Down Counter Model ASM Diagram

O The ASM model consists of a single thread, with two condition tests: (1) loop termination condition, and (2) count direction (up or down). We use arithmetic macro-functions INCRNC (increment by one with no carry) to increment the loop counter, and INCRNC and DECRNC (decrement by one with no carry) to modify the counter value, depending on the count direction.
CLK ^RES Bus_1 <- Seed Bus_2 <- Direction Bus_3 <- NoCounts Loop <- 0 CountVal <- 0 Enable = 1 N Y
Poll
CntControl
Loop = Bus_3 Y N Loop <- INCRNC(Loop) CountVal <- Bus_1
Bus_2 = 1 Poll Y N Bus_1 <- DECRNC(Bus_1) Bus_1 <- INCRNC(Bus_1)
Cycle-Level Timing Definition
Synchronous Design Quantized Timing via Cycles

In high-level design, we make a trade-off of timing accuracy versus design productivity. We abstract detailed timing to lower-level design, so we are not overwhelmed with design details; but we use a cycle-level timing model.
Source: Tanenbaum, 4th ed. 1999, Prentice-Hall
Digital Design Register Device Timing
Source: A. Tanenbaum, 4th ed., 1999, Prentice-Hall Inc.
Setup Time
Time tsu is the amount of time we must keep stable data on the input prior to sampling at the active clock edge.
Hold Time
Time th is the amount of time the input signal value must be stable after the active clock edge, so that it can be sampled by the flip flop correctly.
Example: D flip flop
Delay Assumption for Cycle-Based Timing

In Executable ASMs, we model the behavior of registers using the "limit" assumption. First, at some time tn corresponding to the active edge of a clock, there is a different value on the input of a register than on its output. The time between when the register input is "sampled" and when the value appears on the register output cannot be zero. We need to consider the change in "state" of the design on the clock edge, where we are assigning values to the input and wanting to see the results that appear on the output. We assume the mathematical limit of tn from both sides of the clock transition, which we refer to as times tn- and tn+. The time tn- is when the register input is being "sampled", and the time tn+ is when the sampled value appears on the register output.
tpsd = 0 next state State Reg present state
next state decoding logic
tpsm = 0 t0t0+
combination and steering logic
Data Path Reg
tpdp = 0 t0t0+
Register Level Design Clocking Schemes

A clocking element
An oscillating crystal with properties that allow it to serve as a synchronizing element in digital designs. Graphical means to indicate expected or observed timing behavior of a design. Represented by a waveform of significant signals, their timing and values at each instant. Many designs operate on the clock to generate some multiple or fraction of the whole clock signal, which could be symmetric or asymmetric to the system clock.
Timing diagram
Multiple clocks
Source: A. Tanenbaum, 4th ed., 1999, Prentice-Hall Inc.
Register Level Design Clocking Schemes

Executable ASM models support built-in clocking schemes.
Control and data path register clock pins tied to same synchronizing source.
Scheme 1:
Single-phase, rising or falling edge clocks.
Scheme 2:
Two-phase overlapping, rising or falling edge clocks.
Scheme 3:
Two-phase non-overlapping, rising and/or falling edge clocks.
Executable ASM Method Synchronization

The timing semantics of Reset, Clock, and synchronized concurrent threads
Synchronous vs Asynchronous Enable Events

Enable Events are used is specifying pre-emptive control behaviors of the design, where the normal control flow in the design is interrupted. Enable Events can either be synchronous or asynchronous.
Enable Events for Synchronization

CLK (rising) ^RES C1 (rising) AS1 C1 <- '1' AS1 <- '1' AS1 <- '0' C1 <- ^C1 C2 (falling) ÂS2 C2 <- '1' AS2 <- '0' AS2 <- '1' C2 <- ^C2
s0 s1 s6
Thread 1
s2 s3 s8
Thread 2
s4 s5 s7
Thread 3
ASSERT_1 !ASSERT_2
Modeling Concurrency: - Multiple model FSM "threads" having shared buses. - Independent clocking schemes and enabling events (e.g., ^RES). - Types of concurrent interaction:
I. Synchronization - coordinated activities (e.g., handshaking, pipelining). - implicit references to shared buses. II. Competition - shared resources (for example, bus arbitration). - explicit use of other concurrent processes, components, or entities to model the arbitration protocol.
Architecture Patterns
Control Design Handshaking Pattern

Handshaking
Polled handshaking:
FSM A thread waits in a polling loop, testing for signal ZB to be asserted by FSM B. FSM B thread waits in IDLE loop for signal ZA to be asserted by FSM A.
Asynchronous handshaking:
FSM threads use an asynchronous interrupt mechanism to alert it to when the event has occurred (however, most likely gated to a clock signal).
We use architecture and behavior patterns

Source: Roth 1998, PWS Publishing
Well-defined structure and behavior, re-used through a system.

Control Design Sequencing Pattern

z
Sequencing
9
This pattern has the sequencing of data path operations by one or more state machines. The example shown is the data path for a small CPU, where micro-operations based on program instructions are decoded and staged to execute multi-cycle instructions out of memory. This example also uses a pipelining pattern structure (discussed later).
Control Design Pipelining Pattern

N-Stage Pipeline
Pipeline allows serial processing, in sequence, of instructions or data elements. Each n-element in the pipeline processes its task, then passes the element to the stage n+1 in the pipeline. Design structures that use pipelining: CPU Instruction Fetch Unit (IFU), Digital filters (e.g., FIR, IIR).
Source: Tanenbaum, 4th ed. 1999, Prentice-Hall 2004 Dr. James P. Davis

z
Pipelining - 1
9
There are two kinds of pipelining: data path pipelining and control pipelining. An example of control pipelining is the Instruction Fetch, Decode, Execute cycle used in all CPU architectures. Another example is Bus Reads and Writes, which are generally pipelined so as to interleave the control operations, thus saving clock cycles (shown in the figure).

Pipelining - 2
The sequence of figures show how pipelining works in the control path. The control pipelining is the Instruction Fetch, Decode, Execute cycle used in all CPU architectures. Each stage of the control pipeline is buffered by registers that provide setup of data. The different stages of the pipeline also use handshaking.
Control Design Arbitration Pattern

Arbitration-1
The pattern works in situations where multiple service requesters want access to a scarce resource (such as a Bus). There are different arbitration schemes for requesting by one or more requesters and granting control of the resource by the arbiter module. Some use daisy chaining, or other prioritization schemes, to grant access. Arbitration can be centralized, using an arbiter module, or it can be decentralized (see next example).
Control Design Arbitration Pattern
S ta tio n -3 (In te r n e t G a te w a y ) S ta tio n -4 (P r in t S e r v e r )
Arbitration-2
The previous scheme involves use of a separate arbiter module, as is the case with most bus schemes.
802.11 WLAM operates this way when an Access Point is present, and the network is operating in PCF (point coordination facility) mode.
8 0 2 .1 1 W ir e le s s M e d iu m (C S M A /C A )
Another scheme involves no centralized arbiter:

When 802.11 is operating in ad-hoc DCF (distributed coordination facility) mode, without the centralized control of an Access Point, which operates like 802.3 Ethernet. CSMA/CD: Carrier Sense Multiple Access/Collision Detect. Sense for a distributed carrier signal, and detect for collisions as a means to gain access to the shared resource (wired network medium). CSMA/CA: Carrier Sense Multiple Access/Collision Avoidance. Sense for carrier signal, but dont rely on it solely as the means for gaining access. Use an additional timing mechanism passed among the data frames (needed because of the hidden node problem).
S ta tio n -2
S ta tio n -1
Memory Arrays
Gates to Registers Single-Port Memory Array

4 x 3 Memory Array
The Memory array is built up from gates and flip flops, to take advantage of certain properties of the devices. Each row is one of four 3-bit words. Data Lines I0 I2 feed all of the D FFs in a column. The address lines A0 A1 act as select lines for a given bank. The control signals CS (control select), RD (read enable), OE (output enable) are used to route data to and from memory (allowing writes and reads) based on the combinational logic gates. The bus drivers are enabled by the AND of the 3 control signals.
Systems Design Memory Arrays

m x n Memory Array
Memory arrays allow multi-bit values to be stored and retrieved by address. A memory of n bits can be organized in different ways. Memory array as a word length: the number of bits of each word stored in memory (e.g., 32 bits). Memory array has number of words (e.g., 64KB). Each memory location is uniquely addressable (called content addressable memory).
Memory Use Example RAM Control Thread

O
Here, we have two banks of memory that are selectable through a bank select signal. Usually, some number of upper bits of the Address are used for this purpose. Note how we access memory locations via assignment and an index register.
Note: This bus serves as an index into the memory array.
Memory Use Example Memory Table

Memory creation consists of several steps: (1) Create a RAM array in the Memory Table (with Addr. Width, Data Width and Type). O(2) Edit values in the array using the Matrix Table. O(3) Reference the memory locations using an index.
O O
Memory Arbitration Pattern-1
Invoke
Return
Invocation of a sub-flow in the ASM model: the handshaking logic is abstracted into a single thread fragment which is reused.

Invocation of a sub-flow initiates the handshaking with the Arbiter thread for access to the memory array.
Request
Grant Start Transfer
Signal when Done
The handshaking is managed by polling loops on both sides of the protocol.

Section 2 - Summary
Executable algorithmic state machines (E-ASM):
Allow both control and data operations to be specified in time (cycle by cycle scheduling) and space (allocation of specific resource types using abstract macro-functions). Notation we define is directly executable in the Nimbus tool set.
Using ASM diagrams:

A thinking aid for defining the structure and sequencing behavior of Finite State Machines. Used in 3 different ways: (1) definition/specification of sequential systems, (2) analysis of sequential circuits, (3) design of combinational and sequential circuits behaviorally.
Next: well look at some designs and analysis.
Section 3 Example High-Level Designs of VLSI Circuits & Systems

Analysis & Design Process Variations

We can have several possible process variations, using the Executable ASM as the core methodology component in analysis and design.
Starting from a protocol description, derive an appropriately partitioned architecture of communicating concurrent ASM threads for realizing the distributed control. Starting with an algorithm, derive an appropriately sequenced and scheduled set of data operations, allocated and bound to appropriate classes of macro operators, as modeled using one or more ASM threads. Starting with an existing circuit design, refactor or reengineer, the design into an abstract architecture consisting of one or more ASM threads.
Process Design-for-Synthesis Methods & Tools

Start KBS flowHDL Exsedia NimbusTM
& IPstation blockHDL
TM
TM
TM
Design Approach - "stepwise refinement", with "iterative enhancement". Create design "skeleton", with core functions and cycle-level timing information specified. Iterate the design through synthesis, checking key area and timing constraints.
Capture Design Compile & Checking Correct Entry?
NO
YES
Cycle-based Simulation?
YES
Synopsys SGE
NO
TM TM
Synopsys VSS
Behavioral Simulation Correct Behavior?
HDL Simulation Required? NO

YES
Synopsys Design Compiler HDL Compiler

TM
TM
FPGA Compiler
TM
NO
YES
Functional Simulation Correct Function?
Logic Synthesis Gate-level Timing Analysis Correct Timing?
DesignWare
TM
Design Analyzer Timing Analyzer

TM
NO
YES
NO
YES
Return to the top of the process to make corrections, and to enhance the design description. Integrate completed behavioral block with other blocks for HDL "system" simulation.
NO
Partition, Place & Route Area & Speed?
Xilinx ISETM
YES
Fabricate Device Done
Executable ASM Design Method Using NimbusTM
The Design Capture Process

Using Exsedias NimbusTM toolset:
Well specify the behavior of design units as Executable ASM models. Well apply design patterns for defining the system control as a collection of threads, executing concurrently. Well verify execution of compiled ASM models using the integrated cycle-based graphical simulator in NimbusTM.
Arithmetic Datapath Unit Example:

192-bit Unsigned Integer Multiplier
Getting From Small to Large Multipliers

Multiplier Top-level Unit A q
Realities:
Z
B q
2q or q
Large multipliers dont behave like small ones, as they are not linearly scalable in space or time. [Parhami, 2000]
Adder Intermediate Unit
CIN
Multiplier Intermediate Unit A p
B p
2p or p
Z Multiplier Base Unit A n
A r
Questions:
Z r
+
1
COUT
B n
*
Adder Base Unit
CIN
r Z 2n
What small multiplier unit configurations can be efficiently deployed to achieve the best tradeoffs in area and timing on the FPGA fabric? How do we structure the hierarchy of such units to achieve the best overall balance in building wide-bit MUL units for use in our application?
A B
1 m m
+
1
COUT
Exploring the Architecture Design Space

To pursue this line of inquiry, there are a number of questions to be answered: - Given the modeling of these arithmetic base units, can we use behavioral models, or must we use structural models? - If we use behavioral models for multiplication, at what bitwidths do these models cease to be efficient? - What combination of structural and behavioral models should be used, and where in the overall architectural hierarchy are each appropriate?
Candidate Wide-bit Multipliers BCn

b5 a0 a1 a5 32 b4 b3 b2 b1 b0
The 192-bit Broadcast :

[Buell et al., 2002]
. . .
MUX 32
X
Operand Select 64
Shift_32
Rf
Rd
Rb
Re
Rc
Ra
Requires 6 32-bit MUL units, 3 64-bit ADD units, a 192-bit ADD unit, and lots of shifters & wide registers. Were I/O limited by a 32-bit PCI-X bus transfer to the FPGAs
We size the MUL ops of this unit to match. We absorb some of the cycle cost of getting operand data by starting MUL pipelining after 7 bus cycles.
+ + +
MUX Ripple Carry
MUX
MUX
Ri
Rh
Rg
Ripple Carry
Shift_32
Rl
Rk
Rj
Final Product
We want to exploit the Xilinx Virtex MULT18x18 hard macros.
Characterizing the Design Space-1
MUL Architectures Virtex 18x18 Block (VE) Behavioral VHDL * Op (BE) Booth (BO) Shift-Add (SA) Broadcast (BC) Divide & Conquer (DC) Bit-widths 12, 16-bits 32, 48, 64-bits 16, 32, 48, 64-bits 16, 32, 48, 64-bits 32, 48, 64, 96, 128, 192, 256-bits 32, 48, 64, 96, 128, 192, 256-bits
Using a decomposition tree model, we can discuss size of the space of candidate MUL configurations, based on possible candidate partition trees.
Let R be the set of possible root node multipliers, of differing topology and bit-width, R type X R bit-width
R = { DC192, DC256, BC192, BC256, .. }
Root node
Depth of partition tree:
Let NT be the set of non-terminal nodes, compositional MUL units, of different topology and bit-width, which can be further decomposed (either compositionally or recursively) into smaller NT units, where
NT = { DC128, DC96, DC64, DC48, DC32, BC128, BC96, BC64, BC48, BC32, BO64, BO48, BO32, SA64, SA48, SA32, BE64, BE48, BE32, ..}
|S0| = N-bits
d=0
Non-terminal nodes
S1 |S3| = N/4 Sn-r

Terminal nodes
|S2| = N/2 S4 S5 S6 Sn-1
d=1 d=2
Some number of partitions later...
-- Let T be the set of terminal nodes, base MUL units, which are not further decomposed, by type (width = 16 bits). T = { VE16, VE12, BO32, BO16, SA32, SA16, ..}
dmax = 4
Collecting Configuration & Cost Data

For answers, models of different topologies and configurations must be created and their performance characterized:
- Metrics: pin-to-pin delay, resource usage (slices, MUL units, IOBs), and minimum clocking for circuit synchronization. - Architectural techniques: concurrency through parallelism and pipelining (enhancing speed at the cost of FPGA resources), resource sharing (minimizing resource consumption at the cost of speed), and interleaving (trying to hide operation timing in the hard bottlenecks of other operations). - Experimental Method: create models for Multipliers, and their constituent Adder, logic and register units, using different architectures, bit widths, decomposition depths, and decomposition strategies--and measure delay and resource usage for different permutations. - Design Technique: try a number of combinations at a given level of design hierarchy, then use this selection as the primitive unit in the next layer upward in the component hierarchyuntil we reach scale.
Step 1 Costing Logic Component Primitives

Preliminary data.
Goal: to evaluate our ability to estimate area and delay for different bit-width MUL unit architectures, top-down. Approach: build models of combinational & sequential logic units, whose width scalability costs can be assessed, bottom-up.
Build unit models using ASM. Compare cost of behavioral vs. structural model styles.
Step 2 Costing Adder Units

25000.00
64-bit exemplar ADD units.
20000.00
Area Delay Product
Adder schemes are not considered costly relative to multipliers.

But were using large bit-widths! Many ADD units of different widths embedded in a single wide-bit MUL unit.
15000.00
10000.00
5000.00
0.00 Ripple Carry Area * Delay 408.70
Carry Look Ahead 870.94 Carry Select 1031.10 Bit Serial (Gray) 19752.77 VHDL Add Operator ('+') 522.24
Metrics:
area and delay are used to calculate AreaDelay Product. Take inverse ratio of others to lowest A-D Product, and normalize against the most efficient.
120.0% 100.0%
Normalized Efficiency
80.0% 60.0% 40.0% 20.0% 0.0% Ripple Carry Efficiency 100.0%
Result: Ripple-Carry has best efficiency score.

Carry-Lookahead and Carry-Select are less than half as efficient, overall, due to their area being 2X of Ripple-Carry.
Carry Look Ahead 46.9%
Carry Select 39.6%
Bit Serial (Gray) 2.1%
VHDL Add Operator ('+') 78.3%
Step 2 Costing Adder Units

Ripple-Carry ADD - Timing Constraint Evaluation
30 25
As ADD widths scale, do their costs factor significantly in overall cost of a wide-bit MUL?
What impact on design do ADD units have relative to the over-all MUL pipeline architecture? Focus on the delay cost of the carry chain.
Delay (ns)
20 15 10 5 0 16 32 48 64 80 96 112 128 144 160 176 192 208 224 240 256 272 288 Operand Bit-width
Factors:
For Virtex-II, with 18x18s, the ADD stages will be slower than the base unit MUL stages in the overall wide-bit pipeline. Clocking strategies will be gated by how fast ADD stages of different widths can run.
Split this 192-bit ADD into 3 64-bit ADD pipeline stages, then clock at faster rate.
b5 a0 a1 a5 32 b4 b3 b2 b1 b0
. . .
MUX 32
X
O p e ra n d S e le c t 64
S h if t _ 3 2
Rf
Rd
Rb
Re
Rc
Ra
Impact on setting clocking scheme of the pipeline.

Timing constraints defined for 15ns and 30 ns Ripple-Carry ADD runs. ADD units < ~128-bit width can be clocked at 15 ns. Others require slower clock, unless wide-bit ADD units are split across pipeline stages.
MUX R i p p le C a r r y
MUX
MUX
Ri
Rh
Rg
R i p p le C a r r y
S h ift_ 3 2
Rl
Rk
Rj
F in a l P r o d u c t
Step 3 Costing Base Multiplier Choices

450 400 350 425 387
269
Data shown for 32-bit base unit configurations only
Utilization (CLB Slices)
At issue was the selection of the base Multipliers:

What architecture(s) should be used? What bit-width should constitute the base unit, from which higher-level wide-bit units would be subsequently built? How many levels of recombination was optimal, with how many base units at each level?
300 250 200 150 100 50 0 197
119
Shift-Add Virtex-II Slices

16 14 12
Knuth 425
Booth 387
Divide & Conquer 269
Xilinx Macro 119
We settled on two paths:

32-bit base units (leveraging CLB LUTs alone). 16-bit units (leveraging 18x18 MUL resources for Virtex II).
197
13.563
Max Delay (ns)
10 8 6.223 6 4 2 0 Shift-Add Knuth 7.243 7.243
9.289
4.421
Each path was costed out at width including Adder data collected in previous step. Some were pruned from further consideration.
At issue: how much better are Xilinx 18x18s than CLB-based MULs?
Booth 9.289
Divide & Conquer 13.563
Xilinx Macro 4.421
Max Delay (ns)
6.223
Preliminary data.
Step 3 Costing Base Multiplier Choices

4000
32-bit base unit MULs.
3500
Area Delay Product
3000 2500 2000 1500 1000 500 0
How do the base units compare with one another?

Use Area-Delay Product as a metric to assess relative strength of architectures. [Rabaey et al, 2002] This gives a measure that is independent of area vs. speed tradeoff.
Shift-Add 1225.931
Knuth 3078.275
Booth 3594.843
Divide & Conquer 3648.447
Xilinx Macro 526.099
Area * Delay
How to assess the relative strength of using the Virtex II 18x18 multipliers as base units?
Use Normalized Efficiency, defined relative to the 18x18. Most of the other multiplier schemes are extremely inefficient when compared to 18x18 as a base unit. The Shift-Add architectures approach 45% of the 18x18 unit efficiency (60.4% in area, 71% in delay), making it the next candidate, when 18x18 resources are exhausted. The Divide & Conquer is 14% as efficient as 18x18s at 32-bits!!!
120.0%
32-bit base unit MULs.
100.0%
80.0% 60.0%
40.0% 20.0% 0.0% Shift-Add Efficiency 42.9% Knuth 17.1% Booth 14.6%
Divide & Conquer 14.4%
Xilinx Macro 100.0%
Preliminary data.
Step 4 Characterizing Wide Datapaths

Data for 32-bit base unit MULs only.
Preliminary data.
We have wide Multipliers of differing widths (e.g., 192 and 256-bits):

Estimate each one in terms of performance of its smaller-width subunits. Different unit architectures are possible: Divide & Conquer, and Broadcast, etc. These MUL architectures are then decomposed. We need resource cost estimates, checking which candidate MUL configurations to apply, which to prune. Consider cost of controller as well. We need worst case delay, along with the properties of the FPGA, to select the clocking strategy. We need to count cycles, to get at the latency of the unit, and trade this against FPGA device fit.

Worst case delay envelope => minimum cycle time Cycle time * number of cycles per operation => latency
Step 4 Characterizing Wide Datapaths
Wide-bit MUL Comparison

DC-1 10000 9000 8000 7000
Area (CLB Slices)
Assessing architectures at different bit-widths.

Two versions of Divide & Conquer, one of Broadcast, where differences apparent in CLB usage. 18x18 usage is same across versions at differing bit-widths.
DC-2
BC
6000 5000 4000 3000 2000 1000 0 32 64 Bit-wdith 128 256
Comparing DC and BC architectures at different bitwidths.

Which architecture uses resources most efficiently as it scales.
Preliminary data.
Assessment is augmented by considering latency (after clock assignment) and throughput.
Step 5 Characterizing Pipeline Control

Using Control-Dataflow Graphs (CDFGs)[Karp & Miller, 1969; Gajski et al., 1994]
Specify cycle-based sequencing and scheduling of operation steps in deterministic graph. Algorithmic state machine (ASM) method allows direct representation of algorithm for encoding in VHDL. Combination of state machine and RTN specification of MUL operation steps. Allows direct exploration of design space, for trading off area/speed issues, and for pipelining & resource sharing. Allows direct consideration of the controller logic component of a given MUL architecture.
State encoding scheme can impact delay cost from output decoding logic.
N states requires register resources of 2N/2 CLB slices on Virtex-II device (binary, gray coding) or N/2 slices (one-hot), plus decoding LUTs and MUXes.
Step 5 Characterizing Pipeline Control

Preliminary data.
16000.00 14000.00
Area Delay Product
12000.00 10000.00 8000.00 6000.00 4000.00 2000.00 0.00 Binary 14081.23 Gray 11126.29 OneHot 14217.25
How do architectures compare in terms of FSM encoding efficiency?

Use Area-Delay Product as metric. Averaged results over multiple architectures and differing wide bitwidths. Not looking at FSM efficiency for a given pipelined MUL architecture.
Area * Delay
120.00% 100.00%
Assessing the impact of FSM encoding on the architectures:

Use Normalized Efficiency, defined relative to Gray Code scheme. The other FSM schemes (Binary, One-Hot) are moderately inefficient compared to Gray Code. This is surprising, since One-Hot is generally preferred for FPGAs. On average, One-Hot encoding is 80% as efficient as Gray Code for realizing the FSM, as efficient as Binary encoding, for these pipelined architectures. The disparity appears to derive from how delay is affected by MUX chains in the control path.
80.00%
60.00%
40.00%
20.00%
0.00% Efficiency
Binary 79.02%
Gray 100.00%
OneHot 78.26%
Step 6 Assessing Pipeline Latency & Throughput

Examination of the multi-level hierarchical wide-bit units:
MUL Unit Slices 18x18s Clock Cycles Clock Period Latency DCn (ns) (ns) 32 585 3 6 15 8 64 2485 9 16 20 32 128 9282 27 21 30 63 Throughput (MMOPS) 12.5 3.1 1.6
Divide & Conquer Multiplier - Composite Plot

Slices 18x18s Latency Throughput
- use of subunits consists of intermediate MUL units - themselves divide & conquer units of the smaller width - these ultimately make reference to MULT18x18s.
We can assess the cost of the MUL pipelines, in terms of latency and throughput. There is a big question as to whether DCn will scale effectively while holding resource utilization, latency and throughput at acceptable levels.
32
64 Bit-width
128
Protocol State Machine Example:

802.11b WLAN MAC Layer
Example - 802.11 WLAN MAC

Wireless Local Networking
Supports pervasive computing on PCs, laptops, PDAs and other Internet-enabled devices. Most Access Points support broadband access to a wired network. Personal network management screens access functions that control behavior of the MAC Layer. Example: wireless router from D-Link.
S ta tio n -3 (In te r n e t G a te w a y ) S ta tio n -4 (P r in t S e r v e r )
Screen shorts 2002 D-Link Corporation
8 0 2 .1 1 W ir e le s s M e d iu m (C S M A /C A )
S ta tio n - 2
S t a tio n -1
802.11 WLAN Typical Architecture

C P U co re D ie area
4 .9 m m 2 o n 0 .1 8 m e stim a te d size w ith 1 6 K B in stru ctio n & 4 K B d a ta ca ch e s a n d n o TCMs U sin g A rtisa n ce ll lib ra ry & R A M co m p ile r
P eak P o w er C o n s u m p tio n (m W /M H z )
M em o ry S ystem
S e le cta b le I & D ca ch e size s: 0 , 4 K , 8 K ... 1 M S e le cta b le I & D T C M size s: 0 , 4 K , 8 K ... 1 M
C lo ck fre q u e n c y & M IP S p e rfo rm an c e

150M H z on TSM C 0 .1 8 m (w o rst ca se ) 230M H z on TSM C 0 .1 8 m (typ ica l)
A R M 9 E / A R M 9 4 6 E-S ca ch e d p ro ce sso r w ith tig h tly co u p le d m e m o ry in te rfa ce s
1 .1 m W /M H z @ 1 .8 V (e stim a te d )
M AC co n tro ller BUS in terface RF F ro n t End BBP M AC
d evice d river (b asic fu n ctio n call, m em o ry co n tro l etc.
P ro to co l S tack & OS kern el
C lien t d river Ap p licatio n AP d river
T arg et B B P & M A C C o n tro ller C h ip
S o ftd rive r
PHY
M AC
M e m o ry b lo c k (4 ~ 1 5 k B ) + c o n tro l b lo c k
O n c h ip m e m o ry d e p e n s o n firm w a re s ize (1 0 0 ~ 2 0 0 K B ) th e F W s ize d e p e n d s o n th e M AC fe a tu re 1 1 e & i?
1 0 K g a te s P C I/P C M C IA
Source: Knowledge Edge KK
Behavior to Architecture Mapping
Most 802.11b MAC implementations are done as embedded systems executing on a CPU (e.g., ARM microprocessor).
Well be designing our MAC layer model in VLSI custom logic using concurrent state machines, and will generate a circuit using a Xilinx family FPGA device.
802.11 WLAN Core Functions (UML Use Case) Operations

Inventory of basic TransmitFrame functions supported in our MAC-layer Receiver architecture. MAC_Layer PHY_Layer Interaction at the system ReceiveFrame boundary with the PHY layer. Each Use Case will be iterated using Sequence diagrams, hardware Dec odeFrameHeader Dec odeFrameCheck block diagrams, and Executable ASM diagrams. DecodeAddresses DecryptFrameData Each Use Case will have a set of behaviors associated with it that DecodeDurationID DecodeSequenc eControl we will want to model as an RTL hardware description.
802.11 WLAN Core Actors & Data (UML Class)

Model structure of the problem domain using Class diagram.
Most applications have inherent structure that youll want to understand. Useful for defining problem scope and for capturing design requirements. The classes identified may become key application data, or modules that operate on data. Well use some of the identified classes as modules in Sequence diagrams, which will aid in making partitioning decisions.
ShiftController <<PartOf>> 1 WordCounter
(from Use Case View)
<<PartOf>>
1 FrameSequencer
<<PartOf>> Generates_Word
WordSelector 0..* <<PartOf>> 1.. * Frame_Word

Maintains 1
1 Check s_Current_Value 1 1 0..* +next_st at e
Forwards_Word_To_Target 1 Selects_Target_For_Word 1 DecoderSelector

Frame_Sequence_State 0..* +previous_state
St ate_Sequence
802.11 WLAN Partitioning (UML Sequence)

Modeling sequenced module interactions leads to a partitioned architecture.
Taking each Use Case and decomposing it into a sequence of interactions between system modules. Allows a designer to explore choices for partitioning the system. Partitioning metrics: (1) module cohesion, (2) inter-module coupling, (3) scope of module responsibility, (4) degree of information hiding. We use concepts of object-oriented systems analysis to converge on an optimal partitioning strategy. Generally, we trade-off circuit optimization against degree of design reuse, extensibility, maintainability.
802.11 WLAN Behavior (UML Statechart)

Take each module and derive its internal behavior description from the sequenced interactions.
Each module has events and actions defined at its interfaces. These collectively provide an inventory of the behavior the module must support. Looking at the Sequence diagrams, we extract the sequence of steps to be performed within each module. We specify the internal behavior of a module using either a UML Statechart diagram (for state-based lifecycles) or UML Activity diagram (for algorithmic description).
Done
Start ^Target_Decoded AND Select_Enable
Created Data_Buffered Target_Decoded Frame_Subtype = Data AND Buffer_Complete Frame_Subtype != D ata ^Select_Enable Frame_Subtype = Data AND ^Buffer_Complete
Target_Identified Fully_Decoded ^Decode_Complete Select_Enable Decode_Complete Latched
802.11 WLAN - Frame Word Lifecycle

: Frame_W ord : Shifter 1: Shift in 4-bits This scenario is designed to show how state information for the Frame_W ord might be correlated with the passing of signals between blocks, actions corresponding to state change for the Frame_W ord. 2: Shift in 4-bits 3: Shift in 4-bits : WordS el ector : Frame Sequencer : Decoder Selector : FrameHeader Decoder
4: Shift in 4-bits 5: Create Frame_W ord 6: State = "Created" 7: Signal new Frame_W ord
In this scenario, we assume this is the first word created for a new frame.
8: Check if 1st W ord in Frame 9: Signal Seque ncer to 1st W ord
10: Set Decoder_Selector = "FrameHeaderDecoder" 1 1: Signal "Target_ Dec oded" 11: 1 3: Signal Fram e_H eader_Decoder 14: Latch Frame_W ord
12: State = "Targ et_Identifi ed"
15: Signa l "S elect_ Ena ble" 16: Stat e = "Lat ched" 17: Decode Header 18: Pass Frame_Subtype 19: Signal "Decode_Complete" 20: State = "Fully_Decoded"
802.11 WLAN Sequence Diagram-1

: Shifter 1: Buffer 1st Frame Word : WordCounter : Decoder Selector : FrameHeader Decoder
See next slide!

2: Signal Start_of_Frame 3: Initialize for New Frame 4: SOF_Acknowledge 5: Signal Word_Ready 6: Initialize for Next Word in Frame 7: Latch New Frame Word 8: Select FrameHeaderDecoder
You will need to create a block for this one!
9: Signal New Word Available 8b: Buffer Header Word 10: Initialize for New Frame Header 11: Latc h Frame Header W ord 12: Determine Frame Subtype & DS Bits
13: Buffer Frame Subtype and DS Bits 14: Signal Header_Data_Ready 15: Load Subtype & DS Bits
802.11 WLAN Sequence Diagram-2

: P HY 1: P HY _W ord_A vail 2: Initializ e 3: G et 4-bit W ord 4: S hift_In (do it 4 t im es) : S hift er : W or dS ele c tor : F ram e S equenc er : Decoder Selec t or
5: Creat e F r am e W ord (r egis ter )
6: New_W ord
6b: New_W or d 7b: S elec t Nex t S t ate (regis ter) 7: Initializ e
8: Latc h New W ord 9: Read F ram e S tate
Note that the Word_Counter of the previous slide is now decomposed into the Word-Selector and the Frame_Sequencer actors, which will be the basis for depicting the details in the block diagrams that follow.
10: CA S E : s witc h on F ram e S tate
11: M ak e Dec oder S elec tion for W ord Des tination 12: S et E nable Line for downs tream bloc k
802.11 WLAN MAC Receiver Block Diagram
802.11 WLAN Receive Word Shifter

A model for MAC-layer data receiver/shifter thread.
We use Executable ASM diagrams to model state machine behavior and datapath operations for the hardware. Executable ASM models have a graphical symbol set that looks like a flowchart. Algorithm structure can be easily modeled using the ASM graphics. The diagrams are annotated with register transfer notation (RTN) expressing operations and events. Executable ASM models are directly executable in NimbusTM, are correct by construction and result in VHDL/Verilog code optimized for circuit synthesis.
802.11 WLAN MAC Receiver: DID Decoder
Go to this state on error.
Go back to the poll state.

802.11 WLAN MAC Receiver: FCS Decoder

Go to this state on reset from MAC frame error. Poll for start of Frame Check Sequence bit stream.
Poll waiting for second word (16 bits) of FCS frame field..
Check that FCS field bits match CRC-32 bits computed from frame stream on the fly.
802.11 WLAN Frame Word Counter

Poll for new 16-bit word in receiver stream.
Test is new word is first word of a new MAC frame. Our sequencing choice depends on first word or not. Well always assume Frame Control Header if we have a new frame. Enable decoding of target block select, based on current state of Frame Sequencer.
802.11 WLAN Transmit Frame Selector

ASM sub-flow has cascading tests as nested If-Then-Else structure. Priority sequence defined by the order of nesting: NOTE there is an error in the design that is easy to see. The Transmitter should send reply to either RTS or DATA frame events reported by Receiver before sending MSDU passed down from LLC layer of network protocol stack.
This logic block encapsulates output Decoding logic, which is good to put In ASM sub-flows.
Section 3 - Summary
Executable algorithmic state machines:
Allow both control and data operations to be specified in time (cycle by cycle scheduling) and space (allocation of specific resource types using abstract macro-functions). Notation we define is directly executable in the Nimbus tool set, hence, executable ASM models. Basic data operations and memory operations supported directly in the ASM notation using datapath macro-functions and memory arrays.
Using ASM diagrams:

A thinking aid for defining the structure and sequencing behavior of Finite State Machines. Used in 3 different ways: (1) definition/specification of sequential systems, (2) analysis of sequential circuits, (3) design of combinational and sequential circuits behaviorally.
Algorithms and Protocols

Can directly support the mapping of an algorithm into one or more candidate architectures. Can directly support exploration of protocol implementations distributed across many concurrent state machine threads

Tsinghua Session1 040425 D 2pg

Caricato da

Informazioni sul documento

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Tsinghua Session1 040425 D 2pg

Caricato da

Copyright:

Formati disponibili

High-level VLSI Digital Systems Design: Methods, Notation, Architecture

2004 Dr. James P. Davis

Seminar Session Outline

2. Teaching & Practicing High-level Design (70 minutes)

3. Example High-level System Designs (50 minutes)

Section 1 Introduction to High-Level Design

2004 Dr. James P. Davis

Section 1 Key Points

System Design Trends What is driving us:

2004 Dr. James P. Davis

Introduction Vertical Market Drivers

VLSI Silicon "chip"

Example Wireless Communications

G2/G3 Network CDMA

Blue Tooth 802.11b

IEEE802.11g Data Rate

2004 Dr. James P. Davis

Introduction - VLSI SOC Drivers

The Capacity vs. Capability Gap

Product Tim e-to-Market

2004 Dr. James P. Davis

Introduction VLSI SOC Objectives

Using electronic systems design best practices

SOC Architecture Imperatives:

Design Problem-Solving What we use:

2004 Dr. James P. Davis

VLSI-based Design Space (Y-chart)

2004 Dr. James P. Davis

Categories of Computing Systems Design

VLSI hardware microcode machine code operating system application

ASICs and FPGAs

Classes of Electronic Systems

Mapping Algorithms/Protocols to Architecture

2004 Dr. James P. Davis

Levels of Abstraction in System Design

design is iteratively refined.

Algorithm to Architecture Process

Control Flow modeling (Algorithmic structure)

Create Ordered Sequence of Operations

Overlay Operation Sequence onto Control Structure

Add Hardware Semantics

Data Flow modeling (Operation ordering)

- Clocking - Operation Scheduling - Parallelism - Resource Binding

2004 Dr. James P. Davis

Design as a Problem-solving Process

The "search" for an optimal solution involves tools & methods.

2004 Dr. James P. Davis

Impact of Backtracking on High-level Design

HDL Coding or Text documentation

High-level Design and Verification Area or Timing Constraint Violation

Logic Synthesis or Schematic Capture

Gate-level Analysis and Verification Area or Timing Constraint Violation

Layout and Routing

Physical Analysis and Verification

2004 Dr. James P. Davis

Process Design-for-Synthesis Methods & Tools

Capture Design Compile & Checking Correct Entry?

Behavioral Simulation Correct Behavior?

HDL Simulation Required? NO

Synopsys Design Compiler HDL Compiler

Functional Simulation Correct Function?

Logic Synthesis Gate-level Timing Analysis Correct Timing?

Design Analyzer Timing Analyzer

Partition, Place & Route Area & Speed?