07 RTL Optimization Techniques

Contents
Timing optimization
Area optimization
Additional readings
Budapest University of Technology and Economics
RTL Optimization Techniques

Pter Horvth
Department of Electron Devices
March 30, 2016
Pter Horvth
1 / 20
Contents
Timing optimization
Area optimization
Additional readings
Contents
Contents
timing optimization concepts and design techniques

throughput, latency, local datapath delay
loop unrolling, removing pipeline registers, register balancing
area optimization concepts and design techniques

resource requirement metrics in standard cell ASIC and FPGA
control-based logic reuse, priority encoders, considering technology
primitives
additional readings
Pter Horvth
2 / 20
Contents
Timing optimization
Area optimization
Additional readings
Timing optimization
Pter Horvth
3 / 20
Contents
Timing optimization
Area optimization
Additional readings
Computation performance concepts
There are three important concepts related to the computation

performance.
throughput: The amount of data processed in a single clock cycle
(bits per second).
latency: The time elapsed between data input and processed data
output (clock cycles).
local datapath delays: Delay of logic between storage elements
(nanoseconds). It determines the maximum clock frequency.
Pter Horvth
4 / 20
Contents
Timing optimization
Area optimization
Additional readings
High throughput loop unrolling (pipeline)

x[31:0]
32
clk
x1
x[31:0]
32
32
32
clk
start
x2
32
0
32
32
32
clk
32
pow1
32
clk
pow
32
32
pow[31:0]
32
clk
pow
throughput: 32/3 = 10.7 bits/cycle;

latency: 3 cycles
32
pow[31:0]
throughput: 32/1 = 32 bits/cycle;

latency: 3 cycles
Pter Horvth
5 / 20
Contents
Timing optimization
Area optimization
Additional readings
Timing optimization techniques
High throughput loop unrolling (pipeline)

During the high throughput optimization the time required for
processing of a single data is irrelevant but the time elapsed
between two input reads is minimized.
Data n+1 is read while data n is still under processing.
architecture iterative of pow3 is
begin
process (clk)
begin
if (rising_edge(clk)) then
if (start = '1') then
count <= 2;
pow <= x;
elsif (stop = '0') then
count <= count - 1;
pow <= pow * x;
end if;
end if;
end process;
stop <= '1' when count = 0 else '0';
end architecture;
architecture pipelined of pow3 is

begin
process (clk)
begin
-- stage 1
x1 <= x;
-- stage 2
x2 <= x1;
pow1 <= x1 * x1;
-- stage 3
pow <= pow1 * x2;
end if;
end process;
end architecture;
throuhgput: 32/1 = 32 bits/cycle; latency: 3 cycles
throuhgput: 32/3 = 10.7 bits/cycle; latency: 3 cycles

Pter Horvth
6 / 20
Contents
Timing optimization
Area optimization
Additional readings
Low latency removing pipeline registers

x[31:0]
32
clk
x1
x[31:0]
32
32
32
32
32
clk
x2
32
32
clk
32
pow1
32
32
clk
pow
32
32
clk
pow[31:0]
pow
32
latency: 1 cycle
pow[31:0]
latency: 3 cycles
Pter Horvth
7 / 20
Contents
Timing optimization
Area optimization
Additional readings
Low latency removing pipeline registers

The objective of the low-latency optimization is to pass the data
from the input to the output with minimal internal processing
delay.
A low-latency design uses parallelism and removes pipeline registers.
architecture async of pow3 is
begin
process (x)
begin
x1 <= x;
architecture pipelined of pow3 is

begin
process (clk)
begin
-- stage 1
x1 <= x;
end process;
process (x1)
begin
x2 <= x1;
pow1 <= x1 * x1;
end process;
-- stage 2
x2 <= x1;
pow1 <= x1 * x1;
-- stage 3
pow <= pow1 * x2;
end if;
end process;
end architecture;
pow <= pow1 * x2;

end architecture;
latency: 1 cycle (with an additional output register)
latency: 3 cycles
Pter Horvth
8 / 20
Contents
Timing optimization
Area optimization
Additional readings
Minimizing logic delay register layers

x[31:0]
x[31:0]
32
A[31:0]
32
32
32
clk
B[31:0]
B[31:0]
32
32
clk
32
32
clk
x2
x1
x1
A[31:0]
clk
x2
C
32
32
C[31:0]
32
32
32
32
32
clk
32
prod3
32
clk
clk
prod2
32
prod1
32
32
32
clk
32
clk
y
32
y[31:0]
32
y[31:0]
local datapaths: 1 adder and 1

multiplier
Pter Horvth
local datapaths: 1 adder or 1

multiplier
9 / 20
Contents
Timing optimization
Area optimization
Additional readings
Minimizing logic delay register layers

The logic between two sequential elements is called local datapath.
The delay of the slowest local datapath determines the maximum
clock frequency.
The local datapath delay can be reduced by additional register
layers.
architecture single_cycle of fir is
begin
process (clk)
begin
if (valid = '1') then
x1 <= x;
x2 <= x1;
y <= A*x + B*x1 + C*x2;
end if;
end if;
end process;
end architecture;
Pter Horvth
architecture multi_cycle of fir is

begin
process (clk)
begin
if (valid = '1') then
x1 <= x; x2 <= x1;
prod1 <= A * x;
prod2 <= B * x1;
prod3 <= C * x2;
y <= prod1 + prod2 + prod3;
end if;
end if;
end process;
end architecture;
10 / 20
Contents
Timing optimization
Area optimization
Additional readings
Minimizing logic delay register balancing

in_a[31:0]
in_b[31:0]
32
clk
reg_b
32
clk
clk
reg_ab_sum
reg_c
32
32
32
in_c[31:0]
32
reg_b
32
32
in_b[31:0]
32
clk
reg_a
32
in_a[31:0]
in_b[31:0]
32
32
clk
+
32
32
clk
clk
sum
sum
32
32
sum[31:0]
local datapaths: 2 adders
sum[31:0]
local datapaths: 1 adder
Pter Horvth
11 / 20
Contents
Timing optimization
Area optimization
Additional readings
Minimizing logic delay register balancing
During register balancing the logic between registers is redistributed

in order to minimize the worst-case delay between any register pairs.
architecture not_balanced of add3 is
begin
process (clk)
begin
reg_a <= in_a;
reg_b <= in_b;
reg_c <= in_c;
sum <= reg_a + reg_b + reg_c;
end if;
end process;
end architecture;
Pter Horvth
architecture balanced of add3 is

begin
process (clk)
begin
reg_ab_sum <= in_a + in_b;
reg_c <= in_c;
sum <= reg_ab_sum + reg_c;
end if;
end process;
end architecture;
12 / 20
Contents
Timing optimization
Area optimization
Additional readings
Area optimization
Pter Horvth
13 / 20
Contents
Timing optimization
Area optimization
Additional readings
Area concepts
Area concepts
The resource requirement means the amount of the basic functional

primitives required for implementing the described functionality.
The basic functional primitives in standard cell ASICs are the
standard cells, which can be simple logic gates, flip-flops but also
more complex arithmetic-logic functions or memories.
The basic logic elements (BLE) of FPGAs consist of a logic
function (the input number is dependent on the vendor and the
device family), a flip-flop and a multiplexer. There are special
purpose resoures as well, such as memory blocks, signal processing
elements (multipliers) etc.
Pter Horvth
14 / 20
Contents
Timing optimization
Area optimization
Additional readings
Area optimization techniques
Minimizing area control-based logic reuse

Control-based logic reuse should be considered the opposite
operation to the loop unrolling. Pipeline requires internal data
storage resources and additional logic to implement parallel
operation. These resources can be reused with the cost of a
reduced throughput.
in1
in2
in3
in4
32
32
32
32
sel
reset
clk
32
plr2
zero
clk
reset
in4
32
32
1
FSM
ce
plr1
32
0
32
ce
in3
in2
32
+
32
reset
clk
in1
sel_input
zero ce_acc
clk
reset ss_z
32
ce
32
32
reset
clk
32
32
32
1
reset
clk
acc
ce
reset
clk
acc
Control-based logic reuse requires an

FSM to generate control signals.
32
zero
acc
32
acc
Pter Horvth
15 / 20
Contents
Timing optimization
Area optimization
Additional readings
Minimizing area priority encoders

The resource requirement can be improved if the mutual exclusion
is exploited. The elsif statement should be used only if a priority
encoder is required and the conditions are not mutually exclusive.
architecture not_priority of logic is
begin
process (clk)
begin
if (ctrl(0) = '1') then
output(0) <= input; end if;
end if;
architecture priority of logic is

begin
process (clk)
begin
output(0) <= input;
elsif (ctrl(1) = '1') then
output(1) <= input;
output(2) <= input;
output(3) <= input;
end if;
end if;
end process;
end architecture;
end process;
end architecture;
Pter Horvth
16 / 20
Contents
Timing optimization
Area optimization
Additional readings
Minimizing area priority encoders

32
32
input[31:0]
input
0
32
32
output_a
clk
1
sel
ctrl
output_a[31:0]
32
32
output_a
clk
1
sel
ctrl
output_a
[0]
[0]
32
32
32
4
32
4
32
output_b
clk
1
sel
32
output_b[31:0]
output_b
clk
1
sel
output_b
[0]
[1]
[1]
32
32
32
32
0
32
output_c
clk
1
sel
0
32
output_c[31:0]
output_c
clk
1
sel
output_c
[0]
[1]
[2]
[2]
32
32
32
32
4
0
32
output_d
clk
1
sel
0
32
output_d[31:0]
output_d
clk
1
sel
[0]
[1]
[2]
[3]
output_d
[3]
without exploiting mutual exlusion

Pter Horvth
with exploiting mutual exclusion

17 / 20
Contents
Timing optimization
Area optimization
Additional readings
Minimizing area considering technology primitives
With appropriate HDL coding style a more efficient logic

synthesis can be achieved. The synthesis tool vendors usually
provide coding technique proposals to improve the resource
requirement or timing parameters of the design. The proposed
coding style takes the unique characteritics of the technology
primitives into consideration.
utilizing block RAM modules in FPGAs: Block RAM modules do
not have any reset inputs and their outputs are synchronous to a
clock signal. Only HDL models with these parameters can be
implemented in block RAMs.
utilizing high quality DSP units: The DSP slices in the FPGAs have
synchronous outputs. This restriction have to be taken into account
in HDL model generation.
Pter Horvth
18 / 20
Contents
Timing optimization
Area optimization
Additional readings
Minimizing area considering technology primitives

architecture FFS of RAM is
begin
process (clk)
begin
if (reset = '1') then
content <= (others=>(others=>'0'));
elsif (rising_edge(clk)) then
if (write = '1') then
content(address) <= data_in;
end if;
end if;
end process;
data_out <= content(address);
end architecture;
Because of the asynchronous

output this model cannot be
implemented in block RAM.
The reset function hinders the
LUT implementation as well.
Pter Horvth
architecture BRAM of RAM is

begin
process (clk)
begin
if (write = '1') then
content(address) <= data_in;
end if;
data_out <= content(address);
end if;
end process;
end architecture;
This model can be implemented

as flip-flops, LUT RAM and
block RAM as well.
19 / 20
Contents
Timing optimization
Area optimization
Additional readings
Additional readings
Additional readings
Steve Kilts Advanced FPGA Design, Architecture, Implementation,

and Optimization
David Money Harris, Sarah L. Harris Digital Design and Computer
Architecture
Peter J. Ashenden Digital Design An Embedded System
Approach Using VHDL
M. Moris Mano, Charles R. Kime Logic and Computer Design
Fundamentals
Pong P. Chu RTL Hardware Design Using VHDL
Peter Wilson Design Recipes for FPGAs
Pter Horvth
20 / 20

07 RTL Optimization Techniques

Caricato da

Informazioni sul documento

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

07 RTL Optimization Techniques

Caricato da

Copyright:

Formati disponibili

Contents

Budapest University of Technology and Economics

RTL Optimization Techniques

March 30, 2016

RTL Optimization Techniques

timing optimization concepts and design techniques

area optimization concepts and design techniques

RTL Optimization Techniques

RTL Optimization Techniques

Computation performance concepts

Computation performance concepts

There are three important concepts related to the computation

RTL Optimization Techniques

Computation performance concepts

High throughput loop unrolling (pipeline)

throughput: 32/3 = 10.7 bits/cycle;

throughput: 32/1 = 32 bits/cycle;

RTL Optimization Techniques

Timing optimization techniques

High throughput loop unrolling (pipeline)

architecture pipelined of pow3 is

throuhgput: 32/3 = 10.7 bits/cycle; latency: 3 cycles

RTL Optimization Techniques

Timing optimization techniques

Low latency removing pipeline registers

RTL Optimization Techniques

Timing optimization techniques

Low latency removing pipeline registers

architecture pipelined of pow3 is

pow <= pow1 * x2;

RTL Optimization Techniques

Timing optimization techniques

Minimizing logic delay register layers

local datapaths: 1 adder and 1

local datapaths: 1 adder or 1

Timing optimization techniques

Minimizing logic delay register layers

architecture multi_cycle of fir is

Timing optimization techniques

Minimizing logic delay register balancing

local datapaths: 2 adders

local datapaths: 1 adder

RTL Optimization Techniques

Timing optimization techniques

Minimizing logic delay register balancing

During register balancing the logic between registers is redistributed

architecture balanced of add3 is

RTL Optimization Techniques

RTL Optimization Techniques

The resource requirement means the amount of the basic functional

RTL Optimization Techniques

Area optimization techniques

Minimizing area control-based logic reuse

Control-based logic reuse requires an

RTL Optimization Techniques

Area optimization techniques

Minimizing area priority encoders

architecture priority of logic is

RTL Optimization Techniques

Area optimization techniques

Minimizing area priority encoders

without exploiting mutual exlusion

with exploiting mutual exclusion

Area optimization techniques