Sei sulla pagina 1di 32


A substantial improvement in performance can be achieved by overlapping the execution

of basic steps called pipe lining.
Example: Suppose that we want to perform the combined multiply and add operations
with a stream of numbers.
Ai * Bi + Ci for i = 1,2,3,……,7
Each sub operation is to be implemented in a segment within a pipeline. The sub
operations performed in each segment of the pipeline are as follows:

R1 Ai, R2 Bi Input Ai & Bi

R3 R1* R2, R4 Ci Multiply & Input Ci

R5 R3 + R4 Add Ci to product

R1 R2

Ai Bi Ci
R3 R4



Clock Segment 1 Segment 2 Segment 3

Pulse R1 R2 R3 R4 R5
1 A1 B1 --- --- ---
2 A2 B2 A1 * B1 C1 ---
3 A3 B3 A2 * B2 C2 A1 * B1 + C1
4 A4 B4 A3 * B3 C3 A2 * B2 + C2
5 A5 B5 A4 * B4 C4 A3 * B3 + C3
6 A6 B6 A5 * B5 C5 A4 * B4 + C4
7 A7 B7 A6 * B6 C6 A5 * B5 + C5
8 --- --- A7 * B7 C7 A6 * B6 + C6
9 --- --- --- --- A7 * B7 + C7
Content of register in pipeline

1. Suppose to execute n task there are k-segment pipeline with a clock cycle time of tp
2. The first task T1 requires a time equal to ktp to complete its operation using k
segments in the pipe.
3. The remaining n - 1 tasks emerge from the pipe at the rate of one task per clock
cycle and they will be completed after a time equal to (n - l) tp
4. Therefore, to complete n tasks using a k-segment pipeline requires k + (n - 1)
clock cycles.


The speedup of a pipelining processing over an equivalent non pipelining processing is

defined by the ratio.
nt p
(k + n − 1)t p

As the number of tasks increases, n becomes much larger than k - 1, and k + n - 1

approaches the value of n. Under this condition, the speedup becomes
If we assume that the time it takes to process a task is the same in the pipeline and non
pipeline circuits, we will have tn = ktp. Including this assumption, the speedup reduces to
kt p
S= =k
This shows that the theoretical maximum speedup that a pipeline can provide is k,
where k is the number of segments in the pipeline.

For example:

There are 4 segments t1=60ns,t2=70ns,t3=100ns, t4=80ns and the interface registers have a
delay of tr=10ns

Using Pipeline :

The clock cycle is chosen to be tp= t3 + tr = 100 + 10 = 110ns

Using Non-pipeline:

tn = t1 + t2 + t3 + t4 + tr =60 + 70 + 100 + 80 + 10 =320ns

The pipeline adder has a speedup of 320/110 = 2.9ns

For example:

If a pipeline executes many tasks, the overhead of start-up and drainage can be
ignored and the effective throughput of the pipeline taken to be 1 task per cycle. In terms
of instruction execution, this corresponds to an effective CPI (cycles per instruction) of 1.

Linear Pipeline Processor

Linear pipelines are applied for instruction execution, arithmetic computation, and
memory-access operations.

A linear pipeline processor is constructed with k processing stages. External

inputs (operands) are fed into the pipeline at the first stage S1. The processed results are
passed from stage Si to stage Si+1 for all i = 1,2, ... ,k -1. The final result emerges from the
pipeline at the last stage Sk.

Depending on the control of data flow along the pipeline, we model linear pipelines in
two categories:

1. Asynchronous
2. Synchronous.
Asynchronous Model:

1. Data flow between adjacent stages in an asynchronous pipeline is controlled by a

handshaking protocol.

2. When stage Si is ready to transmit, it sends a ready signal to stage Si+1. After stage
Si+l receives the incoming data, it returns an acknowledge signal to Si.
3. Asynchronous pipelines are useful in designing communication channels in message
passing multicomputers.

4. Asynchronous pipelines may have a variable throughput rate.

5. Different amounts of delay may be experienced in different stages.

Synchronous Model

1. Clocked latches are used to interface between stages.

2. The latches are made with master-slave flip-flops, which can isolate inputs from

3. Upon the arrival of a clock pulse, all latches transfer data to the next stage

The pipeline stages are combinational logic circuits. It is desired to have approxi-
mately equal delays in all stages. These delays determine the clock period and thus the
speed of the pipeline.

Clocking and Timing Control

The clock cycle τ of a pipeline is determined below. Let τ i be the time delay of
the circuitry in stage Si and d the time delay of a latch, as shown in Fig. 6.1b.
Clock Cycle and Throughput
Denote the maximum stage delay as τ m , and we can write τ as
τ = max{τ i } k + d = τ m + d
i 1

At the rising edge of the clock pulse, the data is latched to the master flip-flops of
each latch register. The clock pulse has a width equal to d. In general, τ m >> d for
one to two orders of magnitude. This implies that the maximum stage delay τ m
dominates the , clock period.
The pipeline frequency is defined as the inverse of the clock period:
If one result is expected to come out of the pipeline per cycle, f represents the
maximum throughput of the pipeline.

Speedup, Efficiency, and Throughput

Ideally, a linear pipeline of k stages can process n tasks in k + (n - 1) clock cycles,

where k cycles are needed to complete the execution of the very first task and the
remaining n - 1 tasks require n - 1 cycles. Thus the total time required is

Tk = [k + (n - 1)] τ

where 'T is the clock period. Consider an equivalent-function nonpipelined processor

which has a flow-through delay of k'T. The amount of time it takes to execute n tasks on
this nonpipelined processor is T1 = nk τ .

Speedup Factor The speedup factor of a k-stage pipeline over an equivalent nonpipelined
processor is defined as

Τi nkτ nk
Sk = = =
Τk kτ + (n - 1)τ k + (n - 1)

Optimal Number of Stages

The optimal choice of the number of pipeline stages should be able to maximize a
performance/cost ratio.

Let t be the total time required for a nonpipelined sequential program of a given
function. To execute the same program on a k-stage pipeline with an equal flow-
through delay t, one needs a clock period of
p = t/k + d, where d is the latch delay.

Thus, the pipeline has a maximum throughput of f = l/p = l / {t / k + d).

The total pipeline cost is roughly estimated by c + kh, where c covers the cost of
all logic stages and h represents the cost of each latch.

A pipeline performance/cost ratio (PCR) has been defined by Larson (1973):

f 1
PCR = =
c + kh (t / k + d )(c + kh)

Figure 2 plots the PCR as function of k. The peak of the PCR curve corresponds
to an optimal choice for the number of desired pipeline stages:

t ⋅c
Ko =
d ⋅h
Where t is the total flow-through delay of the pipeline. The total stage cost c, the
latch delay d, and the latch cost h can be adjusted to achieve the optimal value ko.
Nonlinear Pipeline Processor

The traditional· linear pipelines are static pipelines because they are used to
perform fixed functions.

1. A dynamic pipeline can be reconfigured to perform variable functions at

different times.
2. A dynamic pipeline allows feed forward and feedback connections in
addition to the streamline connections. For this reason, some authors call
such a structure a nonlinear pipeline.

Pipeline conflicts:
There are 3 major difficulties that cause the instruction pipeline to deviate from its
normal operation.

1. Resource conflict
2. Data dependency
3. Branch difficulties

Resource conflict:

Resource conflict access to memory by 2 segments of the same time. Most of

these conflicts can be resolved by using separate instruction and data memories.

Data dependency:

When an instruction depends on the result of a previous instruction, but this result
is not available.

Branch difficulties:

Branch difficulties arise from branch and other instructions that change that value
of PC.


Instr 1 FI DA FO EX
Instr 2 FI DA FO EX
Branch Instr 3 FI DA FO EX
Instr 4 - - FI DA FO EX
Inst 5 - - - FI DA FO EX

1. In pipe line system, instructions IK+1,IK+2 may be started and completed before
the instruction IK is completed.
2. This difference can cause problems if not properly considered in the design of
the control.
3. Existence of such dependencies causes what is called “hazard”.
4. The hardware technique that detects and resolves hazards is called interlock.

Types of hazards:

1. Instruction hazard
2. Data hazard
3. Branching hazards
4. Structural hazards


1. An instruction RAW hazard occurs when the instruction to be initiated in the

pipeline is fetched from a location that is yet to be updated by some
uncompleted instruction in the pipeline.

2. To handle this hazard a centralized controller is required to keep the address in the
range sets for the instructions inside the pipeline.

3. Every instruction fetch, PC be compared against the possible match with the
address in the address range set used by the subsequent stages. The match with
any of the address means there is an instruction RAW hazard.
Types of hazards:-

1. RAW (read and write) or true dependency hazard

2. WAR (write and read) or anti dependency hazard
3. WAW (write and write) or out dependency hazard

We demonstrate 3 hazards with the help of a sample program:-

Instruction no code

2. SUB R3,A
4. ADD R3,A

RAW hazard:

In the above code, the second instruction must use the value of A updated by the first
instruction. If the second instruction (SUB) reads the value A before instruction 1 has
a chance to update it, a wrong value of data will be used by the CPU. This situation is
called raw hazard.

WAR hazard:

A WAR hazard between 2 instructions i & j occurs when the instruction j attempts to
write onto some object that being read by instruction i.

The WAR hazard exists between the instructions 2 & 3, since an attempt by
instruction 3 to record a value in A before instruction 2 has read the value is clearly
WAW hazard:

When WAW hazard between 2 instructions i & j occurs when the instruction j
attempts to write onto some object that is also required to be modified by the
instruction i.

Similarly WAW hazard occurs between the instructions 3 & 5 since an attempt by
instruction 5 to store before the store of instruction 3 is clearly incorrect.


Data hazards occur when data is modified. Ignoring potential data hazards can result
in race conditions. There are 3 situations data hazard can occur in:

1. RAW (read after write).

An operand is modified and read soon after.


Instr 1: R3 ← R1 + R2
Instr 2: R5 ← R3 + R2

The first instruction is calculating a value to be saved in register R3 and the second is going
to use this value to compute a result for register 5.

2. WAR (write after read)


We must ensure that we do not store the result of register R3 before it has had a
chance to fetch the operands.

3. WAW (write after write)

Two instructions that write the same operand are performed.

For example:

Instr -1: R2←R1+R3

Instr -2: R2←R4+R5

We must delay the WB (write back) of instr2 until the execution instr1.

Eliminating hazards:

There are several established techniques preventing hazards:

1. Bubbling the pipeline/pipeline break or pipeline stall
2 Elimination data hazards
1. Pipeline stall Bubbles:
Data dependency in pipelines (execution of one instruction depending on completion of a
previous instruction) can cause pipeline stalls which diminish the performance.
Read-after-compute: register access after updating it with a computed value
Read-after-load: register access after updating it with data from memory
As instruction is fetched, control logic determines whether a hazard could/will occur. If this
is true, then the control logic inserts NOPs into the pipeline. Thus, before the next instruction
is executed, the previous would have had sufficient to complete and prevent the hazard.
Bubble insertion
1. Insertion of bubbles in a pipeline implies reduced throughput.
2. An objective of pipeline designers is to minimize the instances of bubble
insertion. An intelligent compiler can also help mitigate the performance
3. It is shown in figure where the third instruction uses the value that the second
instruction loads into register $8. Without data forwarding, the third
instruction must read register $8 in cycle 7. . This data dependency problem
can be solved by bubble insertion or via data forwarding.

sw – store word
lw –load word

Data Forwarding:
1. The result of the second instruction is not yet stored in register $8 by the time
the third instruction needs it, the result is in fact available at the output of the
ALU. Thus, the needed value can be passed (bypass path) from the second
instruction to the third one.
For example:
Let’s say we want to write the value 3 to register 1, and then add 7 to register 1 and store the
result in register 2. i.e.
Instr -0 : register 1=6
Inst r-1 : register 1=3
Instr -2 : register 2= register 1+7 =10

1. Following execution register 2 should contain the value 10.

2. However if instruction 1 does not completely exit the pipeline before inst-2 starts
execution, it means that the register 1 does not contain the value when instr-2
performs its addition.

3. In such an event, inst-2 adds 7 to the old value of register 1 and so register 2 would
contain 13

Operand hazard:

1. Logic to detect the hazards in operand fetches can be worked out as follow from
the definition of range and domain sets.

2. We can device a mechanism to keep track of the domain and range sets for
various instructions passing through the pipeline stages.

3. Each set will associate with one set of storage registers for the domain set and
another for the range set addresses.

4. This storage will be required for all the stages beyond the decode stage.

5. When an instruction moves from stage to stage it also carries its range and domain
set information.
Scalar processor:
The simplest processors are processor. Scalar processors represent the simplest
class of computer Microprocessors. A scalar processor processes one data item at
a time.
Vector processor:

A vector processor, or array processor, is a Central processing unit design that is

able to run mathematical operations on multiple data elements operates
simultaneously on many data items.

4.2.1 Superscalar Processor

1. Superscalar processors are designed to exploit more instruction-level

parallelism in user programs.
2. Pipelining in Superscalar Processors The fundamental structure of a
superscalar pipeline is illustrated in Fig. 4.
3. A superscalar processor of degree m can issue m instructions per cycle.


MEMWB(Superscalar Pipeline Design)
Fig. 28a. In this design, the processor can issue two instructions per cycle if there
is no resource conflict and no data dependence problem.

4. For simplicity, we assume that each pipeline stage requires one cycle, except the
execute stage which may require a variable number of cycles. Four functional
units, multiplier, adder, logic unit, and load unit, are available for use in the
execute stage.
5. The multiplier itself has three pipeline stages, the adder has two stages, and the
others each have only one stage.

lookahead window

There is a lookahead window with its own fetch and decoding logic. This
window is used for instruction lookahead in case of out-of-order instruction issue is
desired to achieve better pipeline throughput.


1. Available performance improvement from superscalar techniques is limited by

two key areas:

a) The degree of intrinsic parallelism in the instruction stream, i.e. limited

amount of instruction-level parallelism.
b) The complexity and time cost of the dispatcher and associated dependency
checking logic.
Data Dependences:

Consider the example program in Fig. 28b. A dependence graph is drawn to indicate the
relationship among the instructions.

1. Because the register content in Rl is loaded by I1 and then used by I2, we have
flow dependence: 11→ 12.

2. Because the result in register R4 after executing I4 may affect the operand register
R4 used by I3, we have anti dependence: 13 → 14.

3. Since both I5 and I6 modify the register R6 and R6 supplies an operand for I6, we
have both flow and output dependence: 15 → 16 and 15 → 16 as shown in the
dependence graph.

Pipeline Stalling :

1. This is a problem which may seriously lower pipeline utilization.

2. Proper scheduling avoids pipeline stalling.
3. Stalling can be caused by data dependences or by resource conflicts among
instructions already in the pipeline or about to enter the pipeline.

We use an example to illustrate the conditions causing pipeline stalling.

Consider the scheduling of two instruction pipelines in a two-issue superscalar

processor. Figure 29a shows the case of no data dependence on the left and flow depen-
dence (11 → 12) on the right. Without data dependence, all pipeline stages are utilized
without idling.
With dependence, instruction I2 entering the second pipeline must wait for two
cycles (shaded time slots) before entering the execution stages. This delay may also pass
to the next instruction I4 entering the pipeline.

In Fig. 6.29b, we show the effect of branching (instruction I2). A delay slot of
four cycles results from a branch taken by I2 at cycle 5. Therefore, both pipelines must be
flushed before the target instructions I3 and I4 can enter the pipelines from cycle 6. Here,

In Fig.29c, we show a combined problem involving both resource conflicts and

data dependence. Instructions I1 and I2 need to use the same functional unit, and 12 →
14 exists.
The net effect is that I2 must be scheduled one cycle behind because the two
pipeline stages (el and e2) of the same functional unit must be used by I1 and I2 in an
overlapped fashion. For the same reason, I3 is also delayed by one cycle. Instruction 14 is
delayed for two cycles due to the flow dependence on I2. The shaded boxes in all the
timing charts correspond to idle stages.
Multipipeline Scheduling:

Instruction issue and completion policies are critical to superscalar processor

performance. Three scheduling policies are introduced below.
in-order issue
When instructions are issued in program order, we call it in-order issue.
When program order is violated, out-of-order issue is being practiced.
inorder completion
if the instructions must be completed in program order, it is called inorder
out-of-order completion
In-order issue is easier to implement but may not yield the optimal performance.
In-order issue may result in either in-order or out-of-order completion.
Out-of-order issue usually ends up with out-of-order completion. The purpose
of out-of-order issue and completion is to improve performance.

In-Order Issue

Figure 30a shows a schedule for the six instructions being issued in program order Il, 12,
…. I6. Pipeline 1 receives Il, I3, and I5, and pipeline 2 receives instructions I2,I4, and I6
in three consecutive cycles. Due to I1 -> I2, I2 has to wait one cycle to use the data
loaded in by I1.

I3 is delayed one cycle for the same adder used by I2. I6 has to wait for the
result1 of I5 before it can enter the multiplier stages. In order to maintain in-order
completion; I5 is forced to wait for two cycles to come out of pipeline 1. In total, nine
cycles are needed and five idle cycles (shaded boxes) are observed.

In Fig. 30b, out-of-order completion is allowed even if in-order issue is practiced.

The only difference between this out-of-order schedule and the in-order schedule is that ,
15 is allowed to complete ahead of I3 and I4, which are totally independent of I5. The;
total execution time does not improve. However, the pipeline utilization rate does.

Only three idle cycles were observed. Note that in Figs..29a and 29b, we did not
use the lookahead window. In order to shorten the total execution time, the window can
be used to reorder the instruction issues.

Out-of-Order Issue

By using the lookahead window, instruction I5 can be decoded in advance because it is

independent of all the other instructions. The six instructions are issued in three cycles as
shown: I5 is fetched and decoded by the window, while I3 and I4 are decoded
concurrently .
It is followed by issuing I6 and I1 at cycle 2, and I2 at cycle 3. Because the issue
is out of order, the completion is also out of order as shown in Fig. 30c. Now, the total
execution time has been reduced to seven cycles with no idle stages during the execution
of these six instructions.

The in-order issue and completion is the simplest one to implement. It is rarely
used today even in a conventional scalar processor due to some unnecessary delays in
maintaining program order.

However, in a multiprocessor environment, this policy is still attractive. Allowing

out-of-order completion can be found in both scalar and superscalar processors.

Simple data dependence checking, a small lookahead window, and scoreboarding

mechanisms are needed, along with an optimizing compiler, to exploit instruction par-
allelism in a superscalar processor.
Definition :

• Operation latency (OL):

The number of cycles until the result of an instruction is available.

• (Instruction) Issue Latency (IL):

The number of cycles required between issuing two consecutive instructions.

• (Instruction) Issue Rate (Issue Parallelism) (IP):

The maximum number of instructions that can be issued in a cycle. (degree)

• Maximum Instruction-level Parallelism (MILP):

The maximum number of simultaneously executing instructions

A base scalar performance :

1. • OL = 1 cycle
The number of cycles until the result of an instruction is cycle
latency for a simple operation,

2. • IL = 1 cycle
The number of cycles required between issuing two consecutive instructions.A
base scalar processor can be defined as a machine with one instruction issued per

3. • IP= 1 instruction / cycle

The maximum number of instructions that can be issued in a cycle. (degree)

4. • MILP = 1.K=k
The maximum number of simultaneously executing instructions

The instruction pipeline can be fully utilized if successive instruction can enter it
continuously at the rate of one per cycle, as shown in fig.1
Time required to execute N instruction

T(IP, OL) = T(1,1) = K + (N-1)

= 4+(10-1)
=13 base cycles

Superscalar Performance:

To compare the relative performance of a superscalar processor with that of a scalar base
machine, we estimate the ideal execution time of N independent instructions through the

The time required by the scalar base machine is

T (1,1) = k + N – 1 (base cycle)

=4 + (12-3) / 3= 7 ( base cycle)

m-issue superscalar machine is:

• OL operation latency = 1 cycle

The number of cycles until the result of an instruction is cycle
latency for a simple operation,

• IL = instruction issue latency =1 cycle

The number of cycles required between issuing two consecutive instructions.A
base scalar processor can be defined as a machine with one instruction issued per

• IP instruction issue rate = m instruction / cycle

The maximum number of instructions that can be issued in a cycle. (degree)

• MILP time required to execute m instruction = mK

The maximum number of simultaneously executing instructions
The ideal execution time required by an m-issue superscalar machine is

T(m,1) = k + (base cycle)

Where k is the time required to execute the first m instructions through the m
pipelines simultaneously,and the second term corresponds to the time required to execute
the remaining N-m instructions ,m per cycle ,through m pipelines.

The ideal speedup of the superscalar machine over the base machine is

Τ(1,1) Ν + k −1 m( Ν + k − 1)
S(m,1) = = =
Τ(m,1) Ν / m + k − 1 Ν + m( k − 1)

As N → ∞, the speedup limit S(m,1) → m, as expected

Superpipelined Design

In a superpipelined processor of degree n, the pipeline cycle time is 1/n of the

base cycle. As a comparison, while a fixed-point add takes one cycle in the base scalar
processor, the same operation takes n short cycles in a superpipelined processor
implemented with the same technology.

Superpipelining is not possible without a high speed clocking mechanism.

• OL operation latency= n minor cycles

Single operation latency in n pipeline cycles equivalent to 1 baseline cycle
the pipeline cycle time is =1/n

• IL instruction issue latency = 1 minor cycle = 1/n baseline cycle

• IP = 1 instruction / minor cycle

= n instructions / baseline cycle

• MILP max number of simultaneously executing instructions = n.K

Superpipeline Performance:

The minimum time required to execute N instructions for a superpipelined machine of

degree n with k stages in the pipeline is 1

T(1,n) = k + (N-1) (base cycle)

Thus,the potential speedup of a superpipelined machine over the base machine is

Τ(1,1) k + N −1 n(k + N − 1)
S(1,n) = = =
Τ(1, n) k + ( N − 1) / n nk + N − 1

The speedup S (1, n) → n, as N → ∞.

Superpipelined Superscalar Design :

Superpipeline and superscalar approaches can be combined in building so called

superpipelined superscalar processors.

A superpipelined superscalar processor of degree (m,n) is shown below with

(m,n) = (3,3). This machine executes m instructions every cycle with a pipeline cycle 1/n
of the base cycle. Simple operation latency is n pipeline cycles. The level of parallelism
required to fully utilize this machine is mn instructions.

• OL = n minor cycles
Single operation latency in n pipeline cycles equivalent to 1 baseline cycle
the pipeline cycle time is =1/n

= 1 baseline cycle
• IL = 1 minor cycle
= 1/n baseline cycle

• IP = m instruction / minor cycle

= nm instructions / baseline cylce

• MILP max number of simultaneously executing instructions = n.m.K

Time required to execute N instruction?

T(m,n) = k + (base cycle)

Thus the speedup over the base machine is

Τ(1,1) k + N −1 mn(k + N − 1)
S(m,n) = = =
Τ(m, n) k + ( N − m) /(mn) mnk + N − m

The speedup limit s(m,n) → mn as N→ ∞

Pipeline Scheduling & reservation analysis
A pipeline may have a feedback structure, that is, data paths among function units
may have directed loops as shown in the figure.

Example of a feedback pipeline

The timing information of a pipeline is fully described by a simple table called a

reservation table, which specifies the function units that are busy at each clock
when a task is processed without overlapping execution.

Example of ``reservation table''

clock 0 1 2 3 4 5 6
unit0 X . . . X X .
unit1 . X . . . . .
unit2 . . X . . . .
unit3 . . . X . . .
unit4 . . . . . . X

In reservation tables, `X' means ``the function unit is busy at that clock.

and `.' means ``the function unit is not busy at that clock.

once a task enters the pipeline, it is processed by unit0 at the first clock, by unit1
at the second clock, and so on. It takes seven clock cycles to perform a task.

Example of ``conflict''

clock 0 1 2 3 4 5 6 7
unit0 0 1 . . 0 C 1 .
unit1 . 0 1 . . . . .
unit2 . . 0 1 . . . .
unit3 . . . 0 1 . . .
unit4 . . . . . . 0 1
(`0's and `1's in this table except those in the first row represent tasks 0 and 1,
respectively, and `C' means the conflict.)

Notice that no special hardware is provided to avoid simultaneous use of the same
function unit.

Therefore, a task must not be started if it would conflict with any tasks being processed.

For instance:

If two tasks, say task 0 and task 1, were started at clock 0 and clock 1, respectively, a
conflict would occur on unit0 at clock 5.

This means that you should not start two tasks with single cycle interval.

This invalid schedule is depicted in the following process table, which is obtained by
overlapping two copies of the reservation table with one being shifted to the right by 1
Reservation Analysis

1. In a static pipeline, it is easy to partition a given function into a sequence

of linearly ordered sub functions.

2. Function partitioning in a dynamic pipeline becomes quite involved

because the pipeline stages are interconnected with loops in addition to
streamline connections.
A multifunction dynamic pipeline is shown in Fig. 3a. This pipeline has three
stages. Besides the streamline connections

1. from Sl to S2

2. from S2 to S3,

3. Feed forward connection from Sl to S3

4. Two feedback connections from S3 to S2 and from S3 to Sl.

With these connections, the output of the pipeline is not necessarily from the last
stage. In fact, following different dataflow patterns, one can use the same pipeline
to evaluate different functions.
Reservation Tables
1. The reservation table for a static linear pipeline is trivial in the sense that
dataflow follows a linear streamline.
2. The reservation table for a dynamic pipeline becomes more interesting
because a nonlinear pattern is followed. Given a pipeline configuration,
multiple reservation tables can be generated for the evaluation of different

Two reservation tables are given in Figs. 3b and 3c, corresponding to a function X
and a function Y, respectively. Each function evaluation is specified by one
reservation table.

A static pipeline is specified by a single reservation table. A dynamic pipeline

may be specified by more than one reservation table.

For example:

The function X requires eight clock cycles to evaluate, and function Y requires
six cycles, as shown in Figs. 3b and 3c, respectively.

Static versus dynamic reservation table:

1. A pipeline initiation table corresponds to each function evaluation. All

initiations to a static pipeline use the same reservation table.

2. A dynamic pipeline may allow different initiations to follow a mix of

reservation tables.
The Instruction Execution Unit

Instructions can be executed in a hardware , the content of the program counter (PC) is
supplied to the instruction cache and an instruction word is read out from the specified

Rd –specifies the destination register

Rt-designates the destination register

Once an instruction has been read out from the instruction cache, its various fields are
separated and each is dispatched to the appropriate place.

1. op and fn fields go to the control unit

2. whereas rs, rt, and rd are sent to the register file.
3. The upper input of the -ALU always comes from register rs, whereas its
lower input can be either the content of rt or the immediate field of the
4. For many instructions, the output of the ALU is stored in a register; in such
cases, the data cache is bypassed.

A Single-Cycle Data Path

Data path composed of

1. the program counter
2. instruction cache
3. register file
4. ALU
5. data cache.
Three are three multiplexers used:

1. input to the register file

2. lower input of the ALU
3. outputs of the ALU and data cache.
1. input to the register file
(i) rt, rd, or $ 31 to be used as the index of the destination register into which a
result will be written.
(ii) A pair of logic signals RegDst, supplied by the control unit.
a) RegDst is set to
00 for selecting rt

b) 01 for rd
c) 10 for $31.
iii) Of course, not every instruction writes a value into a register. Writing into a
register requires that the RegWrite control signal be asserted by the control
unit; otherwise, regardless of the state of RegDst, nothing is written into the
register file.
iv) the instruction cache does not receive any control signal, because an
instruction is read out in every cycle (i.e., the clock signal serves as read
control for the instruction cache).
2. lower input of the ALU
a) control unit to choose the content of rt or the 32-bit sign-extended
version of the 16-bit immediate operand
b) The first or top input always comes from rs.
c) This is controlled by asserting or deasserting the control signal
ALUSrc. If this signal is deasserted (has a value 0), the content of rt is
used as the lower input to the ALU; otherwise, the immediate operand,
sign-extended to 32 bits, is used.
3. outputs of the ALU and data cache
a) allows the word supplied by the data cache, output from the ALU, or the incremented
PC value to be sent to the register file for writing and is effected by a pair of control
signals, RegInSrc
b) set to 00 for choosing the data cache output, 01 for the ALU output, and 10 for the
incremented PC value corning from the next-address block.
In this case the data path is capable of executing one instruction per clock cycle; hence
the name "single-cycle data path."
The ALUFunc control signal bundle contains 5 bits:
1. bit for adder control (Add,Sub)
2. bits for controlling the logic unit (Logic function)
3. 2 bits for controlling the rightmost multiplexer.
4. The ALU output signal indicating overflow in addition or subtraction.
5. In the case of arithmetic and logic instructions, the ALU result must be stored in
the destination register and is thus forwarded to the register file through the
feedback path.
6. In the case of memory access instructions, the ALU output is a data address. for
writing into the data cache (Data Write asserted) or reading from it (DataRead
7. In the latter case, the data cache output, which appears after a short latency, is
sent through the lower feedback: path to the register file for writing.

The five instruction execution:

(1) Instruction fetches
(2) Instruction decodes and registers access
(3) ALU operation
(4) Data memory access
(5) Register writes back.
Figure 1 shows the execution of five instructions in such a pipelined fashion.
There are at least two strategies for achieving greater performance:

Strategy 1:
Use multiple independent data paths that can accept several instructions that are
read out at once:- this leads to multiple-instruction-issue or superscalar

Strategy 2:
Overlap the execution of several instructions in the single-cycle design, starting
the next instruction before the previous one has run to completion:- this leads to
pipelined or superpipelined organization.

Pipeline Timing and Performance

A q-stage pipeline ideally increases instruction execution throughput by a factor of q. This is

because without any overhead or dependency of any kind, and assuming that a task of latency t is
divisible into q stages having equal latencies t / q, a new task can be initiated every t / q time units
instead of every t time units without pipelining.

In this section, we aim to determine the actual performance gain by modeling the effects
of overheads and dependencies.

T Latching
Of results
Function unit

Stage Stage Stage Stage Stage

1 2 3 o o o q-1 q

Pipelined form of a function unit with latching overhead
The overhead due to unequal stage latencies and latching of signals between
stages can be modeled by taking the stage delay to be t / q + r instead of the ideal t / q, as
in Figure 6.
Then, the increase in throughput will drop from the ideal of q to
t q
Throughput increase in a q-stage pipeline = =
(t r + r ) (1 + qr t )
So, the ideal throughput enhancement factor of q can be approached, provided the
cumulative pipeline latching overhead qr is considerably smaller than t.

Figure 7 plots the throughput improvement as a function of q for different overhead ratios
We note that r / t (the relative magnitude of the per-stage time overhead r) both limit the
maximum performance that can be attained.

For example:

we have t = 8 ns (corresponding to the latency in single-cycle implementation), q = 5, and

qr = 2 ns.

This leads to a throughput improvement factor of 4 with our 5-stage pipeline.

Note, however, that we have not yet considered the effects of data and control
1. To keep our analysis simple, we assume that most data dependencies are
resolved through data forwarding, with a single bubble inserted in the pipeline
only for a read-after-load data dependency (Figure 4).
2. Similarly, a fraction of branch instructions (those for which the compiler does
not succeed in filling the branch delay slot with a useful instruction) lead to
insertion of a single bubble in the pipeline.
3. In any case, no type of dependency results in the insertion of more than one
bubble. Let the fraction of all instructions that are followed by a bubble in the
pipeline be ….. Such bubbles effectively reduce the throughput by a factor of
1 +β, making the throughput equation

Throughput increase with dependencies =
(1 + qr / t )(1 + β )