Sei sulla pagina 1di 17

EE 457 Midterm Review

Unit 1: Digital Logic

Unit 2a: Fixed Point Arithmetic


Sign Extension
In unsigned systems, sign extension is simply a matter of adding 0s in front of the most
significant bit. Sign extension in signed systems is slightly more complicated. If the number is
negative, sign extension is performed by adding 1s in front of the MSB. If the number is
positive, sign extension is performed by adding 0s in front of the MSB.
Overflow
Overflow occurs when the result of an arithmetic operation on valid representations of numbers
falls outside the valid representable range. If overflow occurs, the resulting value is greater than
the maximum representable value or less than the minimum representable value.
Unsigned Binary: If adding two values causes a carry out, overflow has occurred. If subtracting
two values does not result in a carry out, overflow has occurred.
2s Complement: If a 0 is carried in to the sum of the most significant digit and a 1 is carried out,
or vice versa, overflow has occurred. Thus, in the addition of 2 4-bit numbers, C3 XOR C4 will
tell us that overflow has occurred.
Occasionally, we need to know more than just whether overflow has occurred:
For example, say we want to output a fifth bit, S4 in the 2s complement addition of 4-bit
numbers A and B. OVERFLOW is determined through C3 XOR C4. If 4-bit signed OVERFLOW
did not occur, then S4 and S3 should be the same and if 4-bit signed overflow did occur, S4 and
S3 should be different. Thus S4 is determined from XOR logic between S3 and the OVERFLOW
signal.
In another example, say we want to determine whether A < B, where A and B are 4-bit signed
numbers. In this case, we use an adder as to perform subtraction (A-B). We determine
OVERFLOW through C3 XOR C4. If OVERFLOW did not occur, then A < B if the result was
negative, or S3 = 1. Conversely, if OVERFLOW did occur, then A < B if the result was positive,
or S3 = 0. Thus A < B is the result of S3 XOR OVERFLOW.

Unit 2b: Adders


Ripple Carry Adder
A ripple carry adder is a multi-bit adder made up of chained full adder units. The full adder unit
requires two 2-input XOR gates to compute the sum of X, Y, and a Cin. The full adder also has
a 2 level SOP logic gate configuration to determine whether a Cout is present: XY + XCin +

YCin = Cout. Chaining these adder units together creates a total delay for an n-bit RCA
equivalent to: $% .
Carry Lookahead Adder
Carry lookahead adders are built upon the idea that we can generate all the carries in two logic
levels of generate and propagate signals. A generate signal implies that a column will generate
a carry-out whether or not the carry in is 1. A propagate signal implies that a column will
propagate a carry-in if it is a 1.
' = ' '
' = ' + '
Using these signals, we can define the carry out ('./ ):
'./ = ' + ' '
Expanding these equations, we get the result:
/ = 0 + 0 0
1 = / + / 0 + / 0 0
1 = 1 + 1 / + 1 / 0 + 1 / 0 0
2 = 2 + 2 1 / + 2 1 / 0 + 2 1 / 0 0
These equations form the carry-look ahead logic that is fundamental to the fast-adder
functionality. The carry-lookahead logic generates all the carries in 2 clock cycles. The generate
and propagate signals are all generated in 1 clock cycle. Like the RCA, the sum is still a function
of ' , ' , and ' , and is created with two chained XOR gates. However, since we know ' and '
at the beginning, their sum can be computed while the PG and CLL logic is being conducted.
Thus, computing the sums of a CLL requires two extra clock cycles. This allows the a 4-bit CLA
to generate all the sums in 5 gates instead of the 8 that a 4-bit RCA would require.
A CLA has a delay of log 8 , where is the blocking factor, or the number of carries
produced in parallel. In a 4-bit CLA, the blocking factor is 4.

Unit 3: Instruction Set


Instruction Set Architecture
The instruction set is the vocabulary the HW can understand and the SW is composed with.
This architecture is the software interface of the processor and memory system, and falls into
three categories: (1) arithmetic/logic, (2) data transfer, and (3) control.
There are two main types of instruction sets: (1) complex instruction set computers (CISC) and
(2) reduced instruction set computers (RISC). CISC has a larger vocabulary, but because it
requires more work per instruction, it usually requires a slower clock. On the other hand, RISC
has a much more limited vocabulary, but because it requires less work per instruction, it can
have a faster clock cycle. RISC architectures are more common today than CISC architectures.
Addressing Modes

Addressing modes refer to how instructions specify where an operand is located. Load/Store
operations often need to reference a specific
Direct addressing
Indirect addressing
Base addressing (indirect w/ offset)
Immediate addressing
MIPS Instructions
MIPS is a RISC style instruction set architecture. MIPS operates on a 32-bit internal and
external data size, where the registers and ALU are 32-bits wide and the memory bus is
logically 32-bits wide. All MIPS instructions are encoded as a single 32-bit word.
There are three important special purpose registers we should take note of: (1) the program
counter PC, (2) the hi-half reg HI, and (3) the lo-half reg LO. The program counter holds the
address of the next instruction to be fetched from memory and executed. The hi-half register
stores the 32 MSBs of a multiply operation or the 32-bit remainder of a divide operation. The lohalf register stores the 32 LSBs of a multiply operation or the 32-bit quotient of a divide
operation.
R-Type Instructions
Opcode=6

Rs=5

Rt=5

Rd=5

Shamt=5

Rs, Rt, Rd are 5-bit fields from register numbers


Shamt indicates the number of places to shift bits
Opcode and Function identify actual operations

I-Type Instructions
Opcode=6

Rs=5

Rt=5

Immed=16

Rs and RT are 5-bit fields for register numbers


Immediate is a 16-bit constant
Opcode identifies actual operations

J-Type Instructions
Opcode=6

Jump address=26

Jump address is a 26-bit address, not a displacement


Opcode identifies the jump operation

Branch Instructions

Func=6

A branch instruction is an I-Type instruction. The address specified by the 16-bit immediate is a
signed displacement value that is added to the program counter to specify the new PC location.
Because the last two bits of an instruction address are always 0s, they are not included in the
16-bit immediate value, but rather, tacked on after the instruction is decoded. This gives a
branch instruction an 18-bit signed value range: 128.
Memory Organization
In MIPS, integer data come in (1) bytes 8bits, (2) halfwords 16bits, and (3) words 4bits.
Floating point data comes in (1) single 32bits and (2) double 64bits.
Little vs. Big Endian
Endianness refers to ordering of bytes within a larger chunk. In a big-endian system, byte 0 is
at the big-end (left justified) of a word. In a little-endian system, byte 0 is at the little-end (right
justified) of a word.
Load/Store Instructions
There are three types of load and store instructions for both signed and unsigned cases. These
three cases are for the single byte, half word, and full word. Loading an immediate requires
Support for Subroutines

Unit 4: Computer System Performance


Latency vs. Bandwidth
Latency is the time from the start of an operation until the operation completes (sec). Bandwidth,
or throughput, is the jobs per unit time (jobs/sec). Latency is the perspective of time in relative to
the completion of a single task. Bandwidth takes the perspective of multiple tasks. Bandwidth is
not the inverse of latency because often times, multiple tasks can be completed in parallel.
Execution Time
Absolute execution time is the ultimate metric for comparing systems. Execution time is
preferable to clock rates because such rates are not necessarily normalized against clocks per
instruction.
MIPS
Computers can execute instructions very quickly, so the unit MIPS, or millions of instructions per
second, is used to quantify the rate of instructions executed. However, MIPS of two devices
alone cannot tell us whether one device is faster than another. We also require the route, or the
number of instructions a program requires to complete the same task on each device to
normalize the comparison.
Performance

Performance is the inverse of execution time:


=

Performance can be modeled using three components: (1) Dynamic instruction count (total
number of instructions actually performed) [IC], (2) clocks per instruction (average number of
clock cycles to execute each instruction) [CPI], and (3) clock period or frequency [T].
= LML = PQR =

Speedup
When comparing relative performance, we can use a metric known as speedup. Speedup is
how many times faster a new system is than an older one:
=

ULV
WQX
LML
=
=
WQX
ULV LML

Instruction Count
There are two types of instruction counts: (1) static instruction count and (2) dynamic instruction
count. Static instruction count is the number of written instructions, but dynamic instruction
count is the trace count, or how many instructions were actually executed at run time.
CPI
=

=

=

The average CPI can be calculated by analyzing the instructions in a code sequence and
summing the product of the CPI of an individual instruction type and the probability of that
instruction (via the dynamic instruction count).
=

^_`L_' '
'

Performance Aspects
Algorithm software level computer component that affects instruction count and CPI by
determining how many instructions and which kinds are executed
Programming Language software level computer component that affects instruction count
and CPI by determining the constructs that need to be translated and the kind of instructions
Compiler software level computer component that affects instruction count and CPI through
the efficiency by which a programming language is translated into instructions
Instruction Set hardware level computer component that affects instruction count, CPI and
clock cycle by specifying the instructions that are available and the work the instruction actually
performs

Microarchitecture hardware level computer component that affects CPI and clock cycle by
determining how each instruction is actually executed
Other Performance Measures
Programs are either memory bound or computationally bound.
Memory bound systems are limited by memory bandwidth (bytes/sec): the maximum
bytes of memory per second that can be read/written.
Computationally bound systems are limited by the ALU operations per second
(OPS/FLOPS): the maximum number of arithmetic operations per second the processor
can achieve
Computers are also compared using performance/watt.
Amdahls Law
Amdahls law describes the overall performance gain obtained by improving a part of a system.
LML = LML +
=

LML
=
LML

LML

% +

Amdahls law says that a small improvement on a large piece of a program is more important
than a large improvement on a small piece of a program. Amdahls law requires a percentage
of time, not the percentage of CPI or instruction count.

Unit 5: Single Cycle CPU

Single Cycle CPU


A single cycle CPU executes each instruction in one clock cycle. In other words, one clock cycle
is necessary to execute any instruction: = 1.
In order to assure that our processor operates correctly, our slowest instruction must be able to
complete execution correctly in one clock cycle. Thus, the time of the slowest instruction
execution limits the fastest clock frequency. On one level, this poses as a disadvantage to the
Single Cycle CPU, because shorter operations end up wasting time during a portion of the clock
cycle. On the other hand, it allows for parallelism, and is easy to implement.
Single Cycle Datapath
A datapath contains all the functional units and connections necessary to implement an
instruction set architecture. In this case, we are studying the functional units to implement a
subset of the MIPS architecture: Load Word (LW), Store Word (SW), Arithmetic and Logic
Instructions (ADD, SUB, AND, OR, SLT), and Branch and Jump Instructions (BEQ, J).
In order to implement these instructions, the datapath must be able to accommodate the
following operations:

Instruction fetch grab the instruction located at PC


Instruction decode parse the instruction to determine the opcode, register values,
intermediate values, etc.
Read registers grab whatever register values we need from the register file
Execute execute the instructions arithmetic and logic portions

Write register write a new register value into the register file, if necessary
Update PC if we executed a branch instruction, and the branch was taken, set the PC
to the target of the branch instruction; otherwise, set PC to PC + 4
Repeat start the cycle over again

These operations can be separated into five stages:


1. Instruction Fetch
2. Instruction Decode
Instruction Fetch (IF)
In the instruction fetch datapath, the address in PC provides an instruction cache (I-Cache) with
an address that is then decoded in the next stage. Additionally, in this stage, the program
counter is incremented and a mux chooses between a branch or jump address or the
incremented program count. Thus, instruction fetch requires a (1) PC register, (2) I-Mem, (3)
Adder (to increment the PC), and (4) a Mux.
The adder in the instruction fetch datapath cannot be the ALU because we cannot share
resources in a single-cycle CPU. A write enable is not required on the PC register, because the
PC register increments on every clock cycle.
Instruction Decode (ID)
In the instruction decode datapath, the instruction pulled from the I-Mem is decoded in order to
pull the appropriate resources to execute the operation and produce the control signals. The
instruction decode stage requires (1) the register file, (2) a sign extension unit, (3) the control
unit, and (4) a few muxes.
The register file has a write enable, REGWrite, to indicate whether write data should be written
to a register or not. Read enables are not required because unintentionally reading does not
change the state of the processor. While it may be unnecessary, reading data out of the register
file will not cause any harm.
The sign extension unit simply takes the MSB and extends it. Thus, if the MSB was a 1, the sign
extension will write 1s to the remaining 16 MSBs, and vice versa.
Execute (EX)
In the execute data path, the ALU takes inputs from the register file and performs add, sub, and,
or, and slt operations. The results are written back to a destination register.
Memory Access (MEM)
In the memory access data path, operands are read from the register file and the offset is sign
extended. The ALU calculates the effective address, and a D-Mem is accessed to perform load
and store words. In the case of a load word, the output from the D-Mem is written back to a
register.
The single cycle CPU requires a separate I-Mem and D-Mem to allow instruction fetch and
read/write operations in the same clock. The D-Mem only requires one address input because
we only read or write a memory in one clock. Unlike the register file, the memory does need a

read enable because a read operation requires more energy than a register read. Thus, we
dont want to read or write unnecessarily with a memory.
Write Back (WB)
Write back datapath is really a subset of the execute and memory access datapaths. The write
back datapath simply allows us to write data from the ALU or the D-Mem back to a register in
the reg-file.
Branch Datapath
In the case of a branch instruction, the ALU is used to compare two register values from the reg
file. In the case of a beq, if the comparison results in an ALU output of ZERO=1, the sign
extended, left shifted branch target address is written to the PC.
The calculation of the branch target address and the ALU comparison can happen in parallel.
Whether or not the branch is actually taken is a non-issue.
Control Signals
A single cycle CPU is not time dependent like a normal state machine. A single cycle CPU
always performs the same behavior: IF, ID, EX, MEM, WB, so there is technically only one
state. Thus, a single cycle CPU does not require Next State Logic (NSL) or State Memory (SM).
A SCCPU only requires output function logic in the form of a control unit produces the following
signals to control the datapath.
The Control signals are generated by decoding the OpCode (Instruction[31:26]) into R-Type,
LW, SW, BEQ, and JUMP instructions. The following control signals are generated from these
decoded signals.
Jump
The Jump mux catches the output of the Branch, or PCSrc Mux, and updates the next PC
address to a Jump address if necessary. The Jump Address is given by
{NewPC[31:28],Inst[25:0],00}. When Jump is 1, this address is sourced to the PC, and when it is
a 0, the branch target address is sent to the PC.
Branch
Branch is a signal generated by the control unit, when a branch instruction is received. For a
branch to actually occur, BRANCH must be 1 and the ALU comparison of the two read registers
must result in a ZERO = 1. The ANDing of these two signals results in the PCSrc select signal,
which chooses between the PC + Offset and the PC + 4 and feeds the PC register. PC + Offset
is generated by sign extending and left shifting the offset value. The operation is ultimately (PC
+ 4) + (Offset 4); thus, the offset value is really DesiredOffset 4. The PC offset value
requires its own adder because resources cannot be shared in a Single Cycle CPU.
MemRead

MemRead is a read enable on the D-Mem. The D-Mem requires a read enable because unlike
register reads, memory reads require more energy. An unnecessary read an invalid address, or
result in a cache miss which would waste more time or overwrite necessary data.
MemWrite
MemWrite is a write enable on the D-Mem. The D-Mem requires a write enable to prevent
writing when a SW instruction is not received.
MemtoReg
The MemtoReg mux controls the writeback value to the register file. ALU instructions want to
writeback the result from the ALU while LW wants to writeback the data from the D-Mem. When
MemtoReg is a 0, the ALU output is written back and when MemtoReg is a 1, the D-Mem output
is written back. The output of this mux feeds the Write Data input in the register file.
ALUSrc
The ALUSrc Mux controls the second input of the ALU. ALU instructions want to feed the Read
Register 2 data to the 2nd input of the ALU while LW/SW use the 2nd input of the ALU to form an
effective address for the D-Mem. The ALUSrc select signal chooses the register input at 0 and
the offset value at 1.
RegDst
The RegDst Mux controls the ID field of the destination register in the register file. ALU
instructions write to rd, specified by Instruction[15:11] while LW instructions write to rt specified
by Instruction[20:16]. The RegDst select signal chooses rd when 1 and rt when 0.
RegWrite
RegWrite is a write enable signal for the RegFile. This write enable is necessary because we
have certain instructions like BEQ and SW that do not cause a register to be updated. Thus, we
need to assure that the register file is only written to during the appropriate instruction.
ALUOp and Func
The Single Cycle CPU has a separate ALU Control Unit to provide control signals to the ALU,
telling it which operations need to be performed. The inputs to the ALU Control Unit are the
ALUOp and Func. The ALUOp is a two bit signal generated by the control unit, and it specifies
between LW/SW operations (ADD), Branch operations (SUB), and R-Type operations (Func).
While LW/SW and Branch operations require fixed ALU computations, the Func signal is used
to determine the specific ALU computation that must be applied during the current instruction.

Unit 6a: Pipelining


Pipelining is a technique that overlaps the execution of multiple instructions at once. By
separating the single-cycle CPU datapath into separate stages, we can actually allow each
stage to operate on a different instruction during a single clock cycle. This process improves
throughput, because instead of operating at a clock rate limited by the execution of the entire
single cycle CPU datapath, we can operate at a clock rate limited by the slowest pipelined
stage.
Note, that an N-stage pipeline does not necessarily realize an Nx speedup because (1) the
stages may not all require the same amount of time, so there will be some time wasted (2) the
overhead of filling up the pipe initially (3) the overhead (setup time and clock-to-Q) delay of the
stage registers, and (4) the inability to keep the pipe full due to branches and data hazards.
Pipelined Timing
The execution time of n instructions using a k stage datapath:
:
: + 1
In the with pipelining case, we require k cycles to execute the first instructions + (n-1) cycles to
execute the remaining (n-1) instructions, assuming the pipeline remains full the entire time.
Pipelining with the Single Cycle CPU
Ultimately, pipelining requires each stage to have its own resources, because each stage would
be operating a different instruction at the same time. Thus, the single cycle CPU can be
separated into stages for pipelining.
Register File and Information Flow
Pipelining requires a linear flow in order to work correctly, but the single cycle CPU requires us
to write back to the register file in some cases. This does not violate the linear flow of the
datapath, because the writeback utilizes a different halve of the register file than a register
read. Thus, both operations can be done during a single clock cycle without disrupting each
other. However, an issue arises when we want to read and write the same register. For this
case, we design register files with internal forwarding to immediately pass out data in the case
that a read and write occur with the same register.
Basic 5-Stage Pipeline
The 5-stage pipeline works by adding pipeline registers between each stage. These pipeline
registers store all the information needed for any instruction in each stage. The Single Cycle
SPU is divided into the following stages: (1) Fetch, (2) Decode, (3) Execute, (4) Memory
Access, and (5) Write Back.
In the write-back stage, we must preserve the write register and transfer it from the decode
stage all the way to the write back stage in order to operate the pipelined SCCPU correctly.
Pipeline Control

To create the 5-stage pipeline with a single cycle CPU, we must create separate pipeline
registers for each stage. In addition to these pipeline registers, we must add special stage
control registers to transfer the control signals throughout the pipeline as well. Like the PC, the
pipeline registers update on each clock cycle, so no write enable signals are needed on these
registers.
While all the control signals are generated in one clock cycle, they are all consumed in
subsequent clock cycles in the Execution, Memory, and WriteBack stages.
Instruction Fetch
This stage does not require special control signals because we should always read from the
instruction memory and write the PC on each clock cycle.
Instruction Decode & Register File
This stage also does not require special control signals, because we should always decode the
instruction into control signals and read from the register file.
Execution
The signals RegDst, ALUOp/Func, and ALUSrc are required to properly execute the ALU
operation and store the appropriate destination register for writeback.
Memory
The signals Branch, MemRead, and MemWrite to operate the D-Mem and determine whether a
branch should be executed.
WriteBack
The signals RegWrite and MemtoReg to control the write-enable on the register file and the
write data for the register file, respectively.
Pipelining Review
1. Although an instruction can begin at each clock cycle, a single instruction takes 5 clock
cycles to go through the entire datapath
2. It takes four clock cycles before the five-stage pipeline is operating at full efficiency
3. Register write-back is controlled by the WB stage even though the register file is located
in the ID stage; we must transfer the destination register address to the writeback stage
4. When a stage is inactive, the value of the control lines are deasserted to prevent
anything harmful from occurring (*)
5. No state machine is required because the pipeline follows the single state SCCPU
datapath

Unit 6b: Data Hazards


Data hazards occur when the pipeline changes the order of read/write accesses to operands in
a way that differs from a sequentially executing, unpipelined machine. These order change can
cause issues when multiple sequential operations rely or manipulate the data in the same
register.
A specific data hazard, known as a Read After Write (RAW) data hazard, occurs when a
register is written to in one instruction, and then read from in consecutive instructions. The
pipelined system takes three clock cycles to actually write back the data, so three consecutive
instructions would read the incorrect data from the register.
There are two hardware solutions to prevent data hazards: (1) stalls and (2)
forwarding/bypassing.
Stalling Strategy for Data Hazard Prevention
A stall, or no operation (nop), is a dummy operation to prevent data hazards. To determine how
many stalls are necessary for subsequent instructions (1) determine the stage where data
becomes available from an instruction, (2) determine the stage where data gets consumed by
an instruction, and (3) measure the stage difference between them. The result is how many
stalls will be required. In the case of a write and a dependent read for two arithmetic operations,
the data becomes available at the WB stage and becomes consumed in the ID stage = 3 stalls.
The process of stalling requires us to detect the hazard and stall the dependent instructions until
the hazard is resolved by sending nops down the pipeline.
Hazard Detection Unit (HDU)
The Hazard Detection Unit is an extra hardware unit that is added to the ID stage to determine if
the 5-stage pipeline needs to be stalled. The unit stalls the pipeline by determining if in
instruction in the pipe is going to write a register than in instruction in ID wants to read. The
following cases should result in a stall:
1a. ID/EX.RegWrite && (ID/EX.WriteRegister == IF/ID.ReadRegister1)
1b. ID/EX.RegWrite && (ID/EX.WriteRegister == IF/ID.ReadRegister2)
2a. EX/MEM.RegWrite && (EX/MEM.WriteRegister == IF/ID.ReadRegister1)
2b. EX/MEM.RegWrite && (EX/MEM.WriteRegister == IF/ID.ReadRegister2)
3a. MEM/WB.RegWrite && (MEM/WB.WriteRegister == IF/ID.ReadRegister1)
3b. MEM/WB.RegWrite && (MEM/WB.WriteRegister == IF/ID.ReadRegister2)
Essentially, if RegWrite is enabled and the Write Register in the last three stages matches either
of the read registers in the IF/ID pipelined register, then a stall should be rendered down the
pipeline. If a hazard does exist, the control signals are all cleared to 0.
The HDU requires six 5-bit comparators along with AND and OR gates. A stall is implemented
via a hardware generated NOP that turns all the control signals to zero.
HDU Implementation

The number of stalls necessary depends on when the writeback to the register file occurs and
the stage of the write and read instructions. Assuming no internal forwarding in a register file,
subsequent write and read instructions would require 3 stalls in the 5-stage pipeline. This is
because the write instruction would need two stalls to reach the writeback stage and an extra
stall to actually write the register file. After three stalls, the hazards would be cleared and the
read operation could begin.
Despite the fact that hazards occur over multiple stages, the HDU does not require a counter or
a state machine because hazards are produced independently by the EX, MEM, and WB
stages. Thus, write operation in the EX stage will produce an EX hazard, a MEM hazard, and a
WB hazard before clearing.
In addition to clearing all the control registers, PCWrite and IRWrite signals are generated to
prevent from incrementing the PC or writing the IF/ID Pipeline Register.
Register Forwarding/Bypassing for Data Hazard Prevention
The key idea of register forwarding/bypassing is to transfer dependent register values when
they are needed instead of stalling the pipeline. This would allow for a higher throughput by
allowing us to maximize the productivity of the pipeline during its operation.
Register File Internal Forwarding
Internal forwarding within the register file allows us to shave a stall during the WB stage. If the
write register is equal to the read register, we transfer the write data directly to the output
instead of stalling.
Thus, if we return back to the example in which a write instruction is followed by a dependent
read instruction, we would only require 2 stalls to transfer the write instruction from EX to WB.
Thus, we could ignore the WB Hazard case if we have internal forwarding in our register file.
Forwarding Unit
The forwarding unit is a hardware unit that allows us to source the EX stage with data from the
ID/EX pipeline register, the EX/MEM pipeline register, or the MEM/WB pipeline register. In our
forwarding unit, we catch the register write values and read the dependent register read values
at the beginning of the clock cycle.
The forwarding unit lets the stale register data flow through the ID/EX pipeline register and then
replaces the data just before it executes in the beginning of the EX state. The data needs to be
forwarded back from the MEM stage or the WB stage, with data from the MEM stage having
higher priority than data from the WB stage.
Muxes at the beginning of the EX stage allow us to pass the original Register File Read data,
the data from the MEM stage, or the data from the WB stage. The forwarding unit compares the
Rs and Rt values from the ID stage with the Rd value from the MEM and WB stage, and if the
comparison matches, the data is replaced via the muxes.
The forwarding unit only requires four comparators compared to the old HDU, because the
comparison occurs in the EX stage instead of the ID stage.

Forwarding Unit - Hazard Definitions


An EX hazard occurs when data from the MEM stage is forwarded back to the EX stage. An EX
hazard occurs for Rs if EX/MEM.RegWrite is enabled, EX/MEM.WriteReg is not $0, and
EX/MEM.WriteReg == ID/EX.ReadReg1. In this case, ALUSelA = 01. This can be repeated for
Rt by checking whether EX/MEM.WriteReg == ID/EX.ReadReg2.
A MEM hazard occurs when data from the WB stage is forwarded back to the EX stage. A MEM
hazard occurs for Rs if MEM/WB.RegWrite is enabled, MEM/WB.WriteReg is not $0,
MEM/WB.WriteReg == ID/EX.ReadReg1, and an EX Hazard has not occurred for Rs. In this
case, ALUSelA = 10. This can be repeated for Rt by checking whether MEM/WB.WRiteReg ==
ID/EX.ReadReg2 and an EX Hazard has not occurred for Rt.
Forwarding Unit with Simplified HDU
While the Forwarding Unit allows us to avoid stalling in many situations, there are still specific
situations where stalling would be necessary. In the specific case where a LW writes to a
register that is read in the next instruction, an issue occurs because when the data is finally read
from memory and loaded to the MEM/WB pipeline register, the sequential dependent operation
would have gone through the EX stage with a stale register value and written to the EX/MEM
pipeline register.
For this specific case of an LW followed by a dependent instruction, we introduce a simplified
HDU, which stalls the 5-stage pipeline when this case occurs. Thus a stall should occur if
ID/EX.RegWrite is enabled, ID/EX.RegDst == 0, and (ID/EX.WriteRegRt == IF/ID.ReadReg1) or
(ID/EX.WriteRegRt == IF/ID.ReadReg2).
In this case, we use RegDst == 0 or MemRead == 1 or MemtoReg == 1 to determine whether
we have a Load Word.
Stalls necessary:
(1) What stage data gets produced by an instruction
(2) What stage data gets consumed by an instruction
(3) Measure the stage difference between them
Forwarding logic experiences 1 growth as pipelines grow.
Forwarding Unit with a Dependent Store
If we have a dependent store, we can actually forward the write data from the MEM stage back
to the EX stage, but we need to tap the write data from the output of the Rt mux, not before the
Rt Mux.

Unit 6c: Control Hazards


Control (branch) hazards deal with issues related to program control instructions
(branch, jump, subroutine call, etc). These hazards occur because there is delay in
determining a branch or jump instruction, and thus, incorrect instructions may be
inserted into the pipeline.
Stalling Strategy for Control Hazard Prevention
The issue with a branch instruction is that a branch is realized in the MEM stage. Thus,
while a branch is being processed, three other instructions exist at that same clock
cycle in the ALU, ID, and IF stages. The stalling solution calls for stalls to occur as soon
as we know that a branch instruction has been received (after the ID stage). The issue
is that even with this solution, one instruction would have seaped through before we
realize it is a branch instruction.
Flushing Strategy for Control Hazard Prevention
It is important to realize that while three instructions are processing by the time we
realize whether we will actually branch or not, the three instructions have at most,
completed the EX stage, meaning that they have not written to memory or to the
register file. Additionally, if we dont actually branch, then there was no harm in
processing those three instructions. Thus, we can use an alternative solution known as
flushing to stall the three instructions if we do in fact need to branch.
When a branch outcome is true, the following steps occur: (1) zero out the control
signals in the ID, EX, and MEM stages, and (2) set a control bit in the IF/ID stage
register that will tell the ID stage on the next clock cycle that the 0-instruction it just
received is INVALID.
An issue can occur if a stall occurs at the same time a successful branch occurs.
Normally, when a successful branch occurs, the PC will be written with the PC + disp
value and the branch will be taken successfully. However, if a stall occurs, the PCWrite
enable would be disabled when we attempt to write PC + disp to the program counter.
The solution is to OR the FLUSH signal with PCWrite, thus enabling PCWrite anytime a
FLUSH occurs.
Early Branch Determination
It may be worth considering how we can reduce the branch penalty (the number of
instructions lost during a branch). Technically, at the output of the register file, we
should be able to add a comparator between ReadData1 and ReadData2 and
determine whether a branch will occur. The issue is that if the data is modified in an
earlier instruction, we wouldnt actually be comparing the correct values, so we could go
a step further and move the entire forwarding unit into the ID stage. By doing this, and
adding the comparison of the ouput of the ALUSelA and ALUSelB muxes to determine if

a branch should occur, we could perform branches and only lose 1 instruction instead of
two.
In this case, the distance between the branch determination and the fetch is 1 stage,
versus the old situation, where the distance between the branch (MEM) and fetch (IF)
was 3 stages.
Branch Delay Slots
After the Early Branch Determination modification, we have one wasted instruction in
the case where a branch is actually taken. One way to potentially prevent this time slot
from being wasted is to replace the instruction in this delay slot with an instruction that
should always be executed no matter what the branch outcome actually is.
Implementing branch delay slots can be performed on the compiler level. Essentially, a
compiler can reorganize the code to fill the branch delay slots with instructions that
should always be implemented. If the compiler cannot find instructions that the branch
does not depend on, it can simply insert a NOP in the delay slot to avoid any issues.
Other Delay Slots
This concept of a delay slot can also be applied to other areas where stalls may be
necessary. For example, in the case of a load word followed by a dependent instruction,
our HDU logic must insert a hardware NOP to avoid a hazard. MIPS ISA could declare
a delay slot, allowing the compiler to schedule an independent instruction into the delay
slot after a LW to avoid any stalls.

Potrebbero piacerti anche