Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
YCin = Cout. Chaining these adder units together creates a total delay for an n-bit RCA
equivalent to: $% .
Carry Lookahead Adder
Carry lookahead adders are built upon the idea that we can generate all the carries in two logic
levels of generate and propagate signals. A generate signal implies that a column will generate
a carry-out whether or not the carry in is 1. A propagate signal implies that a column will
propagate a carry-in if it is a 1.
' = ' '
' = ' + '
Using these signals, we can define the carry out ('./ ):
'./ = ' + ' '
Expanding these equations, we get the result:
/ = 0 + 0 0
1 = / + / 0 + / 0 0
1 = 1 + 1 / + 1 / 0 + 1 / 0 0
2 = 2 + 2 1 / + 2 1 / 0 + 2 1 / 0 0
These equations form the carry-look ahead logic that is fundamental to the fast-adder
functionality. The carry-lookahead logic generates all the carries in 2 clock cycles. The generate
and propagate signals are all generated in 1 clock cycle. Like the RCA, the sum is still a function
of ' , ' , and ' , and is created with two chained XOR gates. However, since we know ' and '
at the beginning, their sum can be computed while the PG and CLL logic is being conducted.
Thus, computing the sums of a CLL requires two extra clock cycles. This allows the a 4-bit CLA
to generate all the sums in 5 gates instead of the 8 that a 4-bit RCA would require.
A CLA has a delay of log 8 , where is the blocking factor, or the number of carries
produced in parallel. In a 4-bit CLA, the blocking factor is 4.
Addressing modes refer to how instructions specify where an operand is located. Load/Store
operations often need to reference a specific
Direct addressing
Indirect addressing
Base addressing (indirect w/ offset)
Immediate addressing
MIPS Instructions
MIPS is a RISC style instruction set architecture. MIPS operates on a 32-bit internal and
external data size, where the registers and ALU are 32-bits wide and the memory bus is
logically 32-bits wide. All MIPS instructions are encoded as a single 32-bit word.
There are three important special purpose registers we should take note of: (1) the program
counter PC, (2) the hi-half reg HI, and (3) the lo-half reg LO. The program counter holds the
address of the next instruction to be fetched from memory and executed. The hi-half register
stores the 32 MSBs of a multiply operation or the 32-bit remainder of a divide operation. The lohalf register stores the 32 LSBs of a multiply operation or the 32-bit quotient of a divide
operation.
R-Type Instructions
Opcode=6
Rs=5
Rt=5
Rd=5
Shamt=5
I-Type Instructions
Opcode=6
Rs=5
Rt=5
Immed=16
J-Type Instructions
Opcode=6
Jump address=26
Branch Instructions
Func=6
A branch instruction is an I-Type instruction. The address specified by the 16-bit immediate is a
signed displacement value that is added to the program counter to specify the new PC location.
Because the last two bits of an instruction address are always 0s, they are not included in the
16-bit immediate value, but rather, tacked on after the instruction is decoded. This gives a
branch instruction an 18-bit signed value range: 128.
Memory Organization
In MIPS, integer data come in (1) bytes 8bits, (2) halfwords 16bits, and (3) words 4bits.
Floating point data comes in (1) single 32bits and (2) double 64bits.
Little vs. Big Endian
Endianness refers to ordering of bytes within a larger chunk. In a big-endian system, byte 0 is
at the big-end (left justified) of a word. In a little-endian system, byte 0 is at the little-end (right
justified) of a word.
Load/Store Instructions
There are three types of load and store instructions for both signed and unsigned cases. These
three cases are for the single byte, half word, and full word. Loading an immediate requires
Support for Subroutines
Performance can be modeled using three components: (1) Dynamic instruction count (total
number of instructions actually performed) [IC], (2) clocks per instruction (average number of
clock cycles to execute each instruction) [CPI], and (3) clock period or frequency [T].
= LML = PQR =
Speedup
When comparing relative performance, we can use a metric known as speedup. Speedup is
how many times faster a new system is than an older one:
=
ULV
WQX
LML
=
=
WQX
ULV LML
Instruction Count
There are two types of instruction counts: (1) static instruction count and (2) dynamic instruction
count. Static instruction count is the number of written instructions, but dynamic instruction
count is the trace count, or how many instructions were actually executed at run time.
CPI
=
=
=
The average CPI can be calculated by analyzing the instructions in a code sequence and
summing the product of the CPI of an individual instruction type and the probability of that
instruction (via the dynamic instruction count).
=
^_`L_' '
'
Performance Aspects
Algorithm software level computer component that affects instruction count and CPI by
determining how many instructions and which kinds are executed
Programming Language software level computer component that affects instruction count
and CPI by determining the constructs that need to be translated and the kind of instructions
Compiler software level computer component that affects instruction count and CPI through
the efficiency by which a programming language is translated into instructions
Instruction Set hardware level computer component that affects instruction count, CPI and
clock cycle by specifying the instructions that are available and the work the instruction actually
performs
Microarchitecture hardware level computer component that affects CPI and clock cycle by
determining how each instruction is actually executed
Other Performance Measures
Programs are either memory bound or computationally bound.
Memory bound systems are limited by memory bandwidth (bytes/sec): the maximum
bytes of memory per second that can be read/written.
Computationally bound systems are limited by the ALU operations per second
(OPS/FLOPS): the maximum number of arithmetic operations per second the processor
can achieve
Computers are also compared using performance/watt.
Amdahls Law
Amdahls law describes the overall performance gain obtained by improving a part of a system.
LML = LML +
=
LML
=
LML
LML
% +
Amdahls law says that a small improvement on a large piece of a program is more important
than a large improvement on a small piece of a program. Amdahls law requires a percentage
of time, not the percentage of CPI or instruction count.
Write register write a new register value into the register file, if necessary
Update PC if we executed a branch instruction, and the branch was taken, set the PC
to the target of the branch instruction; otherwise, set PC to PC + 4
Repeat start the cycle over again
read enable because a read operation requires more energy than a register read. Thus, we
dont want to read or write unnecessarily with a memory.
Write Back (WB)
Write back datapath is really a subset of the execute and memory access datapaths. The write
back datapath simply allows us to write data from the ALU or the D-Mem back to a register in
the reg-file.
Branch Datapath
In the case of a branch instruction, the ALU is used to compare two register values from the reg
file. In the case of a beq, if the comparison results in an ALU output of ZERO=1, the sign
extended, left shifted branch target address is written to the PC.
The calculation of the branch target address and the ALU comparison can happen in parallel.
Whether or not the branch is actually taken is a non-issue.
Control Signals
A single cycle CPU is not time dependent like a normal state machine. A single cycle CPU
always performs the same behavior: IF, ID, EX, MEM, WB, so there is technically only one
state. Thus, a single cycle CPU does not require Next State Logic (NSL) or State Memory (SM).
A SCCPU only requires output function logic in the form of a control unit produces the following
signals to control the datapath.
The Control signals are generated by decoding the OpCode (Instruction[31:26]) into R-Type,
LW, SW, BEQ, and JUMP instructions. The following control signals are generated from these
decoded signals.
Jump
The Jump mux catches the output of the Branch, or PCSrc Mux, and updates the next PC
address to a Jump address if necessary. The Jump Address is given by
{NewPC[31:28],Inst[25:0],00}. When Jump is 1, this address is sourced to the PC, and when it is
a 0, the branch target address is sent to the PC.
Branch
Branch is a signal generated by the control unit, when a branch instruction is received. For a
branch to actually occur, BRANCH must be 1 and the ALU comparison of the two read registers
must result in a ZERO = 1. The ANDing of these two signals results in the PCSrc select signal,
which chooses between the PC + Offset and the PC + 4 and feeds the PC register. PC + Offset
is generated by sign extending and left shifting the offset value. The operation is ultimately (PC
+ 4) + (Offset 4); thus, the offset value is really DesiredOffset 4. The PC offset value
requires its own adder because resources cannot be shared in a Single Cycle CPU.
MemRead
MemRead is a read enable on the D-Mem. The D-Mem requires a read enable because unlike
register reads, memory reads require more energy. An unnecessary read an invalid address, or
result in a cache miss which would waste more time or overwrite necessary data.
MemWrite
MemWrite is a write enable on the D-Mem. The D-Mem requires a write enable to prevent
writing when a SW instruction is not received.
MemtoReg
The MemtoReg mux controls the writeback value to the register file. ALU instructions want to
writeback the result from the ALU while LW wants to writeback the data from the D-Mem. When
MemtoReg is a 0, the ALU output is written back and when MemtoReg is a 1, the D-Mem output
is written back. The output of this mux feeds the Write Data input in the register file.
ALUSrc
The ALUSrc Mux controls the second input of the ALU. ALU instructions want to feed the Read
Register 2 data to the 2nd input of the ALU while LW/SW use the 2nd input of the ALU to form an
effective address for the D-Mem. The ALUSrc select signal chooses the register input at 0 and
the offset value at 1.
RegDst
The RegDst Mux controls the ID field of the destination register in the register file. ALU
instructions write to rd, specified by Instruction[15:11] while LW instructions write to rt specified
by Instruction[20:16]. The RegDst select signal chooses rd when 1 and rt when 0.
RegWrite
RegWrite is a write enable signal for the RegFile. This write enable is necessary because we
have certain instructions like BEQ and SW that do not cause a register to be updated. Thus, we
need to assure that the register file is only written to during the appropriate instruction.
ALUOp and Func
The Single Cycle CPU has a separate ALU Control Unit to provide control signals to the ALU,
telling it which operations need to be performed. The inputs to the ALU Control Unit are the
ALUOp and Func. The ALUOp is a two bit signal generated by the control unit, and it specifies
between LW/SW operations (ADD), Branch operations (SUB), and R-Type operations (Func).
While LW/SW and Branch operations require fixed ALU computations, the Func signal is used
to determine the specific ALU computation that must be applied during the current instruction.
To create the 5-stage pipeline with a single cycle CPU, we must create separate pipeline
registers for each stage. In addition to these pipeline registers, we must add special stage
control registers to transfer the control signals throughout the pipeline as well. Like the PC, the
pipeline registers update on each clock cycle, so no write enable signals are needed on these
registers.
While all the control signals are generated in one clock cycle, they are all consumed in
subsequent clock cycles in the Execution, Memory, and WriteBack stages.
Instruction Fetch
This stage does not require special control signals because we should always read from the
instruction memory and write the PC on each clock cycle.
Instruction Decode & Register File
This stage also does not require special control signals, because we should always decode the
instruction into control signals and read from the register file.
Execution
The signals RegDst, ALUOp/Func, and ALUSrc are required to properly execute the ALU
operation and store the appropriate destination register for writeback.
Memory
The signals Branch, MemRead, and MemWrite to operate the D-Mem and determine whether a
branch should be executed.
WriteBack
The signals RegWrite and MemtoReg to control the write-enable on the register file and the
write data for the register file, respectively.
Pipelining Review
1. Although an instruction can begin at each clock cycle, a single instruction takes 5 clock
cycles to go through the entire datapath
2. It takes four clock cycles before the five-stage pipeline is operating at full efficiency
3. Register write-back is controlled by the WB stage even though the register file is located
in the ID stage; we must transfer the destination register address to the writeback stage
4. When a stage is inactive, the value of the control lines are deasserted to prevent
anything harmful from occurring (*)
5. No state machine is required because the pipeline follows the single state SCCPU
datapath
The number of stalls necessary depends on when the writeback to the register file occurs and
the stage of the write and read instructions. Assuming no internal forwarding in a register file,
subsequent write and read instructions would require 3 stalls in the 5-stage pipeline. This is
because the write instruction would need two stalls to reach the writeback stage and an extra
stall to actually write the register file. After three stalls, the hazards would be cleared and the
read operation could begin.
Despite the fact that hazards occur over multiple stages, the HDU does not require a counter or
a state machine because hazards are produced independently by the EX, MEM, and WB
stages. Thus, write operation in the EX stage will produce an EX hazard, a MEM hazard, and a
WB hazard before clearing.
In addition to clearing all the control registers, PCWrite and IRWrite signals are generated to
prevent from incrementing the PC or writing the IF/ID Pipeline Register.
Register Forwarding/Bypassing for Data Hazard Prevention
The key idea of register forwarding/bypassing is to transfer dependent register values when
they are needed instead of stalling the pipeline. This would allow for a higher throughput by
allowing us to maximize the productivity of the pipeline during its operation.
Register File Internal Forwarding
Internal forwarding within the register file allows us to shave a stall during the WB stage. If the
write register is equal to the read register, we transfer the write data directly to the output
instead of stalling.
Thus, if we return back to the example in which a write instruction is followed by a dependent
read instruction, we would only require 2 stalls to transfer the write instruction from EX to WB.
Thus, we could ignore the WB Hazard case if we have internal forwarding in our register file.
Forwarding Unit
The forwarding unit is a hardware unit that allows us to source the EX stage with data from the
ID/EX pipeline register, the EX/MEM pipeline register, or the MEM/WB pipeline register. In our
forwarding unit, we catch the register write values and read the dependent register read values
at the beginning of the clock cycle.
The forwarding unit lets the stale register data flow through the ID/EX pipeline register and then
replaces the data just before it executes in the beginning of the EX state. The data needs to be
forwarded back from the MEM stage or the WB stage, with data from the MEM stage having
higher priority than data from the WB stage.
Muxes at the beginning of the EX stage allow us to pass the original Register File Read data,
the data from the MEM stage, or the data from the WB stage. The forwarding unit compares the
Rs and Rt values from the ID stage with the Rd value from the MEM and WB stage, and if the
comparison matches, the data is replaced via the muxes.
The forwarding unit only requires four comparators compared to the old HDU, because the
comparison occurs in the EX stage instead of the ID stage.
a branch should occur, we could perform branches and only lose 1 instruction instead of
two.
In this case, the distance between the branch determination and the fetch is 1 stage,
versus the old situation, where the distance between the branch (MEM) and fetch (IF)
was 3 stages.
Branch Delay Slots
After the Early Branch Determination modification, we have one wasted instruction in
the case where a branch is actually taken. One way to potentially prevent this time slot
from being wasted is to replace the instruction in this delay slot with an instruction that
should always be executed no matter what the branch outcome actually is.
Implementing branch delay slots can be performed on the compiler level. Essentially, a
compiler can reorganize the code to fill the branch delay slots with instructions that
should always be implemented. If the compiler cannot find instructions that the branch
does not depend on, it can simply insert a NOP in the delay slot to avoid any issues.
Other Delay Slots
This concept of a delay slot can also be applied to other areas where stalls may be
necessary. For example, in the case of a load word followed by a dependent instruction,
our HDU logic must insert a hardware NOP to avoid a hazard. MIPS ISA could declare
a delay slot, allowing the compiler to schedule an independent instruction into the delay
slot after a LW to avoid any stalls.