Overview This project requires us to implement the pipeline datapath for MIPS processor. In this project, we need to
Modify the datapath from single cycle datapah. Apply the hazard detection unit. Apply the data forwarding unit. Extend the Instruction pool. Develop some test case for the circuit. Run a simple for loop on it.
Description of the project Pipelining is a technique that exploits parallelism among the instructions in a sequential instruction stream. Each instruction is split up into a sequence of steps. Different steps are executed concurrently. This increases the instruction throughput, speeding up the execution time. However it brings some issues which never exists in single cycle situation. We need to deal with Data Hazard, Control Hazard in this project. Besides this, we need to assembly the datapath correctly. These things would make sure we are able to run a simple program after our processor is completed.
Determining the architecture of the processor Our processor is a 32-bit MIPS processor. Its function is based on the lecture and textbook from CprE 381. The operation is well explained in the textbook. We modified some parts to make it be able to run more instructions.
CPRE 381 FINAL PROJECT May 1, 2014
2
1. DataPath
Figure 1 shows the single-cycle datapath with pipeline stage identified. The division of an instruction into five stages means a five stage pipeline. The five stages are:
1. IF: Instruction fetch 2. ID: Instruction decode and register file read 3. EX: Execution or address calculation 4. MEM: Data memory access 5. WB: Write back
Thus we need to add pipeline registers between stages.
Figure 1. Si ngle Cycle Datapath CPRE 381 FINAL PROJECT May 1, 2014
3
After these registers are set up, we need to assign correct value from register to register. For example, we need to pass instruction and PC value to IF/ID registers; the control signals from control to ID/EX register;
Since some instructions need to track the extra values, we need to add some extra registers. For example, for function jal , we need add control signal jal to tell the write back stage that we are using the 31 as the write address for the register file. Also, since we are trying to write the PC value into the register file, we need extra registers to pass by the PC value. The original IF/ID.PC, ID/EX.PC Figure 3. Pipeli ne Registers in Code Figure 2. Pipeli ne Registers CPRE 381 FINAL PROJECT May 1, 2014
4
can be used, but EX/MEM.PC cannot remain the PC+4 (Our machine predict the PC = PC+1, the reason is because the given Memory module is word based, and its index is increased by 1), because the EX/MEM.PC will be the branch address. As a result, we need an EX/MEM.PC.jal and MEM/WB.PC.jal to keep the PC value that we want to write into register file. 2. Memory
The memory in the book is supposed to be byte based, which means each instruction in increased by 4 in the instruction memory. Thus when predicting the next PC value, we need make PC = PC+4. When we grab the value from data memory, we also need to make the address incremented by 4.
However, in our lab, the memory module produced by the MegaWizard is word based, which means each address is pointing to a whole word. As a result, our prediction for PC is PC= PC +1;
Figure 4. Bubble Sort Instructions i n the memory file CPRE 381 FINAL PROJECT May 1, 2014
5
The memory is sized Width = 32 and Depth = 1024. But because we only use 8 bits as our immediate field, the effect memory address will be from 0 to 255.
Another thing I noticed is the memory module is clock based, which means it is driven by clock. However in the single cycle datapath, our other module is combination logic which means the result is directly produced by input. The clock delay makes the single cycle datapath failed. To solve this problem, we added a much faster clock signal to drive the memory module so the correct value will be available when the normal clock rising edge comes.
3. Registers
The register file will consist of 32 32-bit registers. Most are general purpose registers, but some are served as special purpose.
Register Purpose 0 The constant zero 31 Link register
4. Instruction Format
The instruction format is exactly the same from MIPS, which make it easy because we can get a lot help form textbook. This website helps us to convert the instructions into binary and hexadecimal. http://www.mipshelper.com/mips-converter.php
Figure 5. www.mipshel per.com converter CPRE 381 FINAL PROJECT May 1, 2014
6
5. Addressing Modes
The processor supports the following addressing mode: Branch operation o Branch Equal (8-bit field) o Branch Not Equal (8-bit field) Jump operations o Absolute jump (8-bit field) o Jump and Link (8-bit filed) Memory load and store operation o Register base address with8 bit immediate field
6. Instruction Supported
To make more instructions supported by the processor is the most challenging part and it is time consuming.
Since we did this project in about two days, we dont have many instructions added. But it supports the basic functions:
We also have branch not equal implemented. To implement BNE, we need another control signal called Branch_NE, which indicating it is an Branch not equal instruction. Then the branch condition becomes (Branch_NE && !Zero).
We have already talked about jal, it also need some additional control signal and pipeline registers. The change in data path is as follows:
INST Reg Dst ALU Src Memto- Reg Reg Write Mem Read Men Write Branch ALU Op1 ALU Op0 jal R-Type 1 0 0 1 0 0 0 1 0 0 lw 0 1 1 1 1 0 0 0 0 0 sw X 1 X 0 0 1 0 0 0 0 beq X 0 X 0 0 0 1 0 1 0 addi 0 1 0 1 0 0 0 0 0 0 ori 0 1 0 1 0 0 0 1 0 0 andi 0 1 0 1 0 0 0 1 0 0 j X X X 0 0 0 X X X 0 jal X X X 1 0 0 X X X 1 CPRE 381 FINAL PROJECT May 1, 2014
7
Although it is single cycle datapath, we are able to modify it relatively easily. The key point implementing jal is that we need to use IDEX_jal to select the write address between 31 and Rt, Rd field. This is at EX stage.
And use MEMWB_jal to select the data to write. This is at WB stage.
7. Specifying the behavior of the clock signal
Literally we have to use rising edge to update PC and pipeline registers, and use falling clock edge to update the register file and the memory unit. However, we actually use rising clock edge for all registers and it work well. We think it is maybe because in ModelSim we dont need to deal with time the clock signal flips. It is ideal and everything happen instantly. In reality we may have slew rate issue and have to make correct value stable before clock changes.
Figure 6. Jal Jr added si ngle cycle datapath CPRE 381 FINAL PROJECT May 1, 2014
8
8. Design, implement, and test the ALU
Our ALU is the same with the one in lecture slides or textbooks. It is 32 bit wide and can implement bitwise and, bitwise or, addition, subtraction and set less than, but shift is not supported. We can actually implement shift inside the ALU using the shift module from lab 3.
Our ALU uses carry look ahead method to compute the addition. Each bit produces generate and propagate, g and p. Then each 4 bits are group together to produce P and G, then we reuse the 4 bit carry look-ahead unit to produce carry out for 16 bits. Finally we use ripple carry method to combine two 16 bits ALU together to make it a 32 bit ALU.
ALUOp Funct field Operation ALUOp1 ALUOp0 F5 F4 F3 F2 F1 F0 0 0 X X X X X X 010 X 1 X X X X X X 110 1 X X X 0 0 0 0 010 1 X X X 0 0 1 0 110 1 X X X 0 1 0 0 000 1 X X X 0 1 0 1 001 1 X X X 1 0 1 0 111
Figure 7. 16 bit carry look-ahead ALU Figure 8. ALU Control Unit Truth Table CPRE 381 FINAL PROJECT May 1, 2014
9
Operation is a three bit control signal. The least significant bit is used as B_invert. Oeration[1:0] is used for determine which function the ALU is performing:
a) Logical AND (operation 00) b) Logical OR (operation 01) c) Addition (operation 10) d) Subtraction (operation 10 with binvert) e) Set-on-Less-Than (operation 11)
Besides these it also Overflow flag and Zero which indicates the overflow occurs and the result is zero. Zero is set when the result is zero.
9. Drawing the datapath
Several drafts has been sketched but there is no a final version because we are keeping adding new thing to it. It is very much similar with those on textbook with additional multiplexer and some wire added. It is most closed to that in the Figure 6 on page 7, plus stage registers.
10. Defining the control lines
Our control lines are based on the textbook, the new control lines are discussed here.
Jal : Our datapath adds Jal control line. Asserting this line causes the register file to record the value of the PC of the instruction in the WB stage to register 31. As long as control detect it is a jump and link instruction in the ID stage, its output jal will be set, and it will pass on to the next stage. In the EX stage, we will use ID/EX. jal to select 31 or Rd as a new writing address; in the WB stage, we will use MEM/WB.jal to select the recorded PC value or the data from MemtoReg Multiplexer. Branch_NE: The signal is asserted when the instruction is a branch not equal instruction. This signal is ANDed with the !Zero signal to determine whether a branch is taken or not.
CPRE 381 FINAL PROJECT May 1, 2014
10
11. Determine the control line coding for each instruction
With all of the control lines are defined in the table, we are able to code the control unit for each instruction. The control signals are as follows:
The code for the control unit is as follows:
The page is not wide enough to show the complete equation. When we need to add new instructions, we add the part which would assert the signal, otherwise it would be deserted.
12. Implementing and testing the control module.
We did not do separate testing for the control module because this part is embedded into the testing for instructions. As long as the instructions are working fine, there should be no problems.
INST Reg Dst ALU Src Memto- Reg Reg Write Mem Read Men Write Branch ALU Op1 ALU Op0 jal BNE R-Type 1 0 0 1 0 0 0 1 0 0 0 lw 0 1 1 1 1 0 0 0 0 0 0 sw X 1 X 0 0 1 0 0 0 0 0 beq X 0 X 0 0 0 1 0 1 0 0 bne X 0 X 0 0 0 0 0 1 0 1 addi 0 1 0 1 0 0 0 0 0 0 0 j X X X 0 0 0 X X X 0 0 jal X X X 1 0 0 X X X 1 0 Figure 9. Control Unit Truth Table CPRE 381 FINAL PROJECT May 1, 2014
11
13. Designing and implementing the remaining modules
The other modules were developed and tested individually and we are confirmed that they are working correctly. These individual testing is done in each lab assignment because they are not changed as much as we change the datapath and control unit. We have used them in Single Cycle datapath and Multi-cycle datapath. The only new modules we use for the first time is the hazard detection unit and forwarding unit.
14. Implementing the datapath
The datapath was implemented in Verilog using connecting wires and modules we had created. The most important work in coding is make sure you dont enter the wrong variable name into the module ports. It is very common mistake when assembling the datapath. You may get error in the value you want and spend hours finding what is going on and finally find you put the wrong name in the slot. With the help of the textbook and lecture slides, also with some resources from internet, actual connecting modules are relatively easy.
15. Selecting the sequence of instructions to test the processor
The tested value is the same as we expected. The add $1, $1, $3 instruction needs the value which is not yet written into the register $1, but it is forwarded from next stage.
In this case, actually it is the Double Data Hazard case in the lecture slides, both hazards occur, and we want to use the most recent result. Thus we used the revised MEM hazard condition: only forward if EX hazard condition isnt true.
The result is as we expected, the add $1, $1, $4 applies forwarding from EX/MEM registers rather than from MEM/WB registers.
Figure 10. Data forwarding test result Figure 11. Double Hazard Data forwardi ng test result CPRE 381 FINAL PROJECT May 1, 2014
13
Forwarding condition:
EX hazard: Data forwarding from EX/MEM register if (EX/MEM.RegWrite and (EX/MEM.RegisterRd 0) and (EX/MEM.RegisterRd = ID/EX.RegisterRs)) ForwardA = 10
if (EX/MEM.RegWrite and (EX/MEM.RegisterRd0) and (EX/MEM.RegisterRd = ID/EX.RegisterRt)) ForwardB = 10
MEM hazard (revision): Data forwarding from MEM/WB register if (MEM/WB.RegWrite and (MEM/WB.RegisterRd 0) and not (EX/MEM.RegWrite and (EX/MEM.RegisterRd0) and (EX/MEM.RegisterRd = ID/EX.RegisterRs)) and (MEM/WB.RegisterRd = ID/EX.RegisterRs))
ForwardA = 01
if (MEM/WB.RegWrite and (MEM/WB.RegisterRd 0) and not (EX/MEM.RegWrite and (EX/MEM.RegisterRd0) and (EX/MEM.RegisterRd = ID/EX.RegisterRt)) and (MEM/WB.RegisterRd = ID/EX.RegisterRt))
As we can see, lw $3, 2($8) uses the result from addi $8, $0, 4, which is forwarded from next stage. This procedure takes only 1 cycle.
For add $4, $3, $8, it takes two cycles because we have to stall the pipeline to wait the result comes out at MEM stage.
The result is the same as we expected.
Hazard Detection Condition:
Control Hazard Detection
Branch flush:
PC Instruction Machine Code Expected Value 0 beq $zero, $zero, 8 10000008 branch to 9 1 addi $1, $0, 1 20010001 flushed 2 addi $2, $0, 2 20020002 flushed 3 addi $3, $0, 3 20030003 flushed 4 addi $4, $0, 4 20040004 Not in the pipeline 5 9 addi $5, $0, 5 20050005 $5 =5 7 NOP 00000000
Figure 12. Load Use Hazard test CPRE 381 FINAL PROJECT May 1, 2014
15
Simulation result
Since we resolve the branch in MEM stage, each taken branch will cause 3 cycles performance loss. We flush three instructions followed by the branch instruction. As a result, $1, $2, $3 will not be assign any value. Instruction addi $4, $0, 4 is not even fetched into the registers. After branch taken, the PC is set to 9 and it executes addi $5, $0, 5 and the $5 becomes 5.
The result is the same as we expected. The branch takes 4 cycles to determine whether to branch or not. And the addi $5, $0, 5 takes 5 cycle to write the result back to register files.
Thats why you see 9 clock cycles on the wave form.
Figure 13.Control Hazard Detection Figure 14. Branch Fl ush process CPRE 381 FINAL PROJECT May 1, 2014
16
For the IFID register, we set the instruction to 0, essentially add $0, $0, $0, which is equivalent to a NOP and it does no harm to our data. For the IDEX registers and EXMEM registers, we set the control signal to 0 so that it wont write into memory or register file.
Jump flush: PC Instruction Machine Code Expected Value 0 j 9 08000009 jump to 9 1 addi $1, $0, 1 20010001 flushed 2 addi $2, $0, 2 20020002 Not in the pipeline 3 addi $3, $0, 3 20030003 Not in the pipeline 4 addi $4, $0, 4 20040004 Not in the pipeline 5 9 addi $5, $0, 5 20050005 $5 =5 10 NOP 00000000
Simulation Result
It is the same result, but the only difference is that jump instruction only wasted 1 clock cycle, since only 1 instruction behind jump instruction is fetched. We only need to flush that.
The addi instruction still takes 5 cycles and the total clock cycles need is 7. You only see 6 wave form because the first wave is triggered by reset, and the write-back is happen at the beginning of the WB stage.
Figure 14. Jump Flush process CPRE 381 FINAL PROJECT May 1, 2014
17
Jal flush and record PC
PC Instruction Machine Code Expected Value 0 jal 9 0C000009 jump to 9 and link 1 addi $1, $0, 1 20010001 flushed 2 addi $2, $0, 2 20020002 Not in the pipeline 3 addi $3, $0, 3 20030003 Not in the pipeline 4 addi $4, $0, 4 20040004 $ 31 =1 5 9 addi $5, $0, 5 20050005 $5 =5 10 NOP 00000000
Simulation Result:
We can see that the PC+1 of when the jump happens is written into the register 31 at beginning of forth cycle. Again, the first cycle is triggered by reset and cannot clearly see, after the first instruction, everything is as we expected.
It also flushes the following one instruction as jump instruction does. Figure 14. Jal Flush and Record process CPRE 381 FINAL PROJECT May 1, 2014
18
: Run a program to test the whole processor
After testing each instructions we are able to combine them together to run some simple programs.
Bubble Sort Program
Special thanks for Nicholas Cervantes who translates the C code of bubble sort into MIPS assembly code and share with me to test my processor.
C code: int main () { int arr[10] = {1, 7, 10, 2, 3, 8, 4, 9, 5, 6};
int i = 0; int d = 0;
//Bubble Sort for the given array. for (i = 0 ; i <10; i++) { for (d = 0 ; d < 9 - i; d++) { if (arr[d] > arr[d+1]) { int swap = arr[d]; arr[d] = arr[d+1]; arr[d+1] = swap; } } }
return 0; }
CPRE 381 FINAL PROJECT May 1, 2014
19
Assembly Code:
PC Instruction Machine Code Expected Value 0 addi $19, $0, 9 20130009 Set $19 = 9 1 addi $20, $0, 10 2014000A Set $20 = 10 2 add $1, $0, $0 00000820 Set $1 = i = (0) 3 slt $11, $1, $20 0034582A branch check (i<10) #OUTER LOOP 4 beq $0, $11, 14 100B000E Branch to where Loop Ends 5 sub $18, $19, $1 02619022 $18 = 9 - i 6 add $2, $0, $0 00001020 Set $2 = d = (0) 7 slt $11, $2, $18 0052582A branch check (d<9-i) 8 beq $0, $11, 8 100B0008 branch to OUTER if not (d<9-i) #INNER LOOP 9 lw $4, 0($2) 8C440000 load arr[d] -> $4 10 lw $5, 1($2) 8C450001 load arr[d+1] -> $5 11 slt $11, $5, $4 00A4582A Note: a bubble should form (from lw, followed by read) 12 beq $0, $11, 2 100B0002 Branch if no swap needed. 13 sw $5, 0($2) AC450000 Place arr[d+1] into arr[d] 14 sw $4, 1($2) AC440001 Place arr[d] into arr[d+1] 15 addi $2, $2, 1 20420001 Addi $2++ (d++) 16 J 3 08000003 Jump to the #INNER LOOP 17 addi $1, $1, 1 20210001 Addi $1++ (i++) 18 j 7 08000007 Jump to OUTER LOOP 19 lw $1, 0($0) 8C010000 Load all values into the regs to 20 lw $2, 1($0) 8C020001 check their values 21 lw $3, 2($0) 8C030002 22 lw $4, 3($0) 8C040003 23 lw $5, 4($0) 8C050004 24 lw $6, 5($0) 8C060005 25 lw $7, 6($0) 8C070006 26 lw $8, 7($0) 8C080007 27 lw $9, 8($0) 8C090008 28 lw $10, 9($0) 8C0A0009 29 NOP 00000000
Figure 15. Bubble Sort MIPS Code CPRE 381 FINAL PROJECT May 1, 2014
20
We expect the result to be sorted from small to large, from register 1 to register 10.
As you can see the final result is sorted as we want. These data from the data memory is random placed, lets try one more.
Data in the data memory is like this:
The result is as follows:
Figure 16. Sorting example 1 Figure 17. Sorting example 2 CPRE 381 FINAL PROJECT May 1, 2014
21
Conclusion: We should start this project earlier, so we will have enough time to implement more things. At the Friday morning, we are still writing the report. This class is not required for EE student but I am quite interested in and fortunately I am doing not bad. As an EE senior student, I am more familiar with Verilog compared to my partner who are majored in computer engineering. However, we finally go through all the labs and have done our final project, it is a great success. All the instructions are working correctly and we have reviewed what we learned in the textbook and verified the design. We gain a lot from implementing this pipeline datapath project, and we appreciate it.