Sei sulla pagina 1di 3

CS433: Computer Systems Organization Fall 2007 Homework 1 Assigned: 8/30 Points: 48 (Undergraduate), 57 (Graduate) Due in class 9/13

Instructions: Please write your name, NetID and an alias on your homework submissions for posting grades. We will use this alias throughout the quarter.Please show all work that you used to arrive at your answer. Answers without justication will not receive credit. Homeworks are due in class on the date posted.

1. [4 points] Consider the following two J-type MIPS instructions: JR J Reg name

The rst one is encoded as 6 bits for opcode and 5 bits for register. The second one is encoded as 6 bits for opcode and 26 bits for oset. Assume that these two instructions generate the destination at the EX stage. Draw the ID and EX stage of the corresponding MIPS pipeline. Please draw two dierent gures; one for each instruction. 2. [16 points] Suppose we have a MIPS processor with a 1-delay slot for branches. Consider codes (a) through (c):
ADD R1,R2,R3 NOP BEQZ R4 label [ ADD R10,R10,R10 JMP end NOP label: ADD R6,R6,R6 end: ADD R1,R2,R3 NOP BEQZ R1 label [ ADD R7,R9,R10 JMP end NOP label: ADD R7,R11,R12 end: ADD R1,R2,R3 NOP BEQZ R1 label [ ADD R13,R9,R10 JMP end NOP label: ADD R14,R9,R10 end:




(a) [2 points] What is the best instruction to put in the delay slot in code (a)? Explain why. (b) [2 points] What is the best instruction to put in the delay slot in code (b)? Explain why. (c) [5 points] What is the best instruction to put in the delay slot in code (c) if R2+R3=0 60% of the time? Show the resulting code. In this case, what are the instructions executed when R2+R3=0, and what are the instructions executed when R2+R3!=0? (d) [5 points] Repeat the whole question c if R2+R3=0 40% of the time. (e) [2 points] If R2+R3=0 50% of the time, which code do you prefer, the one in question c or the one in question d? Why? 3. [5 points] Consider the following chart of instruction type frequencies for a certain program. Instruction Type Loads Stores Branches ALU-Integer ALU-Floating-Point Frequency 25% 5% 30% 30% 10% Cycles 3 4 1 2 5

(a) [1 point] Calculate the average CPI of the program. 1

(b) [4 points] Suppose one of following two proposals for speeding up the program is to be implemented. Proposal 1 decreases the average CPI of Integer-ALU operations to 1 and that of Floating-Point ALU operations to 4, but increases the clock cycle time by 10%. Proposal 2 targets only ALU oating point operations, and decreases their CPI to 2.5. Proposal 2 increases the clock cycle time by 5%. Which proposal gives a higher speedup? Assume that the instruction count remains the same for both proposals. 4. [13 points] Consider the following code fragment. loop: LD DADD LD DADD SD DADDI BNEZ R1, R1, R4, R1, R1, R2, R2, 0(R2) R1 0(R1) R1 0(R2) R2 loop R3 R4 -4

Before the loop begins, the value of R2 is 60. Assuming the system is the classic 5 stage integer pipeline RISC processor as discussed in class and that memory accesses take 1 cycle, answer the following questions: (a) [5.5 points] Use a pipeline timing chart similar to Figure A.10 of the textbook (or Figure A.5 of the 3rd edition) to show the timing of the above code fragment as it gets executed. Show only 1 iteration of the loop and the load for the following iteration. Assume there is no bypassing/forwarding hardware, branches are resolved in the ID stage (and the target instruction is fetched in the cycle following the one in which branch instruction can be resolved ) and that register writes occur in the 1st half of the clock cycle and register reads occur on the 2nd half. Use IF, ID, EX, MEM, WB to indicate which stage the instruction is in and use S to indicate stalls. (It is also recommended, but not required , to use R to indicate when a branch is resolved). (b) [5.5 points] Repeat part a but this time assume there is bypassing/forwarding hardware. (c) [2 points] Consider the following sequence of instructions for our 5 stage pipeline. LD R6, 0(R4) DSUB R1, R6 R2 Even with forwarding hardware, the load causes a one-cycle stall in the pipeline. A certain designer decides to speed up such code by an enhancement similar to branch-delay slot. The instruction following the load instruction will use the values of operands that were assigned prior to the load instruction. In the above code, DSUB will use value of R6 prior to LD instruction. Will this design choice work? If yes, what are the issues to be considered? If no, why not? Justify your answer. 5. [10 points] Consider a pipeline with the following structure: IF ID EX MEM WB. Assume that the EX stage is 1 cycle long for all ALU operations, loads and stores. Also, the EX stage is 3 cycles long for the FP add, and 7 cycles long for the FP multiply. The pipeline supports full forwarding. All other stages in the pipeline take one cycle each. The branch is resolved in the ID stage. WAW hazards are resolved by stalling the second instruction in its ID stage until the rst instruction enters its MEM stage. (a) [4 points] For the following code, show the timing for the following set of instructions through one loop iteration. (b) [6 points] List all of the data hazards that cause stalls in the following code segment and explain why they occur. (Assume there are no structural hazards).



F0, F2, F4, R1, F4, 0(R2), R2, R1,

0(R1) 8(R1) F0, R1, F8, F4 R2, loop

F2 16 F0 8

The following questions are for graduate students only. 6. [7 points] Three enhancements with the following speedup are proposed for a new architecture (assume they apply to non-overlapping parts of the execution): Speedup1 = 40 Speedup2 = 20 Speedup3 = 10 (a) [2 points] If enhancements 1 and 2 are each usable for 40% of the time, what fraction of the time must enhancement 3 be used to achieve an overall speedup of 10? (b) [2 points] Assume the three enhancements are respectively used over 20%, 30%, and 20% of the original execution. Now consider the enhanced version. For what fraction of the new execution time is no enhancement in use? (c) [3 points] Assume for some benchmark, enhancement 1 is used for 20% of the original execution, enhancement 2 is used for 20% of the original execution, and enhancement 3 is used for 60% of the original execution. We want to maximize performance. If only one enhancement can be implemented, which should it be? If two enhancements can be implemented, which should be chosen? 7. [2 points] Consider two dierent 5-stage pipeline machines (IF ID EX MEM WB). The rst machine resolves branches in the ID stage, uses one branch delay slot, and can ll 60% of the delay slots with useful instructions. The second machine resolves branches in the EX stage and uses a predict-not-taken scheme. Assume that the cycle times of the machines are identical. Assume that 15% of the instructions are branches, 30% of branches are taken, and that stalls are due to branches alone. Which machine is faster? Justify your answer (by showing your work) to get credit.