Ece 4750 Pset 4

ECE 4750 Computer Architecture, Fall 2018
Problem Set #4
http://www.csl.cornell.edu/courses/ece4750
School of Electrical and Computer Engineering
Cornell University
revision: 2018-11-19-13-19
The problem sets are intended to provide you an opportunity to put the concepts you have been
learning in lecture into practice. We encourage you to discuss the information and concepts cov-
ered in lecture with other students in the course; one of the best ways to learn new and challenging
material is through substantive discussion with your peers. You can give “consulting” help to or
receive “consulting” help from other students in the course to help clarify your understanding of
the problems, but under no circumstances should you directly copy solutions from another stu-
dent. If you give or receive “consulting” help you should explicitly include mention of this in your
problem set. Each student must turn in his or her own solutions to the problems, and these solu-
tions are expected to accurately reflect your understanding of the material. Please see the course
syllabus if you have any questions concerning the collaboration policy for the course.
Problem sets must be submitted electronically in PDF format through the online CMS system. No
other format or method of submission is allowed! Problem sets are due at 11:59pm on the date listed
on the course webpage. You can continue to resubmit your file as many times as you would like
up until the deadline, so please feel free to upload early and often. No late submissions will be
accepted and no extensions will be granted except for a family or medical emergency. If you miss
the deadline, then your assignment will not count as submitted, and you will receive a zero!
As an exception to this rule, each student has four slip days that may be used to secure 24-hour
extensions on problem sets. The maximum extension for any given assignment is 48 hours.
Students are strongly encouraged to type their solutions; if necessary students can scan hand-
written diagrams and combine these diagrams with typed solutions. Please use a reasonably large
font size, especially in tables. If students absolutely must write up their entire solutions by hand,
then they must scan their final version into a legible PDF. Illegible scans will not be graded. A
digital photograph will almost certainly not be legible. Please bold, circle, or otherwise highlight
your specific answer where appropriate. All pages must be 8.5”×11” using a portrait orientation.
If you cannot print your problem set using standard printer settings, then we also will not be able
to print your problem set. If we cannot print your problem set, then we will not be able to grade
your hard work. Please try and include your full name and NetID on every page. Number the pages. We
do not want to lose any pages.
ECE 4750 Computer Architecture, Fall 2018 Problem Set #4
List of Problems
1 Register Renaming 3
1.A Architectural RAW, WAW, and WAR Dependencies . . . . . . . . . . . . . . . . . . . 3
1.B Pipeline Diagram with Register Renaming . . . . . . . . . . . . . . . . . . . . . . . . 3
1.C Register Renaming with Pointers in the IQ/ROB . . . . . . . . . . . . . . . . . . . . . 3
1.D Register Renaming with Values in the IQ/ROB . . . . . . . . . . . . . . . . . . . . . . 4
2 Branch Prediction 5
2.A Two-Bit Saturating Counter Branch History Table . . . . . . . . . . . . . . . . . . . . 7
2.B Two-Level Adaptive Branch Predictor to Exploit Temporal Correlation . . . . . . . . 8
2.C Two-Level Adaptive Branch Predictor to Exploit Spatial Correlation . . . . . . . . . . 9
2.D Branch Predictor Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2
Problem 1. Register Renaming

In this problem, we will be looking at the detailed microarchitectural state required to implement
register renaming both with pointers and with values in the issue queue and reorder buffer. We
will be executing the following assembly sequence:
1 mul x1, x2, x3

2 mul x4, x1, x5
3 add x6, x7, x8
4 mul x1, x2, x5
5 add x6, x6, x9
For this problem you should assume that each microarchitecture follows the details described in
the lecture notes. Assume a fully-pipelined four-cycle multiplier. Do not assume that all of the
instructions are waiting in the issue queue. You must explicitly fetch and decode each instruction
in-order.
Part 1.A Architectural RAW, WAW, and WAR Dependencies
List the assembly instructions above and clearly draw arrows to indicate the architectural RAW,
WAW, and WAR dependencies between instructions. You must clearly disambiguate the different
types of dependencies.
Part 1.B Pipeline Diagram with Register Renaming
Draw the pipeline diagram for the above assembly code with register renaming for an IO2L mi-
croarchitecture with an in-order front-end, out-of-order issue and writeback/completion with late
commit. Label each column with the cycle number starting with cycle 0.
Part 1.C Register Renaming with Pointers in the IQ/ROB
Create two tables similar to the ones in Figure 1 and 2. Use the tables to show how the microar-
chitectural state (i.e., the rename table, free list, issue queue, and reorder buffer) changes over time
for a register renaming scheme with pointers in the issue queue and reorder buffer. Assume that
the physical register file and the architecture register file are combined into a single unified register
file (i.e., instead of copying values into a separate architecture register file in commit, we simply
update pointers in the architectural rename table). The execution should directly correspond to the
pipeline diagram from the previous part. The first three cycles have already been filled in for you.
Carefully handle deallocating physical registers and placing them back on the free list.
Fill in the D, I, W, C columns with the instruction number (1–5) to indicate when each instruction is
in the corresponding stage. Rename table entries should indicate the correct architectural to phys-
ical mapping. Use an asterisk symbol (*) to indicate when a rename table entry is pending. Issue
queue entries should be of the form pdest/psrc0/psrc1 where pdest is the physical destination register
specifier and psrc0/psrc1 are the physical source register specifiers. Use an asterisk symbol (*) to
indicate when a physical source register specifier is pending (i.e., the source data is not ready in the
physical register file yet). Note that to simplify our tables, you only need to show the issue queue
entry on the cycle after that entry is written into the issue queue (this allows us to represent the
issue queue as a single column). The reorder buffer entries should be of the form pdest/adest/ppreg
where pdest is physical destination register specifier, adest is the architectural destination register
3
Stage RT
Cycle D I W C x1 x2 x3 x4 x5 x6 x7 x8 x9 Free List IQ
0 p0 p1 p2 p3 p4 p5 p6 p7 p8 p9, pA, pB, pC, pD
1 1 : : : : : : : : : p9, pA, pB, pC, pD
2 2 1 p9* : : : : : : : : pA, pB, pC, pD p9/p1/p2
...
Figure 1: Microarchitectural State (RT/FL/IQ) for Reg Renaming with Pointers in the IQ/ROB
ROB
Cycle 0 1 2 3 4
0
1
2 p9*/x1/p0
...
Figure 2: Microarchitectural State (ROB) for Reg Renaming with Pointers in the IQ/ROB
specifier, and ppreg is the previous physical register specifier. Use an asterisk (*) symbol to indicate
when an entry in the ROB is not finished. You can use a vertical line (|) or vertical dots (:) to indi-
cate that a field does not change on that cycle. Note that the physical register specifiers are denoted
in hex from p0 to pF. All state should be shown as in the table as it exists at the beginning of the cycle.
Part 1.D Register Renaming with Values in the IQ/ROB
Create a table similar to the one in Figure 3. Use the table to show how the microarchitectural state
(i.e., RT, IQ, ROB) changes over time for a register renaming scheme with values in the issue queue
and the reorder buffer. The execution should directly correspond to the pipeline diagram from the
previous part. The first column and first three cycles have already been filled in for you. Carefully
handle deallocating physical registers and placing them back on the free list.
Fill in the D, I, W, C columns with the instruction number (1–5) to indicate when each instruction
is in the corresponding stage. Renaming table entries should indicate the correct architectural to
physical mapping. If there is no mapping in the rename table (i.e., that entry is invalid), then we get
the value from the architectural register file. Use an asterisk symbol (*) to indicate when a rename
table entry is pending. Issue queue entries should be of the form pdest/src0/src1 where pdest is the
physical destination register specifier and src0/src1 are either the physical source register specifiers
or the actual source data. Use the architectural register specifier for the source in the table if the
data is read from the architectural register file and copy it into the issue queue in the decode stage.
Use the physical register specifier if the source is not ready yet. For consistency, use an asterisk
symbol (*) to indicate when a physical source register specifier is pending (i.e., the source data is
being written by an instruction in-flight that has not finished yet). Note that to simplify our tables,
you only need to show the issue queue entry on the cycle after that entry is written into the issue
4
Stage RT ROB
Cycle D I W C x1 x2 x3 x4 x5 x6 x7 x8 x9 IQ 0 1 2 3 4
0
1 1
2 2 1 p0* p0/x2/x3 p0*/x1
...
Figure 3: Microarchitectural State for Reg Renaming with Values in the IQ/ROB
queue (this allows us to represent the issue queue as a single column). The reorder buffer entries
should be of the form pdest/adest where pdest is physical destination register specifier (this is always
the same as the ROB entry number) and adest is the architectural destination register specifier. You
can use a vertical line (|) or vertical dots (:) to indicate that a field does not change on that cycle.
All state should be shown as in the table as it exists at the beginning of the cycle.
Problem 2. Branch Prediction

Consider the C code shown in Figure 4. This function processes an input signal (stored in array
src), which for example, might represent a sampled audio signal. The function performs two
operations on the input signal to produce a processed output signal (stored in array dest). First,
the function “saturates” the signal such that any input sample greater than the given limit is set
to the limit value. Second, the function counts the number of non-zero input samples and returns
this count.
The corresponding assembly code is shown in Figure 5. Assume that when the assembly code
begins, the destination array pointer is stored in x1, the source array pointer is stored in x2, the
size of the two arrays is stored in x3, the limit value is stored in x4, and the return value should
ultimately be written to x5. Note that this function requires the following three branches:
• Branch B0 – Used to saturate the input signal. Branch is taken if src[i] ≤ limit.
• Branch B1 – Used to count number of non-zero input samples. Branch is taken if src[i] = 0.
• Branch B2 – Used to implement the loop.
In this problem, we will explore how three different branch predictors perform on this assembly
code. For all parts, assume that there is no unwanted aliasing and that the branch predictors are
updated immediately after the branch is predicted. This is obviously unrealistic, since we need to
first resolve the branch before updating the predictor, but it simplifies our analysis. For all parts,
limit is 10 and the input signal includes the following 20 samples:
0, 0, 12, 15, 0, 0, 11, 17, 0, 0, 11, 13, 9, 0, 12, 15, 0, 8, 12, 18
This means the size input argument is 20.
5
1 int process_signal( int dest[], int src[], int size, int limit )
2 {
3 int num_non_zeros = 0;
4 for ( int i = 0; i < size; i++ ) {
5 dest[i] = src[i]; // Ensure samples in destination array
6 if ( src[i] > limit ) // do not exceed limit
7 dest[i] = limit; //
8
9 if ( src[i] > 0 ) // Count number of non-zero samples

10 num_non_zeros++; //
11 }
12 return num_non_zeros;
13 }
Figure 4: C Code for Signal Processing Function
1 # x1 = dest_ptr, x2 = src_ptr, x3 = size, x4 = limit

2 # x5 = return value
3
4 addi x5, x0, 0 # num_non_zeros = 0

5 loop:
6
7 # Ensure samples in destination array do not exceed limit

8
9 lw x6, 0(x2) # temp_a = *src_ptr

10 addi x7, x6, 0 # temp_b = temp_a
11 slt x8, x4, x6 # if ( limit >= temp_a )
12 beq x8, x0, L1 # goto L1 Branch B0
13 addi x7, x4, 0 # temp_b = limit
14 L1: # L1:
15 sw x7, 0(x1) # *dest_ptr = temp_b
16
17 # Count number of non-zero samples

18
19 beq x6, x0, L2 # if ( temp_a == 0 ) goto L2 Branch B1

20 addi x5, x5, 1 # num_non_zeros++
21 L2: # L2:
22
23 # Increment pointers and iterate

24
25 addi x1, x1, 4 # dest_ptr++

26 addi x2, x2, 4 # src_ptr++
27 addi x3, x3, -1 # size--
28 blt x0, x3, loop # if ( size > 0 ) goto loop Branch B2
Figure 5: Assembly Code for Signal Processing Function
6
Part 2.A Two-Bit Saturating Counter Branch History Table
We begin with a basic two-bit saturating counter finite-state machine shown in Figure 6 imple-
mented with the simple one-level branch history table (BHT) shown in Figure 7. There are four
states in the finite-state machine:
• Strongly Taken – (ST) predict taken
• Weakly Taken – (WT) predict taken
• Weakly Not-Taken – (WNT) predict non-taken
• Strongly Not-Taken – (SNT) predict not-taken
Assume that all entries in the BHT are initialized to the weakly taken (WT) state.
Create a table similar to the one shown in Figure 8. There should be twenty rows, one for each
iteration of the loop. There are columns for the branch predictor state, prediction (column labelled
P), and actual resolution (column labelled A) for each of the three branches. In this table, the branch
predictor state should always reflect the state of the predictor that is used to make the prediction
(i.e., before we update the state for that branch resolution). So for example, the BHT entry for
branch B0 is initially WT and thus we predict taken for the first iteration of the loop when src[0]
is zero. The branch is indeed taken, and thus we update the BHT entry for branch B0 to strongly
taken (ST) as indicated by the BHT entry for the second iteration. Calculate the branch predictor
accuracy for each of the three branches individually, and then also for the entire function. Hint:
Fill in the actual resolution column for all rows and all branches first, then fill in the BHT state, and finally
fill in the prediction column.
Figure 6: Two-Bit Saturating Counter FSM

Figure 7: BHT Implementation
Branch B0 Branch B1 Branch B2

i src[i] BHT P A BHT P A BHT P A
0 0 WT T T
1 0 ST T T
2 12 ...
...
Figure 8: Two-Bit Saturating Counter BHT Execution
7
Part 2.B Two-Level Adaptive Branch Predictor to Exploit Temporal Correlation
The input signal has interesting temporal correlation that we may be able to capture with a more
sophisticated two-level adaptive branch predictor. Consider the two-level BHT shown in Figure 9.
In this predictor, we use a branch history shift register table (BHSRT) to capture a local branch
history pattern for each branch. The patterns are used to choose between one of several BHTs.
Each entry in the BHT is a two-bit saturating counter initialized to the weakly taken (WT) state.
For this problem, assume that each entry in the BHSRT is three bits. This means that we need eight
BHTs. Assume that all entries in the BHSRT are initialized to zero.
Create a table similar to the one shown in Figure 10. Calculate the branch predictor accuracy for
each of the three branches individually, and then also for the entire function. Hint: Fill in the
actual resolution column for all rows and all branches first (should be the same as in the previous part), then
fill in the BHSRT state, then fill in the BHT state, and finally fill in the prediction column.
Figure 9: Two-Level BHT for Temporal Correlation

i src[i] BHSRT BHT P A BHSRT BHT P A BHSRT BHT P A
0 0 000 WT T T
1 0 001 WT T T
2 12 ...
...
Figure 10: Two-Level BHT for Temporal Correlation Execution
8
Part 2.C Two-Level Adaptive Branch Predictor to Exploit Spatial Correlation
The branches in the assembly sequence have spatial correlation that we may be able to capture with
a more sophisticated two-level adaptive branch predictor. Consider the two-level BHT shown in
Figure 11. In this predictor, we use a single branch history shift register (BHSR) to capture a global
branch history pattern across all branches. This pattern is used to choose between one of several
BHTs. Each entry in the BHT is a two-bit saturating counter initialized to the weakly taken (WT)
state. For this problem, assume that the BHSR is only one bit. This means that we need two BHTs.
Assume that the BHSR is initially zero.
Create a table similar to the one shown in Figure 12. Calculate the branch predictor accuracy for
each of the three branches individually, and then also for the entire function. Hint: Fill in the
actual resolution column for all rows and all branches first (should be the same as in the previous parts), then
fill in the BHSR state, then fill in the BHT state, and finally fill in the prediction column.
Figure 11: Two-Level BHT for Spatial Correlation

i src[i] BHSR BHT P A BHSR BHT P A BHSR BHT P A
0 0 0 WT T T 1 WT T T 1 WT T T
1 0 1 WT T T 1 ST T T
2 12 ...
...
Figure 12: Two-Level BHT for Spatial Correlation Execution
9
Part 2.D Branch Predictor Comparison
Create a table similar to the one in Figure 13 to record the branch prediction accuracies for all three
branch predictors. For each predictor and each branch, discuss why the accuracy is better or worse
than the other predictors on the same branch. Your answer should reflect your understanding of
the temporal and spatial correlation in this example and how it impacts the various prediction
accuracies.
Two-Bit FSM Two-Level Temporal Two-Level Spatial

Accuracy Accuracy Accuracy
Branch B0
Branch B1
Branch B2
All Branches
Figure 13: Summary of Branch Predictor Accuracies
10

Ece 4750 Pset 4

Caricato da

Informazioni sul documento

Descrizione originale:

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Ece 4750 Pset 4

Caricato da

Copyright:

Formati disponibili

ECE 4750 Computer Architecture, Fall 2018

Problem 1. Register Renaming

1 mul x1, x2, x3

Part 1.A Architectural RAW, WAW, and WAR Dependencies

Part 1.B Pipeline Diagram with Register Renaming

Part 1.C Register Renaming with Pointers in the IQ/ROB

Part 1.D Register Renaming with Values in the IQ/ROB

Problem 2. Branch Prediction

0, 0, 12, 15, 0, 0, 11, 17, 0, 0, 11, 13, 9, 0, 12, 15, 0, 8, 12, 18

This means the size input argument is 20.

9 if ( src[i] > 0 ) // Count number of non-zero samples

1 # x1 = dest_ptr, x2 = src_ptr, x3 = size, x4 = limit

4 addi x5, x0, 0 # num_non_zeros = 0

7 # Ensure samples in destination array do not exceed limit

9 lw x6, 0(x2) # temp_a = *src_ptr

17 # Count number of non-zero samples

19 beq x6, x0, L2 # if ( temp_a == 0 ) goto L2 Branch B1

23 # Increment pointers and iterate

25 addi x1, x1, 4 # dest_ptr++

Part 2.A Two-Bit Saturating Counter Branch History Table

Figure 6: Two-Bit Saturating Counter FSM

Branch B0 Branch B1 Branch B2

Figure 8: Two-Bit Saturating Counter BHT Execution

Part 2.B Two-Level Adaptive Branch Predictor to Exploit Temporal Correlation

Figure 9: Two-Level BHT for Temporal Correlation

Branch B0 Branch B1 Branch B2

Figure 10: Two-Level BHT for Temporal Correlation Execution

Part 2.C Two-Level Adaptive Branch Predictor to Exploit Spatial Correlation

Figure 11: Two-Level BHT for Spatial Correlation

Branch B0 Branch B1 Branch B2

Figure 12: Two-Level BHT for Spatial Correlation Execution

Part 2.D Branch Predictor Comparison

Two-Bit FSM Two-Level Temporal Two-Level Spatial

Figure 13: Summary of Branch Predictor Accuracies

Potrebbero piacerti anche