Sei sulla pagina 1di 16

Superscalar Processor Simulator

Submitted for: ECE 636 ( Fall, 2013 )


Submitted by: Roshni Uppala, Damon Stachler
Submitted on: 25
th
November, 2013
1 OBJECTIVE
A superscalar processor is one that can execute multiple set of instructions in a single cycle. This report
presents a trace driven execution simulator for a superscalar processor which is analyzed for various
applications such as scientific, integer and multimedia.

Fig 1: Super Scalar Processor definition
2 OVERALL PROGRAM DESIGN
In this section, we present a schematic flow diagram for every stage demonstrated in the source code
sim.c . The program incorporates the different architecture stages starting from the fetching of the
instructions, decode ( fig 2 ), dispatch, issue, execute, commit.
2.1 FETCH AND DECODE STAGE:
The fetch stage brings in instructions from the cache depending upon the superscalar width into the
IF_ID. This stage accesses the instructions from the L1 cache or the L2 cache or from the main memory
depending on the hitrates to the caches.
In an actual processor, instruction cache hit rate is for the average over all instructions. In in contrary to
this , in this project we assume not to model a cache . Therefore, through a random number, we decide
the hit rate of the instructions and access it depending on this hitrate. A latency to access the instruction
in individual caches is also incorporated. Once all the instructions are accessed ( till SW width) , its put in
the IF_ IF buffer inorder. Note, instructions are accessed from the caches inorder.
The decode stage in this project functions to decode thee instructions obtained in the IF_ID buffer. It
identifies the instruction boundaries, reads the sources and destination and their respective opcode.
Thus, after all the instructions are decoded they are moved into the dispatch buffer, inorder.
In this project, the IF_ID buffer and the Dispatch buffer are designed as circular lists.

Fig 2 : Fetch and Decode stage
2.2 DISPATCH STAGE:

Instructions move from the dispatch buffer inorder into the Renaming register file ( RRF) , Reorder
Buffer( ROB) and their respective reservations stations(RS). There are four reservations stations used in
this project for integer( RS_int), floating point(RS_fpp), memory(RS_mem) and branch (RS_br)
instructions. If either one of the ROB, RRF or RS has no vacancy for new instructions, the instructions
does not get dispatch and waits in the dispatch buffer until all of them have a vacancy. Once the
instructions have been dispatched, there are three operations mainly performed, namely source
read( reading the source register values from the ARF or from the RRF indicated by the RRF tag in ARF ),
destination allocate( allocated the destination registers in RRF and updating their rrf_tag in ARF) and
register update( updating the data for those instructions which have been waiting for their source
register values i.e., whose operands are not yet valid and instructions which are not yet ready for
execution).


Fig 3 : Dispatch stage
2.3 ISSUE AND EXECUTE STAGE:

In the issue stage of the code, the instructions which have a ready bit set to 1 in their reservation
stations are allowed to be sent to their respective functional units for execution. The ready bit of the
instructions in the RS is set 1 when both its source operands are valid i.e., valid1-1 and valid2-1.
Therefore, instructions can remain in their respective RS for one or more cycles while waiting for their
source operands. Once the source operands are available , that instruction is issued from the RS into the
execution unit.

Fig 4: Structure of Reservation station used in the code

In stage each RS ( RS_int, RS_fpp, RS_mem, RS_br) is checked for a ready bit. If multiple instructions are
ready , then the selection is done depending on which one fetched earlier. Depending on the instruction
fetched time they are allowed for execution.
In the execute stage, instructions are executed. Each functional unit has a specific latency , after which
the instruction is known to have finished its execution.The hybrid type of reservation stations are used
in this project , in which instructions are dispatched to multiple reservation stations , and each
reservation station can feed or be shared by more than one functional units. The number of functional
units are fed in by the user.
2.4 COMMIT STAGE:

Instructions which have finished execution in their respective functional units are updated as finished in
the reorder buffer. Note, the execution, issue is done in out of order, but the commit of instructions into
the architecture register is done in order. The commit stage mainly incorporates the functioning of the
reorder buffer, where instructions in the reorder buffer are completed according to program order. The
reorder buffer is managed as a circular queue with instructions arranged according to the program order.

Fig 5: Structure of ROB used in the code
2.5 BRANCH PREDICTION:

The simulator implements a gshare predictor by utilizing a branch history shift register(BHSR), a branch
target buffer(BTB), and a gshare buffer of saturating counters. The PC of instructions is sent to the
gshare predictor in the fetch stage to make a prediction of taken/not taken for the instruction. The
branch is resolved in the execute stage which updates the prediction, BHSR, and BTB.

Fig 6 :Gshare predictor
3 STEPS TO EXECUTE THE PROGRAM

To run the program using the terminal in Unix based systems (LINUX):
Download the files sim.c, globals.h and Makefile to your local system.
1. Type make at the command line.
2. ./sim
3. The program will ask you for an input trace file as well as for values of parameters such as
superscalar width, number of functional units, number of reservation station entries, number of
reorder buffer , number of rename table entries, size of the BTB, L1 cache latency, L2 cache
latency, L1 data cache hit rate, 2 cache hit rate, L1 instruction cache hit rate, Memory latency, L1
data cache access time .
(Make sure the three classes of application are stored in the same folder as the codes.)
4 OPTIMUM SIMULATOR ARCHITECTURES
The optimum simulator architecture values for different classes of application are obtained in two
stages. First, each parameter is varied while keeping the other parameters constant to their maximum
values. Second, once optimal values are obtained from stage 1 , they are used as maximum values for
the parameters and stage 1 is repeated. Finally, the parameter values are adjusted in such a way to give
the best architecture for each class of application. The parameters in the architecture design are given
by user and they are Superscalar width(SW), Number of functional units( FU_no), Number of reservation
station entries(Rs_entries), Number of reorder buffer entries(ROB_entries), Number of renaming table
entries(RRF_entries), L1 cache latency, L1 cache hitrate, L2 cache latency, L2 cache hitrate, Memory
access time, Branch target buffer size( BTB_size) and the input program file. In the following analysis ,
we maintain the L1 access time as 1 cycle , L1 instruction cache hitrate as 98% , L1 data cache hitrate as
95%, L2 cache hitrate as 70%, L2 cache access time as 6 cycles and Memory latency as 60 cycles and
Branch target buffer size as 64.
The functional units latency are as below:
Integer Functional Units 1 cycle
Floating point Functional units 5 cycle
Memory functional units 2 cycles
Branch functional units 1
Multiply/Divide functional units 4

4.1 SCIENTIFIC CLASS OF APPLICATION
4.1.1 Architecture and Utilization rates
The optimum architecture values for fpppp.tra (the scientific class of application) were found to be as:

Number of Functional Units 7
Superscalar Width 7
Number of Reservation Station Entries 8
Number of Rename Register File Entries 30
Number of Reorder Buffer entries 50

These numbers were found by first stage 1: keeping all values but one at a maximum and sweeping the
other. This was done for all values. The relevant charts are included below:

Fig 7 : Plots from stage 1 for fpppp.tra
Next the optimal values obtained from above stage 1 were set, and then one of the parameters was
swept. The following charts show these results.

Fig 8: Stage 2 analysis of fpppp.tra
Using these final optimum architecture, the simulation was run and the IPC, branch prediction rate, and
various utilization rates were calculated . They are as follows:



IPC 2.87
Branch Prediction Rate 30.54%
FU_int Utilization 15.37%
FU_fpp Utilization 55.80%
FU_mem Utilization 53.91%
FU_br Utilization 19.48%
RS_int Utilization 52.26%
RS_fpp Utilization 54.8%
RS_mem Utilization 26.17%
RS_br Utilization 4.31%
RRF Utilization 46.64%
ROB Utilization 68.23%

4.1.2 Analysis :
Using the architecture obtained from stage 1 and 2 for fpppp.tra class of application its seen that the
ROB is utilization is 68 % , which means for 22% of the time it remains idle which is considerable. While
the branch prediction is quite poor for 30% . Its also seen that the RS of floating point has been utilized
more at 55% compared to other reservation stations. Also its seen that floating point Functional units is
utilized more compared to other FU at 56%. This design is a considerable architecture for fpppp.tra.
While the branch predictor has to be designed better to obtain higher branch prediction rate. This
processor for fpppp.tra has a higher throughput at 2.87.
4.2 INTEGER CLASS OF APPLICATION
4.2.1 Architecture and Utilization rates :
The optimum values for perl.tra (the integer application) were found to be:
Number of Functional Units 6
Superscalar Width 9
Number of Reservation Station Entries 6
Number of Rename Register File Entries 20
Number of Reorder Buffer entries 20
The same simulation was run for perl.tra as for fpppp.tra. The results from stage 1 of the analysis are
shown in the following figure9.


Fig 9: Stage 1 analysis of perl.tra
The results from the second stage of analysis are shown in the following graphs.

Fig : Stage 2 analysis of perl.tra
Fig 10: Stage 2 analysis of perl.tra

The IPC, branch prediction rate, and utilization rates for the obtained optimal values from stage 1 and
stage2 are shown below.


IPC 2.39
Branch Prediction Rate 32.79%
FU_int Utilization 12.45%
FU_fpp Utilization 0%
FU_mem Utilization 66.02%
FU_br Utilization 34.54%
RS_int Utilization 55.59%
RS_fpp Utilization 0%
RS_mem Utilization 46.20%
RS_br Utilization 6.73%
RRF Utilization 21.33%
ROB Utilization 83.37%
4.2.2 Analysis :
From the above table , its seen that for this designed architecture for perl.tra the ROB utilization is quite
satisfactory at 83% , where only 17% of the time its idle. While the branch prediction rate has to be
improved which stays at 33%. Its also noticed that the floating point functional unit has never been used
to execute any instruction while the memory functional functional units have the maximum usage at
66% . But the RRF utilization rate is quite poor for this architecture as it remains idle for 79% of the time.
This processor for perl has a higher throughput at 2.39.
4.3 MULTIMEDIA CLASS OF APPLICATION
4.3.1 Architecture and Utilization rates
The architecture obtained values for mpeg2e.tra(the media application) are shown below.
Number of Functional Units 4
Superscalar Width 7
Number of Reservation Station Entries 4
Number of Rename Register File Entries 20
Number of Reorder Buffer entries 20
The same analysis was run for mpeg2e.tra as for the first two applications. The results from the first
stage of analysis are shown in the figures below.

Fig 11 : Stage 1 analysis of mpeg2e.tra

The results for the second stage of analysis are shown in the following figures.

Fig 12 :Stage 2 analysis of mpeg2e.tra
The IPC, branch prediction rate, and utilization rates for the ideal values are shown in the following
table.


IPC 1.84
Branch Prediction Rate 88.29%
FU_int Utilization 30.53%
FU_fpp Utilization 44.59%
FU_mem Utilization 23.44%
FU_br Utilization 17.63%
RS_int Utilization 98.26%
RS_fpp Utilization 15.37%
RS_mem Utilization 8.02%
RS_br Utilization 4.46%
RRF Utilization 43.59%
ROB Utilization 61.85%
4.3.2 Analysis :
From the above table its observed that the ROB has been utilized almost 62% of the time. Also the
branch prediction rate is quite considerable at 88%. Its also seen that the RS for integer has been busy
almost 98% whereas the RS for branch is least busy. It seen that the utilization RRF at 44% has to be
better. Though we get a better branch prediction rate, the ROB stays idle for almost 39% of the time.
Also the instruction throughput is higher at 1.84.
5 CONCLUSION
In this report we present a trace driven simulator which has been tested for three classes of application
namely , Scientific ( fpppp.tra ),Integer(perl.tra) and Multimedia(mpeg2e.tra). Optimum architecture
values are obtained for each class of application. Further, different metrics of performance such as
instruction per cycle throughput(IPC) , branch prediction rate and resource utilization of the designed
processor is obtained for each class of application. Its observed from the processor for multimedia
application that a better branch prediction rate is obtained while the ROB is quite not fully utilized. In
contrast to this, processor for integer application shows that a better ROB utilization is obtained but at
the same time it has lower branch prediction rate. Future work would involve to optimize the
architecture values in a much better way to gain better resource utilization rates with good branch
prediction rates and having a overall higher throughput. Power performance metrics also includes in the
future study of this report.
6 REFERENCES
Shen, Lipasti and. July 7, 2004. Modern Processor Design: Fundamentals of Superscalar Processors
(McGraw-Hill Series in Electrical and Computer Engineering). McGraw-Hill
Science/Engineering/Math.

Potrebbero piacerti anche