0 valutazioniIl 0% ha trovato utile questo documento (0 voti)
118 visualizzazioni16 pagine
This report presents a trace driven execution simulator for a superscalar processor. It is analyzed for various applications such as scientific, integer and multimedia. The program incorporates the different architecture stages starting from the fetching of the instructions, decode, dispatch, issue, commit.
This report presents a trace driven execution simulator for a superscalar processor. It is analyzed for various applications such as scientific, integer and multimedia. The program incorporates the different architecture stages starting from the fetching of the instructions, decode, dispatch, issue, commit.
This report presents a trace driven execution simulator for a superscalar processor. It is analyzed for various applications such as scientific, integer and multimedia. The program incorporates the different architecture stages starting from the fetching of the instructions, decode, dispatch, issue, commit.
Submitted by: Roshni Uppala, Damon Stachler Submitted on: 25 th November, 2013 1 OBJECTIVE A superscalar processor is one that can execute multiple set of instructions in a single cycle. This report presents a trace driven execution simulator for a superscalar processor which is analyzed for various applications such as scientific, integer and multimedia.
Fig 1: Super Scalar Processor definition 2 OVERALL PROGRAM DESIGN In this section, we present a schematic flow diagram for every stage demonstrated in the source code sim.c . The program incorporates the different architecture stages starting from the fetching of the instructions, decode ( fig 2 ), dispatch, issue, execute, commit. 2.1 FETCH AND DECODE STAGE: The fetch stage brings in instructions from the cache depending upon the superscalar width into the IF_ID. This stage accesses the instructions from the L1 cache or the L2 cache or from the main memory depending on the hitrates to the caches. In an actual processor, instruction cache hit rate is for the average over all instructions. In in contrary to this , in this project we assume not to model a cache . Therefore, through a random number, we decide the hit rate of the instructions and access it depending on this hitrate. A latency to access the instruction in individual caches is also incorporated. Once all the instructions are accessed ( till SW width) , its put in the IF_ IF buffer inorder. Note, instructions are accessed from the caches inorder. The decode stage in this project functions to decode thee instructions obtained in the IF_ID buffer. It identifies the instruction boundaries, reads the sources and destination and their respective opcode. Thus, after all the instructions are decoded they are moved into the dispatch buffer, inorder. In this project, the IF_ID buffer and the Dispatch buffer are designed as circular lists.
Fig 2 : Fetch and Decode stage 2.2 DISPATCH STAGE:
Instructions move from the dispatch buffer inorder into the Renaming register file ( RRF) , Reorder Buffer( ROB) and their respective reservations stations(RS). There are four reservations stations used in this project for integer( RS_int), floating point(RS_fpp), memory(RS_mem) and branch (RS_br) instructions. If either one of the ROB, RRF or RS has no vacancy for new instructions, the instructions does not get dispatch and waits in the dispatch buffer until all of them have a vacancy. Once the instructions have been dispatched, there are three operations mainly performed, namely source read( reading the source register values from the ARF or from the RRF indicated by the RRF tag in ARF ), destination allocate( allocated the destination registers in RRF and updating their rrf_tag in ARF) and register update( updating the data for those instructions which have been waiting for their source register values i.e., whose operands are not yet valid and instructions which are not yet ready for execution).
Fig 3 : Dispatch stage 2.3 ISSUE AND EXECUTE STAGE:
In the issue stage of the code, the instructions which have a ready bit set to 1 in their reservation stations are allowed to be sent to their respective functional units for execution. The ready bit of the instructions in the RS is set 1 when both its source operands are valid i.e., valid1-1 and valid2-1. Therefore, instructions can remain in their respective RS for one or more cycles while waiting for their source operands. Once the source operands are available , that instruction is issued from the RS into the execution unit.
Fig 4: Structure of Reservation station used in the code
In stage each RS ( RS_int, RS_fpp, RS_mem, RS_br) is checked for a ready bit. If multiple instructions are ready , then the selection is done depending on which one fetched earlier. Depending on the instruction fetched time they are allowed for execution. In the execute stage, instructions are executed. Each functional unit has a specific latency , after which the instruction is known to have finished its execution.The hybrid type of reservation stations are used in this project , in which instructions are dispatched to multiple reservation stations , and each reservation station can feed or be shared by more than one functional units. The number of functional units are fed in by the user. 2.4 COMMIT STAGE:
Instructions which have finished execution in their respective functional units are updated as finished in the reorder buffer. Note, the execution, issue is done in out of order, but the commit of instructions into the architecture register is done in order. The commit stage mainly incorporates the functioning of the reorder buffer, where instructions in the reorder buffer are completed according to program order. The reorder buffer is managed as a circular queue with instructions arranged according to the program order.
Fig 5: Structure of ROB used in the code 2.5 BRANCH PREDICTION:
The simulator implements a gshare predictor by utilizing a branch history shift register(BHSR), a branch target buffer(BTB), and a gshare buffer of saturating counters. The PC of instructions is sent to the gshare predictor in the fetch stage to make a prediction of taken/not taken for the instruction. The branch is resolved in the execute stage which updates the prediction, BHSR, and BTB.
Fig 6 :Gshare predictor 3 STEPS TO EXECUTE THE PROGRAM
To run the program using the terminal in Unix based systems (LINUX): Download the files sim.c, globals.h and Makefile to your local system. 1. Type make at the command line. 2. ./sim 3. The program will ask you for an input trace file as well as for values of parameters such as superscalar width, number of functional units, number of reservation station entries, number of reorder buffer , number of rename table entries, size of the BTB, L1 cache latency, L2 cache latency, L1 data cache hit rate, 2 cache hit rate, L1 instruction cache hit rate, Memory latency, L1 data cache access time . (Make sure the three classes of application are stored in the same folder as the codes.) 4 OPTIMUM SIMULATOR ARCHITECTURES The optimum simulator architecture values for different classes of application are obtained in two stages. First, each parameter is varied while keeping the other parameters constant to their maximum values. Second, once optimal values are obtained from stage 1 , they are used as maximum values for the parameters and stage 1 is repeated. Finally, the parameter values are adjusted in such a way to give the best architecture for each class of application. The parameters in the architecture design are given by user and they are Superscalar width(SW), Number of functional units( FU_no), Number of reservation station entries(Rs_entries), Number of reorder buffer entries(ROB_entries), Number of renaming table entries(RRF_entries), L1 cache latency, L1 cache hitrate, L2 cache latency, L2 cache hitrate, Memory access time, Branch target buffer size( BTB_size) and the input program file. In the following analysis , we maintain the L1 access time as 1 cycle , L1 instruction cache hitrate as 98% , L1 data cache hitrate as 95%, L2 cache hitrate as 70%, L2 cache access time as 6 cycles and Memory latency as 60 cycles and Branch target buffer size as 64. The functional units latency are as below: Integer Functional Units 1 cycle Floating point Functional units 5 cycle Memory functional units 2 cycles Branch functional units 1 Multiply/Divide functional units 4
4.1 SCIENTIFIC CLASS OF APPLICATION 4.1.1 Architecture and Utilization rates The optimum architecture values for fpppp.tra (the scientific class of application) were found to be as:
Number of Functional Units 7 Superscalar Width 7 Number of Reservation Station Entries 8 Number of Rename Register File Entries 30 Number of Reorder Buffer entries 50
These numbers were found by first stage 1: keeping all values but one at a maximum and sweeping the other. This was done for all values. The relevant charts are included below:
Fig 7 : Plots from stage 1 for fpppp.tra Next the optimal values obtained from above stage 1 were set, and then one of the parameters was swept. The following charts show these results.
Fig 8: Stage 2 analysis of fpppp.tra Using these final optimum architecture, the simulation was run and the IPC, branch prediction rate, and various utilization rates were calculated . They are as follows:
4.1.2 Analysis : Using the architecture obtained from stage 1 and 2 for fpppp.tra class of application its seen that the ROB is utilization is 68 % , which means for 22% of the time it remains idle which is considerable. While the branch prediction is quite poor for 30% . Its also seen that the RS of floating point has been utilized more at 55% compared to other reservation stations. Also its seen that floating point Functional units is utilized more compared to other FU at 56%. This design is a considerable architecture for fpppp.tra. While the branch predictor has to be designed better to obtain higher branch prediction rate. This processor for fpppp.tra has a higher throughput at 2.87. 4.2 INTEGER CLASS OF APPLICATION 4.2.1 Architecture and Utilization rates : The optimum values for perl.tra (the integer application) were found to be: Number of Functional Units 6 Superscalar Width 9 Number of Reservation Station Entries 6 Number of Rename Register File Entries 20 Number of Reorder Buffer entries 20 The same simulation was run for perl.tra as for fpppp.tra. The results from stage 1 of the analysis are shown in the following figure9.
Fig 9: Stage 1 analysis of perl.tra The results from the second stage of analysis are shown in the following graphs.
Fig : Stage 2 analysis of perl.tra Fig 10: Stage 2 analysis of perl.tra
The IPC, branch prediction rate, and utilization rates for the obtained optimal values from stage 1 and stage2 are shown below.
IPC 2.39 Branch Prediction Rate 32.79% FU_int Utilization 12.45% FU_fpp Utilization 0% FU_mem Utilization 66.02% FU_br Utilization 34.54% RS_int Utilization 55.59% RS_fpp Utilization 0% RS_mem Utilization 46.20% RS_br Utilization 6.73% RRF Utilization 21.33% ROB Utilization 83.37% 4.2.2 Analysis : From the above table , its seen that for this designed architecture for perl.tra the ROB utilization is quite satisfactory at 83% , where only 17% of the time its idle. While the branch prediction rate has to be improved which stays at 33%. Its also noticed that the floating point functional unit has never been used to execute any instruction while the memory functional functional units have the maximum usage at 66% . But the RRF utilization rate is quite poor for this architecture as it remains idle for 79% of the time. This processor for perl has a higher throughput at 2.39. 4.3 MULTIMEDIA CLASS OF APPLICATION 4.3.1 Architecture and Utilization rates The architecture obtained values for mpeg2e.tra(the media application) are shown below. Number of Functional Units 4 Superscalar Width 7 Number of Reservation Station Entries 4 Number of Rename Register File Entries 20 Number of Reorder Buffer entries 20 The same analysis was run for mpeg2e.tra as for the first two applications. The results from the first stage of analysis are shown in the figures below.
Fig 11 : Stage 1 analysis of mpeg2e.tra
The results for the second stage of analysis are shown in the following figures.
Fig 12 :Stage 2 analysis of mpeg2e.tra The IPC, branch prediction rate, and utilization rates for the ideal values are shown in the following table.
IPC 1.84 Branch Prediction Rate 88.29% FU_int Utilization 30.53% FU_fpp Utilization 44.59% FU_mem Utilization 23.44% FU_br Utilization 17.63% RS_int Utilization 98.26% RS_fpp Utilization 15.37% RS_mem Utilization 8.02% RS_br Utilization 4.46% RRF Utilization 43.59% ROB Utilization 61.85% 4.3.2 Analysis : From the above table its observed that the ROB has been utilized almost 62% of the time. Also the branch prediction rate is quite considerable at 88%. Its also seen that the RS for integer has been busy almost 98% whereas the RS for branch is least busy. It seen that the utilization RRF at 44% has to be better. Though we get a better branch prediction rate, the ROB stays idle for almost 39% of the time. Also the instruction throughput is higher at 1.84. 5 CONCLUSION In this report we present a trace driven simulator which has been tested for three classes of application namely , Scientific ( fpppp.tra ),Integer(perl.tra) and Multimedia(mpeg2e.tra). Optimum architecture values are obtained for each class of application. Further, different metrics of performance such as instruction per cycle throughput(IPC) , branch prediction rate and resource utilization of the designed processor is obtained for each class of application. Its observed from the processor for multimedia application that a better branch prediction rate is obtained while the ROB is quite not fully utilized. In contrast to this, processor for integer application shows that a better ROB utilization is obtained but at the same time it has lower branch prediction rate. Future work would involve to optimize the architecture values in a much better way to gain better resource utilization rates with good branch prediction rates and having a overall higher throughput. Power performance metrics also includes in the future study of this report. 6 REFERENCES Shen, Lipasti and. July 7, 2004. Modern Processor Design: Fundamentals of Superscalar Processors (McGraw-Hill Series in Electrical and Computer Engineering). McGraw-Hill Science/Engineering/Math.