PROC. 20th INTERNATIONAL CONFERENCE ON MICROELECTRONICS (MIEC951, VOL.
2, NIS, SERBIA, 12-1 4 SEPTEMBER, 1995
Branch Mechanisms in Deep Pipelines: Reevaluating the Existing Solutions and Proposing a New Guideline Milena PetroviC., Igor Tartalja, and Veljko Milutinovie Abstract: In this study we have shown that the ranking of the branch mechanisms changes when the underlined technology changes fromthe advanced CMOS, used for the state-of-the-art commercial microprocessors, to the moreadvanced technologies, that will probably be used in future microprocessors. New tecluiologies as such GaAs or GaInAs imply deepening of instruction pipeline. Hence, the problem studied here is important because branch instruction execution is one of the most serious causes of the performance degradation of deep pipeline processors. Software methods (Gross-Hennessy and Ignore), exibiting an advantage compared to the elementary hardware schemes (Assume Branch Not Taken and Branch Target Buffer) in short pipelines, become inferior in deep pipelines. 1. Introduction Efficient design of branching related mechanisms is of crucial importance for the overall performance and commercial success of novel microprocessors. Most of the research effort is directed towards the issues in superscalar and superpipelined computing, assuming conditions typical for the present day CMOS silicon (not so deep instruction pipelines). However, very little research effort is devoted to the problems typical for the emerging technologies characterized with relatively large ratios of off-chip and on-chip delays (which results in extremely deep pipelines). Some of the early research in the field of branching for deep pipelines was related to the DARPA's effort to develop a GaAs RISC machine for VLSI [I]. In the meantime, we realized that the solution used there (ignore instruction), with some improvements, could be used in the present day efforts dedicated to more advanced technologies [2]. However, the problem to be resolved This research was partially supported by theFNRS. M.Petrovi6, I. Tartalja, and V. Milutinovit: are with the School of Electrical Engineering, University of Belgrade, Bul. Revohcije 73, POB 816, 11000 Beograd, Yugoslavia. E-mail: {epetromi, etartalj, emilutiv}@ubbg.etf.bg.ac.yu. before an aggressive usage of the ignore-style mechanisms is the precise determination of the performance and complexity related implications in different environments of interest. This is exactly the subject and the goal of the research presented in this paper. With all above in mind, our paper compares various previosly proposed and some newly devised ignore-style mechanisms and a number of state-of-the-art branching mechanisms proposed in literature or utilized in modem microprocessor systems (Assume Branch Not Taken, Branch Target fluffer, Delayed Branch). I n our analysis, we have used the modeling and evaluation methodology which is related to the one from [3]. We have shown that the ranking of the above mentioned mechanisms changes when the underlined technology changes from the advanced CMOS (used for the State-of-the-art commercial microprocessors) to the more advanced technologies characterized witha relatively large ratio of off-chip to on-chip delays (to be used for future commercial microprocessors). In order to make our results of ultimate practical value, all the analyses have implied the undexlying architecture with cache memory. 2. Branch Mechanisms The conditional changes in control flow are one of the most important causes of the pipeline performance degradation. Basically, there are two possible approaches to the branching problem: a software-oriented approach (compiler tries 1.0 fill the branch delay slots with useful1 instructions), and a hardware-oriented speculative execution (the outcome of the branch is predicted in an early pipeline stage). Speculative execution could be static or dynamic. In this study wecompared the performances of several branch mechaniisms characterized with small hardware _ _ 0-7803-2786-1/95/$4.00 0 1995 IEEE 855 complexity, and thus very appropriate for new and expensive technologies. Assume Branch Not Taken [4]. A conditional branch is allways predicted as not taken, and the instruction fetching continues sequentially after the branch. In the wrong prediction case, the pipeline is flushed and instructions from the branch target are fetched. Branch Target Bufler (BTB)[S]. BTB is a small cache memory that could store the predicted target address (or the target instructions) and the informations necessary for the branch prediction. There are some very sophisticated branch predictors i n a number of recent reports, but they are generally characterized with high hardware complexity. The BTB that we modeled is a very simple one - only taken branches are stored in, and there are no prediction bits. The BTB is a 64-entry, fully-associative LRU cache. There are no lost cycles if the prediction is correct. The BTB updating for the incorrect prediction lasts one cycle. Delayed Brunch 161. Compiler tries to fill as many as possible branch delay slots with useful instructions (instructions from the basic block above the branch, sequentially after the branch or from the target), according to the Gross-Hennessy optimization algorithm. Since compiler is unable to fill more then a few delay slots, the rest is filled with nop instructions. The improved version is Delayed Branch With Squashing [7], that allows any instructions from the branch destination basic block to be placed in the delay slots and squashed in the misprediction case. Ignore [l,S]. Ignore mechanism is a further improvement of software methods. The lgnore Always instruction replaces a series of nop instructions, forbidding the new instruction fetch during the appropriate number of cycles. C.'onditional Igiiore allows partial squashing of delay slots. The branch delay area is divided in two parts - one, filled by the Gross-Hennessy algorithm with always executable instructions, and the other, filled with squashable instructions from predicted branch destination. In the case of a very deep pipeline the compiler is unable to fill the whole squashable area, so the rest is filled with nops. We also propose a Mixed Ignore instruction, that improves Conditional Ignore. Like in the Ignore Always approach, rather then filling nops in the branch delay slots, the Ignore instruction is inserted after the squashable area. All Ignore versions could be integrated (with a branch instruction) or separated. 3. Conditions of the Analysis The performances of the different branch mechanisms are evaluated using a developed pipelined processor model, with one level of cache memory (separated for data and instructions). Cache memory is directly mapped; one cache line stores one instruction word. We have functionally simulated just two pipeline stages: the Fetch stage and the BranchEvaluation stage (the rest of the pipeline is of no interest for this research). We decided to use the synthetic programs as the simulator input, in order to have a great flexibility of parameter values. A special synthetic code generator is developed, based on the probabilistic model. The representative test programs, with acceptable length and simulation time, are obtained using this generator [9]. The generated code is structured and consists of conditional branch and notbranch instructions. Branch forward is assumed to be $-branch and branch backward is assumed to be loop-bvanch. Some of the synthetic benchmark parameter values were chosen according to the values of the SPEC92 set of benchmarks [ 101. 4. Analytical Evaluation We used a simple analytical model to evaluate performances of different branch mechanisms. 90 Ebtb Oabnt Eignc Ogrhe 80 70 U 60 50 40 30 l n l n Tigure 1. Processor utilization for 10 synthetic benchmarks - lnalytical mode,! Parameter values: D=12, Miss=O. 856 The performance measure is processor utilization, U (equal to the ratio of useful cycles to the number of all cycles, useful and wasted). If the pipeline is flushed due to the wrong branch prediction or if the nop instruction is executed, processor cycles are wasted and processor utilization decreases. The cache miss cycles are not included in the model. Parameters of interest are: DB (dynamic branch percentage), D (number of the pipeline stages needed for branch resolution), Ply and Pif (percentage of dynamic branches backward and forward), Pt and P, (percentage of taken and not taken branches), Ply, and PI& BPR (branch prediction rate of BTB), Ud (BTB updating cycles), Pf(i), Pa, Pb, Pc, and Nfi~l (the frequency of filling the i-th delay slot, frequency of filled slots according to the a, b, and c Gross-Hennessy algorithm scheme, and average number of filled slots), PAor and PAeq (the percentages of delay slots filled from target basic block and from the sequential one, for the Conditional Ignore mechanism): N+,/ = ?Ai) ; P, =Pi ~x Pi f +P/ p, x Pl p P,= 1-Pt; Po +P,+P,= 1 U is equal to: Gross-Hennessy and Ignore Always: M -1 Assme Brunch Not Taken: 1 / ( Pt x (D- 1 )xDB +1) Brunch Target Bufer:l/((l-BPR)xDBx(D-1 +Ud) + 1) l/(((D-1)- Nfiiix(Pa+PbxP,,+PcxPt))xDB +1) Conditional Ignore and Mixed Ignore: 1 /((D-I-N~Iix(Pu+PbxP,~+PcxPt) x (Pf&xPiftxPif+ +PJ,zi-xPlpt XPl p)X(D-l - N J l l l ) ) ~ DB+ 1) Figure 1. shows the values obtained from our analytical model, with the same parameter values as for the simulation-based analysis, that will be presented in the next section. 2: 50-60 13: 60-70 5 (12.61 4.41 18 5. Simulation-Based Evaluation 4: 70-90 I 5 : >90 2.5 (11.51 1.6110.8 Synthetic model includes two additional parameters, the cache size (Csize) and the number of extra cycles due to cache miss (Miss) [ 1 I]. Caldet and Grunwald in [lo] reported some of the parameters of interest for our research, for 19 different SPEC92 programs and 5 selected C benchmarks. All benchmarks could be divided in five classes, according to percentage of taken conditional branches (Table 1). The other important parameter, DB, varies within each class. We generated two synthetic benchmarks per class, with DB taking the values near minimum and maximum interval values (Table 2). Table 1. Benchmark classes The code length is 20 000 object instructions, and at least 75% is executed during simulation time (1 000 000 dynamic instrwtions). The simulation time and code length are statistiically acceptable, as shown in [9]. Table 2. Synthetic benchmark parameter vulues Some of the simulation results are presented in the figure 2. If number of the pipeline stages needed for branch resoluticln (D) is relatively small, the obtained results justify the usage of software methods. For very deep pipelines (e.g., D=12), even the simplest hardware method that always falls through shows nearly equal or even better performance than the compiler- supported methods, for the majority of examined benchmarks (class 1 , 2, 3 and 4). Only when P,h,>90% 85 7 (highly repetitive structures), software methods perform significantly better than Assume Branch Not Taken, but they are inferior compared to the BTB mechanism. Software methods expand the code and increase the number of cache misses; on the other hand, not all the slots are filled with useful instructions, hence the poor J erformance in very deep pipelines. 90 85 U 80 75 70 I mbtb Clabnt Bi gnm Cligna mignc Dgrhe 70 60 U 50 40 30 20 10 1 80 I b t b Uabnt 0i gnm Oi gna Mignc Ogrhe I 7n 1 I I . 60 U 50 40 30 20 i n 2-c ?icture 2. Processor utilization for 10 synthetic benchnuwks - Fimulation model. Parameter values: 2-a: D=3, Csize= 16K, diss=l; 2-b: D=12, Csize=8K, Miss=3; 2-c: D=12: 7size=I 6K. Miss=3. The simulated BTB is one of the simplest dynamic predictors with relatively low branch prediction rate, If P,,~,,>650/0, than BPR increases; as a result, the processor utilization increases considerably. More complex dynamic methods would certainly give better results, but at the hlgher hardware cost. 6. Conclusion New technologies, characterized with very deep pipelines and relatively expansive chip area, require reevaluation of branch mechanisms, under the changed conditions of analysis. Software methods (especially Ignore) could be used in a relatively shallow pipeline, but deep pipeline require hardware approaches. If hardware complexity is the critical issue, simple hardware methods (e.g., simple BTB) would give the best performance. Even the simplest hardware method, Assume Branch Not Taken, would outperform Ignore mechanisms, for the applications with the probability of taken conditional branch less than 90%. References [l] Helbig, W., MhtinoviC, V., A DCFL ED-MESFET GaAs Experimental RISC Machine, IEEE Transactions on Computers, February 1989, Vol. 38, No. 2, pp. 263-274. [2] Ivlilutinovic, V., Surviving the Design of a 200 MHz lUSC Microprocessor: Lessons Learned, IEEE Computer Society Press, Los Alamitos, California, U.S.A., 1996. [3] Cragon, H.G.., Branch Strategy Taxonomy and PerfDmance Models, IEEE Computer Society Press, Los Alamitos, California, U.S.A., 1992. [4] Patterson, D.A., Hennessy, J.L., Computer Organization and Design, Morgan Kaufmann Publishers, San Mateo, California, U.S.A., 1994, pp. 425-430. [ 5] Perleberg, C., Smith, A. J ., BTB Design and Optimization, IEEE Transactions on Computers, April 1993, Vol. 42, No. 4, pp. 396-412. [6] Hennessy, J. L., Gross, T. R., Optimizing Delayed Branches, Proceedings of the MICRO-IS Workshop, Palo Alto, California, U.S.A., 1982, pp. 114-120. [7] Chow, P., Horowitz, M., Architectural Tradeoffs in the Design of MIPS-X, Proceedings of the 14Ih ISCA, Pittsburgh, Pennsylvania, U.S.A., June 1987, pp. 300-308. [8] MilutinoviC, V., Simulation Study of a Mechanism for Delayed Branch Control, Facta Universitatis, Series: El. andEn., Vol. 1, NiS, Srbija, Jugoslavija, 1995, pp. 133-147. [9] PetroviC, M., Tartalja, I , The Development of Tools For Evaluation of Branch Mechanisms In Deep Pipeline Processors, Proceedings of YUINFO95, Brezovica, Srbija, Jugoslavija, April 1995. [ 10]Calder, B., Grunvald, D., Reducing Branch Cost via Branch Alignment, Proceedings of the ASPLOS VI, San Jose, Califomia, U.S.A., October 1994, pp. 242-25 1. [l l JPetrovid, M., Tartalja, I., MilutinoviC, V., A Comparation of Branch Mechanisms With Small Hardware Complexity In Deep Pipeline Processors, IFACT Technical Report, Belgrade, Serbia, Yugoslavia, January 1995. 858