Pipe Lining

PROC. 20th INTERNATIONAL CONFERENCE ON MICROELECTRONICS (MIEC951, VOL.
2, NIS, SERBIA, 12-1 4 SEPTEMBER, 1995

Branch Mechanisms in Deep Pipelines: Reevaluating
the Existing Solutions and Proposing a New Guideline
Milena PetroviC., Igor Tartalja, and Veljko Milutinovie
Abstract: In this study we have shown that the ranking of the
branch mechanisms changes when the underlined technology
changes fromthe advanced CMOS, used for the state-of-the-art
commercial microprocessors, to the moreadvanced technologies,
that will probably be used in future microprocessors. New
tecluiologies as such GaAs or GaInAs imply deepening of
instruction pipeline. Hence, the problem studied here is
important because branch instruction execution is one of the
most serious causes of the performance degradation of deep
pipeline processors. Software methods (Gross-Hennessy and
Ignore), exibiting an advantage compared to the elementary
hardware schemes (Assume Branch Not Taken and Branch
Target Buffer) in short pipelines, become inferior in deep
pipelines.
1. Introduction
Efficient design of branching related mechanisms is of
crucial importance for the overall performance and
commercial success of novel microprocessors. Most of the
research effort is directed towards the issues in superscalar
and superpipelined computing, assuming conditions
typical for the present day CMOS silicon (not so deep
instruction pipelines). However, very little research effort
is devoted to the problems typical for the emerging
technologies characterized with relatively large ratios of
off-chip and on-chip delays (which results in extremely
deep pipelines).
Some of the early research in the field of branching for
deep pipelines was related to the DARPA's effort to
develop a GaAs RISC machine for VLSI [I]. In the
meantime, we realized that the solution used there (ignore
instruction), with some improvements, could be used in
the present day efforts dedicated to more advanced
technologies [2]. However, the problem to be resolved
This research was partially supported by theFNRS.
M.Petrovi6, I. Tartalja, and V. Milutinovit: are with the
School of Electrical Engineering, University of Belgrade,
Bul. Revohcije 73, POB 816, 11000 Beograd, Yugoslavia.
E-mail: {epetromi, etartalj, emilutiv}@ubbg.etf.bg.ac.yu.
before an aggressive usage of the ignore-style mechanisms
is the precise determination of the performance and
complexity related implications in different environments
of interest. This is exactly the subject and the goal of the
research presented in this paper.
With all above in mind, our paper compares various
previosly proposed and some newly devised ignore-style
mechanisms and a number of state-of-the-art branching
mechanisms proposed in literature or utilized in modem
microprocessor systems (Assume Branch Not Taken,
Branch Target fluffer, Delayed Branch).
I n our analysis, we have used the modeling and
evaluation methodology which is related to the one from
[3]. We have shown that the ranking of the above
mentioned mechanisms changes when the underlined
technology changes from the advanced CMOS (used for
the State-of-the-art commercial microprocessors) to the
more advanced technologies characterized witha relatively
large ratio of off-chip to on-chip delays (to be used for
future commercial microprocessors). In order to make our
results of ultimate practical value, all the analyses have
implied the undexlying architecture with cache memory.
2. Branch Mechanisms
The conditional changes in control flow are one of the
most important causes of the pipeline performance
degradation. Basically, there are two possible approaches
to the branching problem: a software-oriented approach
(compiler tries 1.0 fill the branch delay slots with useful1
instructions), and a hardware-oriented speculative
execution (the outcome of the branch is predicted in an
early pipeline stage). Speculative execution could be static
or dynamic.
In this study wecompared the performances of several
branch mechaniisms characterized with small hardware
_ _
0-7803-2786-1/95/$4.00 0 1995 IEEE
855
complexity, and thus very appropriate for new and
expensive technologies.
Assume Branch Not Taken [4]. A conditional branch is
allways predicted as not taken, and the instruction fetching
continues sequentially after the branch. In the wrong
prediction case, the pipeline is flushed and instructions
from the branch target are fetched.
Branch Target Bufler (BTB)[S]. BTB is a small cache
memory that could store the predicted target address (or
the target instructions) and the informations necessary for
the branch prediction. There are some very sophisticated
branch predictors i n a number of recent reports, but they
are generally characterized with high hardware
complexity. The BTB that we modeled is a very simple one
- only taken branches are stored in, and there are no
prediction bits. The BTB is a 64-entry, fully-associative
LRU cache. There are no lost cycles if the prediction is
correct. The BTB updating for the incorrect prediction
lasts one cycle.
Delayed Brunch 161. Compiler tries to fill as many as
possible branch delay slots with useful instructions
(instructions from the basic block above the branch,
sequentially after the branch or from the target), according
to the Gross-Hennessy optimization algorithm. Since
compiler is unable to fill more then a few delay slots, the
rest is filled with nop instructions. The improved version
is Delayed Branch With Squashing [7], that allows any
instructions from the branch destination basic block to be
placed in the delay slots and squashed in the misprediction
case.
Ignore [l,S]. Ignore mechanism is a further
improvement of software methods. The lgnore Always
instruction replaces a series of nop instructions, forbidding
the new instruction fetch during the appropriate number of
cycles. C.'onditional Igiiore allows partial squashing of
delay slots. The branch delay area is divided in two parts -
one, filled by the Gross-Hennessy algorithm with always
executable instructions, and the other, filled with
squashable instructions from predicted branch destination.
In the case of a very deep pipeline the compiler is unable
to fill the whole squashable area, so the rest is filled with
nops. We also propose a Mixed Ignore instruction, that
improves Conditional Ignore. Like in the Ignore Always
approach, rather then filling nops in the branch delay
slots, the Ignore instruction is inserted after the squashable
area. All Ignore versions could be integrated (with a
branch instruction) or separated.
3. Conditions of the Analysis
The performances of the different branch mechanisms
are evaluated using a developed pipelined processor
model, with one level of cache memory (separated for data
and instructions). Cache memory is directly mapped; one
cache line stores one instruction word. We have
functionally simulated just two pipeline stages: the Fetch
stage and the BranchEvaluation stage (the rest of the
pipeline is of no interest for this research).
We decided to use the synthetic programs as the
simulator input, in order to have a great flexibility of
parameter values. A special synthetic code generator is
developed, based on the probabilistic model. The
representative test programs, with acceptable length and
simulation time, are obtained using this generator [9]. The
generated code is structured and consists of conditional
branch and notbranch instructions. Branch forward is
assumed to be $-branch and branch backward is assumed
to be loop-bvanch. Some of the synthetic benchmark
parameter values were chosen according to the values of
the SPEC92 set of benchmarks [ 101.
4. Analytical Evaluation
We used a simple analytical model to evaluate
performances of different branch mechanisms.
90 Ebtb Oabnt Eignc Ogrhe
80
70
U 60
50
40
30
l n l n
Tigure 1. Processor utilization for 10 synthetic benchmarks -
lnalytical mode,! Parameter values: D=12, Miss=O.
856
The performance measure is processor utilization, U
(equal to the ratio of useful cycles to the number of all
cycles, useful and wasted). If the pipeline is flushed due to
the wrong branch prediction or if the nop instruction is
executed, processor cycles are wasted and processor
utilization decreases. The cache miss cycles are not
included in the model.
Parameters of interest are: DB (dynamic branch
percentage), D (number of the pipeline stages needed for
branch resolution), Ply and Pif (percentage of dynamic
branches backward and forward), Pt and P, (percentage of
taken and not taken branches), Ply, and PI& BPR (branch
prediction rate of BTB), Ud (BTB updating cycles), Pf(i),
Pa, Pb, Pc, and Nfi~l (the frequency of filling the i-th delay
slot, frequency of filled slots according to the a, b, and c
Gross-Hennessy algorithm scheme, and average number of
filled slots), PAor and PAeq (the percentages of delay slots
filled from target basic block and from the sequential one,
for the Conditional Ignore mechanism):
N+,/ = ?Ai) ; P, =Pi ~x Pi f +P/ p, x Pl p
P,= 1-Pt; Po +P,+P,= 1
U is equal to:
Gross-Hennessy and Ignore Always:
M
-1
Assme Brunch Not Taken: 1 / ( Pt x (D- 1 )xDB +1)
Brunch Target Bufer:l/((l-BPR)xDBx(D-1 +Ud) + 1)
l/(((D-1)- Nfiiix(Pa+PbxP,,+PcxPt))xDB +1)
Conditional Ignore and Mixed Ignore:
1 /((D-I-N~Iix(Pu+PbxP,~+PcxPt) x (Pf&xPiftxPif+
+PJ,zi-xPlpt XPl p)X(D-l - N J l l l ) ) ~ DB+ 1)
Figure 1. shows the values obtained from our analytical
model, with the same parameter values as for the
simulation-based analysis, that will be presented in the
next section.
2: 50-60 13: 60-70
5 (12.61 4.41 18
5. Simulation-Based Evaluation
4: 70-90 I 5 : >90
2.5 (11.51 1.6110.8
Synthetic model includes two additional parameters,
the cache size (Csize) and the number of extra cycles due
to cache miss (Miss) [ 1 I].
Caldet and Grunwald in [lo] reported some of the
parameters of interest for our research, for 19 different
SPEC92 programs and 5 selected C benchmarks. All
benchmarks could be divided in five classes, according to
percentage of taken conditional branches (Table 1). The
other important parameter, DB, varies within each class.
We generated two synthetic benchmarks per class, with
DB taking the values near minimum and maximum
interval values (Table 2).
Table 1. Benchmark classes
The code length is 20 000 object instructions, and at
least 75% is executed during simulation time (1 000 000
dynamic instrwtions). The simulation time and code
length are statistiically acceptable, as shown in [9].
Table 2. Synthetic benchmark parameter vulues
Some of the simulation results are presented in the
figure 2. If number of the pipeline stages needed for
branch resoluticln (D) is relatively small, the obtained
results justify the usage of software methods.
For very deep pipelines (e.g., D=12), even the simplest
hardware method that always falls through shows nearly
equal or even better performance than the compiler-
supported methods, for the majority of examined
benchmarks (class 1 , 2, 3 and 4). Only when P,h,>90%
85 7
(highly repetitive structures), software methods perform
significantly better than Assume Branch Not Taken, but
they are inferior compared to the BTB mechanism.
Software methods expand the code and increase the
number of cache misses; on the other hand, not all the
slots are filled with useful instructions, hence the poor
J erformance in very deep pipelines.
90
85
U 80
75
70
I
mbtb Clabnt Bi gnm Cligna mignc Dgrhe
70
60
U 50
40
30
20
10
1 80 I b t b Uabnt 0i gnm Oi gna Mignc Ogrhe I
7n 1 I I
.
60
U 50
40
30
20
i n
2-c
?icture 2. Processor utilization for 10 synthetic benchnuwks -
Fimulation model. Parameter values: 2-a: D=3, Csize= 16K,
diss=l; 2-b: D=12, Csize=8K, Miss=3; 2-c: D=12:
7size=I 6K. Miss=3.
The simulated BTB is one of the simplest dynamic
predictors with relatively low branch prediction rate, If
P,,~,,>650/0, than BPR increases; as a result, the processor
utilization increases considerably. More complex dynamic
methods would certainly give better results, but at the
hlgher hardware cost.
6. Conclusion
New technologies, characterized with very deep
pipelines and relatively expansive chip area, require
reevaluation of branch mechanisms, under the changed
conditions of analysis. Software methods (especially
Ignore) could be used in a relatively shallow pipeline, but
deep pipeline require hardware approaches. If hardware
complexity is the critical issue, simple hardware methods
(e.g., simple BTB) would give the best performance. Even
the simplest hardware method, Assume Branch Not Taken,
would outperform Ignore mechanisms, for the applications
with the probability of taken conditional branch less than
90%.
References
[l] Helbig, W., MhtinoviC, V., A DCFL ED-MESFET GaAs
Experimental RISC Machine, IEEE Transactions on
Computers, February 1989, Vol. 38, No. 2, pp. 263-274.
[2] Ivlilutinovic, V., Surviving the Design of a 200 MHz lUSC
Microprocessor: Lessons Learned, IEEE Computer Society
Press, Los Alamitos, California, U.S.A., 1996.
[3] Cragon, H.G.., Branch Strategy Taxonomy and PerfDmance
Models, IEEE Computer Society Press, Los Alamitos,
California, U.S.A., 1992.
[4] Patterson, D.A., Hennessy, J.L., Computer Organization and
Design, Morgan Kaufmann Publishers, San Mateo,
California, U.S.A., 1994, pp. 425-430.
[ 5] Perleberg, C., Smith, A. J ., BTB Design and Optimization,
IEEE Transactions on Computers, April 1993, Vol. 42,
No. 4, pp. 396-412.
[6] Hennessy, J. L., Gross, T. R., Optimizing Delayed
Branches, Proceedings of the MICRO-IS Workshop, Palo
Alto, California, U.S.A., 1982, pp. 114-120.
[7] Chow, P., Horowitz, M., Architectural Tradeoffs in the
Design of MIPS-X, Proceedings of the 14Ih ISCA,
Pittsburgh, Pennsylvania, U.S.A., June 1987, pp. 300-308.
[8] MilutinoviC, V., Simulation Study of a Mechanism for
Delayed Branch Control, Facta Universitatis, Series: El.
andEn., Vol. 1, NiS, Srbija, Jugoslavija, 1995, pp. 133-147.
[9] PetroviC, M., Tartalja, I , The Development of Tools For
Evaluation of Branch Mechanisms In Deep Pipeline
Processors, Proceedings of YUINFO95, Brezovica, Srbija,
Jugoslavija, April 1995.
[ 10]Calder, B., Grunvald, D., Reducing Branch Cost via Branch
Alignment, Proceedings of the ASPLOS VI, San Jose,
Califomia, U.S.A., October 1994, pp. 242-25 1.
[l l JPetrovid, M., Tartalja, I., MilutinoviC, V., A Comparation of
Branch Mechanisms With Small Hardware Complexity In
Deep Pipeline Processors, IFACT Technical Report,
Belgrade, Serbia, Yugoslavia, January 1995.
858

Pipe Lining

Caricato da

Informazioni sul documento

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Pipe Lining

Caricato da

Copyright:

Formati disponibili

PROC. 20th INTERNATIONAL CONFERENCE ON MICROELECTRONICS (MIEC951, VOL.

2, NIS, SERBIA, 12-1 4 SEPTEMBER, 1995

Potrebbero piacerti anche