Influence of Branch Mispredictions On Sorting Algorithms

Technische Universität Ilmenau
Fakultät für Informatik und Automatisierung

Institut für Theoretische Informatik
Fachgebiet Komplexitätstheorie und Effiziente Algorithmen
Bachelor Thesis
Influence of Branch Mispredictions on

Sorting Algorithms
Mohamad Karam Kassem
Ilmenau, den 2. May 2018
Supervised by
Univ.-Prof. Dr. Martin Dietzfelbinger
Acknowledgement
“Praise be to Allah, the Lord of the entire universe” [Quran 1:2]
I would like first and foremost to thank my supervisor Univ.-Prof. Dr.

Martin Dietzfelbinger for giving me the opportunity of exploring
such an interesting topic and supporting me with resource needed to
continue this work. I also would like to thank my parents Samar and
Essam for their endless love and support. To my wonderful brothers
Kasem, Firas and Ghaith who used to take care about me during
the hard days of my study. Thanks to all my friends who helped in
any small or big matter.
Abstract
Classical analysis of algorithms is usually interested in the count of

the classical elementary operations (additions, multiplications, compar-
isons, swaps .. etc). Sometimes such approach is not enough to quantify
an algorithm’s efficiency. Modern processor architectures have many
techniques in order to speedup its execution like instructions pipelining,
which opens many doors (like pipelining hazards) which have significant
impact and should be considered too. We present an in-depth study
of optimization methods for variants of Quicksort, which take these
aspects, in particular branch misprediction into account. We observe
significant improvements over naive approaches.
Zusammenfassung
Die klassische Analyse von Algorithmen beschäftigt sich normaler-

weise mit der Anzahl der klassischen Elementaroperationen (Addi-
tionen, Multiplikationen, Vergleiche, Swaps usw.). Manchmal re-
ichen solche Analyseansätze nicht aus, um die Effizienz eines Algo-
rithmus zu quantifizieren. Moderne Prozessorarchitekturen haben
viele vershiedene Techniken, um ihre Ausführung zu verbessern,
wie z.B. Befehelen-Pipelining, die viele Türen öffnen (wie Pipeline-
Hazards), die erhebliche Auswirkungen haben und betrachtet werden
sollen. Wir präsentieren eine detaillierte Studie zu Optimierungsmeth-
oden für Quicksort-Varianten, die diese Aspekte, insbesondere die
Branch-Misprediction, berücksichtigen. Wir beobachten signifikante
Verbesserungen gegenüber naiven Ansätzen.
ii
iii
List of Figures
1.1 The relation between execution time, number of branch misses and
number of executed instructions [KS06]. . . . . . . . . . . . . . . . . 3
2.1 Pipelining instructions (figure in the style of [Zar96]). . . . . . . . . 6

2.2 A classification of assembly instructions in term of execution speed. 7
2.3 Branch misprediction (Listing 2.2). Assuming always-taken static
predictor (figure in the style of [Zar96]). . . . . . . . . . . . . . . . 8
2.4 Always taken static predictor (left) and always not-taken static pre-
dictor (right). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.5 1-bit dynamic predictor. . . . . . . . . . . . . . . . . . . . . . . . . 9
2.6 2-bit-saturate-counter dynamic predictor. . . . . . . . . . . . . . . . 9
2.7 2-bit-flip-on-consecutive dynamic predictor. . . . . . . . . . . . . . . 10
2.8 Probability of a misprediction of local prediction schemes [MNW15]. 10
2.9 Execution time of simultaneous minimum and maximum searching
[ANP16]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.10 Array partitioning using a single pivot. . . . . . . . . . . . . . . . . 14
2.11 Quicksort using Hoare partitioning method and median-of-three
pivot example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.12 Array partitioning using dual pivot according to YBB. . . . . . . . . 15
3.1 The expected number of branch misses according to 2-bit-saturate-

counter predictor for programs E and M. . . . . . . . . . . . . . . . . 19
3.2 Branch categorization. . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.1 First step of lean symmetric dual pivot partitioning (Listing 4.7) . . 23
4.2 Pipelining loop body of naive-MinMax assuming no misprediction
happens. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.3 Pipelining loop body of lean naive-MinMax. . . . . . . . . . . . . . . 26
4.4 Unrollsix partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.5 Unrollsix partitioning - State 7 = [000 111]2 . . . . . . . . . . . . . 30
4.6 Unrollsix partitioning - State 11 = [001 011]2 . . . . . . . . . . . . 31
4.7 Unrollsix partitioning - State 47 = [101 111]2 . . . . . . . . . . . . 31
4.8 Lookup table guided power (introduced by Dietzfelbinger, Listing 4.17) 32
4.9 Lookup table MinMax (Listing 4.18). . . . . . . . . . . . . . . . . . 33
4.10 Variants of selection (Left: conditional select, Right: pseudo condi-
tional select). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.11 Variants of conditional swap (Left: conditional swap, Right: pseudo
conditional swap). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
iv List of Figures
4.12 Variants of Guided Power (Left: Guided Power (Listing 7.1)

[ANP16], Right: pseudo conditional branch Guided Power (Listing
7.2)). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.13 Blind Single-Pivot partitioning. . . . . . . . . . . . . . . . . . . . . 37
4.14 Blind Single-Pivot partitioning example for 9 elements. . . . . . . . 38
4.15 Storing indexes preparing to rearrangement phase in single pivot
block partition (Listing 7.9). . . . . . . . . . . . . . . . . . . . . . . 40
5.1 Simultaneous maximum and minimum. . . . . . . . . . . . . . . . . 43

5.2 Exponentiation by squaring. . . . . . . . . . . . . . . . . . . . . . . 44
5.3 Single pivot partitioning algorithms. . . . . . . . . . . . . . . . . . . 45
5.4 Dual pivot partitioning algorithms. . . . . . . . . . . . . . . . . . . 46
5.5 Triple pivot partitioning algorithms. . . . . . . . . . . . . . . . . . . 46
5.6 Various Quicksorts implementation test. . . . . . . . . . . . . . . . 47
5.7 Quicksorts race of all partitioning algorithms which have been con-
sidered in this thesis. . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.8 Quicksort Algorithms Ranking. . . . . . . . . . . . . . . . . . . . . . 49
v
List of Tables
2.1 Misprediction penalty of various processors mesured by CPU cycles

[Fog17b]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Local prediction schemes. . . . . . . . . . . . . . . . . . . . . . . . . 10
3.1 The expected numbers of branch misses (Section 2.3.1). . . . . . . . 18
4.1 Determining the smaller/larger elements of in listing 4.13. . . . . . 27
5.1 Execution time of naive-MinMax and lean-MinMax measured by mi-

croseconds. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
vi
vii
Listings
2.1 A simple conditional branch written in C++. . . . . . . . . . . . . . 8

2.2 Listing 2.1 transformed to assembly code using g++ compiler. . . . . 8
4.1 Swap two elements. . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.2 Three elements rotation. . . . . . . . . . . . . . . . . . . . . . . . . 21
4.3 Four elements rotation. . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.4 Assembly code with an unpredictable branch. . . . . . . . . . . . . . 22
4.5 Listing 4.4 optimized by SETcc instruction. . . . . . . . . . . . . . . 22
4.6 Listing 4.4 optimized by CMOVcc instruction. . . . . . . . . . . . . . 22
4.7 Lean Symmetric Dual-Pivot Partitioning. . . . . . . . . . . . . . . . 24
4.8 The for-loop of listing 4.7 transformed to assembly code using g++. 24
4.9 Naive-MinMax [ANP16]. . . . . . . . . . . . . . . . . . . . . . . . . 25
4.10 The main loop of listing 4.9 transformed to assembly code using g++
compiler. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.11 Lean Naive-MinMax. . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.12 The main loop of listing 4.11 transformed to assembly code using
g++ compiler. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.13 Optimized 3/2 MinMax. . . . . . . . . . . . . . . . . . . . . . . . . 28
4.14 The main loop in listing 4.13 transformed to assembly code using
g++ compiler. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.15 Loop before unrolling. . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.16 Loop after unrolling. . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.17 Lookup Table Guided Power (Dietzfelbinger, Section 4.3). . . . . . . 32
4.18 Lookup Table Naive MinMax. . . . . . . . . . . . . . . . . . . . . . 32
4.19 Conditional swap. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.20 Pseudo conditional swap. . . . . . . . . . . . . . . . . . . . . . . . . 33
4.21 The main loop of lean Lomuto partitioning (Listing 7.5). . . . . . . 35
4.22 Listing 4.21 transformed to assembly code using g++ compiler. . . . 35
4.23 The main loop of lean Lomuto partitioning using CMOVcc (Listing
7.6). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.24 Listing 4.23 transformed to assembly code using g++ compiler. . . . 36
5.1 Generating integer permutation. . . . . . . . . . . . . . . . . . . . . 41
7.1 Guided Power (Auger, Nicaud and Pivoteau [ANP16]) (Section 4.3
and 4.4). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
7.2 Pseudo Conditional Branch Guided Power (Section 4.4). . . . . . . 57
7.3 3/2 MinMax (Auger, Nicaud and Pivoteau [ANP16], Section 4.1). . 57
7.4 Hoare Single-Pivot Partitioning (Section 4.2). . . . . . . . . . . . . 58
7.5 Lean Lomuto Single-Pivot Partitioning (Katajainen [Kat14], Sec-
tion 4.4). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
viii Listings
7.6 Lean Lomuto (cmov) Single-Pivot Partitioning (Katajainen [Kat14],

Section 4.4). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
7.7 Lean Blind Single-Pivot Partitioning (Seidel idea, Section 4.5). . . . 59
7.8 Unrollsix Single-Pivot Partitioning (Section 4.2). . . . . . . . . . . 60
7.9 Block Single-Pivot Partitioning (Section 4.6). . . . . . . . . . . . . 63
7.10 YBB Dual-Pivot Partitioning (Yaroslavskiy, Bentley and Bloch). . . 64
7.11 Lean Blind Dual-Pivot Partitioning (Section 4.5). . . . . . . . . . . 64
7.12 Symmetric Triple-Pivot Partitioning (Amüller and Dietzfelbinger
[AD16]). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
7.13 Lean Symmetric Triple-Pivot Partitioning (Section 4.1). . . . . . . 66
7.14 Lean Blind Triple-Pivot Partitioning (Section 4.5). . . . . . . . . . 67
7.15 Single-Pivot Quicksort. . . . . . . . . . . . . . . . . . . . . . . . . . 68
7.16 Dual-Pivot Quicksort. . . . . . . . . . . . . . . . . . . . . . . . . . . 68
7.17 Triple-Pivot Quicksort. . . . . . . . . . . . . . . . . . . . . . . . . . 68
ix
Contents
List of Figures iii

List of Tables v
1 Introduction 1
1.1 History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Background 5
2.1 Pipelining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Instructions costs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 Branch misses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3.1 Prediction schemes . . . . . . . . . . . . . . . . . . . . . . . 8
2.3.2 Observation from the literature . . . . . . . . . . . . . . . . 11
2.3.2.1 Simultaneous Maximum and Minimum . . . . . . . 11
2.3.2.2 Exponentiation by Squaring . . . . . . . . . . . . . 11
2.4 Lean procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.5 Quicksort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.5.1 Classical Quicksort . . . . . . . . . . . . . . . . . . . . . . . 13
2.5.2 New Quicksort . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.5.3 Approaches in previous work . . . . . . . . . . . . . . . . . . 16
2.5.3.1 Katajainen approach . . . . . . . . . . . . . . . . . 16
2.5.3.2 Block approach . . . . . . . . . . . . . . . . . . . . 16
2.5.3.3 Seidel approach (private communication) . . . . . . 16
3 Branch categorization 17
3.1 Branch categories . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2 Real example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4 Eliminating branch misses 21
4.1 Conditional instruction . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.2 Loop Unrolling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.3 Lookup table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.4 Pseudo conditional branch . . . . . . . . . . . . . . . . . . . . . . . 33
4.5 Safety blind swaps . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.6 Static fragmentation . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5 Experiments 41
5.1 Setup for the Experiments . . . . . . . . . . . . . . . . . . . . . . . 41
5.1.1 Machine and Software . . . . . . . . . . . . . . . . . . . . . 41
5.1.2 Input Generation . . . . . . . . . . . . . . . . . . . . . . . . 41
x Contents
5.1.3 Runtime Measurement Methodology . . . . . . . . . . . . . 42

5.2 Simultaneous Maximum and Minimum . . . . . . . . . . . . . . . . 43
5.3 Exponentiation by Squaring . . . . . . . . . . . . . . . . . . . . . . 44
5.4 Single-pivot partitioning . . . . . . . . . . . . . . . . . . . . . . . . 45
5.5 Multi-pivot partitioning . . . . . . . . . . . . . . . . . . . . . . . . 46
5.6 Quicksort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
6 Conclusion 51
7 Appendix 53
7.1 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
7.2 GNU Compiler Collection . . . . . . . . . . . . . . . . . . . . . . . 55
7.3 Routines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
8 Bibliography 69
1
Chapter 1
Introduction
In the real world humanity needs very wide set of applications that handle data
like Database, Webservers, Schedulers, Networking and many engineering issues.
Sorting makes everything easier such as in search (binary search O(log2 n))
and selection/order statistics (the k-th smallest element can be found in constant
time by simply looking at the k-th position) problems.
“Sorting a sequence of elements of some totally ordered universe remains one of

the most fascinating and well-studied topics in computer science. Moreover, it is
an essential part of many practical applications. Thus, efficient sorting algorithms
directly transfer to a performance gain for many applications.” [EW16]
There are several reasons to consider sorting to be the most fundamental prob-
lem in the study of algorithms:
• Some applications/algorithms need to sort information (i.e. task schedul-

ing/rendering graphical layered objects according to an above-relation)
[CLRS09].
• Many important techniques, which involved in sorting problem, are used

throughout a many other algorithms, that means development of sorting
algorithms leads to improve many other algorithms [CLRS09].
• Many engineering issues come to the fore when implementing sorting algo-
rithms, which depend on many factors such as prior knowledge about the
keys and satellite data, the memory hierarchy (caches and virtual memory)
of the host computer, and the software environment [CLRS09].
“A Sorting algorithm describes the method by which we determine the sorted

order, regardless of whether we are sorting individual numbers or large records
containing many bytes of satellite data.” [CLRS09]
Sorting problem is defined in [CLRS09] as following

Input: A sequence of n numbers (a1 , a2 , . . . , an ).
0 0 0
Output: A permutation (reordering) (a1 , a2 , . . . , an ) of the input sequence
0 0 0
such that a1 ≤ a2 ≤ · · · ≤ an .
In fact, the values that should be sorted are rarely isolated and may be included
in non-trivial data structure called record. Furthermore this record has key, on
which the records should be sorted. The remainder of the record consists of satellite
data, which are usually carried around with the key. Actually, sorting algorithm
2 Chapter 1. Introduction
doesn’t permute only the keys but also satellite data as well. However if the record
includes a large amount of satellite data, those movements (by permutation) is
costly, and in order to minimize data movement we let the permutation on the
pointers of the records records than the records themselves [CLRS09].
Sorting algorithms are classified according many aspects such as compassion
(compassion sort, counting sort), memory usage (in-place sort, not in-place sort)
and .. etc [CLRS09].
Compassion is one of most important aspect which classifies sorting algorithms
to compassion-based (Quicksort, Insertionsort, .. etc) and counting-based (Radix-
sort, Bucketsort, .. etc). Compassion-based algorithms determine the sorted order
of an input array by comparing elements and counting-based sorts n numbers using
array indexing as a tool for determining relative order [CLRS09].
1.1 History
In 1962, Quicksort algorithm has been invented by Tony Hoare and considered
to be one of the most efficient sorting algorithms. It gained widespread adop-
tion, appearing (like in C standard library and in Java). Quicksort follows a
divide-and-conquer strategy by choosing one pivot element from the input, put
it in its right position in the array and classifies the rest elements around it [EW16].
In 2009, Yaroslavskiy together with Bentley and Bloch improved Quicksort

using two pivots (p and q, with p ≤ q) instead single one, which outperforms the
state-of-the-art implementation of classical Quicksort used in the Java runtime
library (10% faster and 5% less comparisons although 150% number of swaps
of classical Quicksort [Wil13]). Yaroslavskiy-Bentley-Bloch (YBB) algorithm was
deployed to millions of devices with the release of Java 7 in 2011, which offers it
as the default sorting method for primitive-type arrays [Wil16].
1.2 Motivation
Kaligosi and Sanders [KS06] in 2006 have noticed that in comparison based sorting
algorithms like Quicksort or Mergesort, neither the executed instructions nor the
cache faults dominate execution time. Comparisons are much more important, but
only indirectly since they cause the direction of branch instructions depending on
them to be mispredicted. So the improvements of Quicksort such as median-of-
three pivot selection bring no significant benefits in practice (at least for sorting
small objects) because they increase the number of branch mispredictions. There-
fore, it is not only the number of executed instructions that plays a major role in
the running time (Figure 1.1).
Findings of Kaligosi and Sanders [KS06] showed that branch mispredictions
may have a significant effect on the speed of programs. As in Quicksort, a skewed
pivot led to a better branch prediction and - possibly - a decrease in computation
time.
1.2. Motivation 3
Figure 1.1: The relation between execution time, number of branch misses and
number of executed instructions [KS06].
An experimental study of sorting and branch prediction [BNWG08] appeared in

2008, and showed that Quicksort’s optimization strategies1 have a strong influence
on the predictability of its branches.
“Sorting (and indeed searching) algorithms have been designed to minimize
the number of comparisons necessary in the worst case. However, minimizing the
number of comparisons makes each comparison less predictable. In addition, it
is important to note that a sorting algorithm which performs more comparisons
than another does not necessarily have more predictable branches with respect to
a given predictor.” [BNWG08]
These results above inspired Elmasry and Katajainen. They took the idea of
decoupling element comparisons from branches from Mortensen [Mor01] which has
been also used by Sanders and Winkel in their Samplesort [SW04] and tried to
avoid unpredictable branches at all (So after decoupling element comparisons from
branches the lower bound of the number of branch mispredictions that have been
proved by Brodal and Moruz in [BM05] is no more valid).
In 2012, Katajainen showed in his talk (updated in 2014) [Kat12] that Quick-
sort is really faster than Mergesort and the secret behind this result is avoiding
unpredictable branches as possible as it can.
Very newly in 2017 per a private communication, Seidel2 shared his new idea
about unpredictable branches avoidance in the single-pivot partitioning algorithms
with us. The idea relies strongly on blind swaps (Section 4.5) without using extra
space and a few number of assembly instructions.
1
Many optimization orients like memory (explicit stack instead of recursing and quicksort the
smaller number of elements first in order to reduce the worst case stack space to O(log n) from
O(n)) and reducing both the chances of very unbalanced partitioning and the instruction count
by using median-of-3 pivoting.
2
Prof. Dr. Raimund Seidel, Head of Theoretical Computer Science in University of Saarland.
4 Chapter 1. Introduction
1.3 Thesis Outline

In the following chapters, the contribution of this thesis is discussed in detail
in order to clarify the suggested solution and presenting effective optimization
techniques for various algorithms:
• Chapter 2 presents background information about modern hardware architec-

ture, branch mispredictions, Quicksort and some related work in this topic.
• Chapter 3 suggests a branch categorization.
• Chapter 4 discuss various strategies to avoid/eliminate branch misses in

Quicksort and some other algorithms.
• Chapter 5 shows results of algorithms improvements, which have been de-

scribed in Chapter 4.
5
Chapter 2
Background
2.1 Pipelining
Pipelining is an implementation technique in which multiple instructions are
overlapped in execution, which exploits parallelism among the instructions in a
sequential instruction stream [PH13].
In modern processors in order to speedup the execution and reduce run time,
the instructions have to be done pipelined. So the instruction is split into several
stages/phases (number of stages depends on the architecture significantly). In
the best case pipelining speedups the execution time many times as many as the
number of stages (i.e. MIPS instructions classically take five steps). We assume
that there are five stages to make it easy:
F Fetch instruction from memory.

D Instruction decoding.
O Operands fetch (Registers reading).
X Operation execution or address calculation.
W Write back the result into a register.
Each stage/phase takes one CPU cycle, so each instruction will take cycles as
many as it has phases. The speedup has done by executing different phases of
different instructions in parallel, so no instruction-units can be idle. The processor
can execute one entire instruction in one CPU cycle at most (Figure 2.1 at tick
t4 ).
However instructions pipelining has to overstep some difficulties. In various
situations the next instruction cannot be executed in its clock slot. There are
three different types of those problems which called hazards (structural, data and
control/branch hazards) [PH13].
Structural Hazard. Hardware does not support the combination of instructions

that are set to execute.
Data Hazard. Data that is needed to execute the instruction is not yet available.
6 Chapter 2. Background
Control/Branch Hazards. The instruction that was fetched is not the one that
is needed because of branch mispredictions (This topic will be explained deeply in
section 2.3).
t0 t1 t2 t3 t4 t5 t6 t7 t8 t9 ..
Clock tick
Instruction 1 F D O XW
Instruction 6 F D O X ..
Instruction 7 F D O ..
.. F D ..
.. F ..
Figure 2.1: Pipelining instructions (figure in the style of [Zar96]).
2.2 Instructions costs

Code optimization is not a trivial process because improving one dimension means
worsening another one. Since some programming languages as C/C++ are trans-
formed to assembly code as a middle step between programming languages and
machine code, it is important to know the latency/cost of the assembly instructions.
I supposed the following assumptions only in order to simplify.
Simple instruction. Simple instructions such as simple integer arithmetical

add /sub , logical xor /and or movement instructions between registers (including
CMOVcc and SETcc instruction families). Such instructions do cost about 1 CPU
cycle (floating-point addition/subtraction costs the same too in x86/64) [Fog17a].
Multiplication/Division. Integer multiplication instructions costs a lot com-

pared to simple instructions (1-7 CPU cycles in x86/x64) and division is even
more expensive (12-44 CPU cycles in x86/x64). However processors spend 2-5/37-
39 CPU cycles for floating-point multiplication/division [Fog17a].
Branching. Branch instructions cost 1-2 CPU cycles if the branch has been
predicted correctly. However in some cases processors mispredict the outcome of
the branch, which leads to very costly penalty (10-20 CPU cycles) [Fog17a].
Memory Load/Store. Modern computers has about 3 levels of caches L1 , L2

and L3 and each one has its own access speed, whereas the lower cache level is
faster (i.e accessing L1 /L2 /L3 costs about 4/12/44 CPU cycles in Intel-Skylake
[Cor16]. But those numbers differ from one architecture to another). Loading
2.3. Branch misses 7
Processor Misprediction Penalty

Intel Core i7 15
Intel Nehalem 17 ≤
Intel Sandy Bridge and Ivy Bridge 15 ≤
Intel Haswell and Broadwell 15-20
Intel Pentium 4 processors 30
AMD K8 and K10 12-13
AMD Bulldozer, Piledriver and Steamroller 19
Table 2.1: Misprediction penalty of various processors mesured by CPU cycles

[Fog17b].
data from RAM is even more expensive than cache (hundreds of CPU cycles).
Memory store is performed in about 1 CPU cycle [Fog17a].
1 Simple (arithmetical/logical) and memory store

faster
2 Branching (with correct prediction)

3 Multiplication and division
4 Memory load
Figure 2.2: A classification of assembly instructions in term of execution speed.
2.3 Branch misses

In view of the fact that waiting the conditional branch outcome is very expensive
on processors for long pipelines. This is why processors try to predict the next
instruction which should be fetched.
Control/Branch hazards are the most harmful hazards for processors caused by
conditional branching. Such problem appears when the processor fetch a wrong
instruction according to its prediction.
Branch prediction. A method of resolving a branch hazard that assumes a

given outcome for the branch and proceeds from that assumption rather than
waiting to ascertain the actual outcome [PH13].
A correct prediction does cost as little as the cost of an simple instruction, but
a misprediction will lead to flush its pipeline and restart from the correct target
of the branch. However the penalty of a branch misprediction is typically harmful
(large multiple of the cost of executing a correctly predicted branch. See Table
2.1). Processors which pipeline the instruction on n stages/phases spend (n − 1)
CPU cycles for each branch misprediction [BNWG08] [ANP16].
Listing 2.1: A simple conditional branch Listing 2.2: Listing 2.1 transformed to
written in C++. assembly code using g++ compiler.
1 1 ; %rbp: base pointer
2 int main(int argc, char *argv[ ]) { 2 ; %rsp: stack pointer
3 int a, b, c; 3
4 4 pushq %rbp
5 // func(&a, &b) 5 movq %rsp, %rbp
6 6 movl -4(%rbp), %eax
7 if (a < b) { 7 cmpl -8(%rbp), %eax
8 c = a; 8 ; got to L2, if a >= b
9 } else { 9 jge L2
10 c = b; 10 movl -4(%rbp), %eax
11 } 11 movl %eax, -12(%rbp)
12 12 jmp L3
13 return 0; 13 L2:
14 } 14 movl -8(%rbp), %eax
15 movl %eax, -12(%rbp)
As we see in the assembly code (Listing 2.2). At line 9, the processor tries to
guess the next instruction either line 10 or 13.
t0 t1 t2 t3 t4 t5 t6 t7 t8 t9 ..
Clock tick
pushq %rbp F D O XW
movq %rsp, %rbp F D O XW
movl -4(%rbp), %eax F D O XW
cmpl -8(%rbp), %eax F D O XW
jge L2 F D O XW
movl %eax, -12(%rbp) F D O XW
.. F D O XW
.. F D O XW
movl %eax, -12(%rbp) F D O XW
jmp L3 F D O XW
.. F D O XW
Figure 2.3: Branch misprediction (Listing 2.2). Assuming always-taken static pre-
dictor (figure in the style of [Zar96]).
2.3.1 Prediction schemes

So the processor predicts for each conditional branch the next instruction which
should be executed (in other words at each if, while, do-while, ..).
Many different strategies and techniques - as global3 and local prediction
schemes - have been invented in order to improve the quality of those predictions.
The simplest one is a local static branch predictor (Figure 2.4), which assumes
the branch always should be taken/not-taken without using information from the
3
Global prediction schemes will not be considered in our work.
code execution. There are also some adaptive local prediction schemes called dy-
namic that take the decision about a particular branch according to its previous
outcome(s). [ANP16]
The main difference between 1-bit (Figure 2.5) and 2-bit (Figures 2.6 and 2.7)
prediction schemes is the depth of the behavior history of the branch. First one has
only one bit, which can represent only two decision states (T/taken, N/not-taken)
and has information only about the previous outcome. However 2-bit predictors
have four decision states (ST/strongly taken, WT/weakly taken, SN/strongly not-
taken, WN/weakly not-taken) and rely on more information to take the decision
which make the prediction more adaptive [ANP16].
Table 2.2 shows details about 5 prediction schemes (two of them static). As-
suming p is the probability of that the branch must be in real taken (not the
prediction).
taken taken
T N
not taken not taken
Figure 2.4: Always taken static predictor (left) and always not-taken static predic-
tor (right).
taken not taken

not taken
T N
taken
Figure 2.5: 1-bit dynamic predictor.
taken not taken

not taken not taken not taken
ST WT WN SN
taken taken taken
Figure 2.6: 2-bit-saturate-counter dynamic predictor.
Table 2.2 lists the prediction schemes considered here, with expected number
of branch misses. Proofs of formulas can be found in appendix 7.1.
taken
taken ST WN
not taken
taken
taken not taken
WT SN not taken
not taken
Figure 2.7: 2-bit-flip-on-consecutive dynamic predictor.
Name Type Miss Probability E[Branch Misses]

Pn
always taken static 1−p k=1 (1 − p)
Pn
always not-taken static p k=1 (p)
Pn
1-bit dynamic 2p(1 − p) k=1 (2p(1 − p))
p(1−p) Pn p(1−p)
2-bit-saturate-counter dynamic 1−2p(1−p) k=1 1−2p(1−p)
2p2 (1−p)2 +p(1−p) Pn 2p2 (1−p)2 +p(1−p)
2-bit-flip-on-consecutive dynamic 1−p(1−p) k=1 1−p(1−p)
Table 2.2: Local prediction schemes.
1 always taken
always not-taken
1-bit
P r(Branch miss)
2-bit-saturate-counter
2-bit-flip-on-consecutive
0.5
0
0 0.5 1
p
Figure 2.8: Probability of a misprediction of local prediction schemes [MNW15].

2.3.2 Observation from the literature

2.3.2.1 Simultaneous Maximum and Minimum
Auger, Nicaud and Pivoteau [ANP16] considered a simple problem of computing
both the minimum and the maximum of an array of size n as an introductory
example. The naive-approach (Listing 4.9) compares each element of the input
with both current minimum and current maximum, so it uses 2n comparisons.
Actually in terms of number of comparisons there is a better solution, like 3/2-
MinMax (Listing 7.3) which looks at the elements of the array two by two. It
compares the smaller one to the current minimum and the larger one to the current
maximum with 32 n comparisons.
However the surprising observation in [ANP16] was that, the naive-MinMax
is faster than the 3/2-MinMax. The reason behind this unexpected results is
the branch misses. The naive-approach makes O(log n) mispredictions where the
3/2-MinMax induces O(n) branch misses.
Figure 2.9: Execution time of simultaneous minimum and maximum searching

[ANP16].
After analyzing mispredictions according to uniform random distribution prob-

abilistic model, they found that the expected number of mispredictions performed
by naive-MinMax is asymptotically equivalent to 4 log n for 1-bit predictor and
2 log n for 2-bit predictors. And the expected number of mispredictions performed
by 3/2-MinMax is asymptotically equivalent to n4 for all considered predictors
[ANP16]. The weakness of 3/2-MinMax (Listing 7.3) is the conditional statement
if (A[i] < A[i+1]) which is mispredicted with probability 12 (the most harmful
probability).
The problem will be discussed deeply in sections 4.1 and 4.3.
2.3.2.2 Exponentiation by Squaring

Auger, Nicaud and Pivoteau analysed also another problem in [ANP16] which is
exponentiation by squaring.
“The classical divide-and-conquer algorithm to compute xn consists in rewriting
xn = (x2 )bn/2c xn0 , where nk ..n1 n0 is the binary decomposition of n in order to
divide the size n of the problem by two.” [ANP16]
They implemented a new Guided approach (Listing 7.1) to reduce the harmful
misprediction effect in the Classical and Unrolled approaches. The result was very
good after they replaced the branch, which causes the sizable number of branch
misses with one less harmful.
The problem will be discussed deeply in sections 4.3 and 4.4.
2.4 Lean procedure

Elmasry, Katajainen and Stenmark [EKS12] defined a lean procedure as a proce-
dure that has only O(1) branch mispredictions and showed how to efficiently im-
plement many algorithms related to sorting with only few branch mispredictions
in this way.
“According to a folk theorem, every program can be transformed into a program
that produces the same output and only has one loop. We generalize this to a
form where the resulting program has one loop and no other branches than the
one associated with the loop control.” [EK12]
Elmasry and Katajainen [EK12] proved that a program P of length κ - mea-
sured as the number of pure-C 4 instructions - with a running time of t(n) for an
input size n can be transformed into branchless (induces O(1) branch mispredic-
tions) program Q of length O(κ) that runs in O(κt(n)) time for the same input as
P.
Katajainen [Kat14] divided the sorting programs into three main categories
depending on the expected number of branch mispredictions during its running
time.
1. Lean: O(1) branch mispredictions incurred when the branch predictor used
by the underlying hardware is static.
2. Moderately optimized: O(n) branch mispredictions incurred under the same

assumptions as above.
3. Unoptimized: The number of branch mispredictions incurred is proportional

to the number of element comparisons performed.
4
A pure-C program is a sequence of possibly labelled statements that are executed sequentially
unless the order is altered by a branch statement [EK12]. See also the pure-C cost model [Mor01]
Chapter 3.
2.5. Quicksort 13
2.5 Quicksort
Quicksort is one of the most widely used sorting algorithms, which has been in-
troduced by Hoare in 1962 [ADK16] and it is considered to be one of the most
efficient sorting algorithms [EW16].
Quicksort is a comparison-based sorting algorithm, which follows the divide-
and-conquer paradigm (as Mergesort) and it is the most frequently used sorting
algorithm since it is very fast in practice and needs almost no additional memory
(except recursive calls in the stack) without any assumptions on the distribution
of the input [KS06].
The main part of Quicksort is the partitioning procedure which is explained in
following section 2.5.1. This is why we will consider only partitioning algorithms
while we try to improve the runtime of Quicksort in this work.
2.5.1 Classical Quicksort

Classical Quicksort is a method that works as following:
It sorts arrays in three5 phases (the second one is the most important part in the
algorithm):
• Triviality test: if input size n is larger than a predefined threshold n0 , the

algorithm starts a partitioning phase. Otherwise it sorts the input using
Insertionsort.
• Partitioning: choosing an arbitrary pivot element, which is used to classify

the elements of the array in two classes (smaller and larger). The array is
rearranged such that all elements smaller than the pivot are moved to the
left side of the array and all elements larger than pivot are moved to the
right side.
• Recursive step: Apply Quicksort to the left and right sides recursively.
In partitioning phase the array is rearranged into two sub-arrays A[l . . . p − 1]

and A[p + 1 . . . r], where all elements of A[l . . . p − 1] are smaller than A[p] and all
elements of A[p + 1 . . . r] are larger or equal to A[p]. Then it calls itself recursively
for both sub-arrays A[l . . . p − 1] and A[p + 1 . . . r] (Figure 2.10).
Although its average number of comparisons is not optimal – 1.38n log n+O(n)
vs. n log n + O(n) for Mergesort –, its over-all instructions count is very low and
by choosing the pivot element as median of some larger sample it behaves even
better.
In figure 2.11, we see Quicksort using the Hoare partitioning method and
median-of-three pivot. The algorithm converts to Insertionsort if the input size
is ≤ 2. First and last snapshots are taken before and after sorting. However the
other 7 snapshots (in between) are a sample of 64 snapshots that have been taken
while quicksorting array of 100 elements.
5
Divide-and-Conquer has actually one more phase which is combining phase. However there
is no need for this phase in Quicksort.
Algorithm 1: Simple classical Quicksort algorithm.

1 Quicksort (A[l . . . r])
2 if (r − l + 1) > n0 then
3 p ← Partition (A[l . . . r])
4 Quicksort (A[l . . . p − 1])
5 Quicksort (A[p + 1 . . . r])
6 else Insertionsort (A[l . . . r]) ;
l r
v ..
l r
.. < v v v ≤ ..
p
Figure 2.10: Array partitioning using a single pivot.
Hoare partitioning has two pointers. One of them scans the input form the
left side and stops scanning when it found an element larger than the pivot. The
second one scans the input from the right side and stops if a smaller element has
been found. After the scanning phase, the algorithm exchange the elements, at
which the pointers stopped if the pointers did not cross yet then the algorithreturns
to the scanning phase again, otherwise the algorithm is done.
2.5.2 New Quicksort

In 2009, Yaroslavskiy together with Bentley and Bloch improved Quicksort using
two pivots v1 and v2 , with v1 = A[p1 ] ≤ A[p2 ] = v2 instead a single one and
partitioned the input into 3 parts (Figure 2.12). First one contains the smaller
than or equal to A[p1 ], second one contains elements between A[p1 ] and A[p2 ] and
the third one contains the elements that are larger than or equal to A[p2 ].
Algorithm 2: Simple Dual-Pivot Quicksort algorithm.

1 Quicksort (A[l . . . r])
2 if n > n0 then
3 (p1 , p2 ) ← Partition (A[l . . . r])
4 Quicksort (A[l . . . p1 − 1])
5 Quicksort (A[p1 + 1 . . . p2 − 1])
6 Quicksort (A[p2 + 1 . . . r])
7 else Insertionsort (A[l . . . r]) ;
YBB Quicksort became the new standard Quicksort algorithm in Oracle’s Java
7 runtime library. It makes about 1.9n ln n + O(n) comparisons in average in
2.5. Quicksort 15
Snapshot (1) before Quicksort Snapshot (4) Partitioning Snapshot (7) Partitioning
100 100
90
80 80
70
60 60
A[k]
50
40 40
30
20 20
10
0 0
0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100
Snapshot (2) Partitioning Snapshot (5) Partitioning Snapshot (8) Partitioning
100 100
90
80 80
70
60 60
A[k]
50
40 40
30
20 20
10
0 0
0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100
Snapshot (3) Partitioning Snapshot (6) Partitioning Snapshot (9) after Quicksort
90 90 90
70 70 70
A[k]
50 50 50
30 30 30
10 10 10
10 30 50 70 90 10 30 50 70 90 10 30 50 70 90
k k k
Figure 2.11: Quicksort using Hoare partitioning method and median-of-three pivot
example.
l r
v1 .. v2
l r
.. < v1 v1 v1 ≤ .. ≤ v2 v2 v2 < ..
p1 p2
Figure 2.12: Array partitioning using dual pivot according to YBB.

32
contrast to the 2n ln n + O(n) of standard Quicksort and the 15 n ln n + O(n) of
Sedgewick’s dual pivot algorithm and although the number of swaps in YBB Quick-
sort about 0.6n ln n + O(n) much larger than 0.33n ln n + O(n) swap operations in
classical Quicksort, it still faster [Wil16].
2.5.3 Approaches in previous work

2.5.3.1 Katajainen approach
In 2012, Katajainen showed in his talk (updated in 2014) [Kat12] that Quicksort
can be made faster than Mergesort and the secret behind that is avoiding unpre-
dictable branches far as possible.
Elmasry and Katajainen tried to make their procedures lean using the following
steps (the first two have been mentioned in [EKS12]):
1. Storing the result of a comparison in a boolean variable and using this value
in normal integer arithmetic.
2. Moving the data from one place to another conditionally.
3. Making sure that the loop has a fixed number of iterations.
2.5.3.2 Block approach

Edelkamp and Weiß [EW16] presented a BlockQuicksort approach. They intro-
duced two buffers for storing pointers to elements in the left and the right side of
the input that should be swapped (left buffer stores pointers to elements on the
left side of the array which are greater or equal than the pivot element likewise
the other buffer for the right side).
The algorithm behaves very similar to classical Hoare partition with a very
critical modification, instead of stopping at the first element which should be
swapped, only a pointer to the element is stored in the respective buffer then the
pointer continues moving towards the middle.
After scanning a whole block of elements, the algorithm enters a new phase
and starts with the first positions of the two buffers and swaps the elements until
one of the buffers contains no more pointers to elements to swap.
The algorithm continues this way until fewer elements than two times the block
size remain and the simplest variant is to switch to the usual Hoare partitioning
for the remaining elements.
In [EW16] it is proved that for a block size B with median-of-three, Block-
Quicksort induces less then B6 n log n + O(n) branch mispredictions on average (For
extensive experimental results, see [EW16]).
2.5.3.3 Seidel approach (private communication)

In a private communication in 2017, Seidel shared his new idea with us to avoid
unpredictable branch misses altogether in single pivot partitioning algorithms. The
idea relies strongly on blind swaps (the idea is discussed in section 4.5). This idea
won all races between Quicksort variants in the experiments in this work.
17
Chapter 3
Branch categorization
Elmasry and Katajainen [EK12] have mentioned that some branches called easy-
to-predict branches are friendly and need not be eliminated. However some others
called unpredictable branches should be eliminated as long as no extra complica-
tions are introduced.
So they defined two classes of conditional branches easy-to-predict which make
O(1) mispredictions and such branch could be seen in the loops usually that mis-
predict in the last iteration. Such branches do not need to be focused. The second
class is unpredictable (or even hard-to-predict as it defined in [Mor01]), which has
a variable number of mispredictions that leads to a worse behaviour.
“Hard-to-predict are the primary source for branch misses. Branch misses are
very expensive, costing up to 15 clock cycles on Pentium Pro architecture.” [Mor01]
The golden part of pre-previous sentence of Elmasry and Katajainen is that
unpredictable branches should be eliminated as long as no extra complications are
introduced.
3.1 Branch categories

As the reader understands from the statement of Elmasry and Katajainen, remov-
ing an unpredictable branch does not necessarily improve the run time because
the benefit (branch elimination) which has been gained my not be better than the
processor prediction (for a particular branch).
To understand the situation let us bring the problem in a mathematical space.
Let P := (IF condition THEN A ELSE B END), a program contains a branch
and induces m(n) mispredictions in average for n time executions (in other words
n is the input size). Then we call ρ = limn→∞ m(n)
n
the misprediction ratio.
Definition 3.1 (Negligible Branch)

A negligible branch is a branch with misprediction ratio ρ = 0.
Definition 3.2 (Critical Branch)

A critical branch is a branch with misprediction ratio ρ 6= 0.
18 Chapter 3. Branch categorization
3.2 Real example

Let us consider an example to understand the branch categorization. Assuming
a = (a1 , a2 , . . . , an ) ∈ Nn sequence of n natural number where ai 6= aj for all
i 6= j. Let us define the following two programs (the first program E computes the
number of even digits in a, the second one M searches for the maximum in a):
E := (c := 0; FOR k := 1..n DO IF ak is even THEN c := c + 1; END)

M := (max := a1 ; FOR k := 1..n DO IF max < ak THEN max := ak ; END)
First of all we should find the probability of both branches in E and M.

A left-to-right maximum in a sequence is an element which is larger than all
preceding elements [MS08].
P r(ak is even ) = 12 and P r(ak is left-to-right maximum ) = k1 . In other words
c increases in every iteration with probability of 12 and max is updated in k-th
iteration with probability of k1 .
Predictor E M
always taken n/2 n − Hn
always not-taken n/2 Hn
1-bit n/2 2(Hn − ζ(2))
2-bit-saturate-counter n/2 Θ(Hn )
2-bit-flip-on-consecutive n/2 Θ(Hn )
Table 3.1: The expected numbers of branch misses (Section 2.3.1).
According to 2-bit-saturate-counter the misprediction ratio for the first program

is ρE = limn→∞ n/2n
= 13 and for second one ρM = limn→∞ Hnn = limn→∞ logn n = 0.
So program E has a critical branch and program M has a negligible one. Figure
3.1 shows the huge difference between the expected number of branches misses
between program E and program M.
Summary. Some types of branches are harmful and induce a large number of
mispredictions because such type of branches is hard-to-predict. As Elmasry and
Katajainen [EK12] mentioned that the unpredictable branches should be elimi-
nated as long as no extra complications are introduced.
The same observation we found when we tried to make naive-Maximum (Listing
4.9) algorithm lean, which did not lead to a better behaviour. The one explanation
is that the benefit of avoiding branch mispredictions could not cover the loss in
other performance aspects (i.e. memory I/O).
3.2. Real example 19
Expected number of branch misses

E
M
30
20
10
10 50 100
n input size
Figure 3.1: The expected number of branch misses according to 2-bit-saturate-

counter predictor for programs E and M.
Branch
Predictable Unpredictable
O(1) miss
Negligible Critical
Figure 3.2: Branch categorization.

20
21
Chapter 4
Eliminating branch misses
In the course of this work various strategies were applied in order to eliminate
branch misses in Quicksort algorithms. Moreover we will show how branch mis-
predictions affect other algorithms like simultaneous maximum and minimum and
exponentiation by squaring.
Before starting showing the techniques of branch misses elimination, we would
like to introduce three simple procedures which are used in partitioning algorithms
in this work. Swap and Rotation procedures (Listings 4.1, 4.2 and 4.3).
Listing 4.1: Swap two elements. Listing 4.2: Three elements rotation.
1 void SWAP(int &a, int &b) { 1 void ROTATE(int &a, int &b, int &c) {
2 int t = a; 2 int t = a;
3 a = b; 3 a = b;
4 b = t; 4 b = c;
5 } 5 c = t;
6 }
Listing 4.3: Four elements rotation.

1 void ROTATE(int &a, int &b, int &c, int &d) {
2 int t = a;
3 a = b;
4 b = c;
5 c = d;
6 d = t;
7 }
4.1 Conditional instruction

In [AGK13] by Antyipin, Gobi and Kozsik on low level code options. We find the
following description: “In assembly language, conditional instructions are usually
written in the form OPCODEcc, where OPCODE is a conditional instruction itself, and
cc (called condition code) is one of the predefined conditions over the state of the
status flags. If the actual state of the status flags satisfies this predefined condition,
the operation described by the conditional instruction is performed, otherwise no
action is taken.”
22 Chapter 4. Eliminating branch misses
The CMOVcc6 instruction was introduced in the P6 processor family (Intel Pen-
tium II):
CMOVcc source, destination
Here source can be either a register or an in-memory variable, destination is a

register and cc is the condition code.
Edelkamp and Weiß assumed in [EW16] that there are at least two methods
how conditional jumps can be avoided, and both are supported by the hardware
of modern processors. The first one is the same observation as in [AGK13], and
the second has been used in [EKS12]:
1. Conditional moves (CMOVcc instructions on x86 processors) or, more general,

conditional execution. In C++ compilation a conditional move can be (often)
triggered by i = (x < y) ? j : i;.
2. Cast boolean variables to integer (SETcc instructions x86 processors). In

C++: int i = (x < y);
The same methods were suggested in [Cor16] in order to optimize the pre-
diction of a conditional branch: Arrange the code to make basic blocks con-
tiguous, unroll loops and use CMOVcc and SETcc instructions. Consider the
following line of C that has a condition dependent upon one of the constants:
x = (a < b) ? const1 : const2;
We consider three possible ways to compile the previous line of C. Listing 4.4
comes from [Cor16] and shows a bad assembly code. Listing 4.5 uses a SETcc
instruction to avoid the branch and listing 4.6 uses a CMOVcc instruction.
Listing 4.4: Assembly code with an unpredictable branch.

1 cmp a, b ; Condition
2 jbe L30 ; Conditional branch
3 mov ebx const1 ; ebx holds x
4 jmp L31 ; Unconditional branch
5 L30: mov ebx, const2
6 L31:
Listing 4.5: Listing 4.4 optimized by SETcc instruction.

1 xor ebx, ebx ; Clear ebx (x in the C code)
2 cmp a, b
3 setge bl ; When ebx = 0 or 1, OR the complement condition
4 sub ebx, 1 ; ebx=11...11 or 00...00
5 and ebx, const3; const3 = const1-const2
6 add ebx, const2; ebx=const1 or const2
Listing 4.6: Listing 4.4 optimized by CMOVcc instruction.

1 mov ebx const2 ; ebx holds x
2 cmp a, b ; Condition
3 cmovl ebx, const1 ; move happens if condition (a < b) is satisfied
6
See [ALSU06], page 718.
4.1. Conditional instruction 23
We demonstrate this approach with three examples Quicksort, naive-MinMax

and 3/2-MinMax.
Quicksort. In order to improve the YBB partitioning (Listing 7.10) which is in

[AD16], we designed a lean version (Listing 4.7).
We consider this lean symmetric dual pivot partitioning procedure of Quicksort.
The procedure has three pointers l, k and g, where all elements in A[1 . . . l − 1] are
smaller than the smaller pivot v1 and the elements in A[g + 1 . . . n − 2] are larger
than the larger pivot v2 . The third pointer k always points on the new element
which should be classified and all elements in A[l . . . k − 1] are larger/equal than
the smaller pivot and smaller/equal than the larger one.
The procedure is split in two steps. First step (Figure 4.1) is computing the
position alpha where A[k] should be moved, using conditional move CMOVcc.
In the second step the pointers are moved to its new positions using simple
addition/subtraction operations based on values (b0 , b1 , b2 ) determined by casting
the conditional expressions to a numerical type using SETcc instruction, where
(b0 , b1 , b2 ) ← (v2 < x, x < v1 , v2 ≥ x).
For example, if the new element x at position k was not smaller than v1 and not
larger than v2 , pointer l is assigned to alpha in the first step. Since (b0 , b1 , b2 ) equal
to (0, 0, 1), pointers are modified as following (g, k, l) ← (g − b0 , k + b2 , l + b1 ).
The same idea was applied to implement lean symmetric triple pivot partition-
ing (Listing 7.13).
swap
v2 ≮ e v2 < e
p1 l k g p2
v1 .. < v1 v1 ≤ .. ≤ v2 x .. v2 < .. v2
Figure 4.1: First step of lean symmetric dual pivot partitioning (Listing 4.7)
Listing 4.7: Lean Symmetric Dual-Pivot Partitioning.

1 void symm_dl(int *left, int *right, int *&pos_p1, int *&pos_p2) {
2 int *l = left + 1;
3 int *g = right - 1;
4 int *k = l;
5
6 int v1 = *left, v2 = *right;
7
8 for (unsigned long counter = 0; counter < right - left - 1; counter++) {
9 int x = *k;
10 bool b0 = (v2 < x);
11 bool b2 = (v2 >= x); // !b0;
12 bool b1 = (x < v1);
13
14 // Step 1
15 int *alpha = (v2 < x ? g : l);
16 SWAP(*k, *alpha);
17
18 // Step 2
19 g -= b0;
20 k += b2;
21 l += b1;
22 }
23
24 SWAP(*left, *--l);
25 SWAP(*right, *++g);
26 pos_p1 = l;
27 pos_p2 = g;
28 }
Listing 4.8: The for-loop of listing 4.7 transformed to assembly code using g++.
1 L5: ; Step 1
2 movl (%edx), %eax ; x = *k
3 movl %esi, %ecx ; alpha = g
4 cmpl %eax, %edi ; x <= p2
5 cmovge %ebx, %ecx ; alpha = l, if x <= p2
6
7 ; SWAP(*k, *alpha)
8 movl (%ecx), %ebp
9 movl %ebp, (%edx)
10 movl %eax, (%ecx)
11
12 ; Step 2
13 ; [*] integer pointer += 1 ~ +4 Bytes
14 setl %cl
15 movzbl %cl, %ecx
16 sall $2, %ecx ; [*] ecx = 4*(p2 < x)
17 subl %ecx, %esi ; g -= ecx
18
19 xorl %ecx, %ecx
20 cmpl %eax, 4(%esp)
21 setg %cl ; ecx = (x < p1)
22 cmpl %eax, %edi
23 setge %al ; eax = (x <= p2)
24 addl $1, (%esp)
25 leal (%ebx,%ecx,4), %ebx ; [*] l += 4*ecx
26 movzbl %al, %eax
27 movl 8(%esp), %ecx
28 leal (%edx,%eax,4), %edx ; [*] k += 4*eax
29 movl (%esp), %eax
30 cmpl %ecx, %eax
31 jne L5
Simultaneous naive Maximum and Minimum. In section 2.3.2.1, we have

mentioned this problem without giving any solution. Now is the time of analyzing
both algorithms naive- and 3/2-MinMax.
naive-MinMax has two conditional branches as it appeared in the transformed
code (Listing 4.10) at line 4 and line 10. These two branches induce Θ(Hn − ζ(2))
branch misses approximately according to 2-bit-saturate-counter predictor and this
is really a very small number.
Listing 4.9: Naive-MinMax [ANP16].

1 void naive_minmax(int *A, unsigned long int N, int &min, int &max) {
2 max = - 1; min = MAX + 1; // MAX = 10000000
3
4 for (unsigned long int i = 0; i < N; i++) {
5 int t = A[i];
6 if (t < min) min = t;
7 if (t > max) max = t;
8 }
9 }
Listing 4.10: The main loop of listing 4.9 transformed to assembly code using g++
compiler.
1 L7:
2 movl (%eax), %edx
3 cmpl (%ebx), %edx
4 jge L3
5 ; min = A[i]
6 movl %edx, (%ebx)
7 movl (%eax), %edx
8 L3:
9 cmpl %edx, (%ecx)
10 jge L4
11 ; max = A[i]
12 movl %edx, (%ecx)
13 L4:
14 addl $4, %eax
15 cmpl %eax, %esi
16 jne L7
We tried to optimize the naive approach using conditional instructions instead

conditional branches. Such modification ensures 100% lean processing. The con-
ditional branches in listing 4.10 are replaced with conditional move instructions
CMOVcc in listing 4.12 at line 5 and line 8. However although that looks promising,
in that the entire body of the main loop in the lean approach has no conditional
branches, it does not improve the performance. The reason is that the number
of instructions in the lean approach (Listing 4.12) is a little bit larger than the
number of instructions in naive approach.
Listing 4.11: Lean Naive-MinMax.

1 void naive_minmax_l(int *A, unsigned long int N, int &min, int &max) {
2 max = - 1; min = 1 + MAX; // MAX = 10000000
4 int t = A[i];
5 min = (t < min ? t : min);
6 max = (max < t ? t : max);
7 }
8 }
Listing 4.12: The main loop of listing 4.11 transformed to assembly code using g++
compiler.
1 L14:
2 movl (%edx), %eax
3 cmpl %eax, (%ebx)
4 movl %eax, %esi
5 cmovle (%ebx), %esi
6 movl %esi, (%ebx)
7 cmpl %eax, (%ecx)
8 cmovge (%ecx), %eax
9 addl $4, %edx
10 cmpl %edx, %edi
11 movl %eax, (%ecx)
12 jne L14
If we assume a simple pipeline processor with 5 stages then the naive approach
does not make mispredictions since its branches are negligible. The lean approach
spends 3 ticks (Figure 4.3) more than naive (Figure 4.2) in each iteration.
t0 t1 t2 t3 t4 t5 t6 t7 t8 t9 ..
Clock tick
movl (%eax), %edx F D O X W

cmpl (%ebx), %edx F D O XW
jge L3 F D O XW
cmpl %edx, (%ecx) F D O XW
jge L4 F D O XW
addl $4, %eax F D O XW
cmpl %eax, %esi F D O XW
Figure 4.2: Pipelining loop body of naive-MinMax assuming no misprediction hap-

pens.
t0 t1 t2 t3 t4 t5 t6 t7 t8 t9 ..
Clock tick
movl (%edx), %eax F D O X W

cmpl %eax, (%ebx) F D O XW
movl %eax, %esi F D O XW
cmovle (%ebx), %esi F D O XW
movl %esi, (%ebx) F D O XW
cmpl %eax, (%ecx) F D O XW
cmovge (%ecx), %eax F D O XW
addl $4, %edx F D O XW
cmpl %edx, %edi F D O XW
movl %eax, (%ecx) F D O XW
Figure 4.3: Pipelining loop body of lean naive-MinMax.

Simultaneous 3/2-Maximum and Minimum. The huge expected number of

mispredictions in 3/2-MinMax which is introduced in [ANP16] caused by the con-
ditional branch if (A[i] < A[i + 1]), where the probability of a misprediction
at the i-th iteration is 12 and this is why the expected number of mispredictions is
Pn/2
about i=1 21 = n/4 (according left-to-right maximum lemma in [MS08]).
So according to branch categorization in section 3.1, we have three branches in
3/2-MinMax. One of them is critical and must be eliminated and the remaining
are negligible. The critical one has actually only one task, which is determining
the smaller element between A[i] and A[i + 1] and compare it with the current
minimum and compare the other one with the current maximum.
Our approach is split in two steps. First one is to find out which is small-
er/larger between A[i] and A[i + 1]. In the second step we compare the smaller
with the current minimum and the larger with the current maximum.
At the end of first step the smaller element should be in t1 and the larger in t2 .
We assume at first A[i] is smaller than A[i + 1] by storing A[i] in t1 and A[i + 1]
in t2 .
Now if our initial assuming was right, step 2 should be started. However it
may be false, then t1 and t2 should be swapped before step 2 starts. In order to
perform this without branching we used a temporary variable t0 which contains
t2 − t1 if t2 is smaller t1 , otherwise zero using a conditional move CMOVcc (Table
4.1). Now after adding t0 to t1 and subtraction t0 from t2 , we can ensure that t1
contains the smaller element and t2 contains the larger one. The step 2 can be
started.
Listing 4.14 shows the main loop of our optimized approach of 3/2-MinMax
transformed to assembly code. Step 1 is done branchless using 7 simple instruc-
tions.
However this idea can be applied only with natural numbers. Floating-point
numbers can be rounded since we use some arithmetical operations which may
leads to wrong results.
t0 t1 t2 t1 + t0 t2 − t0
0 A[i] A[i + 1] A[i] A[i + 1]
A[i + 1] − A[i] A[i] A[i + 1] A[i + 1] A[i]
Table 4.1: Determining the smaller/larger elements of in listing 4.13.

Listing 4.13: Optimized 3/2 MinMax.

1 void opt_half3_minmax(int *A, unsigned long int N, int &min, int &max) {
2 max = - 1; min = MAX + 1; // MAX = 10000000
3
4 for (unsigned long int i = 0; i < N; i += 2) {
5 // Step 1
6 int t1 = A[i];
7 int t2 = A[i + 1];
8
9 int t0 = (t1 > t2 ? t2 - t1 : 0);
10
11 t1 += t0;
12 t2 -= t0;
13
14 // Step 2
15 if (t1 < min) min = t1;
16 if (max < t2) max = t2;
17 }
18 }
Listing 4.14: The main loop in listing 4.13 transformed to assembly code using
g++ compiler.
1 L43:
2 movl 4(%esi,%ecx,4), %eax ; t2 = A[i + 1]
3 movl (%esi,%ecx,4), %edx ; t2 = A[i]
4
5 ; Step 1
6 movl $0, %edx ; t0 = 0
7 movl %eax, %ebx
8 subl %edx, %ebx ; (t2 - t1)
9 cmpl %eax, %edx ; tmp0 = (t2 - t1)
10 cmovle %edi, %ebx ; if (t1 <= t2) t0 = 0
11 addl %ebx, %edx ; t1 += t0
12 subl %ebx, %eax ; t2 -= t0
13
14
15 ; Step 2
16 cmpl 0(%ebp), %edx
17 jge L38 ; if (min <= t1) goto L38
18 movl %edx, 0(%ebp) ; min = t1
19 L38:
20 movl 32(%esp), %edi
21 cmpl (%edi), %eax
22 jle L39 ; if (t2 <= max) goto L39
23 movl %eax, (%edi) ; max = t2
24 L39:
25 addl $2, %ecx
26 cmpl %ecx, 24(%esp)
27 ja L43
4.2. Loop Unrolling 29
4.2 Loop Unrolling

Loops can be defined as one of the basic control flows of computer programming
and consists of a set of instructions I which are executed recurrently n ∈ N times
(instead direct implementation of |I|n instructions). In the world of computer
programming there are various loop structures as while-, do-while- and for-loop
[Ker88].
Through loops, computers can perform particular tasks as many time as nec-
essary, and keep code size small. Besides recursion this is the only way one can
treat for large input. Listing 4.15 shows a simple loop and listing 4.16 show the
same loop after unrolling. The example comes from [Cor16].
Listing 4.15: Loop before unrolling. Listing 4.16: Loop after unrolling.
1 for (int i = 0; i < N; i++) { 1 for (int i = 0; i < N - 1; i += 2) {
2 if (i % 2) { 2 A[i] = X;
3 A[i] = X; 3 A[i + 1] = Y;
4 } else { 4 }
5 A[i] = Y;
6 }
7 }
Loop unrolling is a transformation technique used to optimize a loop in term of

execution time and it is one of the tasks of the compiler (Section 7.2). The goal of
loop unrolling is increasing execution speed by reducing instructions that control
the loop (removal of redundant loads, common subexpression elimination, .. etc)
and amortizing the branch overhead (for example listing 4.16).
However there are no such benefits without any disadvantages7 at all. For
example unrolling a very large loop leads to increase the code size or unrolling
loops which have branches leads to large BTB (Branch Target Buffers) [Cor16].
In following we will show an improvement attempt of Quicksort using unrolling
technique.
Quicksort. We consider the classical Quicksort with Hoare partitioning (Listing

7.4), which has two pointers. First one k scans the array from left and stops if
an element larger than the pivot has been found. The second g scans from right
side and stops scanning if it found an element smaller than the pivot. Then both
element A[k] and A[g] are swapped if the pointers did not cross yet. Otherwise
the algorithm is done.
Our approach scans six elements A[k . . . k + 2] and A[g − 2 . . . g] directly and
represent comparisons outcomes by a flag pattern k0 k1 k2 g2 g1 g0 , where ki represents
(A[k +i] < p) and gi represents (A[g −i] > p). Each zero means the element should
be swapped and a one indicates no swap is needed.
So we have then 26 = 64 cases to handle. However the most important thing to
take advantage of the unrolling strategy is making sure in every iteration that each
element of those six is moved to a correct position and never be scanned again.
7
funroll-loops makes code larger, and may or may not make it run faster. and funroll-
all-loops usually makes programs run more slowly [Sta03].
l k g r
v .. ≤ v .. v ≤ ..
State
Machine
l k g r
v .. ≤ v .. v ≤ ..
Figure 4.4: Unrollsix partitioning
Definitely, the code size is increased as we increase unrolling level, like in our
approach unrolling six elements means 26 = 64 states should be handled (in general
unrolling m elements means 2m states).
The state machine (Figure 4.4) which decides what to do with the current
six elements is implemented using switch-case control flow, since it is a form of
multiway branching in order to take advantage of branch table/jump table instead
of a tree of conditional branches (in other words we avoid unpredictable branches)8 .
Figure 4.5 represents state 7 (flag pattern [000 111]2 ). The problem is done by
three swaps (A[k] ↔ A[g − 5]), (A[k + 1] ↔ A[g − 4]) and (A[k + 2] ↔ A[g − 5]).
Then the pointers are modified as following (k, g) ← (k, g − 6). Figures 4.6 and
4.7 explain states 11 and 47.
l k g r
5 .. ≤ 5 5 6 7 a0 a1 a2 .. a3 a4 a5 6 7 8 5 ≤ ..
l k g r
5 .. ≤ 5 a 3 a4 a5 a0 a1 a2 .. 5 6 7 6 7 8 5 ≤ ..
Figure 4.5: Unrollsix partitioning - State 7 = [000 111]2
8
Use flag --param case-values-threshold=8 (or any number < 64) to make sure that
switch-case is transformed to branch table/jump table and not a simple tree of conditional
branches which leads branch mispredictions
4.3. Lookup table 31
l k g r
5 .. ≤ 5 5 6 3 a0 a1 a2 .. a3 a4 a5 4 7 8 5 ≤ ..
l k g r
5 .. ≤ 5 4 3 a5 a0 a1 a2 .. a3 a4 6 5 7 8 5 ≤ ..
l k g r
5 .. ≤ 5 2 6 3 a0 a1 a2 .. a3 a4 a5 8 7 9 5 ≤ ..
l k g r
5 .. ≤ 5 2 3 5 a0 a1 a2
a .. a3 a4 6 8 7 9 5 ≤ ..
4.3 Lookup table

Lookup table is a technique used to escape from the harmful cost of branch mispre-
dictions instead the directly using of conditional branches. By casting the outcome
of the conditional expression to a numerical type and use it then as an index. For
example, the following C line
if (x < p) y = a; else { if (x > p) y = c; else y = b; }
can be written as following
T = {a, b, c}; y = T[1 - (x < p) + (x > p)];
However this technique is very critical because it depends on memory load
instructions which cost a lot but may still be cheaper than branch misprediction
penalty. According to the branch classification in chapter 3, lookup tables perform
only very critical (hard-to-predict) and should not be used as a first solution of
branch elimination.
We will demonstrate two optimization attempts using lookup tables. The first
one shows the best way of using such technique and the second one clarifies how
it could be a very harmful solution.
Exponentiation by Squaring. Auger, Nicaud and Pivoteau have introduced

a Guided Power algorithm (Listing 7.1) in [ANP16]. Their solution reduced the
number of branch mispredictions to 0.45 log2 n, while it was 0.5 log2 n for Classical
Power and Unrolled Power according to 2-bit-saturating-counter predictor.
Dietzfelbinger suggested a solution (Listing 4.17) based on lookup tables tech-
nique in order to eliminate branch misses altogether. We replace all branches in
the main loop of Guided Power algorithm by a lookup table (Figure 4.8).
Listing 4.17: Lookup Table Guided Power (Dietzfelbinger, Section 4.3).

1 double guided_power_lt(double x, int n) {
2 double r = 1;
3 double a[4] = {1, x, 0, 0};
4
5 while (n > 0) {
6 a[2] = mult(a[1], a[1]); // or: square(a[1])
7 a[3] = mult(a[2], a[1]);
8 r = mult(r, a[n & 3]);
9 a[1] = mult(a[3], a[1]);
10 n >>= 2;
11 }
12
13 return r;
14 }
0 1 2 3
a 1 t x xt
r, if n1 n0 = 00
|{z}
rx, if n1 n0 = 01
r ← ran1 n0 =
rt, if n1 n0 = 10
rxt, if n1 n0 = 11
Figure 4.8: Lookup table guided power (introduced by Dietzfelbinger, Listing 4.17)
Simultaneous Maximum and Minimum. Indeed the branches in naive-

MinMax (Listing 4.9) both are negligible and do not need to be eliminated (there
is no better solution in terms of branch mispredictions).
However we want to show a situation, in which lookup tables do not help. In
section 5.2, the experiments show that lookup table approach takes 3 times as long
the naive approach takes.
Such improvement attempt (Listing 4.18) does not succeed because the condi-
tional branches in naive-MinMax (that check if A[i] a left-to-right minimum/max-
imum) is a negligible branch and induce a very small number of branch misses. So
when we try to eliminate those negligible branches using such dangerous technique
will lead to catastrophic behavior.
Listing 4.18: Lookup Table Naive MinMax.

1 void naive_minmax_lt(int *A, unsigned long int N, int &min, int &max) {
2 int T[ ] = {MAX + 1, 0, -1}; // MAX = 10000000
3
5 int t = A[i];
6 bool b0 = (t < T[0]), b1 = (t > T[2]);
7 T[1 - b0] = t;
8 T[1 + b1] = t;
9 }
10
11 min = T[0]; max = T[2];
12 }
4.4. Pseudo conditional branch 33
0 1 2
b0 ← (ak < T0 )
T min − max
b1 ← (T2 < ak )
T1−b0 ← ak
T1+b1 ← ak
.. ak ak+1 ak+2 ..
Figure 4.9: Lookup table MinMax (Listing 4.18).
4.4 Pseudo conditional branch

Some conditional branches can be very critical and cause harmful misprediction
penalties (Chapter 3). Such conditional branches can be avoided by pseudo con-
ditional branches, instead spending most of the time suffering because branch
misprediction.
The idea is not more than using the outcome of the conditional expression as a
simple number zero/one, and try to perform the same what a branch does though
arithmetical operations. This technique is dealing only with natural numbers
because of arithmetical operations, which can lead to wrong result if it deals with
floating-point numbers like rounding errors.
To understand the transformation process, let us take the binary selection
problem: 
s , α = 1
1
x←
s2 , α = 0
Such conditional branch can be transformed to pseudo conditional form as

following (Figure 4.10):
x ← s2 + α(s1 − s2 )
Conditional swaps (Listing 4.19) may induce a large number of branch misses
too, if the condition for the swap was critical (hard-to-predict). Listing 4.20 shows
a lean swap variant using pseudo conditional branch (Figure 4.11).
Listing 4.19: Conditional swap. Listing 4.20: Pseudo conditional swap.

1 void cswap(bool c, int &a, int &b) { 1 void pcswap(int c, int &a, int &b) {
2 if (c) { 2 int t = a;
3 int t = a; 3 a = a + c*(b - a);
4 a = b; 4 b = b + c*(t - b);
5 b = t; 5 }
6 }
7 }
The following two examples shows how to apply this technique in Quicksort
and Guided Power.
if (α) x ← s2 + α(s1 − s2 )
x ← s1 x ← s2
Figure 4.10: Variants of selection (Left: conditional select, Right: pseudo condi-
tional select).
if (c) t←a
t←a a ← a + c(b − a)
a←b b ← b + c(t − b)
b←t
Figure 4.11: Variants of conditional swap (Left: conditional swap, Right: pseudo
conditional swap).
4.4. Pseudo conditional branch 35
Quicksort. We consider a special version of Lomuto partitioning which is lean

Lomuto. In few lines Lomuto is a partitioning algorithm which was introduced
by Nico Lomuto and works as following: it maintains two pointers j and i for
example. The first one scans the input array left-to-right. Once it find an element
smaller than the pivot, the algorithm swaps this element with A[i + 1] and moves
the second pointer i one position forward. Where all elements A[l . . . i] are smaller
than the pivot.
Katajainen [Kat14] used the same technique in his lean version of Lomuto
partitioning routine (Listing 7.5). Unfortunately it was not that effective idea on
our machine because of multiplication instruction (Listings 4.21 and 4.22) which
more expensive than CMOVcc.
In section 2.2 we have discussed (with a very simple assumption that proces-
sors are 5-staged pipelined) the costs of assembly instructions. The conditional
instruction (not conditional jump/branch) as CMOVcc and SETcc instructions is
really simple. On the other hand multiplication instructions could cost 1-7 times
as CMOVcc/SETcc. This is why, we tried to implement it again (Listing 7.6) but
using CMOVcc (Listings 4.23 and 4.24).
Listing 4.21: The main loop of lean Lo- Listing 4.22: Listing 4.21 transformed to
muto partitioning (Listing 7.5). assembly code using g++ compiler.
1 int *lomuto_sl(..) { 1 L3:
2 2 movl (%ebx), %ecx ; x = *q
3 // .. 3
4 4 xorl %edx, %edx
5 while (q <= r) { 5 cmpl %ecx, %edi
6 x = *q; 6 setg %dl ; smaller = (x < v)
7 smaller = (x < v); 7
8 p += smaller; 8 ; [*] integer pointer += 1 ~ +4 Bytes
9 delta = smaller * (q - p); 9 movl %edx, %esi
10 10 movl %ebx, %edx
11 // .. 11 leal (%eax,%esi,4), %eax ; [*] p += 4*smaller
12 } 12
13 13 subl %eax, %edx ; q - p
14 // .. 14 movl (%eax), %ebp ; *p
15 } 15 sarl $2, %edx ; [*] 4*(q - p)
16 imull %esi, %edx ; [*] delta = smaller*(q - p)
17
18 ; ..
19
20 jnb L3
Listing 4.23: The main loop of lean Lo- Listing 4.24: Listing 4.23 transformed to
muto partitioning using CMOVcc (Listing assembly code using g++ compiler.
7.6). 1 L3:
1 int *lomuto_sl_cmov(..) { 2 movl (%ecx), %ebx
2 3
3 // ..
4 xorl %edx, %edx ; x = *q
4 5 movl $0, %ebp ; delta = 0
5 while (q <= r) {
6 cmpl %ebx, %esi
6 x = *q;
7 setg %dl ; smaller = (x < v)
7 smaller = (x < v);
8
8 p += smaller;
9 ; [*] integer pointer += 1 ~ +4 Bytes
9 delta = (smaller ? q - p : 0);
10 leal (%eax,%edx,4), %eax ; [*] p += 4*smaller
10 11 movl %ecx, %edx ; edx = q
11 // ..
12 movl (%eax), %edi ; edi = *p
12 }
13
13 14 subl %eax, %edx ; (q - p)
14 // ..
15 cmpl %ebx, %esi ; x < v
15 } 16 cmovg %edx, %ebp ; delta = (q - p) if (x < v)
17
18 ; ..
19
20 jnb L3
Exponentiation by Squaring. We have shown an optimization variant of

Guided Power algorithm using lookup table in section 4.3.
In this section we will discuss another one which is optimized by pseudo condi-
tional branches (Figure 4.12).
The difference between lookup table approach and this approach, is that we do
not use memory load instruction here we only using arithmetical instruction.
n = .. n5 n4 n3 n2 n1 n0
if (n1 n0 ) r ← n0 r(x − 1) + r
if (n0 ) r ← n1 r(t − 1) + r
r ← rx
if (n1 )
r ← rt
Figure 4.12: Variants of Guided Power (Left: Guided Power (Listing 7.1) [ANP16],
Right: pseudo conditional branch Guided Power (Listing 7.2)).
4.5. Safety blind swaps 37
4.5 Safety blind swaps

As Kaligosi and Sanders mentioned in [KS06], they mentioned that the exact
median Quicksort executes much fewer instructions than skewed pivot Quicksort
but it still slower due to branch misses (Figure 1.1). So in some cases it does not
matter how many instructions the processor executes, where branch mispredictions
are more important.
Quicksort. The idea of Seidel is very simple (Figure 4.13). The algorithm (List-
ing 7.7) has two pointers k and g, where all elements in A[l . . . k] are smaller than
the pivot and all elements in A[k + 1 . . . g − 1] are larger/equal than the pivot.
At the same time g points on the current element which should be classified. We
swap the element A[k + 1] and A[g] without any conditions (we call this a blind
swap), then we move pointer g one position forward. After this blind swap the
new element that should be classified is exactly at position k + 1. So if the element
A[k + 1] is smaller than pivot, pointer k should move one position forward. Other-
wise it should not. Definitely the previous step is done branchless by casting the
outcome of (A[k + 1] < p) to a numerical type, then it is added to the pointer k
as following k ← k + (A[k + 1] < p). The algorithm stops once the first pointer g
cross the most right position of the input array.
We applied the concept for dual and triple pivot partitioning algorithms (List-
ings 7.11 and 7.14) as a variants of Always-Large-First partitioning algorithms.
Through this technique, we reduced the time of Hoare algorithm more than
the half.
l k g r
Step 1 v .. < v α v ≤ .. β ..
swap
g++
l k g r
v .. < v β v ≤ .. α ..
l k g r
Step 2 v .. < v β v ≤ .. α ..
β<v β≥v
l k g r l k g r
v .. < v β v ≤ .. α .. v .. < v β v ≤ .. α ..
Figure 4.13: Blind Single-Pivot partitioning.

k g
5 8 0 7 3 6 3 8 1
k g
5 8 0 7 3 6 3 8 1
k g
5 0 8 7 3 6 3 8 1
k g
5 0 7 8 3 6 3 8 1
k g
5 0 3 8 7 6 3 8 1
k g
5 0 3 6 7 8 3 8 1
k g
5 0 3 3 7 8 6 8 1
k g
5 0 3 3 8 8 6 7 1
k g
5 0 3 3 1 8 6 7 8
k g
1 0 3 3 5 8 6 7 8
Figure 4.14: Blind Single-Pivot partitioning example for 9 elements.

4.6. Static fragmentation 39
4.6 Static fragmentation

Normally algorithms turn to perform particular tasks after some predefined events.
Like when you find a new left-to-right maximum (minimum), you replace it with
the current global maximum (minimum) or in single pivot Hoare, where it stops
scanning left side (right side) when a larger/equal element (a smaller/equal) has
been found and swaps those two elements, then returns to scanning phase.
What about splitting the input in two blocks of equal size A[k . . . k + B − 1]
and A[g − B + 1 . . . g]. Each block has its own history (the events that have been
occurred during scanning the elements of the block).
In our block partitioning algorithm (Listing 7.9)9 , we store the indexes of ele-
ments of the left side which are not smaller than pivot in a buffer Bk and the same
we do for the right side of the input in a buffer Bg (Figure 4.15).
This leads to three cases. First case: Bk has the same number of indexes as
Bg which is m, then we should only swap all corresponding elements and move
the pointers (k, g) ← (k + m, g − m). Second case: if |Bk | > |Bg |, we swap first
|Bg | corresponding elements, then we put the remaining elements that should be
swapped in its right positions and move the pointers k and g to its right positions.
Third case: analogous to second case.
The main difference between our implementation and Block partitioning in
[EW16] is:
• In rearrangement phase they swap until one of the buffers contains no more
pointers to elements which should be swapped and move the corresponding
pointer only for the empty buffers. Then they restart scanning phase again.
In our implementation we make sure that both buffers are empty and all
elements have been moved to its right positions before restarting scanning
phase (so the distance between scanning pointers k and g is reduced 2B in
every iteration, where B is the size of the block).
• In scanning phase they do not refill the non-empty buffer (there are at most
one non-empty buffer). In our implementation we do not have non-empty
buffer, since we empty both in rearrangement phase. So we refill both again.
9
The idea was implemented in [EW16] a little bit different from our implementation.
l k g r
v .. ≤ v .. .. .. v ≤ ..
|{z} |{z}
Bk Bg
|{z}
|{z}
{i | v ≤ ai : ∀ai ∈ Bk } swap
{i | ai ≤ v : ∀ai ∈ Bg }
Figure 4.15: Storing indexes preparing to rearrangement phase in single pivot block
partition (Listing 7.9).
41
Chapter 5
Experiments
5.1 Setup for the Experiments

5.1.1 Machine and Software
We used GNU compiler g++ version 4.9.2 on linux server to compile all algorithms
with a special flag for Quicksort which is --param case-values-threshold=8 to
make sure that switch-case is transformed to branch table not a simple tree of
conditional branches. Furthermore, the optimization flag was set to the highest
level -O3.
For sure different hardware architecture and operating system have an influence
that must not be ignored. The machine has an Intel(R) Xeon(R) with 24 CPUs
X5690 running at 3.47GHz. This processor has three cache levels L1d and L1i;
each one is 32KB, L2 has 256KB and L3 is 12MB. The system has also 140GB
main memory.
5.1.2 Input Generation

For Quicksort, the standard library in C <stdlib.h> was used to generate ran-
dom integers. Function rand(void) returns a pseudo-random integral number
∈ [0, RAND_MAX] (RAND_MAX is a library-dependent constant defined in <cstdlib>,
but is guaranteed to be at least 32767 on any standard library implementation),
which is generated by an algorithm that returns a sequence of apparently non-
related numbers each time it is called. This algorithm uses a seed to generate
the series, which should be initialized to some distinctive value using function
srand(unsigned int).
For sizes = (103 , 104 , 105 , 106 , 107 , 108 ), each input of integer permutation is
created as following:
Listing 5.1: Generating integer permutation.

1 int *generate_random_vector(unsigned long size) {
2 srand(time(0));
3 int *a = new int[size];
4 for (unsigned long i = 0; i < size; i++) {
5 a[i] = rand();
6 }
7 return a;
8 }
For simultaneous maximum and minimum and exponentiation by squaring al-

gorithms, we used exactly the same methods those used in [ANP16].
42 Chapter 5. Experiments
5.1.3 Runtime Measurement Methodology

The function clock(void) from the C runtime library <time.h> was used. This
function returns a value representing the current time in clock ticks, which are
units of time of a constant length but system-specific (The time measured in
seconds can be calculated because of the relation between clock ticks and the
constant CLOCKS_PER_SEC).
Each point in the tests of Quicksort and partitioning procedures represents

the average runtime of 1000 test. For Simultaneous Maximum and Minimum the
test was not repeated as in [ANP16]. However in Exponentiation by Squaring
algorithms we measured time of computation 2ak (as in [ANP16]) for all a ∈ Nn
where n ∈ {103 , 104 , 105 , 106 , 107 , 108 }.
5.2. Simultaneous Maximum and Minimum 43
5.2 Simultaneous Maximum and Minimum
·10−9
2
1.5
Time/n [s]
0.5
3 4 5 6 7 8 9
Input size [log10 n]
naive MinMax lean-MinMax

lookup-table MinMax 3/2-MinMax
optimized 3/2-MinMax
Figure 5.1: Simultaneous maximum and minimum.
Actually we have done the experiments (Figure 5.1) to clarify two things.
First: Negligible conditional branches must not be eliminated at all. For exam-
ple naive-MinMax has two negligible which induce O(log n) branch mispredictions
(very small number). In other words the prediction strategy helps the processor
very well. As an attempt to optimize naive-MinMax we tried to use the conditional
instruction CMOVcc (Section 4.1) which did not lead to a better behavior (Table
5.1). We also applied lookup-table (Section 4.3) which was the most harmful Min-
Max approach we have and takes about 3 times as naive-MinMax does. This was
expected because of using memory store/load instructions.
Second: How helpful is elimination of critical conditional branches. 3/2-
MinMax suffers a lot from branch misses and by applying some simple techniques
we saved about 25% of the execution time (Section 4.1).
Input Size naive-MinMax Lean-MinMax

103 1 1
104 5 5
105 46 50
106 531 486
107 5871 5628
108 57298 57417
109 573350 573326
Table 5.1: Execution time of naive-MinMax and lean-MinMax measured by mi-

croseconds.
5.3 Exponentiation by Squaring
·10−8
7
Time/n [s]
3 4 5 6 7 8
Items [log10 n]
Guided Power PCB Guided Power

lookup-table Guided Power
Figure 5.2: Exponentiation by squaring.
Guided Power algorithm which is mentioned in [ANP16] is better than classical-

and unrolled-approach doubtless. Although the branches in the guided approach
induce a fewer number of branch misses but it still critical. From this standpoint we
decided to eliminate those branches at all. PCB approach needs 61% of time which
is needed in the guided approach to performer the same job, while lookup-table
approach needs 65%. Both PCB and lookup-table approaches cast the outcome
of the expression of the branch to a numerical type. But the main difference
between them is how they use those cast numerical values, which should be used
to detect the value by which variable r must be multiplied. PCB applies some
arithmetical operations and lookup-table uses an accessory array. Both have very
similar runtime.
5.4. Single-pivot partitioning 45
5.4 Single-pivot partitioning
·10−9
4
Time/n [s]
1
3 4 5 6 7 8
Hoare Unrollsix Block

Lean Lomuto cmov-Lean Lomuto Lean Blind
Figure 5.3: Single pivot partitioning algorithms.
We started from Hoare algorithm (Listing 7.4) which suffers a lot from branch
misses. Unrolling strategy (Listing 7.8) was not that good variant but it still
better than Hoare since it saves about 9% of the time. Two main ideas are behind
this improvement in Unrolling strategy. The first idea is loop unrolling and the
second one is using switch-case control flow instead if-else, which enables using
branch table in order to avoid branch misses.
Katajainen implemented a lean version of Lomuto (Listing 7.5) in [Kat14] which
has good behavior on his machine. However surprisingly it was not much better
than Hoare on our machine and the reason behind this unexpected result was
the difference between processors architectures. On our machine multiplication
instructions may cost approximately as branch mispredictions cost in Hoare and
it is doubtless more expensive than the conditional instructions like CMOVcc. This
observation motivated us to replace the multiplication instructions with the con-
ditional instructions (Listing 7.6) which speeds up the algorithm about 1.3 times.
The real competition is between Block algorithm (Listing 7.9) and lean-Blind
algorithm (Listing 7.7). Both save about 55% of the time which is needed in Hoare
(they can partition 2 arrays while Hoare still processing the first one). lean-Blind
approach looks very simple and has a few number of instructions in the its main
loop without using any extra space unlike Block approach which needs about 16
KB in our implementation (The algorithm uses 2 buffers. Each one can store 1025
64-bit addresses, so both together need 16 KB and 16 Byte).
5.5 Multi-pivot partitioning
·10−9
5
4
Time/n [s]
3 4 5 6 7 8
YBB Lean Symmetric Lean Blind
Figure 5.4: Dual pivot partitioning algorithms.
·10−9
6
5
Time/n [s]
3 4 5 6 7 8
Symmetric Lean Symmetric Lean Blind
Figure 5.5: Triple pivot partitioning algorithms.
Apparently applying safety blind swap idea (Section 4.5) on dual and triple
pivot partitioning algorithms improves both of them. It speeds up YBB (Listing
7.10) about 1.9 times and Symmetric triple-pivot (Listing 7.12) approximately 1.5
times.
5.6. Quicksort 47
Using the conditional instructions (Listings 4.7 and 7.13) led also to a better
behavior for both. Lean dual symmetric (Listing 4.7) saved about 20% of the
execution time of YBB (Listing 7.10) and lean triple symmetric (Listing 7.13)
saved about 25% of execution time of symmetric triple-pivot (Listing 7.12).
Both ideas blind swaps or conditional statements make the main loop of the
partitioning procedure lean. Seeking of the reason behind the difference superiority
of blind swap strategy, we translated the code of both approaches to assembly level.
Indeed both induce only O(1) branch misses but blind approach has fewer number
of instructions.
5.6 Quicksort
·10−9
4
Time/n log n [s]
3.5
2.5
3 4 5 6 7 8
Block CQS Lean Blind CQS

Lean Symmetric Dual-Pivot QS Lean Blind Dual-Pivot QS
Lean Symmetric Triple-Pivot QS Lean Blind Triple-Pivot QS
Figure 5.6: Various Quicksorts implementation test.
The idea of blind swaps showed very efficient behaviour, where the Blind Single-
Pivot Quicksort (Listing 7.7) was the fastest algorithm since it is faster than
Hoare twice. Even with Dual- and Triple-Pivot (variants of Always-Large-First
approaches) it outperforms. The two main reasons are the few number of instruc-
tion and the constant number of branch misses O(1).
Lean Symmetric Triple- and Dual-Pivot (Listings 7.13 and 4.7) Quicksorts were
attempts to replace the conditional branches in Symmetric Triple-Pivot (Listing
7.12 [AD16]) and YBB (Listing 7.10) with the conditional instructions. Although
we expected that the speedup should be approximately very near from Blind ap-
proaches (Always-Large-First) but the idea was not in the same efficient level.
One of the reasons that blind approaches execute fewer instructions although both
induce only O(1) branch misses.
Block Quicksort came in the second position after Blind Single-Pivot Quick-
sort directly. Indeed it does excellent work but it needs extra space for its accessory
buffers, which should adapt the cache size (in our implementation the algorithm
needs 16 KB and 16 Byte, where the size of L1 cache is 32 KB).
Unrolling strategy together with using switch-case in order to take the advan-
tage of branch tables was not bad idea and it may perform better if more elements
are unrolled with taking care about the code size (in our implementation we have
26 = 64 cases in the switch-case statement). However implementation of a switch-
case with 2n cases looks not a practical solution.
The lean version of Lomuto which was introduced by Katajainen [Kat14] (List-
ing 7.5) was surprisingly not a good variant. The only interpretation that we
have, that the algorithm uses 64-bit multiplication instructions in the main loop
of the partitioning phase. Once we transformed the multiplication instruction to
a conditional move instruction the performance was increased.
·10−9
5
Time/n log n [s]
3 4 5 6 7 8
Hoare CQS Unrollsix CQS

Block CQS Lean Lomuto CQS
cmov-Lean Lomuto CQS Lean Blind CQS
YBB Dual-Pivot QS Lean Symmetric Dual-Pivot QS
Lean Blind Dual-Pivot QS Symmetric Triple-Pivot QS
Lean Symmetric Triple-Pivot QS Lean Blind Triple-Pivot QS
Figure 5.7: Quicksorts race of all partitioning algorithms which have been consid-
ered in this thesis.
5.6. Quicksort 49
Figure 5.8 ranks the Quicksorts which we considered relying on its speedup
with reference to Hoare CQS. Let THoare the average runtime of Hoare CQS and
TQ the average runtime of Quicksor Q, for input of size 108 . Then the speedup of
algorithm Q is defined as following:
THoare
SpeedupQ =
TQ
Speedup
1 Lean Blind CQS 1.959
2 Block CQS 1.859
3 Lean Blind Dual-Pivot QS 1.812
4 Lean Blind Triple-Pivot QS 1.606
5 Lean Symmetric Triple-Pivot QS 1.405
6 cmov-Lean Lomuto CQS 1.291
7 Lean Symmetric Dual-Pivot QS 1.285
8 Unrollsix CQS 1.104
9 Symmetric Triple-Pivot QS 1.054
10 Lean Lomuto CQS 1.045
11 YBB Dual-Pivot QS 1.032
12 Hoare CQS 1.0
Figure 5.8: Quicksort Algorithms Ranking.

50
51
Chapter 6
Conclusion
The main contribution of this thesis is to show the import role which the modern
processors architectures play with regard to execution time of algorithms. Condi-
tional branch instructions induce one of the most harmful hazards which we should
take care about. Reducing the probability of branch misprediction or even elimi-
nating branches at all (lean procedures) could be a good solution. The results we
obtained on our machine showed an accurate improvement in Quicksort and some
other algorithms. The developer is responsible for optimization strategy (which
branch should be optimized/eliminated and which not, with taking architecture of
the processor into account).
All strategies and techniques we used in this work is language-dependent (im-
plemented using C/C++) and this leads to an interesting question I look forward to
answer is, "Which strategies and techniques are applicable in Java and are there
new factors which should be considered in such optimization process then ?".
52
53
Chapter 7
Appendix
7.1 Proofs
Proof (1-bit prediction scheme). We prove the form of the expected number of the
branch mispredictions which is in table 2.2 (Section 2.3.1). To get the probability
of a branch miss, we use the stationary distribution of the corresponding markov
chain as following: Assuming p is the probability that branch taken, and (π1 , π2 )
represent states (T, N).
The transition table is
T N
!
T p 1−p
Π=
N p 1−p
π 1 + π2 = 1
(1 − p)π1 = pπ2
pπ2 = (1 − p)π1
1−p
together with 2k=1 πk = 1, we get π(p) = (p, 1 − p).
P
We conclude π2 = p
π1
The branch miss occurs when the prediction automation is in state T and the
branch should not be taken or when the prediction automation is in state N and
the branch has to be taken, so the probability of a branch miss is π(p) • (1 − p, p) =
2p(1 − p), and the expected number of branch misses is 2 nk=1 p(1 − p) ∀n > 0
P
Proof (2-bit-saturate-counter prediction scheme). We prove the form of the expected

number of the branch mispredictions which is in table 2.2 (Section 2.3.1). The
following proof is from [MNW15]. To get the probability of a branch miss, we
use the stationary distribution of the corresponding markov chain as following:
Assuming p is the probability that branch taken, and (π1 , π2 , π3 , π4 ) represent
states (ST, WT, WN, SN).
ST WT WN SN
ST p 1−p 0 0
 
WT  p
 0 1−p 0 
Π=

WN  0 p 0 1 − p
 
SN 0 0 p 1−p
54 Chapter 7. Appendix
π1 + π 2 + π3 + π4 = 1
(1 − p)π1 = pπ2
pπ2 + (1 − p)π2 = (1 − p)π1 + pπ3
pπ3 + (1 − p)π3 = (1 − p)π2 + pπ4
pπ4 = (1 − p)π3
1−p
We conclude π2 = p
π3 = ( 1−p
π1 , p
)2 π1 and π4 = ( 1−p
p
)3 π1 together with
P4 1 3 2 2 3
k=1 πk = 1, we get π(p) = 1−2p(1−p) (p , p (1 − p), p(1 − p) , (1 − p) ).
The branch miss occurs when the prediction automation is in state ST or WT
and the branch should not be taken or when the prediction automation is in state
SN or WN and the branch has to be taken, so the probability of a branch miss is
p(1−p)
π(p) • (1 − p, 1 − p, p, p) = 1−p(1−p) , and the expected number of branch misses is
Pn p(1−p)
k=1 1−2p(1−p) .
Proof (2-bit-flip-on-consecutive prediction scheme). We prove the form of the ex-

pected number of the branch mispredictions which is in table 2.2 (Section 2.3.1).
The following proof is from [MNW15]. To get the probability of a branch miss, we
use the stationary distribution of the corresponding markov chain as following: As-
suming p is the probability that branch taken, and (π1 , π2 , π3 , π4 ) represent states
(ST, WT, WN, SN).
ST W T W N SN
ST p 1−p 0 0
 
WT   p 0 0 1 − p
Π=

WN  p 0 0 1 − p
 
SN 0 0 p 1−p
π1 + π 2 + π 3 + π4 = 1
(1 − p)π1 = pπ2 + pπ3
pπ2 + (1 − p)π2 = (1 − p)π1
pπ3 + (1 − p)π3 = pπ4
pπ4 = (1 − p)π2 + (1 − p)π3
2 2
We conclude π2 = (1 − p)π1 , π3 = (1−p) p
π1 and π4 = (1−p)
p2
π1 togethor with
P4 1 2 2 2 2
k=1 πk = 1, we get π(p) = 1−p(1−p) (p , p (1 − p), p(1 − p) , (1 − p) ).
The branch miss occurs when the prediction automation is in state ST or WT
and the branch should not be taken or when the prediction automation is in state
SN or WN and the branch has to be taken, so the probability of a branch miss is
2 2 +p(1−p)
π(p) • (1 − p, 1 − p, p, p) = 2p (1−p)
1−p(1−p)
, and the expected number of branch
Pn 2p2 (1−p)2 +p(1−p)
misses is k=1 1−p(1−p)
.
2
7.2. GNU Compiler Collection 55
7.2 GNU Compiler Collection

“GCC is an integrated distribution of compilers for several major programming
languages. These languages currently include C, C++, Objective-C, Objective-
C++, Fortran, Ada, Go, and BRIG (HSAIL). The language-independent component
of GCC includes the majority of the optimizers, as well as the back ends that
generate machine code for various processors. Historically, compilers for many
languages, including C++ and Fortran, have been implemented as preprocessors
which emit another high level language such as C. None of the compilers included in
GCC are implemented this way; they all generate machine code directly. [Sta03]”
Following flags and parameters are important and related to our work
• --param max-unrolled-insns=n : The maximum number of instructions

that a loop may have to be unrolled.
• --param max-unroll-times=n : The maximum number of unrollings of a

single loop.
• --param case-values-threshold=n : The smallest number of different val-

ues for which it is best to use a jump-table instead of a tree of conditional
branches. If the value is 0, use the default for the machine. The default is 0.
• -funroll-loops : Unroll loops whose number of iterations can be deter-

mined at compile time or upon entry to the loop. This option makes code
larger, and may or may not make it run faster.
• -funroll-all-loops : Unroll all loops, even if their number of iterations is

uncertain when the loop is entered. This usually makes programs run more
slowly.
• -fno-jump-tables : Do not use jump tables for switch statements even

where it would be more efficient than other code generation strategies.
• -fif-conversion : Attempt to transform conditional jumps into branch-

less equivalents. This includes use of conditional moves, min, max, set flags
and abs instructions, and some tricks doable by standard arithmetics.
• -fif-conversion2 : Use conditional execution (where available) to trans-

form conditional jumps into branch-less equivalents.
• -O0 : Reduce compilation time and make debugging produce the expected
results. This is the default.
• -O1 : Optimizing compilation takes somewhat more time, and a lot more
memory for a large function. With -O, the compiler tries to reduce code size
and execution time, without performing any optimizations that take a great
deal of compilation time.
• -O2 : Optimize even more. GCC performs nearly all supported optimizations
that do not involve a space-speed tradeoff. As compared to -O, this option
increases both compilation time and the performance of the generated code.
• -O3 : Optimize yet more. ‘-O3’ turns on all optimizations specified by -O2
and turn on all other optimization flags.
7.3. Routines 57
7.3 Routines
Listing 7.1: Guided Power (Auger, Nicaud and Pivoteau [ANP16]) (Section 4.3
and 4.4).
1 double guided_power(double x,int n){
2 double r = 1, t;
3
4 while (n > 0) {
5 t = mult(x,x);
6
7 if (n & 3) {
8 if (n & 1) r = mult(r,x);
9 if (n & 2) r = mult(r,t);
10 }
11
12 x = mult(t,t);
13 n >>= 2;
14 }
15
16 return r;
17 }
Listing 7.2: Pseudo Conditional Branch Guided Power (Section 4.4).

1 double guided_power_pcb(double x, int n) {
2 double r = 1, t;
3
4 while (n > 0) {
5 t = mult(x, x);
6 r = mult(r, 1 + mult(n & 1, x - 1));
7 r = mult(r, 1 + mult((n & 2) >> 1, t - 1));
8 x = mult(t, t);
9 n >>= 2;
10 }
11
12 return r;
13 }
Listing 7.3: 3/2 MinMax (Auger, Nicaud and Pivoteau [ANP16], Section 4.1).
1 void half3_minmax(int *A, unsigned long int N, int &min, int &max) {
2 max = - 1; min = MAX + 1; // MAX = 10000000
3
4 for (unsigned long int i = 0; i < N; i += 2) {
5 if (A[i] < A[i + 1]) {
6 if (A[i] < min) min = A[i];
7 if (A[i + 1] > max) max = A[i + 1];
8 } else {
9 if (A[i + 1] < min) min = A[i + 1];
10 if (A[i] > max) max = A[i];
11 }
12 }
13 }
Listing 7.4: Hoare Single-Pivot Partitioning (Section 4.2).

1 int *hoare_s(int *left, int *right) {
2 int *k = left;
3 int *g = right + 1;
4 int p = *left;
5
6 do {
7 do { k++; } while (*k < p);
8 do { g--; } while (p < *g);
9
10 if (k < g) SWAP(*k, *g);
11 } while (g > k);
12
13 SWAP(*--k, *left);
14
15 return k;
16 }
Listing 7.5: Lean Lomuto Single-Pivot Partitioning (Katajainen [Kat14], Section

4.4).
1 int *lomuto_sl(int *left, int *right) {
2 int *r = right, *p = left, *first = left;
3 int v = *left, x;
4 int *s, *t;
5 unsigned long delta;
6 bool smaller;
7 int *q = first + 1;
8
9 while (q <= r) {
10 x = *q;
12 p += smaller;
13 delta = smaller * (q - p);
14 s = p + delta;
15 t = q - delta;
16 *s = *p;
17 *t = x;
18 ++q;
19 }
20
21 *first = *p;
22 *p = v;
23
24 return p;
25 }
7.3. Routines 59
Listing 7.6: Lean Lomuto (cmov) Single-Pivot Partitioning (Katajainen [Kat14],

Section 4.4).
1 int *lomuto_sl_cmov(int *left, int *right) {
2 int *r = right, *p = left, *first = left;
3 int v = *left, x;
4 int *s, *t;
5 unsigned long delta;
6 bool smaller;
7 int *q = first + 1;
8
9 while (q <= r) {
10 x = *q;
12 p += smaller;
13 delta = smaller ? (q - p) : 0;
14 s = p + delta;
15 t = q - delta;
16 *s = *p;
17 *t = x;
18 ++q;
19 }
20
21 *first = *p;
22 *p = v;
23
24 return p;
25 }
Listing 7.7: Lean Blind Single-Pivot Partitioning (Seidel idea, Section 4.5).
1 int *blind_sl(int *left, int *right) {
2
3 int *k = left;
4 int *g = left + 1;
5 int p = *left;
6
7 for (; g <= right; g++) {
8 SWAP(*g, *(k + 1));
9 k += (*(k + 1) < p);
10 }
11
12 SWAP(*left, *k);
13 return k;
14 }
Listing 7.8: Unrollsix Single-Pivot Partitioning (Section 4.2).

1 int *unrollsix_s(int *left, int *right) {
2 int *k = left + 1;
3 int *g = right;
4 int p = *left;
5
6 unsigned char flag;
7
8 while (g - k > 5) {
9 flag = *k < p;
10 flag = (flag << 1) | (*(k + 1) < p);
11 flag = (flag << 1) | (*(k + 2) < p);
12 flag = (flag << 1) | (p < *(g - 2));
13 flag = (flag << 1) | (p < *(g - 1));
14 flag = (flag << 1) | (p < *g);
15
16 switch (flag) {
17 case 0: SWAP(*k, *g); SWAP(*(k + 1), *(g - 1)); SWAP(*(k + 2), *(g - 2));
18 k += 3; g -= 3; break;
19
20 case 1: SWAP(*k, *(g - 1)); SWAP(*(k + 1), *(g - 2)); SWAP(*(k + 2), *(g - 3));
21 k += 2; g -= 4; break;
22
24 k += 2; g -= 4; break;
25
27 k += 1; g -= 5; break;
28
30 k += 2; g -= 4; break;
31
33 k += 1; g -= 5; break; break;
34
36 k += 1; g -= 5; break;
37
38 case 7: SWAP(*(k + 2), *(g - 3)); SWAP(*(k + 1), *(g - 4)); SWAP(*k, *(g - 5));
39 g -= 6; break;
40
42 k += 4; g -= 2; break;
43
44 case 9: SWAP(*k, *(g - 1)); SWAP(*(k + 1), *(g - 2));
45 k += 3; g -= 3; break;
46
47 case 10: SWAP(*k, *g); SWAP(*(k + 1), *(g - 2));
48 k += 3; g -= 3; break;
49
50 case 11: SWAP(*k, *(g - 2)); ROTATE(*(k + 1), *(k + 2), *(g - 3));
51 k += 2; g -= 4; break;
52
54 k += 3; g -= 3; break;
55
56 case 13: SWAP(*k, *(g - 1)); ROTATE(*(k + 1), *(k + 2), *(g - 3));
57 k += 2; g -= 4; break;
58
59 case 14: SWAP(*k, *g); ROTATE(*(k + 1), *(k + 2), *(g - 3));
60 k += 2; g -= 4; break;
61
62 case 15: SWAP(*(k + 1), *(g - 3)); ROTATE(*k, *(k + 2), *(g - 4));
63 k += 1; g -= 5; break;
64
66 k += 4; g -= 2; break;
67
7.3. Routines 61
69 k += 3; g -= 3; break;
70
72 k += 3; g -= 3; break;
73
75 k += 2; g -= 4; break;
76
78 k += 3; g -= 3; break;
79
81 k += 2; g -= 4; break;
82
84 k += 2; g -= 4; break;
85
86 case 23: SWAP(*(k + 2), *(g - 3)); ROTATE(*k, *(k + 1), *(g - 4));
87 k += 1; g -= 5; break;
88
90 k += 5; g -= 1; break;
91
93 k += 4; g -= 2; break;
94
96 k += 4; g -= 2; break;
97
98 case 27: SWAP(*k, *(g - 2));
99 k += 3; g -= 3; break;
100
101 case 28: SWAP(*k, *g); ROTATE(*(g - 1), *(g - 2), *(k + 3));
102 k += 4; g -= 2; break;
103
104 case 29: SWAP(*k, *(g - 1));
105 k += 3; g -= 3; break;
106
107 case 30: SWAP(*k, *g);
108 k += 3; g -= 3; break;
109
110 case 31: ROTATE(*k, *(k + 2), *(g - 3));
111 k += 2; g -= 4; break;
112
113 case 32: SWAP(*(k + 1), *g); SWAP(*(k + 2), *(g - 1)); SWAP(*(k + 3), *(g - 2));
114 k += 4; g -= 2; break;
115
116 case 33: SWAP(*(k + 1), *(g - 1)); SWAP(*(k + 2), *(g - 2));
117 k += 3; g -= 3; break;
118
119 case 34: SWAP(*(k + 1), *g); SWAP(*(k + 2), *(g - 2));
120 k += 3; g -= 3; break;
121
123 k += 2; g -= 4; break;
124
126 k += 3; g -= 3; break;
127
129 k += 2; g -= 4; break;
130
132 k += 2; g -= 4; break;
133
135 k += 1; g -= 5; break;
136
138 k += 5; g -= 1; break;
139
141 k += 4; g -= 2; break;
142
144 k += 4; g -= 2; break;
145
146 case 43: SWAP(*(k + 1), *(g - 2));
147 k += 3; g -= 3; break;
148
149 case 44: SWAP(*(k + 1), *g); ROTATE(*(g - 1), *(g - 2), *(k + 3));
150 k += 4; g -= 2; break;
151
152 case 45: SWAP(*(k + 1), *(g - 1));
153 k += 3; g -= 3; break;
154
155 case 46: SWAP(*(k + 1), *g);
156 k += 3; g -= 3; break;
157
158 case 47: ROTATE(*(k + 1), *(k + 2), *(g - 3));
159 k += 2; g -= 4; break;
160
162 k += 5; g -= 1; break;
163
165 k += 4; g -= 2; break;
166
168 k += 4; g -= 2; break;
169
170 case 51: SWAP(*(k + 2), *(g - 2));
171 k += 3; g -= 3; break;
172
173 case 52: SWAP(*(k + 2), *g); ROTATE(*(g - 1), *(g - 2), *(k + 3));
174 k += 4; g -= 2; break;
175
176 case 53: SWAP(*(k + 2), *(g - 1));
177 k += 3; g -= 3; break;
178
179 case 54: SWAP(*(k + 2), *g);
180 k += 3; g -= 3; break;
181
182 case 55: SWAP(*(k + 2), *(g - 3));
183 k += 2; g -= 4; break;
184
186 k += 6; break;
187
189 k += 5; g -= 1; break;
190
191 case 58: SWAP(*(k + 3), *(g - 2)); ROTATE(*g, *(g - 1), *(k + 4));
192 k += 5; g -= 1; break;
193
194 case 59: SWAP(*(k + 3), *(g - 2));
195 k += 4; g -= 2; break;
196
197 case 60: SWAP(*(k + 4), *(g - 1)); ROTATE(*g, *(g - 2), *(k + 3));
198 k += 5; g -= 1; break;
199
200 case 61: ROTATE(*(g - 1), *(g - 2), *(k + 3));
201 k += 4; g -= 2; break;
202
203 case 62: ROTATE(*g, *(g - 2), *(k + 3));
204 k += 4; g -= 2; break;
205
206 case 63: k += 3; g -= 3;
207
208 }
7.3. Routines 63
209 }
210
211 SWAP(*left, *--k);
212 k = blind_sl(k, g);
213
214 return k;
215 }
Listing 7.9: Block Single-Pivot Partitioning (Section 4.6).

1 #define B 1024
2 #define Bx3 3072
3 int *block_s(int *left, int *right) {
4
5 if ((right - left) < Bx3) return blind_sl(left, right);
6
8 int *g = right;
9 int p = *left;
10
11 int **kifo = new int *[B + 1]; int **gifo = new int *[B + 1];
12 int **pk = kifo; int **pg = gifo;
13 int **sk; int **sg;
14 bool b;
15
16 while ((g - k) > Bx3) {
17 pk = kifo; pg = gifo;
18
19 for (unsigned int i = 0; i < B; i++) {
20 b = (*(k + i) >= p);
21 *pk = k + i;
22 pk += b;
23
24 b = (p >= *(g - i));
25 *pg = g - i;
26 pg += b;
27 }
28
29 *pk = k + B; *pg = g - B;
30
31 sk = kifo; sg = gifo;
32 unsigned int min = (pg - gifo);
33 b = (pk - kifo) < min;
34 min += b*((pk - kifo) - min);
35
36 for (unsigned int i = 0; i < min; i++) {
37 SWAP(**sk, **sg);
38 sk++; sg++;
39 }
40
41 k = *sk; g = *sg;
42
43 if ((sk < pk) || (sg < pg)) {
44 while (sk < pk) {
45 SWAP(**sk, *g);
46 sk++; g--;
47 }
48
49 while (sg < pg) {
50 SWAP(*k, **sg);
51 sg++; k++;
52 }
53 }
54 }
55
56 SWAP(*left, *--k);
57 return blind_sl(k, g);
58 }
Listing 7.10: YBB Dual-Pivot Partitioning (Yaroslavskiy, Bentley and Bloch).

1 void ybb_d(int *left, int *right, int *&pos_p, int *&pos_q) {
3 int *g = right - 1;
4 int *k = l;
5
6 int p1 = *left, p2 = *right;
7
8 while (k <= g) {
9 if (*k < p1) {
10 SWAP(*k, *l);
11 l++;
12 } else {
13 if (*k > p2) {
14 while (*g > p2) g--;
15 if (k < g) {
16 if (*g < p1) {
17 ROTATE(*g, *k, *l);
18 l++;
19 } else SWAP(*k, *g);
20 g--;
21 }
22 }
23 }
24 k++;
25 }
26
28 SWAP(*right, *++g);
29 pos_p = l;
30 pos_q = g;
31 }
Listing 7.11: Lean Blind Dual-Pivot Partitioning (Section 4.5).

1 void blind_dl(int *left, int *right, int *&pos_p, int *&pos_q) {
2 int *l = left;
3 int *k = left;
5
6 int p1 = *left, p2 = *right;
7
8 while (g < right) {
9 int x = *g;
10 bool b0 = (x <= p2);
11 bool b1 = (x < p1);
12
13 SWAP(*g, *(k + 1));
14 k += b0;
15 if (l < k) SWAP(*k, *(l + 1)); // O(1) branch misprediction
16 l += b1;
17 g++;
18 }
19
20 SWAP(*left, *l);
21 SWAP(*right, *++k);
22
23 pos_p = l;
24 pos_q = k;
25 }
7.3. Routines 65
Listing 7.12: Symmetric Triple-Pivot Partitioning (Amüller and Dietzfelbinger

[AD16]).
1 void symm_t(int *left, int *right, int *&pos_p1, int *&pos_p2, int *&pos_p3) {
2 int *i = left + 2;
3 int *j = i;
4 int *k = right - 1;
5 int *l = k;
6
7 int p1 = *left, p2 = *(left + 1), p3 = *right;
8
9 while (j <= k) {
10 while (*j < p2) {
11 if (*j < p1) {
12 SWAP(*i, *j);
13 i++;
14 }
15 j++;
16 }
17
18 while (*k > p2) {
19 if (*k > p3) {
20 SWAP(*k, *l);
21 l--;
22 }
23 k--;
24 }
25
26 if (j <= k) {
27 if (*j > p3) {
28 if (*k < p1) {
29 ROTATE(*j, *i, *k, *l);
30 i++;
31 } else ROTATE(*j, *k, *l);
32 l--;
33 } else {
34 if (*k < p1) {
35 ROTATE(*j, *i, *k);
36 i++;
37 } else SWAP(*j, *k);
38 }
39 j++;
40 k--;
41 }
42 }
43
44 ROTATE(*(left + 1), *(i - 1), *(j - 1));
45 SWAP(*left, *(i - 2));
46 SWAP(*right, *(l + 1));
47
48 pos_p1 = i - 2;
49 pos_p2 = j - 1;
50 pos_p3 = l + 1;
51 }
Listing 7.13: Lean Symmetric Triple-Pivot Partitioning (Section 4.1).

1 void symm_tl(int *left, int *right, int *&pos_p1, int *&pos_p2, int *&pos_p3) {
2
3 int *i = left + 2;
4 int *j = i;
5 int *k = right - 1;
6 int *l = k;
7
9
10 for (unsigned long int c = 0; c < right - left - 2; c++) {
11 int x = *j;
12
13 bool b0 = (x < p2);
14 bool b1 = (x < p1);
15 bool b2 = (p3 < x);
16 bool b3 = (x >= p2);
17
18 int *beta = (b0 ? i : k);
19 int *alpha = (b2 ? l : beta);
20
21 ROTATE(*j, *beta, *alpha);
22
23 j += b0;
24 i += b1;
25 l -= b2;
26 k -= b3;
27 }
28
29 rotate(*(left + 1), *(i - 1), *(j - 1));
30 SWAP(*left, *(i - 2));
31 SWAP(*right, *(l + 1));
32
33 pos_p1 = i - 2;
34 pos_p2 = j - 1;
35 pos_p3 = l + 1;
36 }
7.3. Routines 67
Listing 7.14: Lean Blind Triple-Pivot Partitioning (Section 4.5).

1 void blind_tl(int *left, int *right, int *&pos_p0, int *&pos_p1, int *&pos_p2) {
2
4 int *s = left + 1;
7
9
10 while (g < right) {
11 int x = *g;
12 bool b0 = (x < p2);
13 bool b1 = (x < p1);
14 bool b2 = (x < p0);
15
16 SWAP(*g, *(k + 1));
17 k += b0;
18
19 if (s < k) SWAP(*k, *(s + 1)); // O(1) branch misprediction
20 s += b1;
21
22 if (l < s) SWAP(*s, *(l + 1)); // O(1) branch misprediction
23 l += b2;
24
25 g++;
26 }
27
28 ROTATE(*(left + 1), *l, *s);
30 SWAP(*right, *++k);
31
32 pos_p0 = l;
33 pos_p1 = s;
34 pos_p2 = k;
35 }
Listing 7.15: Single-Pivot Quicksort.

1 #define N0 30
2 void qs_s(int *left, int *right, int * (* partitioning) (int *, int *)) {
3 if (right - left > N0) {
4 int *p = median_of_3_pivot(left, right);
5 SWAP(*left, *p);
6
7 p = ((* partitioning)(left, right));
8 qs_s(left, p - 1, partitioning);
9 qs_s(p + 1, right, partitioning);
10 } else insertion_sort(left, right);
11 }
Listing 7.16: Dual-Pivot Quicksort.

1 #define N0 30
2 void qs_d(int *left, int *right, void (* partitioning) (int *, int *, int *&, int *&)) {
4 int *p0, *p1;
5 choose_2_pivots(left, right);
6
7 ((* partitioning)(left, right, p0, p1));
8
9 qs_d(left, p0 - 1, partitioning);
10 qs_d(p0 + 1, p1 - 1, partitioning);
11 qs_d(p1 + 1, right, partitioning);
13 }
Listing 7.17: Triple-Pivot Quicksort.

1 #define N0 30
2 void qs_t(int *left, int *right, void (* partitioning) (int *, int *, int *&, int *&, int *&)) {
4 int *p0, *p1, *p2;
5 choose_3_pivots(left, right);
6 ((* partitioning)(left, right, p0, p1, p2));
7
8 qs_t(left, p0 - 1, partitioning);
9 qs_t(p0 + 1, p1 - 1, partitioning);
10 qs_t(p1 + 1, p2 - 1, partitioning);
11 qs_t(p2 + 1, right, partitioning);
13 }
69
Chapter 8
Bibliography
[AD16] Aumüller, Martin ; Dietzfelbinger, Martin: Optimal Partition-

ing for Dual-Pivot Quicksort. In: ACM Trans. Algorithms 12 (2016),
Nr. 2, S. 18:1–18:36. http://dx.doi.org/10.1145/2743020. – DOI
10.1145/2743020
[ADK16] Aumüller, Martin ; Dietzfelbinger, Martin ; Klaue, Pascal:

How Good Is Multi-Pivot Quicksort? In: ACM Trans. Algorithms
13 (2016), Nr. 1, S. 8:1–8:47. http://dx.doi.org/10.1145/2963102.
– DOI 10.1145/2963102
[AGK13] Antyipin, Artyom ; Góbi, Attila ; Kozsik, Tamás: Low Level

Conditional Move Optimization. In: Acta Cybern. 21 (2013), Nr. 1, S.
5–20. http://dx.doi.org/10.14232/actacyb.21.1.2013.2. – DOI
10.14232/actacyb.21.1.2013.2
[ALSU06] Aho, Alfred V. ; Lam, Monica S. ; Sethi, Ravi ; Ullman, Jeffrey D.:
Compilers: Principles, Techniques, and Tools (2Nd Edition). Boston,
MA, USA : Addison-Wesley Longman Publishing Co., Inc., 2006. –
ISBN 0321486811
[ANP16] Auger, Nicolas ; Nicaud, Cyril ; Pivoteau, Carine: Good Predic-

tions Are Worth a Few Comparisons. In: 33rd Symposium on Theoret-
ical Aspects of Computer Science, STACS 2016, February 17-20, 2016,
Orléans, France, 2016, S. 12:1–12:14
[BM05] Brodal, Gerth S. ; Moruz, Gabriel: Tradeoffs Between Branch Mis-

predictions and Comparisons for Sorting Algorithms. In: Algorithms
and Data Structures, 9th International Workshop, WADS 2005, Wa-
terloo, Canada, August 15-17, 2005, Proceedings, 2005, S. 385–395
[BNWG08] Biggar, Paul ; Nash, Nicholas ; Williams, Kevin ; Gregg,

David: An experimental study of sorting and branch prediction.
In: ACM Journal of Experimental Algorithmics 12 (2008), S. 1.8:1–
1.8:39. http://dx.doi.org/10.1145/1227161.1370599. – DOI
10.1145/1227161.1370599
[CLRS09] Cormen, Thomas H. ; Leiserson, Charles E. ; Rivest, Ronald L. ;

Stein, Clifford: Introduction to Algorithms, Third Edition. 3rd. The
MIT Press, 2009. – ISBN 0262033844, 9780262033848
70 Chapter 8. Bibliography
[Cor16] Corporation, Intel: Intel R 64 and IA-32 Architectures Optimiza-

tion Reference Manual. Intel Corporation, 2016
[EK12] Elmasry, Amr ; Katajainen, Jyrki: Lean Programs, Branch Mis-

predictions, and Sorting. In: Fun with Algorithms - 6th International
Conference, FUN 2012, Venice, Italy, June 4-6, 2012. Proceedings,
2012, S. 119–130
[EKS12] Elmasry, Amr ; Katajainen, Jyrki ; Stenmark, Max: Branch

Mispredictions Don’t Affect Mergesort. In: Experimental Algorithms
- 11th International Symposium, SEA 2012, Bordeaux, France, June
7-9, 2012. Proceedings, 2012, S. 160–171
[EW16] Edelkamp, Stefan ; Weiß, Armin: BlockQuicksort: Avoiding

Branch Mispredictions in Quicksort. In: 24th Annual European Sym-
posium on Algorithms, ESA 2016, August 22-24, 2016, Aarhus, Den-
mark, 2016, S. 38:1–38:16
[Fog17a] Fog, Agner: Lists of instruction latencies, throughputs and micro-

operation breakdowns for Intel, AMD and VIA CPUs. Technical Uni-
versity of Denmark, 2017
[Fog17b] Fog, Agner: Optimizing software in C++: An optimization guide for

Windows, Linux and Mac platforms. Technical University of Denmark,
2017
[Kat12] Katajainen, Jyrki: Branch mispredictions don’t affect mergesort

(Talk). (2012)
[Kat14] Katajainen, Jyrki: Sorting programs executing fewer branches.

(2014). – CPH STL report 2014-1, Department of Computer Science,
University of Copenhagen, 51 pp.
[Ker88] Kernighan, Brian W. ; Ritchie, Dennis M. (Hrsg.): The C Program-

ming Language. 2nd. Prentice Hall Professional Technical Reference,
1988. – ISBN 0131103709
[KS06] Kaligosi, Kanela ; Sanders, Peter: How Branch Mispredictions

Affect Quicksort. In: Algorithms - ESA 2006, 14th Annual European
Symposium, Zurich, Switzerland, September 11-13, 2006, Proceedings,
2006, S. 780–791
[MNW15] Martínez, Conrado ; Nebel, Markus E. ; Wild, Sebastian: Anal-

ysis of Branch Misses in Quicksort. In: Proceedings of the Meeting
on Analytic Algorithmics and Combinatorics. Philadelphia, PA, USA
: Society for Industrial and Applied Mathematics, 2015, 114–128
[Mor01] Mortensen, Sofus: Refining the pure-C cost model (Master thesis).
2001
71
[MS08] Mehlhorn, Kurt ; Sanders, Peter: Algorithms and Data Structures:

The Basic Toolbox. Springer-Verlag Berlin Heidelberg, 2008. – ISBN
978–3–540–77977–3, 978–3–540–77978–0
[PH13] Patterson, David A. ; Hennessy, John L.: Computer Organization

and Design, Fifth Edition: The ardware/Software Interface. 5th. San
Francisco, CA, USA : Morgan Kaufmann Publishers Inc., 2013. – ISBN
0124077269, 9780124077263
[Sta03] Stallman, Richard M.: Using the GNU Compiler Collection. GNU
Press, 2003
[SW04] Sanders, Peter ; Winkel, Sebastian: Super Scalar Sample Sort. In:
Algorithms - ESA 2004, 12th Annual European Symposium, Bergen,
Norway, September 14-17, 2004, Proceedings, 2004, S. 784–796
[Wil13] Wild, Sebastian: Master Thesis, Java 7’s Dual Pivot Quicksort (Mas-
ter thesis). 2013
[Wil16] Wild, Sebastian: DualPivot Quicksort and Beyond: Analysis of Multi-

way Partitioning and Its Practical Potential (PhD. dissertation). 2016
[Zar96] Zargham, Mehdi R.: Computer Architecture: Single and Parallel

Systems. Upper Saddle River, NJ, USA : Prentice-Hall, Inc., 1996. –
ISBN 0–13–010661–5
72
Eidesstattliche Erklärung
Hiermit erkläre ich an Eides statt, dass ich die vorliegende Arbeit selbstständig und
ohne Benutzung anderer als der angegebenen Hilfsmittel angefertigt habe. Alle
Stellen, die wörtlich oder sinngemäß aus veröffentlichten oder nicht veröffentlichten
Schriften entnommen wurden, sind als solche kenntlich gemacht. Die Arbeit hat
in gleicher oder ähnlicher Form noch keiner anderen Prüfungsbehörde vorgelegen.
Ort, Datum Unterschrift

Influence of Branch Mispredictions On Sorting Algorithms

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Influence of Branch Mispredictions On Sorting Algorithms

Caricato da

Copyright:

Formati disponibili

Technische Universität Ilmenau

Fakultät für Informatik und Automatisierung

Influence of Branch Mispredictions on

Mohamad Karam Kassem

Ilmenau, den 2. May 2018

“Praise be to Allah, the Lord of the entire universe” [Quran 1:2]

I would like first and foremost to thank my supervisor Univ.-Prof. Dr.

Classical analysis of algorithms is usually interested in the count of

Die klassische Analyse von Algorithmen beschäftigt sich normaler-

2.1 Pipelining instructions (figure in the style of [Zar96]). . . . . . . . . 6

3.1 The expected number of branch misses according to 2-bit-saturate-

4.12 Variants of Guided Power (Left: Guided Power (Listing 7.1)

5.1 Simultaneous maximum and minimum. . . . . . . . . . . . . . . . . 43

2.1 Misprediction penalty of various processors mesured by CPU cycles

3.1 The expected numbers of branch misses (Section 2.3.1). . . . . . . . 18

4.1 Determining the smaller/larger elements of in listing 4.13. . . . . . 27

5.1 Execution time of naive-MinMax and lean-MinMax measured by mi-

2.1 A simple conditional branch written in C++. . . . . . . . . . . . . . 8

7.6 Lean Lomuto (cmov) Single-Pivot Partitioning (Katajainen [Kat14],

List of Figures iii

5.1.3 Runtime Measurement Methodology . . . . . . . . . . . . . 42

“Sorting a sequence of elements of some totally ordered universe remains one of

• Some applications/algorithms need to sort information (i.e. task schedul-

• Many important techniques, which involved in sorting problem, are used

“A Sorting algorithm describes the method by which we determine the sorted

Sorting problem is defined in [CLRS09] as following

In 2009, Yaroslavskiy together with Bentley and Bloch improved Quicksort

An experimental study of sorting and branch prediction [BNWG08] appeared in

1.3 Thesis Outline

• Chapter 2 presents background information about modern hardware architec-

• Chapter 3 suggests a branch categorization.

• Chapter 4 discuss various strategies to avoid/eliminate branch misses in

• Chapter 5 shows results of algorithms improvements, which have been de-

F Fetch instruction from memory.

Structural Hazard. Hardware does not support the combination of instructions

Figure 2.1: Pipelining instructions (figure in the style of [Zar96]).

2.2 Instructions costs

Simple instruction. Simple instructions such as simple integer arithmetical

Multiplication/Division. Integer multiplication instructions costs a lot com-

Memory Load/Store. Modern computers has about 3 levels of caches L1 , L2

Processor Misprediction Penalty

Table 2.1: Misprediction penalty of various processors mesured by CPU cycles

1 Simple (arithmetical/logical) and memory store

2 Branching (with correct prediction)

Figure 2.2: A classification of assembly instructions in term of execution speed.

2.3 Branch misses

Branch prediction. A method of resolving a branch hazard that assumes a

2.3.1 Prediction schemes

not taken not taken

taken not taken

Figure 2.5: 1-bit dynamic predictor.

taken not taken

Figure 2.6: 2-bit-saturate-counter dynamic predictor.

Figure 2.7: 2-bit-flip-on-consecutive dynamic predictor.

Name Type Miss Probability E[Branch Misses]

Table 2.2: Local prediction schemes.

Figure 2.8: Probability of a misprediction of local prediction schemes [MNW15].

2.3.2 Observation from the literature

Figure 2.9: Execution time of simultaneous minimum and maximum searching

After analyzing mispredictions according to uniform random distribution prob-

2.3.2.2 Exponentiation by Squaring

2.4 Lean procedure

2. Moderately optimized: O(n) branch mispredictions incurred under the same