GPU Acceleration

GPU acceleration for the pricing of the eMS spread option
Qasim Nasar-Ullah
University College London
Gower Street
London, United Kingdom
q.nasar-ullah@ucl.ac.uk
ABSTRACT
This paper presents a study on the pricing of a fnancial
derivative using parallel algorithms which are optimised to
run on a GPU. Our chosen fnancial derivative, the con
stant maturity swap (CMS) spread option, has an associ
ated pricing model which incorporates several algorithmic
steps, including: evaluation of probability distributions, im
plied volatility root-fnding, integration and copula simula
tion. The novel aspects of the analysis are: (1) a fast new ac
curate double precision normal distribution approximation
for the GPU (based on the work of Ooura) , (2) a parallel grid
search algorithm for calculating implied volatility and (3) an
optimised data and instruction workfow for the pricing of
the CMS spread option. The study is focused on 91.5% of
the runtime of a benchmark (CPU based) model and results
in a speed-up factor of 10.3 when compared to our single
threaded benchmark model. Our work is implemented in
double precision using the NVIDIA GFI00 architecture.
Categories and Subject Descriptors
D.1.3 Concurrent Programming : Parallel Programming;
G.1.2 Approximation : Special function approximations;
G.3 Probability and Statistics : Probabilistic algorithms
(including Monte Carlo)
General Terms
Algorithms, Performance
Keyords
GPU, Derivative pricing, CMS spread option, Normal dis
tribution, Parallel grid search
1. INTRODUCTION
Modern graphics processing units (GPUs) are high through
put devices with hundreds of processor cores. GPUs are able
to launch thousands of threads in parallel and can be con
fgured to minimise the efect of memory and instruction
latency by an optimal saturation of the memory bus and
arithmetic pipelines. Certain algorithms confgured to the
GPU are thought to ofer speed performance improvements
over existing architectures. In this study we examine the
application of GPUs for pricing a constant maturity swap
(CMS) spread option.
The CMS spread option, a commonly traded fxed income
derivative, makes payments to the option holder based on
the diference (spread) between two CMS rates C1, C2 (e.g.
the ten and two year CMS rates). Given a strike value K, a
CMS spread option payof can be given as [C1 - C2 - K]+,
where [
.
]+ = Max[, 0]. The product makes payments, based
on the payof equation, to the holder at regular intervals
(e.g. three months) over the duration of the contract (e.g.
20 years). The CMS rates C1, C2 are recalculated at the
start of each interval and the payof is made at the end of
each interval.
Prior to discussing our GPU based CMS spread option
model in Sections 4 and 5, we use Sections 2 and 3 to present
two algorithms that are used within our model. Section 2
presents an implementation of the standard normal cumula
tive distribution function based on the work of Ooura [15].
The evaluation of this function is central to numerous prob
lems within computational fnance and dominates the cal
culation time of the seminal Black Scholes formula [4]. We
compare our algorithm to other implementations and dis
cuss sources of performance gain, we also comment on the
accuracy of our algorithm. In Section 3 we present a GPU
algorithm that evaluates implied volatility through a parallel
grid search. The calculation of implied volatility is regarded
as one of the most common tasks within computational f
nance [12]. Our method is shown to be robust and is suited
to the GPU when the number of implied volatility evalua
tions is of order 100 or less. In Section 4 we present a short
mathematical model for the pricing of CMS spread options.
In Section 5 we present our GPU based implementation,
providing various example optimisations alongside a set of
performance results.
2. NORMAL DISTRIBUTION FUNCTION
ON THEGPU
The calculation of the standard normal cumulative distri
bution function, or normal CDF, occurs widely in compu
tational fnance. The normal CDF, <(x) , can be expressed
as:
<(x) =
1
_ jX _
_
,2
y o
(1)
and is typically calculated from a numerical approximation
of the error function erf (x) and the complementary error
function erfc(x), which are shown in Figure 1. Approxima
tions for erf (x) and erfc(x) are often restricted to positive
values of x which are related to <(x) by:
978-1-4673-2633-9/12/$31.00 2012 IEEE
1
U
=Erf(x)
= Erfc(x)
-<(x)
-1 ---
4 -2 0 2 4
x
Figure 1: Overlaying the functions erf (x) , erfc(x)
alongside the normal CDF <(x) . Due to the symme
try of these functions, algorithms typically restrict
actual evluations to positive values of x.
<(
+x) =
[
1+ erf
(
:)]

[
2 erfc
(
:) ]
, (
2
)
<(
-x) =
[
1erf
(
:) ]
= erfc
(
:)
. 3)
The majority of normal CDF algorithms we surveyed ap
proximate the inner region of x (close to x = 0) using erf (x)
and approximate the outer region of x using erfc(x) , this
minimises cancellation error. The optimum branch point
separating the inner and outer regions and minimising can
cellation error is x 0. 47 , this point is the intersection
of erf (x) and erfc(x) shown in Figure 1. The algorithms im
plemented within our study are listed in Table 1 and have
absolute accuracy < 10'.
An algorithmic analysis of various common approxima
tions 1, , 10, 11 highlights areas of expected performance
loss when implemented within CPUs. Firstly, the approxi
mations are rational and so utilise at least one instance of
the division operator ( which has low throughput on the cur
rent generation of CPUs) . The common presence of ratio
nal approximations, for example Pade approximants, stems
from their superior numerical efciency on traditional archi
tectures 7 . Secondly, due to separate approximations over
the range of x, the CPU may sequentially evaluate each ap
proximation ( known as branching) if the thread execution
vector (known as a 'warp' of 32threads in current NVIDIA
architectures) contains values of x in diferent approxima
tion regions. An algorithm falling into this class is the Cody
algorithm , which is also considered a standard implemen
tation within fnancial institutions and is used to benchmark
our results.
Within our survey we identify the Ooura error function derf
and complimentary error function derfe 1 as being par
ticularly suited to the CPU.
The Ooura error function derf is based on polynomial
(as opposed to rational) approximations where each approx
imation utilises high throughput multiplication and addi
tion arithmetic only. The algorithm uses two explicit if
branches, each having access to fve sets of coefcients. As a
result the algorithm consists of ten separate polynomial ap-
10-,----,-,----,-
8 -
_.. ..
-
- - - Outer approximation
-Inner approximation
1 2 3 4
x
Figure 2: The diferent polynomial approximations
within the Ooura error function derf over x. We
observe two explicit branches (each having fve sub
branches) . The domain of x will be expanded by a
factor of , 1. 4 when evaluating the normal CDF
due to the transformation in (2) and (3) .
proximations operating on ten distinct regions of x, the ten
regions can be seen in Figure 2. Having hard-coded polyno
mial coefcients, as opposed to storage in another memory
type, ofered the best performance. It is worthwhile to note
that the addition of a single exponential function exp or sin
gle division operation increased the execution time of a single
derf approximation by around 0%and 2%respectively.
In contrast, the Ooura complimentary error function derfe
uses a single polynomial approximation across the entire
range of x, whilst utilising two instances of low throughput
operations (namely exp and a division) . We were unable
to fnd a more parsimonious representation (in terms of low
throughput operations) of the complimentary error function
within our survey.
We formulate a hybrid algorithm called ONORM to calcu
late the normal CDF (listed in Appendix A) . The algorithm
uses the innermost approximation of derf to cover the in
ner region 1. 4 ( where d = d < x < d) and
derfe for the remaining outer region. The resulting branch
point of x = 1. 4is greater than the optimum branch point of
x = 0. 47and was chosen to maximise the interval evaluated
by the higher performance derf approximation.
Our results are shown in Table 1,within which we compare
the efects of uniform inputs of x, which increment gradually
to minimise potential branching, against random inputs of
x, which are randomised for increased branching. Our re
sults show that ONORM ofers its peak performance when
x is in the inner range of 1. 4 . In this range it slightly
outperforms derf (upon which ONORM is based) due to
fewer control fow operations. Within our test samples we
observe ONORM outperforming the Cody algorithm by fac
tors ranging from 1. 09 to 3. 0.
Focusing on the random access results we see that when
x is in 10 ONORM performs slower than derfe due to
each 'warp' executing both the inner derf and outer derfe
approximations with probability 0. 99. The performance
diference is however limited since the inner derf approxima
tion has around 4%of the cost of the derfe approximation.
The use of random (against uniform) inputs signifcantly re-
Range of x 0. 2 1.4 10
Access Random Uniform Random Uniform Random Uniform
derf 4. 72 4. 7 4. 81 4. 82 1. 0 4.
derfc 2. 22 2. 24 2. 23 2. 24 2. 23 2. 24
Phi 2. 40 2. 7 1. 34 1. 84 1. 08 1. 21
NV . 33 . 34 . 3 . 37 1. 4 2. 1
Cody 4. 3 4. 3 1. 34 2. 39 0. 9 1. 3
ONORM 4. 77 4. 81 4. 88 4. 83 1. 71 2. 28
ONORM speed-up vs. Cody ( Z ) 1. 09 1. 10 3. 2. 02 1. 78 1. 40
Table 1: Calculations per second [10") for the Ooura derf, Ooura derfc, Marsaglia Phi, NV, Cody and our
ONORM algorithm. Active threads per possible active threads or 'occupancy' fxed at 0. . Uniform access
attempts to minimise possible branching, Random access attempts to maximise possible branching. GPU
used: NVIDIA M2070.
|. X 10''
1. X |0''
. Z |0'
v
0
-. Z |0'
-1. Z |0''
-1.Z |0''
-2 -1 0 2
Figure 3: Absolute error AE of the derf algorithm

for the range 2 .
duces the performance of Cody in the ranges 1.4 and
10 and derf in the range 10, highlighting the perfor
mance implications of branching.
We also comment on the Marsaglia Phi algorithm which
is based on an error function approximation 13 . It is a
branchless algorithm which involves the evaluation of a Tay
lor series about the origin. A conditional while loop adds
additional polynomial terms as x moves away from the ori
gin. Within our GPU implementation we add a precalcu
lated array for Taylor coefcients to eliminate all division
operations. The algorithm performs single digit iterations
close to the origin, and grows exponentially as x moves to
wards the tails. We found that despite having extremely
few iterations close to the origin, performance is limited by
the presence of a single exp function. Our results also in
dicate that the Marsaglia algorithm is always dominated by
the derf function (unless extensive branching occurs) and
can perform at least as fast as the derfc function when x is
within 0. 2|.
A comparison is also made against the latest NVIDIA
CUDA 4. 1 implementation 14 of the error and compli
mentary error functions: NV-Erf and NV-Erfc. As per the
ONORM algorithm we can craft a hybrid algorithm NV which
uses the innermost NV-Erf approximation for the inner re
gion 1. 4 (consisting of multiplication and addition arith
metic only) and the branchless NV-Erfc approximation (con
sisting of three low throughput functions: an exp and two
divisions) in the outer region. The inner approximation is
more efcient than ONORM due to a smaller polynomial,
2. X 10''
1. X 10''
v
-1. X 10''
-2. X 10''
-2 -1 0 2
Figure 4: Absolute error AE of the derfc algorithm

for the range 2 .
whereas the outer approximation uses an additional division
operation yielding a loss in performance.
The accuracy of our GPU approximations can be mea
sured against an arbitrary precision normal CDF function
implemented within Mathematica CDF [NormalDistribution],
referred to as
<actual(
X
)
. We measure absolute accuracy as:
AE =
<
(
x
) - <actual(
x
),
4)
and relative accuracy ( which is amplifed for increasingly
negative values of x
)
as:
RE =
< (
x
)
_ 1.
<actual (
x
)
)
The ONORM branches combine to reduce cancellation er
ror by using the inner region of derf and the outer region
of derfc, this can be observed in Figures 3 and 4. How
ever, having chosen our branch point as x = 1. 4rather than
x = 0. 47, we observe in the range 1. 4 < x < 0. 47 a
small increase in the ONORM maximum relative error of
3. 23 Z 10'.
We also compare the accuracy of ONORM against the NV
and Cody algorithms. Comparative relative error plots are
shown in Figures and . Over the range 22 < x < 9
the NV, Cody and ONORM algorithms exhibit comparable
relative error.
As seen in Figure ,within the inner region 1. 4 , ONORM
was inferior to both NV and Cody. It is apparent therefore
18
|
1
- t v
0

14 -
12
20 1 10 0
x
Figure : Maximum bound of relative error RE for
the NV, Cody and ONORM algorithms for the range
22 < x < 9. Lower values depict higher relative
error.
1.
1.
1
-Cody
- - - ONORM
14.
2
x
Figure : Maximum bound of relative error RE for
the NV, Cody and ONORM algorithms for the range
2[. Lower vlues depict higher relative error.
that the inner region of ONORM can be improved in terms
of speed and accuracy by utilising the error function approx
imation within NV-erf. The maximum absolute errors of NV,
Cody and ONORM were all less than 1. Z 10'.
3. IMPLIED VOLATILITY ON THE GPU
The evaluation of Black 3[style implied volatility occurs
in many branches of computational fnance. The Black for
mula (which is closely related to the seminal contribution
in derivative pricing, the Black Scholes formula 4 ) calcu
lates an option price Vas a function V(S, K, u, '), where
S is the underlying asset value, K is the strike value, U is
the underlying asset volatility and T is the time to matu
rity. The implied volatility calculation is based on a simple
formula inversion where the implied volatility Ui is now a
function Oi (Vm, S, K, '), where Vm is an observed option
price. Due to the absence of analytic methods to calculate
implied volatility, an iterative root-fnding method is typi
cally employed to fnd Ui such that:
V(S, K, Ui, T)Vm = 0. )
The function V (
.
) appears well suited for efcient root
fnding based on Newton's method, for example, the function
Thread block size Q
0 BS 1 32 4 128 12
10 27 7 4 3
10 34 9 7 4
10' 40 10 8 7
10'` 47 12 10 8 7
10'' 4 14 11 9 8 0
Table 2: Iterations needed,
)
to calculate implied
volatility based on a parallel grid search. Domain
size d = 100. 0 represents search accuracy. BS rep
resents binary search.
is monotonically increasing for u, the function has a single
analytic infexion point with respect to U and the function
has analytic expressions for its derivative with respect to U
(8V/ 8u). However, within the context of implied volatility
8V / 8u can tend to 0,resulting in non-convergence 12[. Pre
dominant methods to evaluate implied volatility are there
fore typically based on Newton with bisection or Brent
Dekker algorithms 1 , the latter being preferred. GPU
based evaluation of these functions (particularly Brent-Dekker)
can result in a loss of performance. Firstly, the high regis
ter usage of these functions is generally suboptimal on the
light-weight thread architecture of GPUs. Secondly, the al
gorithms may execute in a substantially increased runtime
due to conditional branch points coupled with unknown it
erations to convergence. Finally, numerous contexts in com
putational fnance (including the CMS spread option model)
are concerned with obtaining the implied volatility of small
groups of options, hence single-thread algorithms such as
Newton and Brent-Dekker can result in severe GPU under
utilisation (assuming sequential kernel launching).
We therefore develop a parallel grid search algorithm that
has the following properties: it uses a minimal amount of
low throughput functions, it is branchless and executes in
a fxed time frame and can be used to target an optimum
amount of processor utilisation when the number of implied
volatility evaluations is low.
The parallel grid search algorithm operates on the follow
ing principles: we assume the domain of Ui is of size d and
the required accuracy of Ui is 0. The required accuracy can
be guaranteed by searching over U units, where
7)
Using a binary search method (which halves the search
interval by a factor of two with each iteration) the number
of required iterations is given by:
8)
where "1 is the ceiling function and i = ,xl is the smallest
integer i such that i :: x.
Alternately a parallel grid search can be employed using
a GPU 'thread block' with Q threads and Q search areas (a
thread block permits groups of threads to communicate via
a 'shared memory' space). Using a parallel grid search, the
number of required iterations
)
is given by:
9)

2
I
~
Z
1.5
~
1
O
=
0.5
7
"

4

_
4
I _f _
_
; .
,, , -` -

f .0

l* _
`
`
, ,
.
`

.
16 threads
- - - 32 threads
64 threads
-96 threads
-128 threads
100 200 300 400 500
No. of implied volatility calculations
Figure 7: Calculations per second of a parallel grid
search implied volatility for vrious thread block
sizes. The number of implied volatility calculations
is equal to the number of thread blocks.
400
200
...... 16 threads
- - - 32 threads
-64 threads
-96 threads
-128 threads
50 100 150 200 250
No. of implied volatility calculations
300
Figure 8: Time taken for the evaluation of a paral
lel grid search implied volatility for vrious thread
block sizes. The number of implied volatility calcu
lations is equal to the number of thread blocks.
The number of iterations needed against various thread
block sizes to guarantee a given accuracy 0 over a given
domain size dcan be estimated a priori, of which an example
is shown in Table 2.
The parallel grid search algorithm will thus calculate the
implied volatility of 0 options by launching 0 thread blocks,
with each thread block having threads. It is noted that
while numerous CPU algorithms seek to maximise thread
execution efciency (since the number of threads are fxed),
we are primarily concerned with maximising thread block
execution efciency (since the number of 0 thread blocks or
0 options are fxed). We ofer a brief outline of the key steps
within a CUDA implementation:
1. Within our listing we use the following example pa
rameters which can be passed directly from the CPU,
where base is a preprocessed variable to enable the di
rect use of high throughput bitshifting on the CPU and
where the thread block size blocksize () is assumed
to be a power of two:
left = 1.0e-9; //minimum volatility
delta = 1.0e-12; //desired accuracy
base = lo
g
2f(blocksize);
2. We begin a for loop over the total iterations iter as
calculated in 9). The integer mult is used to collapse
the grid size with each iteration, its successive val
ues are calculated as
.rr-1_
.rr-z___ _
1_
_
. Sub-
sequently we calculate our volatility guess vol ( O ) and
the error err as given by the left hand side of ) by:
for (int i = 0; i<iter; i++)
i
mult = l(base * (iter - 1 - i;
//vol
g
uess over the entire interval
vol = left + delta*mult*threadldx.x;
//calculation of error: V(vol, ... ) - V m
err = price(vol, ... ) - price_m;
The volatility guess now covers the entire search inter
val. If in 9) U is chosen not be a power of
_
our frst
iteration overstates our initial search size d (which ex
tends out to the right of the left interval left). This
does not afect the algorithm's accuracy and minimally
afects performance due to a partially redundant addi
tional iteration. Special care must be taken to ensure
mult does not overfow due to excessive bitshifts, in
our fnal implementation we avoided this problem by
using additional integer multiples (e.g. mult2, mult3).
3. We found it optimal to declare separate shared mem
ory arrays (prefxed sh_) to store the absolute error
and the sign of the error. This prevents excessive us
age of the absolute function fabs within the reduction.
A static index is also populated to provide the left in
terval location for the next iteration:
//absolute error
sh_err[threadldx.x] = fabs(err);
//si
g
n of error
sh_si
g
n[threadldx.x] = si
g
nbit(err);
//static index
sh_index [threadldx.x] = threadldx.x;
4. After a parallel reduction to compute the index of
the minimum absolute error (stored in sh_index [0] ) ,
the left bracket is computed by checking the sign of
the minimum error location using properties of the
si
g
nbit function:
//V(vol, ... ) - V_m < 0
if (!sh_si
g
n[sh_index[O]])
left = left + (sh_index [O] - l)*delta*mult;
//V(vol, ... ) - V_m - 0
else left = left + sh_index [0] *delta*mult;
J
Our results showing the calculations per second of various
thread block sizes are shown in Figure 7, where a number of
efects are visible. Firstly, consider thread block sizes of 4,
9 and 128 threads. As we increase the number of options
for which we compute implied volatility, the calculations per
second become a strong linear function of the number of
thread blocks launched, exhibited in the plateauing efect
to the right of Figure 7. Secondly, consider thread block
sizes of 16 and 32 threads. It is observed that these kernels
maintain load imbalances where additional calculations can
be undertaken at no additional cost. In our example peak
performance was achieved by thread blocks of size 32. The
lower peaks associated with 16 threads as opposed to 32
threads) is a consequence of 16 threads utilising an addi
tional two iterations as given by 9) .
By studying Figure 8we observe how load imbalances are
linked to the 'number of passes' taken through the GPU.
We introduce this phenomenon as follows: The GPU used
in this study consisted of 14 multiprocessors each accommo
dating upto 8thread blocks, thus a maximum of 14 Z 8= 112
thread blocks are active on the GPU at any given instance.
In our implementation this was achieved by having 16, 32
and 64 threads per block, where we observe an approximate
doubling of execution time as we vary from 112 to 113 total
thread blocks. This is due to the algorithm scheduling an ad
ditional 'pass' through the GPU. Focusing on larger thread
block sizes we see that a similar 'pass' efect is observed
limited to the left hand side of Figure 8. Due to the large
diferences in time, assessing the number of passes should be
carefully studied before implementing this algorithm. The
number of passes can be estimated as follows:
Number of passes =
r _
:
^
M1
'
10)
where 0 is the total number of thread blocks requiring eval
uation, ' is the total number of thread blocks scheduled
on each multiprocessor and ^
M
is the total number of mul
tiprocessors on the GPU.
_
should be obtained by hard
ware profling tools as the hardware may schedule diferent
numbers of thread blocks per multiprocessor than would be
expected by a static analysis of processor resources.
In order to provide a comparison against a single-thread
algorithm on the GPU that is the number of implied volatil
ity calculations is equal to the number of GPU threads
launched) we implement a parsimonious representation of
Newton's method we preprocess input data to ensure con
vergence) . Although the number of Newton iterations may
vary substantially, by comparing Figures 8 and 9 we con
clude that the parallel grid search algorithm is likely to ofer
a comparable runtime when the number of implied volatility
evaluations is of order 100.
Within our parallel grid search algorithm the size of thread
blocks is equal to the size of the parallel grid g. An alternate
algorithm would instead accommodate multiple parallel sub
grids within a single thread block with each sub-grid evalu
ating a single implied volatility) . An increase in the number
of sub-grids per thread block would result in a decrease in
the number of blocks needed for evaluation whilst increasing
the number of iterations and control fow needed. Such an
approach is advantageous when dealing with large numbers
of implied volatility evaluations. For instance, using Figure
8 when the number of evaluations is between 113 and 224,
a thread block of 64 threads makes two passes, however if
the thread block was split into two parallel sub-grids of 32
threads) only one pass would be needed. Thus, in this in
stance, the total execution would approximately half. The
use of a single-thread algorithm is ultimately an idealised
version of this efect where each sub-grid is efectively re-
300
I
I
I
I I
i = Newton iterations
~
7
2 200
~
i = 25 -
4

i=20-

4 i = 15 -
S 100

i = 10 -
i I I I |
50 100 150 200 250 300
Size of GPU thread block
Figure : Time taken for the evaluation of ideal
Newton gradient descent implied volatility, the low
est line represents fve iterations, lines increment by
one iteration, highest line represents 28 iterations.
Results are obtained from the execution of a single
thread block.
duced to a single thread, drastically reducing the number of
passes ofset against increased timeJcomplexity whilst eval
uating on a single thread.
4. MATHEMATICAL MODEL FOR CMS
SPREAD OPTION PRICING
The CMS spread option price is calculated by frstly esti
mating stochastic processes to describe the two underlying
CMS rates C1, C2 and secondly, using such stochastic pro
cesses to obtain via a copula simulation) the expected fnal
payof [C1 - C2 - K[. These two steps are repeated for
each interval start date and the option price is subsequently
evaluated by summing the discounted expected payofs re
lating to each interval payment date. We describe the model
evaluation in more detail using the following steps:
1. As stated, we frst require a stochastic process to de
scribe the underlying CMS rate. The frst step to ob
taining this process is to calculate the CMS rate C it
self using the put-call parity rule. This calculates the
CMS rate C as a function of the price of a CMS call
option CallK) , CMS put option PUtK) and a strike
value K. We set K = an observable forward swap rate.
Put-call parity results in the value of C as:
C = CallK - PutK +K. 11)
In order to calculate the price of the CMS options
CallK, PutK) we follow a replication argument 8[
whereby we decompose the CMS option price into a
portfolio of swaptions R(k) which are evaluated at dif
ferent strike values k swaptions are options on a swap
rate for which we have direct analytic expressions).
The portfolio of swaptions can be approximated by an
integral 8[of which the main terms are:
CallK L= R(k)dk, 12)
(13)
2. With the CMS rate C captured we also require infor
mation regarding the volatility smile efect [9]. The
volatility smile describes changing volatilities as a re
sult of changes in an option's strike value K. We
therefore evaluate CMS call options (CallK) at various
strikes surrounding C and calculate the corresponding
option implied volatilities.
3. We calibrate a stochastic process incorporating the
above strikes, prices and implied volatilities. We thus
obtain unique stochastic processes, expressing the volatil
ity smile efect, for each of the underlying rates C1, C2.
The stochastic process is typically based on a SABR
class model [9].
4. The price of a spread option contract based on two
underlyings C1, C2 with the payof [C1 - C2 - K]
+
is:
(14)
where f( C1, C2) is a bivariate density function of both
underlyings and A is the range of the given density
function. Obtaining the bivariate density function is
non-trivial and a standard approach is to instead calcu
late (14) using copula methods [6]. The copula method
allows us to estimate a bivariate density function f( C1, C2)
through a copula C. The copula is a function of the
component univariate marginal distributions F1, F2 (which
can be directly ascertained from our stochastic pro
cesses for C1, C2) and a dependency structure (for
example a historical correlation between C1, C2). The
price of a spread option can thus be given as:
1
1
1
1
[F1
-
1
(U1) - F2
-
1
(U2) - K]
+
C(U1, U2)du1du2, (15)
where (U1, U2) are uniformly distributed random num
bers on the unit square.
The integral is subsequently approximated by a copula
simulation. This involves frstly obtaining a set of ^
two-dimensional uniformly distributed random num
bers (U1, U2), secondly, incorporating a dependency
structure between (U1, U2) and fnally obtaining a pay
of by using the inverse marginal distributions F
1
-
1
,
F;
1
on (U1, U2)' The fnal result will be the average
of the ^ simulated payofs.
5. GPU MODEL IMPLEMENTATION
The mathematical steps of the previous section consist of
four dominant computational tasks (in terms of the time
taken for computation). Based on these tasks it is instruc
tive to relabel our model into the following stages:
Integration An integration as shown in (12) and (13), re
lating to steps 1 and 2 in Section 4.
,---------
Load Data
l
-------
l
Calibration
,
. ----
------
CPU
GPU
Figure 10: Model fowchart for the CMS spread op
tion.
Calibration A calibration to obtain CMS rate processes,
relating to step 3 in Section 4. This is not implemented
within our CPU model.
Marginals The creation of lookup tables that represent
discretised univariate marginal distributions F1, F2 for
an input range C\, C2. This allows us to evaluate
the inverse marginal distributions F
1
-1 , F
2
-1 on (U1, U2)
through an interpolation method, relating to step 4 in
Section 4.
Copula Simulation of the copula based on (15), relating to
step 4 in Section 4.
When presenting the timing results of each of the above
stages we include the cost of CPU-CPU memory transfers.
A fowchart describing the evaluation of the CMS spread op
tion model is shown in Figure 10, within which we have an
additional CPU operation (Load Data) which loads a set of
market data for the calibration stage. The function is over
lapped with part 2 of the integration stage, hence there is
no time penalty associated with this operation. As a result
of our CPU implementation we obtain performance results
as shown in Table 3. Within our results we set the num
ber of intervals or start dates t = 96; more generally t is
in the range [40, 200]. The speed-up impact upon changing
t is minimal since our benchmark model is a strong linear
function of
are the calibration and copula stages which

account for 96.9% of the execution time within our fnal
CPU implementation. Results are based on an M2070 CPU
and an Intel Xeon L5640 CPU with clock speed 2.26 CHz
running on a single core. Compilation is under Microsoft Vi
sual Studio with compiler fags for debugging turned on for
both the CPU and CPU implementations. Preliminary fur
ther work suggests that the use of non-debug compiled ver
sions results in a signifcantly larger proportional time reduc
tion for the CPU implementation. This indicates that the
stated fnal speed-up results are an underestimate. Within
Time Speed-up Main Kernel Stats %)
ms) Z ) Time Replays Ll Hit
V1 1. 41 22. 87 38. 01 . 47 4. 12
V2 9. 0 37. 10 41. 78 . 78 3. 18
V3 7. 2 48. 9 21. 43 0. 13 92. 80
Table 4: Integration Results: VI = Separate under
lying evluation, V2 = Combined underlyings, V3
= Preprocessing stage. 'Replays' and '1 hit' rep
resents local memory.
Time Speed-up Main Kernel Stats %)
ms) Z ) Time Replays Ll Hit
V1 . 82 48. 38 01. 8 10. 33 8. 73
V2 4. 70 9. 17 7. 70 10. 33 8. 73
V3 1. 43 197. 30 43. 3 0. 00 99. 97
V4 1. 3 207. 84 40. 33 0. 04 90. 77
Table : Marginal Results: VI = Separate underly
ing evaluation, V2 = Combined underlyings, V3 =
Preprocessing stage, V 4 = Optimum thread block
size. 'Replays' and '1 hit' represents local mem
ory.
the context of our eMS model, the underestimate is consid
ered negligible since the fnal speed-up is sufciently close to
Amdahl's 2[theoretical maximum. Our CFUmodel targets
the integration, marginals and copula stages accounting for
91. %of the benchmark model runtime.
5.1 Optimisation examples and resul ts
Our analysis focuses primarily on the integration and marginals
stage as the copula stage ofered fewer avenues of optimisa
tion due to its relative simplicity. The 'main kernel' or the
CFUfunction that dominates runtime is similar for both the
integration and marginals stage. The main kernel is respec
tively used for the pricing of swaptions and the calculation of
marginal distributions, both of which require the evaluation
of SABR 9[ type formulae. The performance bottleneck
within our main kernel was the high number of arithmetic
instructions, this was determined through a profling analy
sis and timing based on code removal.
Within the integration and marginals stage we identify a
grid of size t Z 2Z n, where t is the number of start dates, 2
is the number of underlyings and n is the size of integration
grid or the number of points within the discretised marginal
distribution F. Within the grid of t Z 2 Z n we observe
signifcant data parallelism and target this grid as a bae
for our CFU computation. For the integration stage we set
n = 82, more generally n is in the range 0, 100[ and for
the marginals stage we set n = 12, more generally n is
in the range 20,1000[. Our choice of n is based on error
considerations outside the scope of this paper. We briefy
describe a set of optimisation steps that were relevant in
the context of our model, corresponding results are shown
in Tables 4 and :
1. In our frst implementation V1) we parallelise calcu
lations on a grid of size t Z n and sequentially evaluate
each underlying. We use kernels of two sizes: Type I
kernels are of size t which launch on a single thread
00
-
~
7
2
~
j

40
O
S

- - - Occupancy 0.33
~Occupancy 0.5
20
0
I |
10 1 20
o. of CFU Threads Z 10)
Figure 11: An illustration of occupancy afecting
kernel execution time, when the number of passes
given by (10) is very low 3).
block with t threads) and Type II kernels are of size
t Z n which launch on t thread blocks each with n
threads) . Within the CF100 architecture we are lim
ited to 1024 threads per block thus we must ensure
that n and tare: 1024.
2. In our second implementation V2) we combine both
underlyings such that Type I kernels are now of size
t Z 2 which launch on a single thread block with t Z 2
threads) and Type II kernels are of size t Z 2Z n which
launch on t thread blocks each with n Z 2 threads) .
For Type I kernels we found the additional underlying
evaluation incurred no additional cost due to under
utilisation of the processor, thus reducing Type I ker
nel execution times by approximately half assuming
sequential kernel launching) .
For the main kernel in the integration stage a Type
II kernel) we found that a similar doubling of thread
block sizing from 82 to 104 threads) led to changed
multiprocessor occupancy. As shown in Figure 11,
small changes in processor occupancy can amplify per
formance diferences when the number of passes given
by 10) is small this efect was also described in Sec
tion 3) . As a consequence, we see in Table 0 a signif
cant reduction in the execution time of our main ker
nel. Type II kernels in the marginals stage were not
combined as this led to unfeasibly large thread block
sizes and loss of performance.
3. In our third implementation V3) we undertook a pre
processing step to simplify SABR type evaluations used
by the main kernels. Although formulae will vary
based on the particular SABR class model being em
ployed, the inner-most code will typically involve sev
eral interdependent low throughput transformations of
an underlying asset value or rate S, strike K, volatility
o and time to maturity T, in order to calculate a set
of expressions of the form as found in the Black 3[
formula):
( In SJK) +o`J2)T)
.
o
V
10)
Stage Benchmark Time Final Time Speed-up Kernels
Time ms) %) Time ms) %) ) Launched
Integration 32. 4 12. 30 7. 2 2. 00 48. 9 4
Calibration 243. 74 8. 1 243. 74 87. 34 1. 00 NJA
Marginals 281. 49 9. 82 1. 3 0. 49 207. 84 0
Copula 1,987. 40 09. 37 20. 73 9. 8 74. 34 1
Overall 2,80. 08 100 279. 08 100 10. 27 2
Table 3: Benchmark and fnal performance results of the CMS spread option pricing model, where the number
of intervls or start dates t = 90.
VI V2
Total evaluations, I2! 1744 1744
Thread block size 82 104
Total blocks, 0 192 90
Occupancy 0. 313 0. 37
Possible active blocks, 1? ^ _ 82 49
Passes needed by 10) 3 2
Measured time reduction NJA 32. 2%
Table : Time reduction of the integration stage
'main kernel' through changes in the number of
passes.
Within each grid of size ! parameter changes are strictly
driven by changes in a single variable K. Within our
optimisation eforts we therefore minimise our inter
dependent transformations in the grid size t 2!
to a smaller preprocessing kernel with grid size t 2.
Hence, assuming we were to evaluate 10),our prepro
cessing kernel would calculate the terms = In S) and
= uV. This would result in the idealised compu
tational implementation of 10) as:
nozza1cd - 1ogk))/ + 0.-). 17)
As such the inner-most code conducts signifcantly fewer
low throughput operations resulting in a performance
gain. Such a preprocessing step can also be used to
improve CPU performance. Within the CPU imple
mentation, the preprocessing approach increased the
amount of data needed by kernels from high latency
global memory. However since our main kernels have
arithmetic bottlenecks, the additional memory trans
actions had little efect on computation time. This is
in contrast to numerous CPU algorithms which have
memory bottlenecks and thus additional memory trans
actions are likely to afect computation time. As a re
sult of our preprocessing we observed a large reduction
in the local memory replays and an increase in the !1
local hit ratio which we defne below:
High levels of complex computation can often result
in a single CPU thread running out of allocated regis
ters used to store intermediate values. An overfow of
registers is assigned to the local memory space which
has a small L1 cache, misses to this cache results in
access to higher latency memory. Within our results
we therefore wish to maximise the !1 local hit ratio
- that is, the proportion of loads and stores to local
Time ms) Speed-up )
Ver 1 4. 0 430. 27
Ver 2 18. 1 107. 39
Ver 3 12. 03 10. 22
Ver 4 20. 73 74. 34
Table 7: Copula Results: Ver 1 = Inverse normal
CDF only, Ver 2 = Ver 1 + normal CDF, Ver 3
Ver 1 + interpolation, Ver 4 = All components.
memory that reside within the L1 cache. However, if
the number of instructions associated to local mem
ory cache misses is insignifcant in comparison to the
total number of instructions issued, we can somewhat
ignore the efect of a low L1local hit ratio. Therefore
we also measure the local memory replays - that is, the
number of local memory instructions that were caused
by misses to the L1cache as a percentage of the total
instructions issued.
4. In our fourth implementation V4) of the marginals
stage we experimented with diferent sizes of thread
blocks, as a result we obtained a small reduction in
main kernel execution times to 88% of V3) . Integra
tion stage kernels did not beneft from further optimi
sation.
As a result of our integration and marginals stage op
timisation we observed speed-ups of 48. 9 and 207. 84
respectively against our benchmark implementation.
Within the copula stage we targeted a parallel grid con
sisting of the number of simulations ^, which we launched
sequentially for each start date t. The sequential evaluation
was ]ustifed as ^ is a very large multiple of the total num
ber of possible parallel threads per CPU. Within this stage
we were limited by two numerical tasks: 1) the evaluation
of the normal CDF and 2) a linear interpolation. The ex
tent of these limitations is shown in Table 7 which presents
timing results for diferent versions which implement only
part of the algorithm. The evaluation of the normal CDF
dominated and was conducted by the ONORM algorithm
presented in Section 2. In regards to the interpolation, we
were unable to develop algorithms that suitably improved
performance. In particular we were unable to beneft from
hardware texture interpolation which is optimised for single
precision contexts. Our fnal copula implementation resulted
in a speed-up of 74. 3. The fnal CPU based model obtained
an overall 10. 27 speed-up which is 87. 3% of Amdahl's 2
maximum theoretical speed-up of 11. 70.
6. CONCLUSIONS
Calculation of the normal CDF through our proposed
ONORM algorithm is well suited to the GPU architecture.
ONORM exhibits comparable accuracy against the widely
adopted Cody algorithm whilst being faster, thus it is likely
to be the algorithm of choice for double precision GPU based
evaluation of the normal CDF. The algorithm can be further
improved by using the V-ezalgorithm in the inner range.
Our parallel grid search implied volatility algorithm is ap
plicable to GPUs when dealing with small numbers of im
plied volatility evaluations. The algorithm is robust, guar
antees a specifc accuracy and executes in a fxed time frame.
For larger groups of options, the algorithm is unsuitable as
computation time will grow linearly at a much faster rate
than GPU alternatives which use a single thread per im
plied volatility calculation.
Within our GPU based CMS spread option model we
highlighted the importance of managing occupancy for ker
nels with low pass ratios whilst also obtaining a particular
performance improvement through the use of preprocessing
kernels. In our experience industrial models do not prepro
cess functions due to issues such as enabling maintenance
and reducing obfuscation, an idea which needs to be chal
lenged for GPU performance.
Further work will consider calibration strategies, tradi
tionally problematic on GPUs due to the sequential nature of
calibration algorithms consisting of multi-dimensional opti
misation) . Further work will also consider the wider perfor
mance implications of GPU algorithms within large pricing
infrastructures found within the fnancial industry.
7. ACKNOWLEDGEMENTS
I acknowledge the assistance of Ian Eames, Graham Bar
rett and anonymous reviewers. The study was supported by
the UK PhD Centre in Financial Computing, as part of the
Research Councils UK Digital Economy Programme, and
BNP Paribas.
8. REFERENCES
1[ A. Adams. Algorithm 39. Areas under the normal
curve. Computer Journal, 122): 197-198, 1969.
2[ G. Amdahl. Validity of the single processor approach
to achieving large scale computing capabilities. In
AFIPS Conference Proceedings, pages 483-485, 1967.
3[ F. Black. The pricing of commodity contracts. Journal
of Financial Economics, 3 1-2): 167-179, 1976.
4[ F. Black and M. Scholes. The pricing of options and
corporate liabilities. Journal of Political Economy,
81 3) :637-654, 1973.
[ W. Cody. Rational Chebyshev approximations for the
error function. Mathematics of Computation,
23 107) :631-637,1969.
0[ S. Galluccio and O. Scaillet. CMS spread products. In
R. Cont, editor, Encyclopedia of Quantitative Finance,
volume 1, pages 269-273. John Wiley Sons,
Chichester, UK, 2010.
7[ C. Gerald and P. Wheatley. Applied Numerical
Methods. Addison-Wesley, Reading, MA, 2004.
8[ P. Hagan. Convexity conundrums: pricing CMS swaps,
caps and foors. Wilmott Magazine, 4: 38-44, 2003.
9[ P. Hagan, D. Kumar, and A. Lisniewski. Managing
smile risk. Wilmott Magazine, 1: 84-108, 2002.
10[ J. Hart, E. Cheney, C. Lawson, H. Maehly,
C. Mesztenyi, J. Rice, H. Thacher Jr, and C. Witzgall.
Computer Approximations. John Wiley Sons, New
York, NY, 1968.
11 [ . Hill. Algorithm AS 66: The normal integral. Applied
Statistics, 22 3): 424-427, 1973.
12[ P. Jackel. By implication. Wilmott Magazine,
26:60-66, 2006.
13[ G. Marsaglia. Evaluating the normal distribution.
Journal of Statistical Software, 115): 1-11, 2004.
14[ NVIDIA Corp. CUDA Toolkit 4. 1. Online[Available:
bttp.//deve1opez.nv1d1a.coz/cuda-too1k1t-41,
2012.
1[ T. Ooura. Gamma Jerror functions. Online|
A vailable: bttp.
//vvv.kuz1zs.kyoto-u.ac.]p/oouza/gazez.btz1,
1996.
10[ W. Press, B. Flannery, S. Teukolsky, and
W. Vetterling. Numerical Recipes. Cambridge
University Press, New York, NY, 2007.
APPENDIX
A. LISTING OF THE ONORM ALGORITHM
/0 Based on the derf and derfc function of
Takuya Uh (email: oourammm.t.u-tokyo.ac.jp)
http://www.kurims.kyoto-u.ac.jp/-ooura/gamerf.html 0/
l
devce inline double onorm(double x)
double t, y, u, w;
x o 0.7071067811865475244008;
w = x Y ! X
if (w 1)
l
J
else
l
J
t = w * w;
y 5.958930743e-11 0 t + -1. 13739022964e-9)
o t + 1.46600519983ge-8) 0 t + -1.635035446196e-7) 0 t
+ 1.6461004480962e-6) 0 t 1 -1.492559551950604e-5) 0 t
+ 1.2055331122299265e-4) 0 t + -8.548326981129666e-4) 0
t + 0.00522397762482322257) 0 t + -0.0268661706450773342)
o t + 0.11283791670954881569) 0 t + -0.37612638903183748117)
o t + 1.12837916709551257377) 0 w;
y 0.5 + 0.50y;
x = X
t 3.97886080735226 / (w + 3.97886080735226);
u t - 0.5;
]
0.00127109764952614092 * u +
1.19314022838340944e-4) * u - 0.003963850973605135) * u
- 8.70779635317295828e-4) * u + 0.00773672528313526668)
o u + 0.00383335126264887303) * u - 0.0127223813782122755)
o u - 0.0133823644533460069) * u + 0.0161315329733252248)
* U 1 0.0390976845588484035) * U 1 0.00249367200053503304)
o u - 0.0838864557023001992) * u - 0.119463959964325415)
o u + 0.0166207924969367356) * u + 0.357524274449531043)
o u + 0.805276408752910567) * u + 1.18902982909273333)
o u + 1.37040217682338167) * u + 1.31314653831023098) *
u + 1.07925515155856677) * u + 0.774368199119538609) *
u + 0.490165080585318424) * u + 0.275374741597376782) *
t ^ 0.5;
]
] * exp(-x * x);
x < 1-y y;
J
return

GPU Acceleration

Caricato da

Informazioni sul documento

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

GPU Acceleration

Caricato da

Copyright:

Formati disponibili

GPU acceleration for the pricing of the eMS spread option

Figure 3: Absolute error AE of the derf algorithm

Figure 4: Absolute error AE of the derfc algorithm

are the calibration and copula stages which

Potrebbero piacerti anche