Sei sulla pagina 1di 1

Enhancing machine learning optimization algorithms

by leveraging memory caching


Imen Chakroun, Tom Vander Aa and Thomas Ashby
IMEC/ ExaScience Life Lab, Kapeldreef 75, Leuven, Belgium

Introduction Sliding window SGD

Stochastic Gradient Descent (SGD) is SW-SGD is computed using new


probably the most popular family of training points from the large memory
optimisation algorithms used in
machine learning

number of training points that are in


cache
For Adam using 256 new + 128 cached
has less fluctuation and converges
slightly earlier.
However using a batch of size 256 and
512 is less efficient than using 128
(preliminary experiments)
The added value here is brought by
It iteratively searches for the best Extra training points in the cache are using visited points not because of a
model parameters by taking small “free” due to: bigger batch size.
steps in the direction of the negative - Cache effect
gradient of the cost (loss) function
until it reaches its minimum - Accessing them uses dead time
whilst waiting for new points from
the main memory

Compared to one point SGD and Mini


batch SGD:
SW-SGD incurs extra compute time
extra points in a gradient calculation (more points).

The bigger the size of the problem


lower noise and the batch size are the more SW-
SGD reduces the cache hits misses.
Data access in SGD has extremely
low temporal locality (randomness)
smoothness and easy identification of
convergence Conclusion
Minimal use of the small/fast layers of Experiments SW-SGD adapts SGD to the access
memory in an HPC memory hierarchy characteristics of modern HPC memory
A classification problem of the MNIST to gain extra gradient noise smoothing.
dataset + a regression problem using
the ChEMBL dataset. It combines the epoch efficiency of SGD
with the lower noise and easier to spot
Tested using several gradient descent convergence of mini batch SGD.
optimization algorithms: Momentum,
Adam, Adagrad... SW-SGD is applicable for different
variants of SGD.
We used a neural network with 3 layers
and 100 hidden units each. Averaged
results from 5-fold cross-validation. Acknowledgment
Algorithm that uses temporal
locality of data access while
Results This work is funded by the European
project ExCAPE Grant Agreement no.
combining the advantages of SGD
The fundamental idea of the SW-GD is 671555.
with mini-batch SGD by leveraging
valid for many GD variants without any
memory hierarchy
change to the algorithm.

Potrebbero piacerti anche