Imen Chakroun, Tom Vander Aa and Thomas Ashby IMEC/ ExaScience Life Lab, Kapeldreef 75, Leuven, Belgium
Introduction Sliding window SGD
Stochastic Gradient Descent (SGD) is SW-SGD is computed using new
probably the most popular family of training points from the large memory optimisation algorithms used in machine learning
number of training points that are in
cache For Adam using 256 new + 128 cached has less fluctuation and converges slightly earlier. However using a batch of size 256 and 512 is less efficient than using 128 (preliminary experiments) The added value here is brought by It iteratively searches for the best Extra training points in the cache are using visited points not because of a model parameters by taking small “free” due to: bigger batch size. steps in the direction of the negative - Cache effect gradient of the cost (loss) function until it reaches its minimum - Accessing them uses dead time whilst waiting for new points from the main memory
Compared to one point SGD and Mini
batch SGD: SW-SGD incurs extra compute time extra points in a gradient calculation (more points).
The bigger the size of the problem
lower noise and the batch size are the more SW- SGD reduces the cache hits misses. Data access in SGD has extremely low temporal locality (randomness) smoothness and easy identification of convergence Conclusion Minimal use of the small/fast layers of Experiments SW-SGD adapts SGD to the access memory in an HPC memory hierarchy characteristics of modern HPC memory A classification problem of the MNIST to gain extra gradient noise smoothing. dataset + a regression problem using the ChEMBL dataset. It combines the epoch efficiency of SGD with the lower noise and easier to spot Tested using several gradient descent convergence of mini batch SGD. optimization algorithms: Momentum, Adam, Adagrad... SW-SGD is applicable for different variants of SGD. We used a neural network with 3 layers and 100 hidden units each. Averaged results from 5-fold cross-validation. Acknowledgment Algorithm that uses temporal locality of data access while Results This work is funded by the European project ExCAPE Grant Agreement no. combining the advantages of SGD The fundamental idea of the SW-GD is 671555. with mini-batch SGD by leveraging valid for many GD variants without any memory hierarchy change to the algorithm.