Reproducible Parallel Stochastic Gradient Descent
The stochastic gradient descent (SGD) is one of the most successful techniques ever devised for both machine learning and mathematical optimization. Lokad has been extensively exploiting the SGD for years for supply chain purposes, mostly through differentiable programming. Most of our clients have a least one SGD somewhere in their data pipeline.
In software, speed is a feature. We wanted to make our SGD implementation faster - but without sacrificing its deterministic behavior. Indeed, determinism is essential to deliver production-grade supply chain setups. In short, running the SGD twice should give exactly the same result.
This summer we had to the pleasure to welcome Ziyad Benomar for an internship dedicated to the reproducible speed-up of our SGD implementation. Ziyad Benomar delivered a remarkable piece of work that we are sharing below.
The short version is that we achieve a 5x speed-up with 6x compute cost compared to a baseline SGD. Beyond that, the bottleneck - in the Lokad architecture - becomes loading the data. Performance optimization is an endless Whac-A-Mole: as soon a bottleneck is addressed, a new bottleneck emerges. Solving this I/O problem is the obvious next step for us. Stay tuned!
Title: Reproducible Parallel Stochastic Gradient Descent for Very High Dimensional Data
Author: Ziyad Benomar, École polytechnique
Abstract: We are interested in the study of distributed stochastic gradient descent algorithms and their implementation on a machine with several processors. We attack the problem from a performance and computational speed point of view, and we will also impose on our solutions to be reproducible, i.e. to give the same results if run on the same data set. We will focus mainly on the PR-SGD algorithm, and we will give some adaptations that improve the execution speed when time needed to load the data on the computer’s RAM is very long. For each proposed algorithm, we give a proof of convergence and an estimate of its runtime. Finally, we compare our algorithms using a notion of ε-performance that we define.