PARAD: A Work-Efficient Parallel Algorithm for Reverse-Mode Automatic Differentiation

Tim Kaler,Jie Chen,Aldo Pareja,Charles E Leiserson,Brian Xie,Tao B Schardl,Georgios Kollias

doi:10.1137/1.9781611976489.11

Abstract

Automatic differentiation (AD) is a technique for computing the derivative of function F: Rn → Rm defined by a computer program. Modern applications of AD, such as machine learning, typically use AD to facilitate gradient-based optimization of an objective function for which m≪n (often m=1). As a result, these applications typically use reverse (or adjoint) mode AD to compute the gradient of F efficiently, in time Θ(m·T1(F)), where T1 is the work (serial running time) of F. Although the serial running time of reverse-mode AD has a well known relationship to the total work of F, general-purpose reverse-mode AD has proven challenging to parallelize in a work-efficient and scalable fashion, as simple approaches tend to result in poor performance or scalability. This paper introduces PARAD, a work-efficient parallel algorithm for reverse-mode AD of determinacy-race-free recursive fork-join programs. We analyze the performance of PARAD using work/span analysis. Given a program F with work T1(F) and span (critical-path length) T∞(F), PARAD performs reverse-mode AD of F in O(m·T1(F)) work and O(logm + log(T1(F))T∞(F)) span. To the best of our knowledge, PARAD is the first parallel algorithm for performing reverse-mode AD that is both provably work-efficient and has span within a polylogarithmic factor of the original program F. We implemented PARAD as an extension of Adept, a C++ library for performing reverse-mode AD for serial programs that is known for its efficiency. Our implementation supports the use of Cilk fork-join parallelism and requires no programmer annotations of parallel control flow. Instead, it uses compiler instrumentation to dynamically trace a program's series-parallel structure, which is used to automatically parallelize the gradient computation via reverse-mode AD. On eight machine-learning benchmarks, our implementation of PARAD achieves 1.5× geometric-mean multiplicative work overhead relative to the serial Adept tool, and 8.9× geometric-mean self-relative speedup on 18 cores.

Full Text