Abstract

Adjoint methods are used in both control theory and machine learning (ML) to efficiently compute gradients of functionals. In ML, the adjoint method is a popular approach for training multilayer neural networks and is commonly referred to as backpropagation. Despite its importance in ML, the adjoint method suffers from two well documented shortcomings: (i) gradient decay/explosion and (ii) excessive training time. Until now, the gradient decay problem has primarily been addressed through modification to the network architecture with gating units that add additional parameters. This results in additional computational costs during evaluation and training which further exacerbates the excessive training time. In this letter, we introduce a powerful framework for addressing the gradient decay problem based on second-order sensitivity concepts from control theory. As a result, we are able to robustly train arbitrary network architectures without suffering from gradient decay. Furthermore, we demonstrate that this method is able to speed up training with respect to both wall-clock time and data efficiency. We demonstrate our method on a synthetic long time gap experiment, as well as three sequential modeling benchmarks with a simple recurrent neural network architecture.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call