Abstract

We describe a theoretical method of determining optimal learning rates for on-line gradient descent training of a multilayer neural network (a soft committee machine). A variational approach is used to determine the time-dependent learning rate which maximizes the total decrease in generalization error over a fixed time window, using a statistical mechanics description of the learning process which is exact in the limit of large input dimension. A linear analysis around transient and asymptotic fixed points of the dynamics provides insight into the optimization process and explains the excellent agreement between our results and independent results for isotropic, realizable tasks. This allows a rather general characterization of the optimal learning rate dynamics within each phase of learning (we discuss scaling laws with respect to task complexity in particular). Our method can also be used to optimize other parameters and learning rules, and we briefly consider a generalized algorithm in which weights associated with different hidden nodes can be assigned different learning rates. The optimal settings in this case suggest that such an algorithm can significantly outperform standard gradient descent.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call