Dynamics and algorithms for stochastic search

Genevieve Orr

doi:10.6083/m40p0wz1

Abstract

In this thesis we develop a mathematical formulation for the learning dynamics of stochastic or on-line learning algorithms in neural networks. We use this formulation to (1) model the time evolution of the weight space densities during learning, (2) predict convergence regimes with and without momentum, and (3) develop a new efficient algorithm with few adjustable parameters which we call adaptive momentum. In stochastic learning, the weights are updated at each iteration based on a single exemplar randomly chosen from the training set. Treating the learning dynamics as a Markov process, we show that the weight space probability density P(w,t) can be cast as a Kramers-Moyal series$${\partial P(w,t)\over\partial t} = L\sb{KM} P(w,t)\eqno(0.1)\cr$$where $L\sb{KM}$ is an infinite-order linear differential operator, the terms of which involve powers of the learning rate $\mu.$ We present several approaches for truncating this series so that approximate solutions can be obtained. One approach is the small noise expansion where the weights are modeled as a sum of a deterministic and noise component. However, in order to provide more accurate solutions, we also develop a perturbation expansion in $\mu.$ We demonstrate the technique on equilibrium weight-space densities. Unlike batch learning, stochastic updates are noisy but fast to compute. The speed-up can be dramatic if training sets are highly redundant, and the noise can decrease the likelihood of becoming trapped in poor local minima. However, acceleration techniques based on estimating the local curvature of the cost surface can not be implemented stochastically because the estimates of second order effects are much too noisy. Disregarding such effects can greatly hinder learning in problems where the condition number of the hessian is large. A matrix of learning rates (the inverse hessian) that scales the stepsize according to the curvature along the different eigendirections of the hessian is needed. We propose adaptive momentum as a solution. It results in an effective learning rate matrix that approximates the inverse hessian. No explicit calculation of the hessian or its inverse is required. This algorithm is only ${\cal O}(n)$ in both space and time, where n is the dimension of the weight vector.

Full Text