Abstract

With massive resurgence of artificial intelligence, statistical learning theory and information science, the core technology of AI, are getting growing attention. To deal with massive data, efficient learning algorithms are required in statistical learning. In deep learning, natural gradient algorithms, such as AdaGrad and Adam, are widely used, motivated by the idea of Newton's approach that applies second-order derivatives to rescale gradients. By approximating the second-order geometry of the empirical loss with the empirical Fisher information matrix (FIM), natural gradient methods are expected to obtain extra efficiency of learning. However, the exact curvature of the empirical loss is described by the Hessian matrix, not the FIM, and biases between the empirical FIM and the Hessian always exist before convergence, which will affect the expected efficiency. In this paper, we present a new stochastic optimization algorithm, diagSG (diagonal Hessian stochastic gradient), in the setting of deep learning. As a second-order algorithm, diagSG estimates the diagonal entries of the Hessian matrix at each iteration through simultaneous perturbation stochastic approximation (SPSA) and applies the diagonal entries for the adaptive learning rate in optimization. By comparing the rescaling matrices in diagSG and in natural gradient methods, we argue that diagSG possess advantages in characterizing loss curvature with better approximation of Hessian diagonals. In practical part, we provide a experiment to endorse our argument.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call