Online natural gradient as a Kalman filter

Yann Ollivier

doi:10.1214/18-ejs1468

Abstract

We cast Amari’s natural gradient in statistical learning as a specific case of Kalman filtering. Namely, applying an extended Kalman filter to estimate a fixed unknown parameter of a probabilistic model from a series of observations, is rigorously equivalent to estimating this parameter via an online stochastic natural gradient descent on the log-likelihood of the observations. In the i.i.d. case, this relation is a consequence of the “information filter” phrasing of the extended Kalman filter. In the recurrent (state space, non-i.i.d.) case, we prove that the joint Kalman filter over states and parameters is a natural gradient on top of real-time recurrent learning (RTRL), a classical algorithm to train recurrent models. This exact algebraic correspondence provides relevant interpretations for natural gradient hyperparameters such as learning rates or initialization and regularization of the Fisher information matrix.

Highlights

This exact algebraic correspondence provides relevant interpretations for natural gradient hyperparameters such as learning rates or initialization and regularization of the Fisher information matrix
The natural gradient modifies the ordinary gradient by using the information geometry of the statistical model, via the Fisher information matrix
The extended Kalman filter can be used to estimate the parameters of a statistical model, by viewing the parameters as the hidden state of a “static” dynamical system, and viewing i.i.d. samples as noisy observations depending on the parameters

Summary

Problem setting

We have a series of observation pairs (u1, y1), . . . ,. (The last probability pK is determined by the others via pk = 1 and has to be excluded to obtain a non-degenerate parameterization and an invertible covariance matrix Rt.) This convention allows us to extend the definition of the Kalman filter to such a setting (Def. 5) in a natural way, just by replacing the measurement error yt − yt with T (yt) − yt, with T the sufficient statistics for the exponential family. We are given an exponential family (output noise model) p(y|y) on y with mean parameter yand sufficient statistics T (y) (see the Appendix), and we define the loss function t := − ln p(yt|yt). For a column vector u, u⊗2 is synonymous with uu , and with u u for a row vector

Natural gradient descent

Kalman filtering for parameter estimation

Natural gradient as a Kalman filter: heuristics

Proofs for the static case

Proofs for the recurrent case

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Electronic Journal of Statistics	Publication Date: Dec 11, 2017
Citations: 31	License type: cc-by

R Discovery Prime

R Discovery Prime

Online natural gradient as a Kalman filter

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Electronic Journal of Statistics

Lead the way for us

Similar Papers

Stochastic Natural Gradient Descent by estimation of empirical covariances
Luigi Malago ... Giovanni Pistone
-
Luigi Malago, et. al.Luigi Malago ... Giovanni Pistone
01 Jun 2011
01 Jun 2011

Training recurrent networks using the extended Kalman filter
R.J Williams
-
R.J WilliamsR.J Williams
07 Jun 1992
07 Jun 1992

Decision Feedback Recurrent Neural Equalization With Fast Convergence Rate
J Choi ... M Bouchard
IEEE Transactions on Neural Networks | VOL. 16
J Choi, et. al.J Choi ... M Bouchard
01 May 2005
IEEE Transactions on Neural Networks | VOL. 16

Brain Computer Interface Development Based on Recurrent Neural Networks and ANFIS Systems
Emanuel Morales-Flores ... Pilar Gómez-Gil
-
Emanuel Morales-Flores, et. al.Emanuel Morales-Flores ... Pilar Gómez-Gil
01 Jan 2013
01 Jan 2013

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Online natural gradient as a Kalman filter

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Electronic Journal of Statistics