Abstract

We cast Amari’s natural gradient in statistical learning as a specific case of Kalman filtering. Namely, applying an extended Kalman filter to estimate a fixed unknown parameter of a probabilistic model from a series of observations, is rigorously equivalent to estimating this parameter via an online stochastic natural gradient descent on the log-likelihood of the observations. In the i.i.d. case, this relation is a consequence of the “information filter” phrasing of the extended Kalman filter. In the recurrent (state space, non-i.i.d.) case, we prove that the joint Kalman filter over states and parameters is a natural gradient on top of real-time recurrent learning (RTRL), a classical algorithm to train recurrent models. This exact algebraic correspondence provides relevant interpretations for natural gradient hyperparameters such as learning rates or initialization and regularization of the Fisher information matrix.

Highlights

  • This exact algebraic correspondence provides relevant interpretations for natural gradient hyperparameters such as learning rates or initialization and regularization of the Fisher information matrix

  • The natural gradient modifies the ordinary gradient by using the information geometry of the statistical model, via the Fisher information matrix

  • The extended Kalman filter can be used to estimate the parameters of a statistical model, by viewing the parameters as the hidden state of a “static” dynamical system, and viewing i.i.d. samples as noisy observations depending on the parameters

Read more

Summary

Problem setting

We have a series of observation pairs (u1, y1), . . . ,. (The last probability pK is determined by the others via pk = 1 and has to be excluded to obtain a non-degenerate parameterization and an invertible covariance matrix Rt.) This convention allows us to extend the definition of the Kalman filter to such a setting (Def. 5) in a natural way, just by replacing the measurement error yt − yt with T (yt) − yt, with T the sufficient statistics for the exponential family. We are given an exponential family (output noise model) p(y|y) on y with mean parameter yand sufficient statistics T (y) (see the Appendix), and we define the loss function t := − ln p(yt|yt). For a column vector u, u⊗2 is synonymous with uu , and with u u for a row vector

Natural gradient descent
Kalman filtering for parameter estimation
Natural gradient as a Kalman filter: heuristics
Proofs for the static case
Proofs for the recurrent case

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.