Abstract

First-order methods based on the stochastic gradient descent and variants are popularly used in training neural networks. The large dimension of the parameter space prevents the use of second-order methods in current practice. The empirical Fisher information matrix is a readily available estimate of the Hessian matrix that has been used recently to guide informative dropout approaches in deep learning. In this paper, we propose efficient ways to dynamically estimate the empirical Fisher information matrix to speed up the optimization of deep learning loss functions. We propose two different methods, both using rank-1 updates for the empirical Fisher information matrix. The first one is FisherExp and it is based on exponential smoothing using Sherman-Woodbury-Morrison matrix inversion formula. The second one is FisherFIFO, which uses a circular gradient buffer using the Sherman-Woodbury-Morrison formula twice every time a new gradient is replaced. We found that FisherFIFO scales better and we further improve scaling by proposing a partitioning strategy for the empirical Fisher Information matrix. Our methods can be used in conjunction with existing optimizers that leverage momentum-based information to improve them. We compare the performance of our methods with alternative baselines in image classification problems and found that they produce better results. Despite the overhead incurred by using second-order information, the partitioning strategy combined with parallel block updates allows us to reduce the total training time of FisherFIFO relative to the baselines.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call