Abstract

Event Abstract Back to Event Contrastive Divergence Learning May Diverge When Training Restricted Boltzmann Machines Asja Fischer1, 2 and Christian Igel1, 2* 1 Ruhr-Universität Bochum, Bernstein Center for Computational Neuroscience, Germany 2 Ruhr-Universität Bochum, Institut für Neuroinformatik, Germany Understanding and modeling how brains learn higher-level representations from sensory input is one of the key challenges in computational neuroscience and machine learning. Layered generative models such as deep belief networks (DBNs) are promising for unsupervised learning such representations, and new algorithms that operate in a layer-wise fashion make learning these models computationally tractable [1-5]. Restricted Boltzmann Machines (RBMs) are the typical building blocks for DBN layers. They are undirected graphical models, and their structure is a bipartite graph connecting input (visible) and hidden neurons. Training large undirected graphical models by likelihood maximization in general involves averages over an exponential number of terms, and obtaining unbiased estimates of these averages by Markov chain Monte Carlo methods typically requires many sampling steps. However, recently it was shown that estimates obtained after running the chain for just a few steps can be sufficient for model training [3]. In particular, gradient-ascent on the k-step Contrastive Divergence (CD-k), which is a biased estimator of the log-likelihood gradient based on k steps of Gibbs sampling, has become the most common way to train RBMs [1-5]. Contrastive Divergence learning does not necessarily reach the maximum likelihood estimate of the parameters (e.g., because of the bias). However, we show that the situation is much worse. We demonstrate empirically that for some benchmark problems taken from the literature [6], CD learning systematically leads to a steady decrease of the log-likelihood after an initial increase (see supplementary Figure 1). This seems to happen especially when trying to learn more complex distributions, which are the targets if RBMs are used within DBNs. The reason for the decreasing log-likelihood is an increase of the model parameter magnitudes. The estimation bias depends on the mixing rate of the Markov chain, and it is well-known that mixing slows down with growing magnitude of model parameters [1,3]. Weight-decay can therefore solve the problem if the strength of the regularization term is adjusted correctly. If chosen too large, learning is not accurate enough. If chosen too small, learning still divergences. For large k, the effect is less pronounced. Increasing k, as suggested in [1] for finding parameters with higher likelihood, may therefore prevent divergence. However, divergence occurs even for values of k too large to be computationally tractable for large models. Thus, a dynamic schedule to control k is needed.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.