Abstract
Language modeling (LM) involves determining the joint probability of words in a sentence. The conditional approach is dominant, representing the joint probability in terms of conditionals. Examples include n-gram LMs and neural network LMs. An alternative approach, called the random field (RF) approach, is used in whole-sentence maximum entropy (WSME) LMs. Although the RF approach has potential benefits, the empirical results of previous WSME models are not satisfactory. In this paper, we revisit the RF approach for language modeling, with a number of innovations. We propose a trans-dimensional RF (TDRF) model and develop a training algorithm using joint stochastic approximation and trans-dimensional mixture sampling. We perform speech recognition experiments on Wall Street Journal data, and find that our TDRF models lead to performances as good as the recurrent neural network LMs but are computationally more efficient in computing sentence probability.
Highlights
Language modeling is crucial for a variety of computational linguistic applications, such as speech recognition, machine translation, handwriting recognition, information retrieval and so on
We explore the use of a variety of features based on word and class information in trans-dimensional random field (TDRF) Language modeling (LM)
We describe a trans-dimensional mixture sampling algorithm to simulate from the joint distribution p(l, xl; λ, ζ), which is used with (λ, ζ) = (λ(t−1), ζ(t−1)) at time t for Markov chain Monte Carlo (MCMC) sampling in the joint stochastic approximation (SA) algorithm
Summary
Language modeling is crucial for a variety of computational linguistic applications, such as speech recognition, machine translation, handwriting recognition, information retrieval and so on. It involves determining the joint probability p(x) of a sentence x, which can be denoted as a pair x = (l, xl), where l is the length and xl = Neural network LMs, which have begun to surpass the traditional n-gram LMs, follow the conditional modeling approach, with φ(hi) determined by a neural network (NN), which can be either a feedforward NN (Schwenk, 2007) or a recurrent NN (Mikolov et al, 2011)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.