A language model for very large-vocabulary speech recognition

V Gupta,M Lennig,P Mermelstein

doi:10.1016/0885-2308(92)90027-2

Abstract

We apply a trigram language model to an 86 000-word vocabulary speech recognition task. The recognition task consists of paragraphs chosen arbitrarily from a variety of sources, including newspapers, books, magazines, etc. The trigram language model parameters correspond to probabilities of words conditioned on the previous two words. The number of parameters to be estimated is enormous: 86 000 3 parameters in our case. Even a training set consisting of 60 million words is too small to estimate these parameters reliably. Parameter estimates using relative frequencies would assign a value of zero to a large fraction of the parameters. Many algorithms have been proposed to estimate probabilties of events not observed in the training text. We propose here a simple algorithm for estimating the probabilities of such events using Turing's formula. The resulting trigram language model reduces the acoustic recognition errors by 60%. We also show that the effectiveness of the trigram language model for correcting an acoustic word recognition error depends on whether or not the neighbouring word contexts occur in the training text corpus for the language model.

Full Text