Abstract
For many pattern recognition applications including speech recognition and optical character recognition, prior models of language are used to disambiguate otherwise equally probable outputs. It is common practice to use tables of probabilities of single words, pairs of words, and triples of words (n-grams) as a prior model. Our research is directed to 'backing-off' methods, that is, methods, that build an (n+1)- gram model from an n-gram model.In principle, n-gram probabilities can be estimated from a large sample of text, by counting the number of occurrences of each n-gram of interest and dividing by the size of the training sample. Unfortunately, this simple method, known as the maximum likelihood estimator (MLE), is unsuitable because n-grams which do not occur in the training text are assigned zero probability. In addition, the MLE does not distinguish among bigrams with the same frequency.We study two alternative methods for estimating the frequency of a given bigram in a test corpus, given a training corpus. The first method is an enhanced version of the method due to Good and Turing (Good, 1953). Under the modest assumption that the distribution of each bigram is binomial, Good provided a theoretical result that increases estimation accuracy. The second method assumes even less, merely that training and test corpora are generated by the same process. We refer to this purely empirical method as the Categorize-Calibrate (or Cat-Cal) method.We emphasize three points about these methods. First, by using a second predictor of the probability in addition to the observed frequency, it is possible to estimate different probabilities for bigrams with the same frequency. We refer to this use of a second predictor as enhancement. With enhancement, we find 1200 significantly different probabilities (with a range of five orders of magnitude) for the group of bigrams not observed in the training text; the MLE method would not be able to distinguish any one of these bigrams from any other. Second, both methods provide (estimated) variances for the errors in estimating the n-gram probabilities. Third, the variances are used in a refined testing method that enables us to study small differences between methods. We find that Cat-Cal should be used when counts are very small, and otherwise, GT is the method of choice.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.