In principle, n-gram probabilities can be estimated from a large sample of text by counting the number of occurrences of each n-gram of interest and dividing by the size of the training sample. This method, which is known as maximum likelihood estimator (MLE), is very simple. However, it is unsuitable because n-grams which do not occur in the training sample are assigned zero probability. This is qualitatively wrong for use as a prior model, because it would never allow the n-gram, while clearly some of the unseen n-grams will occur in other texts. For non-zero frequencies, the MLE is quantitatively wrong. Moreover, at all frequencies, the MLE does not separate bigrams with the same frequency. We study two alternative methods. The first method is an enhanced version of the method due to Good and Turing (I. J. Good [1953]. Biometrika, 40, 237–264). Under the modest assumption that the distribution of each bigram is binomial, Good provided a theoretical result that increases estimation accuracy. The second method is an enhanced version of the deleted estimation method (F. Jelinek & R. Mercer [1985]. IBM Technical Disclosure Bulletin, 28, 2591–2594). It assumes even less, merely that the training and test corpora are generated by the same process. We emphasize three points about these methods. First, by using a second predictor of the probability in addition to the observed frequency, it is possible to estimate different probabilities for bigrams with the same frequency. We refer to this use of a second predictor as “enhancement.” With enhancement, we find 1200 significantly different probabilities (with a range of five orders of magnitude) for the group of bigrams not observed in the training text; the MLE method would not be able to distinguish any one of these bigrams from any other. The probabilities found by the enhanced methods agree quite closely in qualitative comparisons with the standard calculated from the test corpus. Second, the enhanced Good-Turing method provides accurate predictions of the variances of the standard probabilities estimated from the test corpus. Third, we introduce a refined testing method that enables us to measure the prediction errors directly and accurately and thus to study small differences between methods. We find that while the errors of both methods are small due to the large amount of data that we use, the enhanced Good-Turing method is three to four times as efficient in its use of data as the enhanced deleted estimate method. Good-Turing method is preferable to the enhanced deleted estimate method. Both methods are much better than MLE.
Read full abstract