Manual sorting of numerals in an inflective language for language modelling

Gregor Donaj,Zdravko Kačič

doi:10.1007/s10772-014-9231-y

Gregor Donaj, Zdravko Kačič

Open Access

https://doi.org/10.1007/s10772-014-9231-y

Copy DOI

Abstract

In speech recognition systems language models are used to estimate the probabilities of word sequences. In this paper special emphasis is given to numerals–words that express numbers. One reason for this is the fact that in a practical application a falsely recognized numeral can change important content information inside the sentence more than other types of errors. Standard \(n\)-gram language models can sometimes assign very different probabilities to different numerals, according to their relative frequencies in training corpus. Based on the assumption that some different numbers are more equally likely to occur, than what a standard \(n\)-gram language model estimates, this paper proposes several methods for sorting numerals into classes in an inflective language and language models based on these sorting techniques. We treat these classes as basic vocabulary units for the language model. We also expose the differences between the proposed language models and well known class-based language models. The presented approach is also transferable to other classes of words with similar properties, e.g. proper nouns. Results of experiments show that significant improvements are obtained on numeral-rich domains. Although numerals represent only a small portion of words in the test set, a relative reduction in word error rate of 1.4 % was achieved. Statistical significance tests were performed, which showed that these improvements are statistically significant. We also show that depending on the amount of numerals in a target domain the improvement in performance can grow up to 16 % relative.

Full Text