Abstract

In speech recognition systems language models are used to estimate the probabilities of word sequences. In this paper special emphasis is given to numerals–words that express numbers. One reason for this is the fact that in a practical application a falsely recognized numeral can change important content information inside the sentence more than other types of errors. Standard \(n\)-gram language models can sometimes assign very different probabilities to different numerals, according to their relative frequencies in training corpus. Based on the assumption that some different numbers are more equally likely to occur, than what a standard \(n\)-gram language model estimates, this paper proposes several methods for sorting numerals into classes in an inflective language and language models based on these sorting techniques. We treat these classes as basic vocabulary units for the language model. We also expose the differences between the proposed language models and well known class-based language models. The presented approach is also transferable to other classes of words with similar properties, e.g. proper nouns. Results of experiments show that significant improvements are obtained on numeral-rich domains. Although numerals represent only a small portion of words in the test set, a relative reduction in word error rate of 1.4 % was achieved. Statistical significance tests were performed, which showed that these improvements are statistically significant. We also show that depending on the amount of numerals in a target domain the improvement in performance can grow up to 16 % relative.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.